# Non-Convex Optimization with Spectral Radius Regularization

We develop a regularization method which finds flat minima during the training of deep neural networks and other machine learning models. These minima generalize better than sharp minima, allowing models to better generalize to real word test data, which may be distributed differently from the training data. Specifically, we propose a method of regularized optimization to reduce the spectral radius of the Hessian of the loss function. Additionally, we derive algorithms to efficiently perform this optimization on neural networks and prove convergence results for these algorithms. Furthermore, we demonstrate that our algorithm works effectively on multiple real world applications in multiple domains including healthcare. In order to show our models generalize well, we introduce different methods of testing generalizability.

## Authors

• 2 publications
• 62 publications
• 45 publications
• ### Towards Understanding Generalization of Deep Learning: Perspective of Loss Landscapes

It is widely observed that deep learning models with learned parameters ...
06/30/2017 ∙ by Lei Wu, et al. ∙ 0

• ### Essentially No Barriers in Neural Network Energy Landscape

Training neural networks involves finding minima of a high-dimensional n...
03/02/2018 ∙ by Felix Draxler, et al. ∙ 0

• ### Data optimization for large batch distributed training of deep neural networks

Distributed training in deep learning (DL) is common practice as data an...
12/16/2020 ∙ by Shubhankar Gahlot, et al. ∙ 0

• ### Local Loss Optimization in Operator Models: A New Insight into Spectral Learning

This paper re-visits the spectral method for learning latent variable mo...
06/27/2012 ∙ by Borja Balle, et al. ∙ 0

• ### The sharp, the flat and the shallow: Can weakly interacting agents learn to escape bad minima?

An open problem in machine learning is whether flat minima generalize be...
05/10/2019 ∙ by Nikolas Kantas, et al. ∙ 0

• ### Regularizing Neural Networks via Adversarial Model Perturbation

Recent research has suggested that when training neural networks, flat l...
10/10/2020 ∙ by Yaowei Zheng, et al. ∙ 7

• ### Hessian Eigenspectra of More Realistic Nonlinear Models

Given an optimization problem, the Hessian matrix and its eigenspectrum ...
03/02/2021 ∙ by Zhenyu Liao, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Finding flat minima solutions is important to many optimization problems, especially those found in machine learning applications. Such models generalize better than sharp minima because the value of the objective function remains similar for flat minima if the data is shifted, distorted, or otherwise changed. Thus, in practice, optimal machine learning models with flatter optima should perform better than those with sharper ones on test data that is distributed differently than the original training data.

Here, we define flat minima as those with a small spectral radius (largest absolute eigenvalue) of the Hessian of the objective function and sharp minima as those where the spectral radius is large. For minima where this spectral radius is small, there is no direction away from this point in which the objective function immediately and rapidly increases. Therefore, by regularizing our optimization of models with respect to this spectral radius, we are able to obtain solutions that are less susceptible to errors or biases in the training or test data.

However, this regularization presents certain challenges: For large neural networks, computing and storing the Hessian and the third derivative tensor are infeasible; therefore we need to develop methods to efficiently compute the regularization term and its gradient. We also needed to design methods to introduce errors and biases into the data in order to test the generalizability of these models.

To tackle these challenges, we build a method to regularize the spectral radius while computing Hessian-vector products. We present results with different regularization parameters to show our methodology is stable. We introduce multiple methods to test generalizability, tailored to specific problems. Our contributions include:

• We develop an algorithm for regularizing neural networks with respect to the spectral radius of the Hessian, a novel use of a derivative measure for such regularization.

• We apply and extend differential operators used to efficiently compute Hessian-vector products for neural networks.

• We provide formal proofs on convergence and other properties of our algorithm.

• We present experimental results on multiple real world data sets across different domains, designing specific methods to test generalizability.

In Section 2, we review existing literature related to our research. In Section 3, we derive the algorithm used for our regularization. In Section 4, we discuss convergence results and other properties of the algorithm. In Section 5, we describe different generalizability tests and present results of our experiments with regularization on various data sets.

## 2 Related Work

Existing works discussed how different learning methods affect the ability of neural networks to converge to flat minima. Keskar et al. [2016]

observed that large-batch stochastic gradient descent (SGD) and its variants, such as adaptive moment estimation (Adam), tend to converge to sharp minima, while small-batch methods converge to flat minima. This implies that small-batch methods generalize better than large-batch methods, as the training function at sharp minima is more sensitive. Some possible causes for this include that large-batch methods over-fit, are attracted to saddle points, lack the explorative properties of small-batch methods (tend to converge to the minima close to the initial weights).

Yao et al. [2018] showed that large-batch training of neural networks converges to points with larger Hessian spectrum (both in terms of dominant and other eigenvalues) which shows poor robustness. Jastrzebski et al. [2018] extended these claims in showing that a large learning rate also leads to flatter minima that generalize better than sharper minima.

Others suggested different ways to measure and find flat minima, including different objective functions and optimization algorithms. Ma et al. [2019] suggests that Kronecker-Factored Approximate Curvature (K-FAC) [Martens and Grosse, 2015], an approximate second-order method may yield generalization improvements over first-order SGD. Chaudhari et al. [2016] proposed an entropy-based objective function to find solutions in flat regions and an algorithm (called Entropy-SGD) to conduct the optimization. He et al. [2019] observed that at local minima of deep networks, there exist many asymmetric directions where the loss sharply increases, which they call “asymmetric valleys.” They propose weight averaging along the SGD trajectory to bias solutions in favor of the flat side. Chaudhari et al. [2016] also noted that many neural networks, trained on various data sets using SGD or Adam, converge to a point with a large number of near-zero eigenvalues, along with a long positive tail, and shorter negative tail. These observations about the imbalanced eigenspectrum by Chaudhari et al. [2016] and He et al. [2019] are of special interest to us, as our regularization method, which attempts to reduce the spectral radius of the Hessian, should be tailored to avoid these asymmetric valleys.

While Yoshida and Miyato [2017] developed a spectral norm radius regularization method, it looks solely at the spectral radius of a neural network’s weight matrices, rather than the spectral radius of the Hessian of the loss function. While they experimentally show that their regularization method has a small generalization gap (between the training and test set), their method also had higher Hessian spectral radius than vanilla SGD, weight-decay, and adversarial methods. We believe our regularization method and generalization tests more directly address the task of finding flat minima and evaluating their generalizability.

## 3 Algorithm

For reference, we provide a summary of the main variables used and their corresponding definitions in Table 1. We choose to express our problem as a regularized optimization problem, rather than a constrained optimization or min-max problem. Thus, our optimization problem is:

 minwf(w)+μmax{0,ρ(w)−K},

where weights , is a non-convex objective function, is the spectral radius of the Hessian of (i.e., the maximal absolute eigenvalue), and are regularization parameters. For convenience, we denote .

The caveat is: we cannot directly compute . For large neural networks of size , computing and storing objects of size (such as the Hessian) is infeasible. However, we can efficiently compute the Hessian-vector product for a given using methods discussed in Section 3.1.1.

Our goal is to design efficient algorithms for solving this minimization problem. In Section 3.1, we discuss how to compute the regularized term and its gradient. In Section 3.2, we present and explain different variants of our algorithm.

### 3.1 Gradients of Regularization Term

The spectral radius can be expressed , where is the eigenvector corresponding with the maximum absolute eigenvalue. In order to compute gradient update steps, we need to calculate .

###### Lemma 3.1.

For distinct eigenvalues of symmetric matrix ,

 dλk(x)dx=vTkdA(x)dxvk,

where is the eigenvector corresponding to eigenvalue .

Aa, van der et al. [2007] proves Lemma 3.1. The expression for this derivative is more complicated with repeating eigenvalues, so we assume that the eigenvalue in question is distinct (in practice, this is almost surely the case).

Using this result and assumption, we express . Thus, if we can efficiently compute and for , we can respectively calculate and .

#### 3.1.1 Hessian-Vector Operations

In order to compute and for large neural networks with , we extend Pearlmutter94fastexact operator. We define this differential operator:

###### Definition 3.1.

.

Note that . Thus, by applying the differential operator to the forward and backwards passes used to calculate the gradient, we can compute efficiently.

We extend this to by computing and during the forward pass and , , and during the backward pass. Our formulas and the derivation can be found in Appendix A. Since , this allows us to efficiently compute .

These methods keep the number of stored values , while directly computing the Hessian and third-derivative tensor would require and storage (which for large networks is infeasible).

### 3.2 Algorithms

Here, we present two versions of our algorithm: an exact gradient descent algorithm (Algorithm 1) and a batch stochastic gradient descent algorithm (Algorithm 2). The first is ideal, but impractical; hence, we need a more practical stochastic algorithm. Here, for simplicity, we hide the dependencies (where is the value of weights at iteration ) for many of the variables by defining: , , , etc. Step size is some predefined function of iteration .

We start with Algorithm 1, a gradient descent algorithm with exact values for the eigenvectors and spectral radius. In practice, these values are often difficult to compute exactly due to computational and storage constraints. For our stochastic algorithm, we relax this exact constraint and compute or approximate values based on a batch (rather than the full data set). We examine the convergence properties of Algorithm 1 in Section 4.1.

Building off the exact gradient descent algorithm, we develop a more practical stochastic algorithm. Algorithm 2, uses batch stochastic gradient descent (rather than gradient descent) and power iteration to compute the eigenvector (rather than an exact method). Due to the implementation of and , the storage requirements are not extensive. Also, since the Hessian is symmetric, the power iteration converges at a rate proportional to the square of the ratio between the two largest eigenvalues , instead of the typical linear rate . The convergence properties of Algorithm 2 are studied in Section 4.2.

## 4 Analysis

First, in Section 4.1, we prove that the exact gradient descent algorithm (Algorithm 1) converges to a critical point. Second, in Section 4.2, we show that the stochastic gradient descent algorithm (Algorithm 2) almost surely converges to a critical point.

Here, we show that Algorithm 1 converges to a critical point with some assumptions on objective function and learning rate . Specifically, we assume , , bounded from below, has Lipschitz gradient, and has Lipschitz continuous third derivative tensor. The learning rate is sufficiently small; specially, , where is finite

###### Theorem 4.1.

Given the above assumptions, Algorithm 1 converges to a critical point.

We use these assumptions, and Taylor’s theorem to show that decreases until it reaches a critical point. Since is bounded from below, this implies the algorithm converges. We provide more details in Appendix B.1.

### 4.2 Stochastic Gradient Descent Convergence

Here, we show that Algorithm 2 almost surely converges to a critical point, with some assumptions on objective function and learning rate .

#### 4.2.1 Assumptions

Assume that 1) , , is bounded from below (without loss of generality, ); 2) typical conditions on the learning rate:

 ∞∑k=1α2k<∞,∞∑k=1αk=∞;

3) the second, third, and fourth moments of the update term do not grow too quickly (in reality, this is usually true):

 E[||p(wk)||j]≤Aj+Bj||wk||j for j=2,3,4,

where is the update term, an approximation of computed on a single sample or batch of samples; 4) and outside a certain horizon, the gradient points towards the origin:

 infw2≥DwT∇h(w)>0.

There are well-known tricks to ensure this assumption, such as adding a small linear term (which does not affect the Hessian-part of our algorithm).

#### 4.2.2 Confinement and Convergence

First, we show that given our assumptions, the iterates are confined.

###### Lemma 4.1.

Given the Assumptions in 4.2.1, the iterates in Algorithm 2 are bounded.

We define a sequence that is a function of and show that the sum of its positive expectations is finite. Then, we apply the Quasi-Martingale Convergence Theorem and show that since the sequence converges almost surely, the norm of our weights is bounded. This also implies all continuous functions of are bounded. Next, using our assumptions and Lemma 4.1, we prove almost sure convergence.

###### Theorem 4.2.

Given the Assumptions in 4.2.1, Algorithm 2 converges almost surely.

We use confinement of to show that positive expected variations in between iterates are bounded by a constant times our learning rate squared. Using Assumption 2 and the Quasi-Martingale Convergence Theorem, we show that converges almost surely. Then, we show that almost surely converges to zero. Our proof is based off Bottou98on-linelearning proof that stochastic gradient descent almost surely convergences. We provide more detailed proofs of Lemma 4.1 and Theorem 4.2 in Appendix B.2.

#### 4.2.3 Power Iteration

Here, we show that with certain additional assumptions, the power iteration and its convergence criteria fit our earlier assumptions for Algorithm 2 to converge almost surely. We start by showing that the power iteration fits our bounds on the moments of the update term. 1) Assume:

 E[||∇fk||j]≤A(1)j+B(1)j||wk||j, E[∣∣∣∣¯vTk∇Hk¯vk∣∣∣∣j]≤A(2)j+B(2)j||wk||j,

for , and where is the true eigenvector of . 2) We also assume that the Hessian is Lipschitz continuous. 3) The Power iteration algorithm converges to eigenvalues with the following condition:

 ||vk−¯vk||≤ε,∀ k,

where is the computed eigenvector. 4) We also assume as , as discussed later (see Lemma 4.3). 5) We define , and assume:

 E[||pk−¯pk||j]≤σj % for j=2,3,4.
###### Lemma 4.2.

Given the above assumptions,

 E[||pk||j]≤Aj+Bj||wk||j for j=2,3,4,

for some positive constants .

We split into components in terms of and the regularization term. We use our above assumptions to bound each of these components. Then, we combine the results to show the lemma holds. Lastly, we discuss constraints on , and show that must decrease to 0. We additionally assume is unbiased, i.e., and that as .

###### Lemma 4.3.

Given the above assumptions, , where is the true gradient.

We split into components with respect to the true eigenvector and our estimate . Then, we take the expectation and bound it, showing that is sufficient for the lemma to hold. We provide detailed proofs of Lemmas 4.2 and 4.3 in Appendix B.2.2.

## 5 Experiments

We tested our spectral radius regularization algorithm on the following data sets: forest cover type [Blackard and Dean, 1999], United States Postal Service (USPS) handwritten digits [LeCun et al., 1990], and chest X-ray [Wang et al., 2017]. The forest cover type data uses cartographic data to predict which of seven tree species is contained in a plot of land. The USPS digits data includes images from scanned envelopes, with the goal to identify which digit 0-9 each image corresponds to. The chest X-ray data uses images of patients’ chest x-rays, and identifying which of fourteen lung diseases each patient has. We describe the data sets in further detail in Appendix C.1.

Additionally, we trained unregularized, ChaudhariCSL16 Entropy-SGD, MartensG15 K-FAC, and He2019 asymmetric valley models, which serve as baseline comparisons for our models.

### 5.1 Methodology

In order to test if the models with lower spectral radius generalized better than those with higher spectral radius, we employed methods to create test sets that are different from the training set. These methods employ covariate shifts, image augmentation techniques, and introducing new, different data. For each model, we measured spectral radius , estimated on a random batch of the training set.

For the forest cover type data, we weighted the test subjects in order to shift the mean of features. Then, we compared the accuracy of our trained models, and repeated for a total of one thousand times. These perturbations could simulate test conditions with poor measurements and/or changes in climate.

For the USPS handwritten digits data, we augmented the test set using random crops and rotations. These crops and rotations could simulate test conditions where digits are written on angles or poorly scanned.

For the chest X-ray data, we compared performance on two similar data sets, CheXpert [Irvin et al., 2019] and MIMIC-CXR [Johnson et al., 2019] (using the six conditions common to the three data sets). We kept the labeled training and validation data sets separate for each data set, as there are differences in labeling. As this new chest X-ray data contains patients with conditions not present in the training data, this tests how well the trained models perform on a larger segment of the population.

We give a more detailed description of our methodology in Appendix C.2.

### 5.2 Results

We trained a feed forward neural network on the forest cover type data, using different regularization parameters

and and compared the accuracy on the randomly shifted test set to that of unregularized, Entropy-SGD, K-FAC, and asymmetric valley models. Figure 1 shows that there is a benefit to the regularized models and comparison models, as the -norm of the shifts increases, over the unregularized model, but no significant benefit to the K-FAC model. However, there is variability in the different shifts, since all values are less than . While these plots help visualize our comparisons, Table 2 gives us a more detailed picture: while the unregularized model is more accurate on the unshifted test data, it performs significantly worse as the -norm of the shifts increases. On the other hand, the regularized, Entropy-SGD, and asymmetric valley models perform about the same, irregardless of these shifts. The slope of the trend line comparing the accuracy to the -norm of the shifts is not significantly different from 0 for the regularized models, but significantly negative for the unregularized model (using a p-value of 0.05). While there are some differences between the various regularized models – there is some delineation between those with lower and higher – all perform better than the unregularized model. The regularized models have a significantly larger slope compared to that of the unregularized model, and the order of the slopes (from high-to-low) follows the models’ regularization strictness. Additionally, our spectral radius measure largely follows the regulation strictness. In terms of the regression slope, our strictest regularized models with and small are the best and third-best performing model. Entropy-SGD is the second-best performing model, although the difference between it and the prior two models is insignificant.

We trained a convolutional neural network on the USPS data, with no regularization; Entropy-SGD, K-FAC, and asymmetric valley methods; and various values of regularization parameters

and , comparing the accuracy on both the test and augmented test data sets. Per Table 3, while the models performed comparably on the test data (all models had an accuracy of 94.47-95.91%), our regularized models performed significantly better than the unregularized model on both augmented test data sets (87.54-90.93% vs. 85.50% on the Augmented Test 1; 66.17-69.61% vs. 63.43% on Augmented Test 2). On all test sets, the regularized model with and was the most accurate. Of our comparison models, the asymmetric valleys model performed best, only 1.5% and 2.6% worse than our best model on the augmented test sets. The Entropy-SGD model had a comparable accuracy to our regularized models on the Augmented Test 1 sets (only 1.9% worse than our best model), but it performed worse on the trickier Augmented Test 2 (6.5% worse than our best model, however, still better than the unregularized model). The K-FAC model performed worse on both augmented test sets: 5.3% worse than our best model on Augmented Test 1 and 11.0% worse on Augmented Test 2.

For the chest X-ray comparisons, we started with Zoogzog implementation of CheXNet [Rajpurkar et al., 2017]

, a 121-layer DenseNet trained on the chest X-ray data set as our baseline. Using this pre-trained model as the initialization, we trained for an additional epoch with our spectral radius regularization method, comparing the mean area under the curve (AUC) of the receiver of the receiver operating characteristic curve over the 14 classes (or 6 for our comparison tests). Similarly, we used this initialization and trained for one additional epoch using the Entropy-SGD and K-FAC optimization methods. The results are displayed in Table

4. Since we measured the spectral radius on a random batch, our batch size of four was too small to get a meaningful estimate for these models. Also, our strictest regularized model () appeared to be too strictly regularized, as its test mean AUC is significantly worse than the other models. While the other regularized models performed worse in terms of test mean AUC, our regularized models performed as well or better on most of the CheXpert and MIMIC-CXR data sets, with the second-most strictly regularized model () performing the best on these four test sets. In fact, it outperformed the original model by 4.4% and 3.8%, respectively, on the CheXpert and MIMIC-CXR “validation” sets, and 0.6% and 2.1% on the “training” sets, implying the regularized model generalizes better than the original model. K-FAC slightly outperformed this model on the MIMIC-CXR training set (by 0.4%), however performed up to 4.0% worse on the other sets. The Entropy-SGD model performed 5.9-10.2% worse than our regularized model on the CheXpert and MIMIC-CXR sets.

## 6 Conclusion

We created algorithms for regularized optimization of machine learning models in order to find flat minima. Furthermore, we developed tools for calculating the regularization term and its gradient for neural networks. We proved that these methods converge (or almost surely do) to a critical point. Then, we showed that the regularization works on a range of applicable problems which required us to design different methods to test generalizability for each of these problems.

## References

• N.P. Aa, van der, H.G. Morsche, ter, and R.M.M. Mattheij (2007) Computation of eigenvalue and eigenvector derivatives for a general complex-valued eigensystem. Electronic Journal of Linear Algebra 16, pp. 300–314 (English). External Links: ISSN 1081-3810 Cited by: §3.1.
• J. A. Blackard and D. J. Dean (1999) Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables. Computers and Electronics in Agriculture vol.24, pp. 131–151. Cited by: §5.
• P. Chaudhari, A. Choromanska, S. Soatto, Y. LeCun, C. Baldassi, C. Borgs, J. T. Chayes, L. Sagun, and R. Zecchina (2016) Entropy-sgd: biasing gradient descent into wide valleys. CoRR abs/1611.01838. External Links: Link, 1611.01838 Cited by: §2.
• H. He, G. Huang, and Y. Yuan (2019) Asymmetric valleys: beyond sharp and flat local minima. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32, pp. 2553–2564. External Links: Link Cited by: §2.
• J. Irvin, P. Rajpurkar, M. Ko, Y. Yu, S. Ciurea-Ilcus, C. Chute, H. Marklund, B. Haghgoo, R. L. Ball, K. S. Shpanskaya, J. Seekins, D. A. Mong, S. S. Halabi, J. K. Sandberg, R. Jones, D. B. Larson, C. P. Langlotz, B. N. Patel, M. P. Lungren, and A. Y. Ng (2019) CheXpert: A large chest radiograph dataset with uncertainty labels and expert comparison. CoRR abs/1901.07031. External Links: Link Cited by: §C.2, §5.1.
• S. Jastrzebski, Z. Kenton, D. Arpit, N. Ballas, A. Fischer, Y. Bengio, and A. Storkey (2018) Finding flatter minima with sgd. External Links: Link Cited by: §2.
• A. E. W. Johnson, T. J. Pollard, S. J. Berkowitz, N. R. Greenbaum, M. P. Lungren, C. Deng, R. G. Mark, and S. Horng (2019) MIMIC-CXR: A large publicly available database of labeled chest radiographs. CoRR abs/1901.07042. External Links: Link Cited by: §C.2, §5.1.
• N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang (2016)

On large-batch training for deep learning: generalization gap and sharp minima

.
• Y. LeCun, O. Matan, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, L. D. Jackel, and H. S. Baird (1990) Handwritten zip code recognition with multilayer networks. In

Proceedings - International Conference on Pattern Recognition

,
Vol. 2, pp. 35–40 (English (US)). Cited by: §5.
• L. Ma, G. Montague, J. Ye, Z. Yao, A. Gholami, K. Keutzer, and M. W. Mahoney (2019) Inefficiency of K-FAC for large batch size training. CoRR abs/1903.06237. External Links: Link, 1903.06237 Cited by: §2.
• J. Martens and R. B. Grosse (2015) Optimizing neural networks with kronecker-factored approximate curvature. CoRR abs/1503.05671. External Links: Link, 1503.05671 Cited by: §2.
• B. A. Pearlmutter (1994) Fast exact multiplication by the hessian. Neural Computation 6, pp. 147–160. Cited by: Appendix A.
• P. Rajpurkar, J. Irvin, K. Zhu, B. Yang, H. Mehta, T. Duan, D. Y. Ding, A. Bagul, C. Langlotz, K. S. Shpanskaya, M. P. Lungren, and A. Y. Ng (2017) CheXNet: radiologist-level pneumonia detection on chest x-rays with deep learning. CoRR abs/1711.05225. External Links: Link Cited by: §5.2.
• X. Wang, Y. Peng, L. Lu, Z. Lu, M. Bagheri, and R. Summers (2017) ChestX-ray8: hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In

2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

,
pp. 3462–3471. Cited by: §5.
• Z. Yao, A. Gholami, Q. Lei, K. Keutzer, and M. W. Mahoney (2018) Hessian-based analysis of large batch training and robustness to adversaries. External Links: 1802.08241 Cited by: §2.
• Y. Yoshida and T. Miyato (2017) Spectral norm regularization for improving the generalizability of deep learning. External Links: 1705.10941 Cited by: §2.

## Appendix A Hessian-Vector Operations Derivation

The forward computation for each layer of a network with input , output , weights , activation , bias , error or objective measure , and direct derivative , is given by:

 xk =∑jwjiyj yk =σk(xk)+Ik

The backward computation:

 ∂E∂yk =ek(yk)+∑jwij∂E∂xj ∂E∂xk =σ′k(xk)∂E∂yk ∂E∂wij =yk∂E∂xj

Applying to forward pass:

 R{xk} =∑j(wjiR{yj}+vjiyj) R{yk} =R{xk}σ′k(xk)

The backwards computation follows as:

 R{∂E∂yk}= e′k(yk)R{yk}+∑j[wijR{∂E∂xj} +vij∂E∂xj] R{∂E∂xk}= σ′k(xk)R{∂E∂yk} +R{xk}σ′′k(xk)∂E∂yk R{∂E∂wij}= ykR{∂E∂xj}+R{yk}∂E∂xj

This yields the result found in Pearlmutter [1994]. However, we extend it one step further by applying again, i.e., applying to the original forward pass:

 R2{xk} =∑j(wjiR2{yj}+2vjiR{yj}) R2{yk} =R2{xk}σ′k(xk)+(R{xk})2σ′′k(xk)

The backwards computation follows as:

 R2{∂E∂yk}= e′′k(yk)(R{yk})2+e′k(yk)R2{yk} +∑j[wijR2{∂E∂xj}+ R2{∂E∂xk}= 2R{xk}σ′′k(xk)R{∂E∂yk} +σ′k(xk)R2{∂E∂yk} +R2{xk}σ′′k(xk)∂E∂yk +(R{xk})2σ′′′k(xk)∂E∂yk R2{∂E∂wij}= 2R{yk}R{∂E∂xj}+ykR2{∂E∂xj} +R2{yk}∂E∂xj

The original formulation allows us to efficiently compute , which can be used to compute and/or estimate the eigenvector corresponding to the spectral radius (via Power iteration). While, the extended formulation allows us to efficiently compute and thus . This enables us to efficiently compute the gradient of our optimization problem for use in gradient descent methods.

## Appendix B Analysis Proofs

Proof that the gradient descent algorithm converges (Theorem 4.1).

###### Proof.

Since is Lipschitz continuous (i.e., ), . And similarly, since is Lipschitz continuous (i.e., ), . We define: .

Since our update step is , and by applying Taylor’s Theorem:

 hk+1= f(wk−αk∇hk)+μvTkH(wk−αk∇hk)vk = hk−αk∇hTk∇hk+α2k2∇hTk[H(ω) +μv(ω)T∇2H(ω)v(ω)]∇hk

for some , where . Applying the inequality derived from Lipschitz continuity:

 hk+1≤hk−αk||∇hk||2+α2k(L1+μL2)2||∇hk||2

Given our assumption ,

 (αk(L1+μL2)2−1)αk≤(12−1)αk=−αk2.

Thus:

 hk+1≤hk−αk2||∇hk||2

Either or . If the first, then the algorithm has converged to a critical point. If the second, then this step will decrease the value of the objective function. Since the objective function is bounded from below, it cannot decrease in perpetuum. Thus, it must eventually converge to a critical point. ∎

### b.2 Stochastic Gradient Descent Convergence

#### b.2.1 Confinement and Convergence

Proof of confinement (Lemma 4.1):

and .

###### Proof.

Definition B.1 implies that:

 φ(ω)−φ(w)≤(ω−w)φ′(w)+(ω−w)2,

for . Note that this becomes an equality when .

Applying this to ,

 ψk+1−ψk≤ (−2αkwTkpk+α2kp2k)ψ′(w2k) +4α2k(wTkpk)2−4α3kwTkp3k+α4kp4k.

By the Cauchy-Schwartz inequality,

 ψk+1−ψk≤ −2αkwTkpkψ′(w2k)+α2k||pk||2ψ′(w2k) +4α2k||wk||2||pk||2−4α3k||wk||||pk||3 +α4k||pk||4.

Taking the expectation,

 E[ψk+1−ψk]≤ −2αkwTk∇hkψ′(w2k) +α2kE[||pk||2]ψ′(w2k) +4α2k||wk||2E[||pk||2] −4α3k||wk||E[||pk||3] +α4kE[||pk||4].

Due to Assumption 3, there exist positive constants such that

 E[ψk+1−ψk]≤ −2αkwTk∇hkψ′(w2k) +α2k(A0+B0||wk||4),

and thus there exist positive constants such that

 E[ψk+1−ψk]≤ −2αkwTk∇hkψ′(w2k) +α2k(A+Bψk).

If , then , then the first term on the right hand side is zero. And if , by Assumption 4, the first term of the right hand side is negative. Therefore,

 E[ψk+1−ψk]≤α2k(A+Bψk).

We then transform the expectation inequality to

 E[ψk+1−(1+α2kB)ψk]≤α2kA.
###### Definition B.2.

We define the sequences as follows:

 ϕk:=k−1∏i=111+α2kB and ˜ψk:=ϕkψk.

Note that (this can be shown by writing and using the condition on the sum of the squared learning rate). By substituting these sequences into the above inequality, we obtain

 E[˜ψk+1−˜ψk]≤α2kϕkA.

By defining , for some process , we can bound the positive expected variations of :

 E[δk(˜ψ)]=E[δE[˜ψk+1−˜ψk]]≤α2kϕkA.

Due to Assumption 2, the sum of this expectation is finite. The Quasi-Martingale Convergence Theorem states:

This implies that converges almost surely. And, since converges to , converges almost surely.

Assume converges to a value . For sufficiently large, . This implies that the above inequality is an equality, which then implies

 ∞∑k=1αkwTk∇hkφ′(w2k)<∞ a.s.

But since and (for sufficiently large), this result is not compatible with Assumption 4. Because of this contradiction, we must conclude that converges to 0.

Since converges to 0, the norm and parameters are bounded. This also means that all continuous functions of are likewise bounded. ∎

Proof that SGD converges almost surely (Theorem 4.2):

###### Proof.

We can bound variations of loss/cost criteria using a first order Taylor expansion and bounding the second derivatives with .

 |hk+1−hk+2αkpTk∇hk|≤α2kp2kK1 a.s.,

which can be rewritten as:

 hk+1−hk≤−2αkpTk∇hk+α2kp2kK1 a.s.

Taking the expectation,

 E[hk+1−hk]≤−2αk∇h2k+α2kE[p2k]K1.

Bounding the expectation , yields

 E[hk+1−hk]≤α2kK1K2.

Therefore, the positive expected variations are bounded by

 E[δk(h)]=E[δE[hk+1−hk]]≤α2kK1K2.

By the Quasi-Martingale Convergence Theorem, converges almost surely,

 hka.s.−−−→k→∞h∞.

Additionally, taking the above and summing on , implies the convergence of the following series:

 ∞∑k=1αk(∇hk)2<∞% a.s.

We define , the variations of which are bounded using the Taylor expansion, similarly to the variations of :

 θk+1−θk≤−2αkpTk∇2hk∇hk+α2kp2kK3 a.s.,

for some constant . Taking the expectation and bounding the second derivative by , yields

 E[θk+1−θk]≤αk(∇hk)2K4+α2kK2K3

The positive expectations are bounded,

 E[δk(θ)]= E[δE[θk+1−θk]] ≤ αk(∇hk)2K4+α2kK2K3

Since the terms on the right hand side are summands of convergent infinite sequences (by the above and the Assumption 3), by the Quasi-Martingale Convergence Theorem, converges almost surely. And since the above sequence converges almost surely, this implies the limit must be zero:

 θka.s.−−−→k→∞0 and ∇hka.s.−−−→k→∞0.

#### b.2.2 Power Iteration

Proof that power iteration follows the assumed update steps (Lemma 4.2):

###### Proof.

For ease, we define . We begin by splitting into its components:

 ||pk||2= ||pk−¯pk+¯pk||2 = ||pk−¯pk||2+||¯pk||2+(pk−¯pk)T¯pk.

By the definition of and the triangle inequality,

 ||¯pk||2≤||∇fk||2+2μk||∇fk||||∇gk||+μ2k||∇gk||2.

Taking the expectation and applying the Cauchy-Schwartz inequality yields:

 E[||pk||2]≤ S2+E[||∇fk||2] +2μk(E[||∇fk||2])12(E[||∇gk||2])12 +μ2kE[||∇gk||2].

Note . Similarly,

 E[||pk||3]≤ S3+E[||∇fk||3] +3μk(E[||∇fk||3])23(