# Backpropagation-Friendly Eigendecomposition

Eigendecomposition (ED) is widely used in deep networks. However, the backpropagation of its results tends to be numerically unstable, whether using ED directly or approximating it with the Power Iteration method, particularly when dealing with large matrices. While this can be mitigated by partitioning the data in small and arbitrary groups, doing so has no theoretical basis and makes its impossible to exploit the power of ED to the full. In this paper, we introduce a numerically stable and differentiable approach to leveraging eigenvectors in deep networks. It can handle large matrices without requiring to split them. We demonstrate the better robustness of our approach over standard ED and PI for ZCA whitening, an alternative to batch normalization, and for PCA denoising, which we introduce as a new normalization strategy for deep networks, aiming to further denoise the network's features.

## Authors

• 339 publications
• 8 publications
• 8 publications
• 109 publications
• 107 publications
• ### Robust Differentiable SVD

Eigendecomposition of symmetric matrices is at the heart of many compute...
04/08/2021 ∙ by Wei Wang, et al. ∙ 0

• ### Some Approximation Bounds for Deep Networks

In this paper we introduce new bounds on the approximation of functions ...
03/08/2018 ∙ by Brendan McCane, et al. ∙ 0

• ### Norm matters: efficient and accurate normalization schemes in deep networks

Over the past few years batch-normalization has been commonly used in de...
03/05/2018 ∙ by Elad Hoffer, et al. ∙ 0

• ### Tangent Space Backpropagation for 3D Transformation Groups

We address the problem of performing backpropagation for computation gra...
03/22/2021 ∙ by Zachary Teed, et al. ∙ 0

• ### A Forward-Backward Approach for Visualizing Information Flow in Deep Networks

We introduce a new, systematic framework for visualizing information flo...
11/16/2017 ∙ by Aditya Balu, et al. ∙ 0

• ### Stochastic Block-ADMM for Training Deep Networks

In this paper, we propose Stochastic Block-ADMM as an approach to train ...
05/01/2021 ∙ by Saeed Khorram, et al. ∙ 0

• ### On the Convergence and Robustness of Batch Normalization

Despite its empirical success, the theoretical underpinnings of the stab...
09/29/2018 ∙ by Yongqiang Cai, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In recent years, Eigendecomposition (ED) has been incorporated in deep learning algorithms to perform face recognition

turk1991eigenfaces , PCA whitening Desjardins15 , ZCA whitening Lei18 , second-order pooling Ionescu15 , generative modeling Miyato18 , keypoint detection and matching Suwajanakorn18a ; Yi18a ; Ranftl18

, pose estimation

Dang18a , camera self-calibration Papadopoulo00 , and graph matching Andrei18 . This requires backpropagating the loss through the ED, which can be done but is often unstable. This is because, as shown in Ionescu15 , the partial derivatives of an ED-based loss depend on a matrix with elements

 ˜Kij={\nicefrac1(λi−λj),i≠j0,i=j, (1)

where denotes the eigenvalue of the matrix being processed. Thus, when two eigenvalues are very close, the partial derivatives become very large, causing arithmetic overflow.

The Power Iteration (PI) method nakatsukasa2013stable

is one way around this problem. In its standard form, PI relies on an iterative procedure to approximate the dominant eigenvector of a matrix starting from an initial estimate of this vector. This has been successfully used for graph matching

Andrei18 and spectral normalization in generative models Miyato18 , which only require the largest eigenvalue. In theory, PI can be used in conjunction with a deflation procedure Burden81 to find all eigenvalues and eigenvectors. This involves computing the dominant eigenvector, removing the projection of the input matrix on this vector, and iterating. Unfortunately, two eigenvalues being close to each other or one being close to zero can trigger large round-off errors that eventually accumulate and result in inaccurate eigenvector estimates. The results are also sensitive to the number of iterations and to how the vector is initialized at the start of each deflation step. Finally, the convergence speed decreases significantly when the ratio between the dominant eigenvalue and others becomes close to one.

In short, both SVD Ionescu15 and PI Andrei18 are unsuitable for use in a deep network that requires the computation of gradients to perform back-propagation. This is particularly true when dealing with large matrices for which the chances of two eigenvalues being almost equal is larger than for small ones. This why another popular way to get around these difficulties is to use smaller matrices, for example by splitting the feature channels into smaller groups before computing covariance matrices, as in Lei18 for ZCA whitening Kessy18 ; Bell97 . This, however, imposes arbitrary constraints on learning. They are not theoretically justified and may degrade performance.

In this paper, we therefore introduce a numerically stable and differentiable approach to performing ED within deep networks in such as way that their training is robust. To this end, we leverage the fact that the forward pass of ED is stable and yields accurate eigenvector estimates. As the aforementioned problems come from the backward pass, once the forward pass is complete we rely on PI for backprogation purposes, leveraging the ED results for initialization. We will show that our hybrid training procedure consistently outperforms both ED and PI in terms of stability for

• ZCA whitening Lei18 : An alternative to Batch Normalization Ioffe15

, which involves linearly transforming the feature vectors so that their covariance matrices becomes the identity matrix and that they can be considered as decoupled.

• PCA denoising Babu12 : A transformation of the data that reduces the dimensionality of the input features by projecting them on a subspace spanned by a subset of the eigenvectors of their covariance matrix, and projects them back to the original space. Here, we introduce PCA denoising as a new normalization strategy for deep networks, aiming to remove the irrelevant signal from their feature maps.

Exploiting the full power of both these techniques requires performing ED on relatively large matrices and back-propagating the results, which our technique allows whereas competing ones tend to fail.

## 2 Numerically Stable Differentiable Eigendecomposition

Given an input matrix , there are two standard ways to exploit the ED of within a deep network:

1. [leftmargin=*]

2. Perform ED using SVD or QR decomposition and use analytical gradients for backpropagation.

3. Given randomly-chosen initial guesses for the eigenvectors, run a PI deflation procedure during the forward pass and compute the corresponding gradients for backpropagation purposes.

Unfortunately, as discussed above, the first option is prone to gradient explosion when two or more eigenvalues are close to each other while the accuracy of the second strongly depends on the initial vectors and on the number of iterations in each step of the deflation procedure. This can be problematic because the PI convergence rate depends geometrically on the ratio of the two largest eigenvalues. Therefore, when this ratio is close to 1, the eigenvector estimate may be inaccurate when performing a limited number of iterations. This can lead to training divergence and eventual failure, as we will see in the results section.

Our solution is to rely on the following hybrid strategy:

1. [leftmargin=*]

2. Use SVD during the forward pass because, by relying on a divide-and-conquer strategy, it tends to be numerically more stable than QR decomposition nakatsukasa2013stable .

3. Compute the gradients for backpropagation from the PI derivations, but using the SVD-computed vectors for initialization purposes.

In the remainder of this section, we show that the resulting PI gradients not only converge to the analytical ED ones, but are bounded from above by a factor depending on the number of PI iterations, thus preventing their explosion in practice. In the results section, we will empirically confirm this by showing that training an ED-based deep network using our approach consistently converges, whereas the standard SVD and PI algorithms often diverge.

Let be a covariance matrix, and therefore be positive semi-definite and symmetric. We now focus on the leading eigenvector of . Since the deflation procedure simply iteratively removes the projection of on its leading eigenvector, the following derivations remain valid at any step of this procedure. To compute the leading eigenvector of , PI relies on the iterative update

 v(k)=\nicefracMv(k−1)∥Mv(k−1)∥, (2)

where denotes the norm. The PI gradients can then be computed as Mang17 .

 ∂L∂M=K−1∑k=0(I−v(k+1)v(k+1)⊤)∥∥Mv(k)∥∥∂L∂v(k+1)v(k)⊤,∂L∂v(k)=M(I−v(k+1)v(k+1)⊤)∥∥Mv(k)∥∥∂L∂v(k+1). (3)

Typically, to initialize PI, is taken as a random vector such that . Here, however, we rely on SVD to compute the true eigenvector . Because is an accurate estimate of the eigenvector, feeding it as initial value in PI will yield . Exploiting this in Eq. 3 and introducing the explicit form of , into lets us write

 (4)

The details of this derivation are provided in the supplementary material. In our experiments, Eq. 4 is the form we adopt to compute the ED gradients.

### 2.2 Relationship between PI and Analytical ED Gradients

We now show that when goes to infinity, the PI gradients of Eq. 4 are the same as the analytical ED ones.

To reformulate the PI gradients, we rely on the fact that

 Mk=VΣkV⊤=λk1v1v⊤1+λk2v2v⊤2+⋯+λknvnv⊤n, (5)

and that , where is the dominant eigenvector and is the dominant eigenvalue. Introducing Eq. 5 into Eq. 4, lets us re-write the gradient as

 ∂L∂M =⎛⎜ ⎜⎝(∑ni=2viv⊤i)λ1+(∑ni=2λiviv⊤i)λ21+⋯+(∑ni=2λK−1iviv⊤i)λK1⎞⎟ ⎟⎠∂L∂v(K)1v⊤1 (6)

Eq. 6 defines a geometric progression. Given that

 1−(\nicefracλiλ1)k→1,whenk→∞,because∣∣\nicefracλiλ1∣∣≤1,

we have

 1λ1+1λ1(λiλ1)1+⋯+1λ1(λiλ1)k−1=1λ1(1−(λiλ1)k)1−λiλ1→1λ11−λiλ1,whenk→∞. (7)

Introducing Eq. 7 into Eq. 6 yields

 ∂L∂M=⎛⎜⎝n∑i=2⎛⎜⎝1λ11−λiλ1⎞⎟⎠viv⊤i⎞⎟⎠∂L∂v(k)1v⊤1=(n∑i=2viv⊤iλ1−λi)∂L∂v(k)1v⊤1. (8)

As shown in Ionescu15 , the analytic form of the ED gradients can be written as

 ∂L∂M=v1{(˜K⊤∘(v⊤1∂L∂v1))+(∂L∂Σ)diag}v⊤1, (9)

where given by Eq. 1. By making use of the same properties as before, this can be re-written as

 ∂L∂M=n∑i=21λ1−λiviv⊤i∂L∂v1v⊤1+∂L∂λiviv⊤i. (10)

The the last term of Eq. 10 can be ignored because is not involved in any computation during the forward pass. Instead, we compute the eigenvalue as the Rayleigh quotient , which only depends on the eigenvector and thus only need the gradients w.r.t. to them. A detailed derivation is provided in the supplementary material.

This shows that the partial derivatives of computed using PI have the same form as the analytical ED ones when . Similar derivations can be done for . This justifies our use of PI to approximate the analytical ED gradients during backpropogation. We now turn to showing that the resulting gradient estimates are upper-bounded and can therefore not explode.

### 2.3 Upper Bounding the PI Gradients

Recall that, when the input matrix has two equal eigenvalues, , the gradients computed from Eq.10 go to , as the denominator is 0. However, as Eq. 6 can be viewed as the geometric series expansion of Eq. 10, as shown below, it provides an upper bound on the gradient magnitude. Specifically, we can write

 ∂L∂M=n∑i=21λ1−λiviv⊤i∂L∂v1v⊤1≈(n∑i=2(1λ1+1λ1q1⋯+1λ1qK−1)viv⊤i)∂L∂v1v⊤1, (11)

where . Therefore,

 ∥∥∥∂L∂M∥∥∥≤∥∥ ∥∥n∑i=2(1λ1+1λ1(λiλ1)1+1λ1(λiλ1)2+⋯+1λ1(λiλ1)K−1)viv⊤i∥∥ ∥∥∥∥∥∂L∂v1∥∥∥∥∥v⊤1∥∥≤∥∥ ∥∥n∑i=2(1λ1⋯+1λ1)viv⊤i∥∥ ∥∥∥∥∥∂L∂v1∥∥∥∥∥v⊤1∥∥≤n∑i=2∥∥∥Kλ1viv⊤i∥∥∥∥∥∥∂L∂v1∥∥∥∥∥v⊤1∥∥≤nKλ1∥∥∥∂L∂v1∥∥∥. (12)

This yields an upper bound of . However, if , this upper bound also becomes . To avoid this, knowing that is symmetric positive semi-definite, we modify it as , where is the identity matrix. This guarantees that the eigenvalues of are greater than or equal to . In practice, we set . Thus, we can write

 (13)

where is the dimension of the matrix and is the power iteration number, which means that choosing a specific value of amounts to choosing an upper bound for the gradient magnitudes. We now provide an empirical approach to doing so.

### 2.4 Choosing an Appropriate Power Iteration Number K

Recall from Eqs. 7 and 8 that, for the PI gradients to provide a good estimate of the analytical ones, we need to go to zero. Fig. 1 shows how the value evolves for different power iteration number and ratio . This suggests the need to select an appropriate for each .

To this end, we assume that is a good approximation to . Then we have

 (λi/λ1)k≤0.05⇔kln(λi/λ1)≤ln(0.05)⇔k≥\nicefracln(0.05)ln(λi/λ1). (14)

That is, the minimum value of to satisfy is .

Table 1 shows the minimum number of iterations required to guarantee that our assumption holds for different values of . Note that when two eigenvalues are very close to each other, e.g., , we need about 3000 iterations to achieve a good approximation. However, this is rare in practice. In CIFAR-10, using ResNet18, we observed that the average lies in interval for the mini-batches. Given the values shown in Table 1, we therefore set the power iteration number to be , which yields a good approximation in all these cases.

### 2.5 Practical Considerations

In practice, we found that other numerical imprecisions due to the computing platform itself, such as the use of single vs double precision, could further affect training with eigendecomposition. Here, we discuss our practical approach to dealing with these issues.

The first issue we observed was that, in the forward pass, the eigenvalues computed using SVD may be inaccurate, with SVD sometimes crashing when the matrix is ill-conditioned. Recall that, to increase stability, we add a small value to the diagonal of the input matrix . As a consequence, all the eigenvalues should be greater than or equal to . However, this is not the case when we use float instead of double precision. To solve this problem, we employ truncated SVD. That is, we discard the eigenvectors whose eigenvalue and the subsequent ones.

The second issue is related to the precision of the eigenvalues computed using . Because of the round-off error from , may be inaccurate and sometimes negative. To avoid using incorrect eigenvalues, we also need to truncate it.

These two practical issues correspond to the breaking conditions , defined in Alg.1 that provides the pseudo-code for our ZCA whitening application. Furthermore, we add another constraint, , implying that we also truncate the eigendecomposition if the remaining energy in is less than 0.0001.

Note that this last condition can be modified by using a different threshold. To illustrate this, we therefore also perform experiments with a higher threshold, thus leading to our PCA denoising layer, which aims to discard the noise in the network features. As shown in our experiments, our PCA denoising layer achieves competitive performance compared with the ZCA one.

In Alg.1, the operation: = Power Iteration() has no computation involved in the forward pass. It only serves to save that will be used to compute the gradients during the backward pass. Furthermore, compared with standard batch normalization Ioffe15

, we replace the running variance by a running subspace

but keep the learnable parameters , . The running mean and running subspace are used in the testing phase.

## 3 Experiments

We now demonstrate the effectiveness of our Eigendecomposition layer by using it to perform ZCA whitening and PCA denoising. ZCA whitening has been shown to improve classification performance over batch normalization Lei18 . We will demonstrate that we can deliver a further boost by making it possible to handle larger covariance matrices within the network. PCA denoising has been used for image denoising Babu12 but not in a deep learning context, presumably because it requires performing ED on large matrices. We will show that our approach solves this problem.

In all the experiments below, we use either Resnet18 or Resnet50 He16 as our backbone. We retain their original architectures but introduce an additional layer between the first convolutional layer and the first pooling layer. For both ZCA and PCA, the new layer computes the covariance matrix of the feature vectors, eigendecomposes it, and uses the eigenvalues and eigenvectors as described below. As discussed in Section 2.4, we use

power iterations when backprogating the gradients unless otherwise specified. To accommodate this additional processing, we change the stride

and kernel sizes in the subsequent blocks to Conv1(33,s=1)->Block1(s=1)->Block2(s=2)->Block3(s=2)->Block4(s=2)->AvgPool(44)->FC.

### 3.1 ZCA Whitening

The ZCA whitening algorithm takes as input a matrix where represents the feature dimensionality and the number of samples. It relies on eigenvalues and eigenvectors to compute a matrix such that the covariance matrix of is the identity matrix, meaning that the transformed features are now decorrelated. ZCA has shown great potential to boost the classification accuracy of deep networks Lei18 , but only when can be kept small to prevent the training procedure from diverging. To this end, the output channels of a convolutional layer are partitioned into groups so that each one contains only features. ZCA whitening is then performed within each group independently. This can be understood as a block-diagonal approximation of the complete ZCA whitening. In Lei18 , is taken to be 3, and the resulting covariance matrices are then less likely to have similar or zero eigenvalues. The pseudo-code for ZCA whitening is given in Alg. 1.

We first use CIFAR-10 Krizhevsky09 to compare the behavior of our approach with that of standard SVD and PI for different number of groups , corresponding to different matrix dimensions . Because of numerical instability, the training process of these method often crashes. In short, we observed that

1. [leftmargin=*]

2. For SVD, when , 8 out of 15 trials failed; when , the algorithm failed everytime.

3. For PI, when , all the trials succeeded; when , only 1 out of 15 trials succeeded; when , the algorithm failed everytime.

4. For our algorithm, we never saw a training failure cases independently of the matrix dimension ranging from to .

In Table 2

, we report the mean classification error rates and their standard deviations over the trials in which they succeeded, along with their success rates. For

, SVD unsurprisingly delivers the best accuracy because using analytical derivative is the best thing one can do when possible. However, even for such a small , it succeeds only in of trials and systematically fails larger values of . PI can handle but not consistently as it succeeds only of the time. Our approach succeeds for all values of up to at least 64 and delivers the smallest error rate for , which confirms that increasing the size of the groups can boost performance.

We report equivalent results in Table 3 on CIFAR-100 using either ResNet18 or ResNet50 as the backbone. Our approach again systematically delivers a result. Being able to boost to 64 for ResNet18 and to 32 for ResNet50 allows us to outperform batch normalization in terms of mean error in both cases.

In Fig. 2, we plot the training losses when using either PI or our method as a function of the number of epochs on both datasets. Note how much smoother ours are.

### 3.2 PCA Denoising

We now turn to PCA denoising that computes the covariance matrix of a matrix , finds the eigenvectors of , projects onto the subspace spanned by the eigenvectors associated to the largest eigenvalues, with , and projects the result back to the original space. Before doing PCA, the matrix needs to be standardized by removing the mean and scaling the data by its standard deviation, i.e., .

During training

can be either taken as a fixed number, or set dynamically for each individual batch. In this case, a standard heuristic is to choose it so that

, where the are the eigenvalues, meaning that of the variance present in the original data is preserved. We will discuss both approaches below. As before, we run our training scheme 5 times for each setting of the parameters and report the mean classification accuracy and its variance on CIFAR-10.

#### Percentage of Preserved Information vs Performance.

We first set to preserve a fixed percentage in each batch as described above. In practice, this means that is always much smaller than the channel number . For instance, after the first convolutional layer that has 64 channels, we observed that remains smaller than 7 most of the time when preserving of the variance, 15 when preserving , and 31 when preserving . As can be seen in Fig.3(a) and Table 4(a), retaining less than 90% of the variance hurts performance, and the best results are obtained for 99%, with a mean error of 4.63. Our PCA denoising layer then outperforms Batch Normalization (BN) whose mean error is 4.81, as shown in Table 5.

#### Number of Eigenvectors vs Performance.

We now set to a fixed value. As shown in Fig.3(b) and Table 4(b), the performance is stable when , and the optimum is reached for , which preserves approximately of the variance on average. Note that the average accuracy is the same as in the previous scenario but that its standard deviation is a bit larger.

#### Number of Power Iterations vs Performance.

As preserving of information yields good results, we now dynamically set accordingly, and report errors as a function of the number of power iterations during backpropagation in Table 4

(c). Note that for our PCA denoising layer, 2 power iterations are enough to deliver an error rate that is very close the best one obtained after 29 iterations, that is, 4.63% of 4.61%. In this case, our method only incurs a small time penalty at run-time. In practice, on one single Titan XP GPU server, for one minibatch with batchsize 128, using ResNet18 as backbone, 2 power iterations take 104.8 ms vs 82.7ms for batch normalization. Note that we implemented our method in Pytorch

paszke2017automatic , with the backward pass written in python, which leaves room for improvement.

#### Training Stability.

Fig.5 depicts the training curve for our method, PI, and standard BN. Note that the loss values for our method and BN decrease smoothly, which indicates that the training process is very stable. By contrast, the PI training curve denotes far more instability and ultimately lower performance. In Table. 5, we report the final error rates, where the NaN values associated to SVD denotes the fact that using the analytical SVD derivatives failed all five times we tried.

## 4 Discussion & Conclusion

In this paper, we have introduced a numerically stable differentiable eigendecomposition method that relies on the SVD during the forward pass and on Power Iterations to compute the gradients during the backward pass. Both the theory and the experimental results confirm the increased stability that our method brings compared with standard SVD or Power Iterations alone. In addition to performing ZCA more effectively than before, this has enabled us to introduce a PCA denoising layer, which has proven to be effective to improve the performance on classification tasks.

The main limitation of our method is that the accuracy of our algorithm depends on the accuracy of the SVD in the forward pass, which is not always perfect. In future work, we will therefore explore approaches to increase it.

## 5 Acknowledgments

This work was funded in part by the Swiss Innovation Agency Innosuisse.

## References

• [1] Matthew Turk and Alex Pentland. Eigenfaces for recognition. Journal of cognitive neuroscience, 3(1):71–86, 1991.
• [2] Guillaume Desjardins, Karen Simonyan, Razvan Pascanu, et al.

Natural neural networks.

2015.
• [3] Lei Huang, Dawei Yang, Bo Lang, and Jia Deng. Decorrelated batch normalization. In CVPR, 2018.
• [4] Catalin Ionescu, Orestis Vantzos, and Cristian Sminchisescu. Matrix Backpropagation for Deep Networks with Structured Layers. In CVPR, 2015.
• [5] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. 2018.
• [6] S. Suwajanakorn, N. Snavely, J. Tompson, and M. Norouzi. Discovery of Latent 3D Keypoints via End-To-End Geometric Reasoning. In NIPS, 2018.
• [7] K. M. Yi, E. Trulls, Y. Ono, V. Lepetit, M. Salzmann, and P. Fua. Learning to Find Good Correspondences. In CVPR, 2018.
• [8] R. Ranftl and V. Koltun. Deep Fundamental Matrix Estimation. In ECCV, 2018.
• [9] Zheng Dang, Kwang Moo Yi, Yinlin Hu, Fei Wang, Pascal Fua, and Mathieu Salzmann. Eigendecomposition-Free Training of Deep Networks with Zero Eigenvalue-Based Losses. In ECCV, 2018.
• [10] T. Papadopoulo and M. Lourakis.

Estimating the jacobian of the singular value decomposition: Theory and applications.

In ECCV, pages 554–570, 2000.
• [11] Andrei Zanfir and Cristian Sminchisescu. Deep Learning of Graph Matching. In CVPR, 2018.
• [12] Yuji Nakatsukasa and Nicholas J Higham. Stable and efficient spectral divide and conquer algorithms for the symmetric eigenvalue decomposition and the svd. SIAM Journal on Scientific Computing, 35(3):A1325–A1349, 2013.
• [13] Richard L. Burden and J. Douglas Faires. Numerical Analysis. Ninth edition, 1989.
• [14] Agnan Kessy, Alex Lewin, and Korbinian Strimmer. Optimal whitening and decorrelation. The American Statistician, 72(4):309–314, 2018.
• [15] Anthony J Bell and Terrence J Sejnowski. The “independent components” of natural scenes are edge filters. Vision research, 37(23):3327–3338, 1997.
• [16] S. Ioffe and C. Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In ICML, 2015.
• [17] Y Murali Mohan Babu, M Venkata Subramanyam, and MN Giri Prasad. Pca based image denoising. Signal & Image Processing, 3(2):236, 2012.
• [18] Mang Ye, Andy J Ma, Liang Zheng, Jiawei Li, and Pong C Yuen. Dynamic label graph matching for unsupervised video re-identification. In ICCV, 2017.
• [19] K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. In CVPR, pages 770–778, 2016.
• [20] A. Krizhevsky. Learning Multiple Layers of Features from Tiny Images. Master’s thesis, Department of Computer Science, University of Toronto, 2009.
• [21] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in PyTorch. In NIPS Autodiff Workshop, 2017.

## 6 Approximate ED Gradients with PI in Backpropogation

In the following two subsections, we prove that the gradients computed from the PI equals those computed from ED.

To compute the leading eigenvector of , PI uses the following standard formula

 v(k)=Mv(k−1)∥Mv(k−1)∥, (15)

where denotes the norm, and is usually initialized randomly with . Its gradient is [18]

 ∂L∂M = ∑k(I−v(k+1)v(k+1)⊤)∥∥Mv(k)∥∥∂L∂v(k+1)v(k)⊤ (16) ∂L∂v(k) = M(I−v(k+1)v(k+1)⊤)∥∥Mv(k)∥∥∂L∂v(k+1)

Using 3 power iteration steps for demonstration, we have

 ∂L∂v(2) =M(I−v(3)v(3)⊤)∥∥Mv(2)∥∥∂L∂v(3) (17) ∂L∂v(1)

Then, because we use ED’s result, denoted as , as initial vector, . Therefore, can be re-written as

 ∂L∂M=(I−v(3)v(3)⊤)∥∥Mv(2)∥∥∂L∂v(3)v(2)⊤+(I−v(2)v(2)⊤)∥∥Mv(1)∥∥∂L∂v(2)v(1)⊤+(I−v(1)v(1)⊤)∥∥Mv(0)∥∥∂L∂v(1)v(0)⊤=((I−vv⊤)∥Mv∥+(I−vv⊤)M(I−vv⊤)∥Mv∥2+(I−vv⊤)M(I−vv⊤)M(I−vv⊤)∥Mv∥3)∂L∂v(3)v⊤ (18)

Since and are symmetric, and , we have

 vv⊤M=(M⊤vv⊤)⊤=(Mvv⊤)⊤=(λvv⊤)⊤=λvv⊤=Mvv⊤.

Introducing the equation above into the numerator of the second term of Eq. 18 yields

 (I−vv⊤)M(I−vv⊤) =(M−vv⊤M)(I−vv⊤)=(M−Mvv⊤)(I−vv⊤) (19) =M(I−vv⊤)(I−vv⊤) =M(I−2vv⊤+v(v⊤v)v⊤)=M(I−vv⊤).

Similarly, for the numerator in the third term in Eq.18, we have

 (20)

Introducing Eq.19 and Eq.20 into Eq.18, we obtain

 (21)

When extending the iteration number from 3 to , Eq.18 becomes

 (22)

Eq.22 is the form we adopt to compute the gradients of ED.

The analytic solution of the ED gradients is [4].

 ∂L∂M=V{(~K⊤∘(V⊤∂L∂V))+(∂L∂Σ)diag}V⊤ (23)
 ~Kij={1λi−λj,i≠j0,i=j (24)
 ~K=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣01λ1−λ21λ1−λ3⋯1λ1−λn1λ2−λ101λ2−λ3⋯1λ2−λn1λ3−λ11λ3−λ20⋯1λ3−λn⋮⋮⋮⋱⋮1λn−λ11λn−λ21λn−λ3⋯0⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦ (25)

where is an eigenvalue, and

 V=[v1v2v3⋯vn] (26)

where is an eigenvector. Then,

 ∂L∂V=[∂L∂v1∂L∂v2∂L∂v3⋯∂L∂vn] (27)
 V⊤∂L∂V=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣v⊤1∂L∂v1v⊤1∂L∂v2v⊤1∂L∂v3⋯v⊤1∂L∂vnv⊤2∂L∂v1v⊤2∂L∂v2v⊤2∂L∂v3⋯v⊤2∂L∂vnv⊤3∂L∂v1v⊤3∂L∂v2v⊤3∂L∂v3⋯v⊤3∂L∂vn⋮⋮⋮⋱⋮v⊤n∂L∂v1v⊤n∂L∂v2v⊤n∂L∂v3⋯v⊤n∂L∂vn⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦ (28)
 (29)
 V~K∘V⊤∂L∂V =[∑ni≠11λ1−λiviv⊤i∂L∂v1,⋯,∑ni≠n1λn−λiviv⊤i∂L∂vn] (30)
 V~K∘V⊤∂L∂VV⊤=n∑i≠11λ1−λiviv⊤i∂L∂v1v1+⋯+n∑i≠n1λn−λiviv⊤i∂L∂vnvn (31)
 V(∂L∂Σ)diagV⊤=n∑i=1∂L∂λiviv⊤i (32)

Let us now consider the partial derivative w.r.t. the dominant eigenvector and ignore the remaining . Then becomes

 ∂L∂M=n∑i=21λ1−λiviv⊤i∂L∂v1v⊤1+∂L∂λ1v1v⊤1. (33)