Controllable Orthogonalization in Training DNNs

04/02/2020 ∙ by Lei Huang, et al. ∙ 19

Orthogonality is widely used for training deep neural networks (DNNs) due to its ability to maintain all singular values of the Jacobian close to 1 and reduce redundancy in representation. This paper proposes a computationally efficient and numerically stable orthogonalization method using Newton's iteration (ONI), to learn a layer-wise orthogonal weight matrix in DNNs. ONI works by iteratively stretching the singular values of a weight matrix towards 1. This property enables it to control the orthogonality of a weight matrix by its number of iterations. We show that our method improves the performance of image classification networks by effectively controlling the orthogonality to provide an optimal tradeoff between optimization benefits and representational capacity reduction. We also show that ONI stabilizes the training of generative adversarial networks (GANs) by maintaining the Lipschitz continuity of a network, similar to spectral normalization (SN), and further outperforms SN by providing controllable orthogonality.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 20

page 21

Code Repositories

ONI

PyTorch and Torch implementation for our accepted CVPR 2020 paper (Oral): Controllable Orthogonalization in Training DNNs


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Training deep neural networks (DNNs) is often difficult due to the occurrence of vanishing/exploding gradients [7, 16, 51]. Preliminary research [39, 16]

has suggested that weight initialization techniques are essential for avoiding these issues. As such, various works have tried to tackle the problem by designing weight matrices that can provide nearly equal variance to activations from different layers

[16, 21]. Such a property can be further amplified by orthogonal weight initialization [58, 46, 62], which shows excellent theoretical results in convergence due to its ability to obtain a DNN’s dynamical isometry [58, 53, 72], i.e. all singular values of the input-output Jacobian are concentrated near 1. The improved performance of orthogonal initialization is empirically observed in [58, 46, 53, 70] and it makes training even 10,000-layer DNNs possible [70]. However, the initial orthogonality can be broken down and is not necessarily sustained throughout training [71].

Previous works have tried to maintain the orthogonal weight matrix by imposing an additional orthogonality penalty on the objective function, which can be viewed as a ‘soft orthogonal constraint’ [51, 66, 71, 5, 3]. These methods show improved performance in image classification [71, 76, 40, 5], resisting attacks from adversarial examples [13], neural photo editing [11] and training generative adversarial networks (GAN) [10, 47]. However, the introduced penalty works like a pure regularization, and whether or not the orthogonality is truly maintained or training benefited is unclear. Other methods have been developed to directly solve the ‘hard orthogonal constraint’ [66, 5], either by Riemannian optimization [50, 20] or by orthogonal weight normalization [67, 27]. However, Riemannian optimization often suffers from training instability [20, 27], while orthogonal weight normalization [27] requires computationally expensive eigen decomposition, and the necessary back-propagation through this eigen decomposition may suffer from numerical instability, as shown in [33, 43].

Figure 1: ONI controls a weight matrix’ magnitude of orthogonality (measured as ), by iteratively stretching its singular values towards 1.

We propose to perform orthogonalization by Newton’s iteration (ONI) [44, 8, 29] to learn an exact orthogonal weight matrix, which is computationally efficient and numerically stable. To further speed up the convergence of Newton’s iteration, we propose two techniques: 1) we perform centering to improve the conditioning of the proxy matrix; 2) we explore a more compact spectral bounding method to make the initial singular value of the proxy matrix closer to .

We provide an insightful analysis and show that ONI works by iteratively stretching the singular values of the weight matrix towards (Figure 1). This property makes ONI work well even if the weight matrix is singular (with multiple zero singular values), under which the eigen decomposition based method [27] often suffers from numerical instability [33, 43]. Moreover, we show that controlling orthogonality is necessary to balance the increase in optimization and reduction in representational capacity, and ONI can elegantly achieve this through its iteration number (Figure 1). Besides, ONI provides a unified solution for the row/column orthogonality, regardless of whether the weight matrix’s output dimension is smaller or larger than the input.

We also address practical strategies for effectively learning orthogonal weight matrices in DNNs. We introduce a constant of to initially scale the orthonormal weight matrix so that the dynamical isometry [58]

can be well maintained for deep ReLU networks

[49]

. We conduct extensive experiments on multilayer perceptrons (MLPs) and convolutional neural networks (CNNs). Our proposed method benefits the training and improves the test performance over multiple datasets, including ImageNet

[55]. We also show that our method stabilizes the training of GANs and achieves improved performance on unsupervised image generation, compared to the widely used spectral normalization [47].

2 Related Work

Orthogonal filters have been extensively explored in signal processing since they are capable of preserving activation energy and reducing redundancy in representation [77]. Saxe et al. [58] introduced an orthogonal weight matrix for DNNs and showed that it achieves approximate dynamical isometry [58] for deep linear neural networks, therefore significantly improving the optimization efficiency [46, 62]. Pennington et al. [53] further found that the nonlinear sigmoid network can also obtain dynamical isometry when combined with orthogonal weight initialization [53, 70, 72].

Research has also been conducted into using orthogonal matrices

to avoid the gradient vanishing/explosion problems in recurrent neural networks (RNNs). These methods mainly focus on constructing square orthogonal matrices/unitary matrices for the hidden-to-hidden transformations in RNNs  

[4, 67, 15, 66, 30, 35, 24]. This is done by either constructing a decomposed unitary weight matrix with a restricted [4] or full representational capability [67, 24], or by using soft constraints [66]. Different from these methods requiring a square weight matrix and limited to hidden-to-hidden transformations in RNNs, our method is more general and can adapt to situations where the weight matrix is not square.

Our method is related to the methods that impose orthogonal penalties on the loss functions

[51, 66, 5]. Most works propose to use soft orthogonality regularization under the standard Frobenius norm [51, 66, 5], though other alternative orthogonal penalties were explored in [5]. There are also methods that propose to bound the singular values with periodical projection [34]. Our method targets at solving the ‘hard constraint’ and providing controllable orthogonality.

One way to obtain exact orthogonality is through Riemannian optimization methods [50, 20]. These methods usually require a retract operation [2] to project the updated weight back to the Stiefel manifold [50, 20], which may result in training instability for DNNs [20, 27]

. Our method avoids this by employing re-parameterization to construct the orthogonal matrix

[27]. Our work is closely related to orthogonal weight normalization [27], which also uses re-parameterization to design an orthogonal transformation. However, [27] solves the problem by computationally expensive eigen decomposition and may result in numeric instability [33, 43]. We use Newton’s iteration [44, 8], which is more computationally efficient and numerically stable. We further argue that fully orthogonalizing the weight matrix limits the network’s learning capacity, which may result in degenerated performance [47, 10]. Another related work is spectral normalization [47]

, which uses reparametrization to bound only the maximum eigenvalue as 1. Our method can effectively interpolate between spectral normalization and full orthogonalization, by altering the iteration number.

Newton’s iteration has also been employed in DNNs for constructing bilinear/second-order pooling [43, 41], or whitening the activations [29]. [43] and [41] focused on calculating the square root of the covariance matrix, while our method computes the square root inverse of the covariance matrix, like the work in [29]. However, our work has several main differences from [29]: 1) In [29], they aimed to whiten the activation [28] over batch data using Newton’s iteration, while our work seeks to learn the orthogonal weight matrix, which is an entirely different research problem [32, 56, 28]; 2) We further improve the convergence speed compared to the Newton’s iteration proposed in [29] by providing more compact bounds; 3) Our method can maintain the Lipschitz continuity of the network and thus has potential in stabilizing the training of GANs [47, 10]. It is unclear whether or not the work in [29] has such a property, since it is data-dependent normalization [32, 47, 10].

3 Proposed Method

Given the dataset composed of an input and its corresponding labels

, we represent a standard feed-forward neural network as a function

parameterized by . is a composition of

simple nonlinear functions. Each of these consists of a linear transformation

with learnable weights and biases , followed by an element-wise nonlinearity: . Here indexes the layers. We denote the learnable parameters as . Training neural networks involves minimizing the discrepancy between the desired output and the predicted output , described by a loss function . Thus, the optimization objective is: .

1:  Input: proxy parameters and iteration numbers .
2:  Bounding ’s singular values: .
3:  Calculate covariance matrix: .
4:  .
5:  for  to T do
6:     .
7:  end for
8:  
9:  Output: orthogonalized weight matrix: .
Algorithm 1 Orthogonalization by Newton’s Iteration (ONI).
Figure 2: Convergence behaviors of the proposed Orthogonalization by Newton’s Iteration. The entries of proxy matrix

are sampled from the Gaussian distribution

. We show (a) the magnitude of the orthogonality, measured as , with respect to the iterations and (b) the distribution (log scale) of the eigenvalues of with different iterations.

3.1 Preliminaries

This paper starts with learning orthogonal filter banks (row orthogonalization of a weight matrix) for deep neural networks (DNNs). We assume for simplicity, and will discuss the situation where in Section 3.4. This problem is formulated in [27] as an optimization with layer-wise orthogonal constraints, as follows:

(1)

To solve this problem directly, Huang et al. [27] proposed to use the proxy parameters and construct the orthogonal weight matrix by minimizing them in a Frobenius norm over the feasible transformation sets, where the objective is:

(2)

They solved this in a closed-form, with the orthogonal transformation as:

(3)

where and

are the eigenvalues and corresponding eigenvectors of the covariance matrix

. Given the gradient , back-propagation must pass through the orthogonal transformation to calculate for updating . The closed formulation is concise; however, it encounters the following problems in practice: 1) Eigen decomposition is required, which is computationally expensive, especially on GPU devices [43]; 2) The back-propagation through the eigen decomposition requires the element-wise multiplication of a matrix [27], whose elements are given by , where . This may cause numerical instability when there exists equal eigenvalues of , which is discussed in [33, 43] and observed in our preliminary experiments, especially for high-dimensional space.

We observe that the solution of Eqn. 3.1 can be represented as , where can be computed by Newton’s iteration [44, 8, 29], which avoids eigen decomposition in the forward pass and potential numerical instability during the back-propagation.

1:  Input: proxy parameters and iteration numbers .
2:  Centering: .
3:  Bounding ’s singular values: .
4:  Execute Step. 3 to 8 in Algorithm 1.
5:  Output: orthogonalized weight matrix: .
Algorithm 2 ONI with Acceleration.

3.2 Orthogonalization by Newton’s Iteration

Newton’s iteration calculates as follows:

(4)

where is the number of iterations. Under the condition that , will converge to [8, 29].

in Eqn. 3.1 can be initialized to ensure that initially satisfies the convergence condition, e.g. ensuring , where are the singular values of . However, the condition is very likely to be violated when training DNNs, since varies.

To address this problem, we propose to maintain another proxy parameter and conduct a transformation , such that , inspired by the re-parameterization method [57, 27]. One straightforward way to ensure is to divide the spectral norm of , like the spectral normalization method does [47]

. However, it is computationally expensive to accurately calculate the spectral norm, since singular value decomposition is required. We thus propose to divide the Frobenius norm of

to perform spectral bounding:

(5)

It’s easy to demonstrate that Eqn. 5 satisfies the convergence condition of Newton’s iteration and we show that this method is equivalent to the Newton’s iteration proposed in [29] (See Appendix B for details). Algorithm 1 describes the proposed method, referred to as Orthogonalization by Newton’s Iteration (ONI), and its corresponding back-propagation is shown in Appendix A. We find that Algorithm 1 converges well (Figure 2). However, the concern is the speed of convergence, since 10 iterations are required in order to obtain a good orthogonalization. We thus further explore methods to speed up the convergence of ONI.

3.3 Speeding up Convergence of Newton’s Iteration

Our Newton’s iteration proposed for obtaining orthogonal matrix works by iteratively stretching the singular values of towards 1, as shown in Figure 2 (b). The speed of convergence depends on how close the singular values of initially are to [8]. We observe that the following factors benefit the convergence of Newton’s iteration: 1) The singular values of have a balanced distribution, which can be evaluated by the condition number of the matrix ; 2) The singular values of should be as close to 1 as possible after spectral bounding (Eqn. 5).

Centering

To achieve more balanced distributions for the eigenvalues of , we perform a centering operation over the proxy parameters , as follows

(6)

The orthogonal transformation is then performed over the centered parameters . As shown in [39, 59], the covariance matrix of centered matrix is better conditioned than . We also experimentally observe that orthogonalization over centered parameters (indicated as ‘ONI+Center’) produces larger singular values on average at the initial stage (Figure 3 (b)), and thus converges faster than the original ONI (Figure 3 (a)).

Compact Spectral Bounding

To achieve larger singular values of after spectral bounding, we seek a more compact spectral bounding factor such that and satisfies the convergence condition. We find that satisfies the requirements, which is demonstrated in Appendix B. We thus perform spectral bounding based on the following formulation:

(7)

More compact spectral bounding (CSB) is achieved using Eqn. 7, compared to Eqn. 5. For example, assuming that has equivalent singular values, the initial singular values of after spectral bounding will be when using Eqn. 7, while when using Eqn. 5. We also experimentally observe that using Eqn. 7 (denoted with ‘+CSB’ in Figure 3) results in a significantly faster convergence.

Algorithm 2 describes the accelerated ONI method with centering and more compact spectral bounding (Eqn. 7).

Figure 3: Analysis of speeding up Newton’s iteration. The entries of proxy matrix are sampled from the Gaussian distribution . (a) Comparison of convergence; (b) Comparison of the distribution of the eigenvalues of at iteration .
Figure 4: Unified row and column orthogonalization. The entries of proxy matrix are sampled from the Gaussian distribution . (a) Orthogonalization comparison between and ; (b) The distribution of the eigenvalues of with different iterations.

3.4 Unified Row and Column Orthogonalization

In previous sections, we assume , and obtain an orthogonalization solution. One question that remains is how to handle the situation when . When , the rows of cannot be orthogonal, because the rank of is less than/equal to . Under this situation, full orthogonalization using the eigenvalue decomposition based solution (Eqn. 3) may cause numerical instability, since there exists at least zero eigenvalues for the covariance matrix. These zero eigenvalues specifically lead to numerical instability during back-propagation (when element-wisely multiplying the scaling matrix , as discussed in Section 3.1).

Our orthogonalization solution by Newton’s iteration can avoid such problems, since there are no operations relating to dividing the eigenvalues of the covariance matrix. Therefore, our ONI can solve Eqn. 3.1 under the situation . More interestingly, our method can achieve column orthogonality for the weight matrix (that is, ) by solving Eqn. 3.1 directly under . Figure 4 shows the convergence behaviors of the row and column orthogonalizations. We observe ONI stretches the non-zero eigenvalues of the covariance matrix towards 1 in an iterative manner, and thus equivalently stretches the singular values of the weight matrix towards 1. Therefore, it ensures column orthogonality under the situation . Our method unifies the row and column orthogonalizations, and we further show in Section 3.5 that they both benefit in preserving the norm/distribution of the activation/gradient when training DNNs.

Note that, for , Huang et al. [27] proposed the group based methods by dividing the weights into groups of size and performing orthogonalization over each group, such that the weights in each group are row orthogonal. However, such a method cannot ensure the whole matrix to be either row or column orthogonal (See Appendix C for details).

3.5 Controlling Orthogonality

One remarkable property of the orthogonal matrix is that it can preserve the norm and distribution of the activation for a linear transformation, given appropriate assumptions. Such properties are described in the following theorem.

Theorem 1.

Let , where and . Assume: (1) , , and (2) , . If , we have the following properties: (1) ; (2) , ; (3) ; (4) , . In particular, if , property (2) and (3) hold; if , property (1) and (4) hold.

The proof is provided in Appendix E. Theorem 1 shows the benefits of orthogonality in preventing gradients from exploding/vanishing, from an optimization perspective. Besides, the orthonormal weight matrix can be viewed as the embedded Stiefel manifold

with a degree of freedom

 [1, 27], which regularizes the neural networks and can improve the model’s generalization [1, 27].

However, this regularization may harm the representational capacity and result in degenerated performance, as shown in [27] and observed in our experiments. Therefore, controlling orthogonality is necessary to balance the increase in optimization benefit and reduction in representational capacity, when training DNNs. Our ONI can effectively control orthogonality using different numbers of iterations.

3.6 Learning Orthogonal Weight Matrices in DNNs

Based on Algorithm 2 and its corresponding backward pass, we can wrap our method in linear modules [57, 27], to learn filters/weights with orthogonality constraints for DNNs. After training, we calculate the weight matrix and save it for inference, as in the standard module.

Layer-wise Dynamical Isometry

Theorem 1 shows that the orthogonal matrix has remarkable properties for preserving the norm/distributions of activations during the forward and backward passes, for linear transformations. However, in practice, we need to consider the nonlinearity function as well. Here, we show that we can use an additional constant to scale the magnitude of the weight matrix for ReLU nonlinearity [49], such that the output-input Jacobian matrix of each layer has dynamical isometry.

Theorem 2.

Let , where and . Assume

is a normal distribution with

, . Denote the Jacobian matrix as . If , we have .

The proof is shown in Appendix E. We propose to multiply the orthogonal weight matrix by a factor of for networks with ReLU activation. We experimentally show this improves the training efficiency in Section 4.1. Note that Theorems 1 and 2

are based on the assumption that the layer-wise input is Gaussian. Such a property can be approximately satisfied using batch normalization (BN)

[32]. Besides, if we apply BN before the linear transformation, there is no need to apply it again after the linear module, since the normalized property of BN is preserved according to Theorem 1. We experimentally show that such a process improves performance in Section 4.1.3.

Learnable Scalar

Following [27], we relax the constraint of orthonormal to orthogonal, with , where is the diagonal matrix. This can be viewed as the orthogonal filters having different contributions to the activations. To achieve this, we propose to use a learnable scalar parameter to fine-tune the norm of each filter [57, 27].

Convolutional Layer

With regards to the convolutional layer parameterized by weights , where and are the height and width of the filter, we reshape as , where , and the orthogonalization is executed over the unrolled weight matrix .

Figure 5:

Effects of maintaining orthogonality. Experiments are performed on a 10-layer MLP. (a) The training (solid lines) and testing (dashed lines) errors with respect to the training epochs; (b) The distribution of eigenvalues of the weight matrix

of the 5th layer, at the 200 iteration.
(a) 6-layer MLP
(b) 20-layer MLP
Figure 6: Effects of scaling the orthogonal weights. ‘-NS’ indicates orthogonalization without scaling by . We evaluate the training (solid lines) and testing (dashed lines) errors on (a) a 6-layer MLP and (b) a 20-layer MLP.
Computational Complexity

Consider a convolutional layer with filters , and mini-batch data . The computational cost of our method, coming mainly from the Lines 3, 6 and 8 in Algorithm 1, is for each iteration during training. The relative cost of ONI over the constitutional layer is . During inference, we use the orthogonalized weight matrix , and thus do not introduce additional computational or memory costs. We provide the wall-clock times in Appendix D.

methods g=2,k=1 g=2,k=2 g=2,k=3 g=3,k=1 g=3,k=2 g=3,k=3 g=4,k=1 g=4,k=2 g=4,k=3
plain 11.34 9.84 9.47 10.32 8.73 8.55 10.66 9.00 8.43
WN 11.19 9.55 9.49 10.26 9.26 8.19 9.90 9.33 8.90
OrthInit 10.57 9.49 9.33 10.34 8.94 8.28 10.35 10.6 9.39
OrthReg 12.01 10.33 10.31 9.78 8.69 8.61 9.39 7.92 7.24
OLM-1 [27] 10.65 8.98 8.32 9.23 8.05 7.23 9.38 7.45 7.04
OLM- 10.15 8.32 7.80 8.74 7.23 6.87 8.02 6.79 6.56
ONI 9.95 8.20 7.73 8.64 7.16 6.70 8.27 6.72 6.52
Table 1: Test errors () on VGG-style networks for CIFAR-10 classification. The results are averaged over three independent runs.

4 Experiments

4.1 Image Classification

We evaluate our ONI on the Fashion-MNIST [69], CIFAR-10 [37] and ImageNet [55] datasets. We provide an ablation study on the iteration number of ONI in Section 4.1.4. Due to space limitations, we only provide essential components of the experimental setup; for more details, please refer to Appendix F. The code is available at https://github.com/huangleiBuaa/ONI.

4.1.1 MLPs on Fashion-MNIST

We use an MLP with a ReLU activation [49]

, and vary the depth. The number of neurons in each layer is 256. We employ stochastic gradient descent (SGD) optimization with a batch size of 256, and the learning rates are selected based on the validation set (

samples from the training set) from .

Maintaining Orthogonality

We first show that maintaining orthogonality can improve the training performance. We compare two baselines: 1) ‘plain’, the original network; and 2) ‘OrthInit’, in which the orthogonal initialization [58] is used. The training performances are shown in Figure 5 (a). We observe orthogonal initialization can improve the training efficiency in the initial phase (comparing with ‘plain’), after which the benefits of orthogonality degenerate (comparing with ‘ONI’) due to the updating of weights (Figure 5 (b)).

Effects of Scaling

We experimentally show the effects of initially scaling the orthogonal weight matrix by a factor of . We also apply this technique to the ‘OLM’ method [27], in which the orthogonalization is solved by eigen decomposition. We refer to ‘OLM-NS’/‘ONI-NS’ as the ‘OLM’/‘ONI’ without scaling by . The results are shown in Figure 6. We observe that the scaling technique has no significant effect on shallow neural networks, e.g., the 6-layer MLP. However, for deeper neural networks, it produces significant performance boosts. For example, for the 20-layer MLP, neither ‘OLM’ nor ‘ONI’ can converge without the additional scaling factors, because the activation and gradient exponentially vanish (See Appendix F.1). Besides, our ‘ONI’ has a nearly identical performance compared to ‘OLM’, which indicates the effectiveness of our approximate orthogonalization with few iterations (e.g. 5).

4.1.2 CNNs on CIFAR-10

BatchSize=128 BatchSize=2
w/BN* [73] 6.61
Xavier Init* [27] 7.78
Fixup-init* [74] 7.24
w/BN 6.82 7.24
Xavier Init 8.43 9.74
GroupNorm 7.33 7.36
ONI 6.56 6.67
Table 2: Test errors () comparison on 110-layer residual network [22] without BN [32] under CIFAR-10. ’w/BN’ indicates with BN. We report the median of five independent runs. The methods with ‘*’ indicate the results reported in the cited paper.
VGG-Style Networks

Here, we evaluate ONI on VGG-style neural networks with convolutional layers. The network starts with a convolutional layer of filters, where is the varying width based on different configurations. We then sequentially stack three blocks, each of which has convolutional layers with filter numbers of , and , respectively. We vary the depth with in and the width with in . We use SGD with a momentum of 0.9 and batch size of 128. The best initial learning rate is chosen from over the validation set of 5,000 samples from the training set, and we divide the learning rate by 5 at 80 and 120 epochs, ending the training at 160 epochs. We compare our ‘ONI’ to several baselines, including orthogonal initialization [58] (‘OrthInit’), using soft orthogonal constraints as the penalty term [71] (‘OrthReg’), weight normalization [57] (‘WN’), ‘OLM’ [27] and the ‘plain’ network. Note that OLM [27] originally uses a scale of 1 (indicated as ‘OLM-1’), and we also apply the proposed scaling by (indicated as ‘OLM-’).

Table 1 shows the results. ‘ONI’ and ‘OLM-’ have significantly better performance under all network configurations (different depths and widths), which demonstrates the beneficial effects of maintaining orthogonality during training. We also observe ‘ONI’ and ‘OLM-’ converge faster than other baselines, in terms of training epochs (See Appendix F.2). Besides, our proposed ‘ONI’ achieves slightly better performance than ‘OLM-’ on average, over all configurations. Note that we train ‘OLM-’ with a group size of , as suggested in [27]. We also try full orthogonalization for ‘OLM-’. However, we observe either performance degeneration or numerical instability (e.g., the eigen decomposition cannot converge). We argue that the main reason for this is that full orthogonalization solved by OLM over-constrains the weight matrix, which harms the performance. Moreover, eigen decomposition based methods are more likely to result in numerical instability in high-dimensional space, due to the element-wise multiplication of a matrix during back-propagation [43], as discussed in Section 3.1.

Top-1 () Top-5 () Time (min./epoch)
plain 27.47 9.08 97
WN 27.33 9.07 98
OrthInit 27.75 9.21 97
OrthReg 27.22 8.94 98
ONI 26.31 8.38 104
Table 3: Test errors () on ImageNet validation set (single model and single crop test) evaluated with VGG-16 [61]. The time cost for each epoch is averaged over the training epochs.
Residual Network without Batch Normalization

Batch normalization (BN) is essential for stabilizing and accelerating the training [32] of DNNs [22, 26, 23, 63]. It is a standard configuration in residual networks [22]. However, it sometimes suffers from the small batch size problem [31, 68] and introduces too much stochasticity [65] when debugging neural networks. Several studies have tried to train deep residual networks without BN [60, 74]. Here, we show that, when using our ONI, the residual network without BN can also be well trained.

The experiments are executed on a 110-layer residual network (Res-110). We follow the same experimental setup as in [22], except that we run the experiments on one GPU. We also compare against the Xavier Init [16, 9], and group normalization (GN) [68]. ONI can be trained with a large learning rate of 0.1 and converge faster than BN, in terms of training epochs (See Appendix F.2). We observe that ONI has slightly better test performance than BN (Table 2). Finally, we also test the performance on a small batch size of 2. We find ONI continues to have better performance than BN in this case, and is not sensitive to the batch size, like GN [68].

4.1.3 Large-scale ImageNet Classification

ResNet w/o BN ResNet ResNetVar
Method Train Test Train Test Train Test
plain 31.76 33.84 29.33 29.64 28.82 29.56
ONI 27.05 31.17 29.28 29.57 28.12 28.92
Table 4: Ablation study on ImageNet with an 18-layer ResNet. We evaluate the top-1 training and test errors ().
Test error () Time (min./epoch)
Method 50 101 50 101
ResNet 23.85 22.40 66 78
ResNet  + ONI 23.55 22.17 74 92
ResNetVar 23.94 22.76 66 78
ResNetVar + ONI 23.30 21.89 74 92
Table 5: Results on ImageNet with the 50- and 101-layer ResNets.

To further validate the effectiveness of our ONI on a large-scale dataset, we evaluate it on the ImageNet-2012 dataset. We keep almost all the experimental settings the same as the publicly available PyTorch implementation

[52]: We apply SGD with a momentum of 0.9, and a weight decay of 0.0001. We train for 100 epochs in total and set the initial learning rate to 0.1, lowering it by a factor of 10 at epochs 30, 60 and 90. For more details on the slight differences among different architectures and methods, see Appendix F.3.

VGG Network

Table 3 shows the results on the 16-layer VGG [61]. Our ‘ONI’ outperforms ‘plain’, ‘WN’, ‘OrthInit’ and ‘OrthReg’ by a significant margin. Besides, ‘ONI’ can be trained with a large learning rate of 0.1, while the other methods cannot (the results are reported for an initial learning rate of 0.01). We also provide the running times in Table 3. The additional cost introduced by ‘ONI’ compared to ‘plain’ is negligible ().

Residual Network

We first perform an ablation study on an 18-layer residual network (ResNet) [22], applying our ONI. We use the original ResNet and the ResNet without BN [32]. We also consider the architecture with BN inserted after the nonlinearity, which we refer to as ‘ResNetVar’. We observe that our ONI improves the performance over all three architectures, as shown in Table 4

. One interesting observation is that ONI achieves the lowest training error on the ResNet without BN, which demonstrates its ability to facilitate optimization for large-scale datasets. We also observe that ONI has no significant difference in performance compared to ‘plain’ on ResNet. One possible reason is that the BN module and residual connection are well-suited for information propagation, causing ONI to have a lower net gain for such a large-scale classification task. However, we observe that, on ResNetVar, ONI obtains obviously better performance than ‘plain’. We argue that this boost is attributed to the orthogonal matrix’s ability to achieve approximate dynamical isometry, as described in Theorem

2.

We also apply our ONI on a 50- and 101-layer residual network. The results are shown in Table 5. We again observe that ONI can improve the performance, without introducing significant computational cost.

Figure 7: Effects of the iteration number of the proposed ONI. (a) 6-layer MLP for Fashion-MNIST; (b) VGG-Style network with for CIFAR-10; (c) 18-layer ResNet for ImageNet.

4.1.4 Ablation Study on Iteration Number

ONI controls the spectrum of the weight matrix by the iteration number , as discussed before. Here, we explore the effect of on the performance of ONI over different datasets and architectures. We consider three configurations: 1) the 6-layer MLP for Fashion-MNIST; 2) the VGG-Style network with for CIFAR-10; and 3) the 18-layer ResNet without BN for ImageNet. The corresponding experimental setups are the same as described before. We vary and show the results in Figure 7. Our primary observation is that using either a small or large T degrades performance. This indicates that we need to control the magnitude of orthogonality to balance the increased optimization benefit and diminished representational capacity. Our empirical observation is that usually works the best for networks without residual connections, whereas usually works better for residual networks. We argue that the residual network itself already has good optimization [22], which reduces the optimization benefits of orthogonality.

Besides, we also observe that larger s have nearly equivalent performance for simple datasets, e.g. Fashion-MNIST, as shown in 7 (a). This suggests that amplifying the eigenbasis corresponding to a small singular value cannot help more, even though the network with a fully orthogonalized weight matrix can well fit the dataset. We further show the distributions of the singular values of the orthogonalized weight matrix in Appendix F.4.

Figure 8: Comparison of SN and ONI on DCGAN. (a) The FID with respect to training epochs. (b) Stability experiments on six configurations, described in [47].

4.2 Stabilizing Training of GANs

How to stabilize GAN training is an open research problem [17, 56, 19]. One pioneering work is spectral normalization (SN) [47], which can maintain the Lipschitz continuity of a network by bounding the maximum eigenvalue of it’s weight matrices as 1. This technique has been extensively used in current GAN architectures [48, 75, 10, 38]. As stated before, our method is not only capable of bounding the maximum eigenvalue as 1, but can also control the orthogonality to amplify other eigenbasis with increased iterations, meanwhile orthogonal regularization is also a good technique for training GANs [10]. Here, we conduct a series of experiments for unsupervised image generation on CIFAR-10, and compare our method against the widely used SN [47].

Experimental Setup

We strictly follow the network architecture and training protocol reported in the SN paper [47]. We use both DCGAN [54] and ResNet [22, 19] architectures. We provide implementation details in Appendix G. We replace all the SN modules in the corresponding network with our ONI. Our main metric for evaluating the quality of generated samples is the Fréchet Inception Distance (FID) [25] (the lower the better). We also provide the corresponding Inception Score (IS) [56] in Appendix G.

Dcgan

We use the standard non-saturating function as the adversarial loss [17, 38] in the DCGAN architecture, following [47]. For optimization, we use the Adam optimizer [36] with the default hyper-parameters, as in [47]: learning rate , first momentum , second momentum , and the number of discriminator updates per generator update . We train the network over 200 epochs with a batch size of 64 (nearly 200k generator updates) to determine whether it suffers from training instability. Figure 8 (a) shows the FID of SN and ONI when varying Newton’s iteration number from 0 to 5. One interesting observation is that the ONI with only the initial spectral bounding described in Formula 7 () can also stabilize training, even though it has downgraded performance compared to SN. When , ONI achieves better performance than SN. This is because, based on what we observed, ONI stretches the maximum eigenvalue to nearly 1, while simultaneously amplifying other eigenvalues. Finally, we find that ONI achieves the best performance when , yielding an , compared to SN’s . Further increasing harms the training, possibly because too strong an orthogonalization downgrades the capacity of a network, as discussed in [47, 10].

We also conduct experiments to validate the stability of our proposed ONI under different experimental configurations: we use six configurations, following [47], by varying and (denoted by A-F, for more details please see AppendixG.1. Figure 8 (b) shows the results of SN and ONI (with T=2) under these six configurations. We observe that our ONI is consistently better than SN.

Figure 9: Comparison of SN and ONI on ResNet GAN. We show the FID with respect to training epochs when using (a) the non-saturating loss and (b) the hinge loss.
ResNet GAN

For experiments on the ResNet architecture, we use the same setup as the DCGAN. Besides the standard non-saturating loss [17], we also evaluate the recently popularized hinge loss [42, 47, 10]. Figure 9 shows the results. We again observe that our ONI achieves better performance than SN under the ResNet architecture, both when using the non-saturating loss and hinge loss.

5 Conclusion

In this paper, we proposed an efficient and stable orthogonalization method by Newton’s iteration (ONI) to learn layer-wise orthogonal weight matrices in DNNs. We provided insightful analysis for ONI and demonstrated its ability to control orthogonality, which is a desirable property in training DNNs. ONI can be implemented as a linear layer and used to learn an orthogonal weight matrix, by simply substituting it for the standard linear module.

ONI can effectively bound the spectrum of a weight matrix in (, ) during the course of training. This property makes ONI a potential tool for validating some theoretical results relating to DNN’s generalization (e.g., the margin bounds shown in [6]) and resisting attacks from adversarial examples [13]. Furthermore, the advantage of ONI in stabilizing training w/o BN (BN usually disturbs the theoretical analysis since it depends on the sampled mini-batch input with stochasticity [32, 29]) makes it possible to validate these theoretical arguments under real scenarios.

Acknowledgement We thank Anna Hennig and Ying Hu for their help with proofreading.

References

  • [1] P.-A. Absil, R. Mahony, and R. Sepulchre. Optimization Algorithms on Matrix Manifolds. Princeton University Press, Princeton, NJ, 2008.
  • [2] Pierre-Antoine Absil and Jerome Malick. Projection-like retractions on matrix manifolds. SIAM Journal on Optimization, 22(1):135–158, 2012.
  • [3] Jaweria Amjad, Zhaoyan Lyu, and Miguel RD Rodrigues. Deep learning for inverse problems: Bounds and regularizers. arXiv preprint arXiv:1901.11352, 2019.
  • [4] Martín Arjovsky, Amar Shah, and Yoshua Bengio. Unitary evolution recurrent neural networks. In ICML, 2016.
  • [5] Nitin Bansal, Xiaohan Chen, and Zhangyang Wang. Can we gain more from orthogonality regularizations in training deep cnns? In NeurIPS, 2018.
  • [6] Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for neural networks. In NeurIPS. 2017.
  • [7] Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient descent is difficult. Trans. Neur. Netw., 5(2):157–166, Mar. 1994.
  • [8] Dario A. Bini, Nicholas J. Higham, and Beatrice Meini. Algorithms for the matrix pth root. Numerical Algorithms, 39(4):349–378, Aug 2005.
  • [9] Nils Bjorck, Carla P Gomes, Bart Selman, and Kilian Q Weinberger. Understanding batch normalization. In NeurIPS. 2018.
  • [10] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale GAN training for high fidelity natural image synthesis. In ICLR, 2019.
  • [11] Andrew Brock, Theodore Lim, James M. Ritchie, and Nick Weston. Neural photo editing with introspective adversarial networks. In ICLR, 2017.
  • [12] Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. cudnn: Efficient primitives for deep learning. CoRR, abs/1410.0759, 2014.
  • [13] Moustapha Cisse, Piotr Bojanowski, Edouard Grave, Yann Dauphin, and Nicolas Usunier. Parseval networks: Improving robustness to adversarial examples. In ICML, 2017.
  • [14] R. Collobert, K. Kavukcuoglu, and C. Farabet.

    Torch7: A matlab-like environment for machine learning.

    In BigLearn, NIPS Workshop, 2011.
  • [15] Victor Dorobantu, Per Andre Stromhaug, and Jess Renteria. Dizzyrnn: Reparameterizing recurrent neural networks for norm-preserving backpropagation. CoRR, abs/1612.04035, 2016.
  • [16] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In AISTATS, 2010.
  • [17] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NeurIPS. 2014.
  • [18] Priya Goyal, Piotr Dollár, Ross B. Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: training imagenet in 1 hour. CoRR, abs/1706.02677, 2017.
  • [19] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein gans. In NeurIPS. 2017.
  • [20] Mehrtash Harandi and Basura Fernando. Generalized backpropagation, etude de cas: Orthogonality. CoRR, abs/1611.05927, 2016.
  • [21] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In ICCV, 2015.
  • [22] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
  • [23] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In ECCV, 2016.
  • [24] Kyle Helfrich, Devin Willmott, and Qiang Ye. Orthogonal recurrent neural networks with scaled Cayley transform. In ICML, 2018.
  • [25] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NeurIPS. 2017.
  • [26] Gao Huang, Zhuang Liu, and Kilian Q. Weinberger. Densely connected convolutional networks. In CVPR, 2017.
  • [27] Lei Huang, Xianglong Liu, Bo Lang, Adams Wei Yu, Yongliang Wang, and Bo Li. Orthogonal weight normalization: Solution to optimization over multiple dependent stiefel manifolds in deep neural networks. In AAAI, 2018.
  • [28] Lei Huang, Dawei Yang, Bo Lang, and Jia Deng. Decorrelated batch normalization. In CVPR, 2018.
  • [29] Lei Huang, Yi Zhou, Fan Zhu, Li Liu, and Ling Shao. Iterative normalization: Beyond standardization towards efficient whitening. In CVPR, 2019.
  • [30] Stephanie Hyland and Gunnar Rätsch. Learning unitary operators with help from u(n). In AAAI, 2017.
  • [31] Sergey Ioffe. Batch renormalization: Towards reducing minibatch dependence in batch-normalized models. In NeurIPS, 2017.
  • [32] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
  • [33] Catalin Ionescu, Orestis Vantzos, and Cristian Sminchisescu.

    Training deep networks with structured layers by matrix backpropagation.

    In ICCV, 2015.
  • [34] Kui Jia. Improving training of deep neural networks via singular value bounding. In CVPR, 2017.
  • [35] Li Jing, Çaglar Gülçehre, John Peurifoy, Yichen Shen, Max Tegmark, Marin Soljacic, and Yoshua Bengio. Gated orthogonal recurrent units: On learning to forget. CoRR, abs/1706.02761, 2017.
  • [36] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
  • [37] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.
  • [38] Karol Kurach, Mario Lučić, Xiaohua Zhai, Marcin Michalski, and Sylvain Gelly. A large-scale study on regularization and normalization in GANs. In ICML, 2019.
  • [39] Yann LeCun, Léon Bottou, Genevieve B. Orr, and Klaus-Robert Müller. Effiicient backprop. In Neural Networks: Tricks of the Trade, 1998.
  • [40] José Lezama, Qiang Qiu, Pablo Musé, and Guillermo Sapiro.

    OlÉ: Orthogonal low-rank embedding - a plug and play geometric loss for deep learning.

    In CVPR, June 2018.
  • [41] Peihua Li, Jiangtao Xie, Qilong Wang, and Zilin Gao. Towards faster training of global covariance pooling networks by iterative matrix square root normalization. In CVPR, 2018.
  • [42] Jae Hyun Lim and Jong Chul Ye. Geometric gan. CoRR, abs/1705.02894, 2017.
  • [43] Tsung-Yu Lin and Subhransu Maji. Improved bilinear pooling with cnns. In BMVC, 2017.
  • [44] Per-Olov Löwdin. On the non-orthogonality problem connected with the use of atomic wave functions in the theory of molecules and crystals. The Journal of Chemical Physics, 18(3):365–375, 1950.
  • [45] Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. Rectifier nonlinearities improve neural network acoustic models. In in ICML Workshop on Deep Learning for Audio, Speech and Language Processing, 2013.
  • [46] Dmytro Mishkin and Jiri Matas. All you need is a good init. In ICLR, 2016.
  • [47] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. In ICLR, 2018.
  • [48] Takeru Miyato and Masanori Koyama. cgans with projection discriminator. In ICLR, 2018.
  • [49] Vinod Nair and Geoffrey E. Hinton. Rectified linear units improve restricted boltzmann machines. In ICML, 2010.
  • [50] Mete Ozay and Takayuki Okatani. Optimization on submanifolds of convolution kernels in cnns. CoRR, abs/1610.07008, 2016.
  • [51] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. In ICML, 2013.
  • [52] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in PyTorch. In NeurIPS Autodiff Workshop, 2017.
  • [53] Jeffrey Pennington, Samuel Schoenholz, and Surya Ganguli. Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice. In NeurIPS. 2017.
  • [54] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR, 2016.
  • [55] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge.

    International Journal of Computer Vision (IJCV)

    , 115(3):211–252, 2015.
  • [56] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, Xi Chen, and Xi Chen. Improved techniques for training gans. In NeurIPS, 2016.
  • [57] Tim Salimans and Diederik P. Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In NeurIPS, 2016.
  • [58] Andrew M. Saxe, James L. McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. CoRR, abs/1312.6120, 2013.
  • [59] Nicol N. Schraudolph. Accelerated gradient descent by factor-centering decomposition. Technical report, 1998.
  • [60] Wenling Shang, Justin Chiu, and Kihyuk Sohn. Exploring normalization in deep residual networks with concatenated rectified linear units. In AAAI, 2017.
  • [61] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
  • [62] Piotr A. Sokol and Il Memming Park. Information geometry of orthogonal initializations and training. CoRR, abs/1810.03785, 2018.
  • [63] Christian Szegedy, Sergey Ioffe, and Vincent Vanhoucke. Inception-v4, inception-resnet and the impact of residual connections on learning. CoRR, abs/1602.07261, 2016.
  • [64] C. Szegedy, Wei Liu, Yangqing Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015.
  • [65] Mattias Teye, Hossein Azizpour, and Kevin Smith.

    Bayesian uncertainty estimation for batch normalized deep networks.

    In ICML, 2018.
  • [66] Eugene Vorontsov, Chiheb Trabelsi, Samuel Kadoury, and Chris Pal. On orthogonality and learning recurrent networks with long term dependencies. In ICML, 2017.
  • [67] Scott Wisdom, Thomas Powers, John Hershey, Jonathan Le Roux, and Les Atlas. Full-capacity unitary recurrent neural networks. In NeurIPS. 2016.
  • [68] Yuxin Wu and Kaiming He. Group normalization. In ECCV, 2018.
  • [69] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. CoRR, abs/1708.07747, 2017.
  • [70] Lechao Xiao, Yasaman Bahri, Jascha Sohl-Dickstein, Samuel S.Schoenholz, and Jeffrey Pennington. Dynamical isometry and a mean field theory of cnns: How to train 10,000-layer vanilla convolutional neural networks. In ICML, 2018.
  • [71] Di Xie, Jiang Xiong, and Shiliang Pu. All you need is beyond a good init: Exploring better solution for training extremely deep convolutional neural networks with orthonormality and modulation. In CVPR, 2017.
  • [72] Greg Yang, Jeffrey Pennington, Vinay Rao, Jascha Sohl-Dickstein, and Samuel S. Schoenholz. A mean field theory of batch normalization. In ICLR, 2019.
  • [73] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In BMVC, 2016.
  • [74] Hongyi Zhang, Yann N. Dauphin, and Tengyu Ma. Fixup initialization: Residual learning without normalization. In ICLR, 2019.
  • [75] Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. Self-attention generative adversarial networks. In ICML, 2019.
  • [76] Liheng Zhang, Marzieh Edraki, and Guo-Jun Qi.

    Cappronet: Deep feature learning via orthogonal projections onto capsule subspaces.

    In NeurIPS. 2018.
  • [77] Jianping Zhou, Minh N. Do, and Jelena Kovacevic. Special paraunitary matrices, cayley transform, and multidimensional orthogonal filter banks. IEEE Trans. Image Processing, 15(2):511–519, 2006.

Appendix A Derivation of Back-Propagation

Given the layer-wise orthogonal weight matrix , we can perform the forward pass to calculate the loss of the deep neural networks (DNNs). It’s necessary to back-propagate through the orthogonalization transformation, because we aim to update the proxy parameters . For illustration, we first describe the proposed orthogonalization by Newton’s iteration (ONI) in Algorithm I. Given the gradient with respect to the orthogonalized weight matrix , we target to compute

. The back-propagation is based on the chain rule. From Line 2 in Algorithm

I, we have:

(A1)

where indicates the trace of the corresponding matrix and can be calculated from Lines 3 and 8 in Algorithm I:

(A2)

We thus need to calculate , which can be computed from Lines 5, 6 and 7 in Algorithm I:

(A3)

where and can be iteratively calculated from Line 6 in Algorithm I as follows:

(A4)

In summary, the back-propagation of Algorithm I is shown in Algorithm II.

1:  Input: proxy parameters and iteration numbers .
2:  Bounding ’s singular values: .
3:  Calculate covariance matrix: .
4:  .
5:  for  to T do
6:     .
7:  end for
8:  .
9:  Output: orthogonalized weight matrix: .
Algorithm I Orthogonalization by Newton’s Iteration.
1:  Input: and variables from respective forward pass: , , , .
2:  .
3:  for  down to 1 do
4:     
5:  end for
6:  .
7:  .
8:  .
9:  Output: .
Algorithm II Back-propagation of ONI.

We further derive the back-propagation of the accelerated ONI method with the centering and more compact spectral bounding operation, as described in Section 3.3 of the paper. For illustration, Algorithm III describes the forward pass of the accelerated ONI. Following the calculation in Algorithm II, we can obtain . To simplify the derivation, we represent Line 3 of Algorithm III as the following formulations:

(A5)
(A6)
(A7)

It’s easy to calculate from Eqn. A5 and Eqn. A7 as follows:

(A8)

where can be computed based on Eqn. A6 and Eqn. A7:

(A9)

Based on Line 2 in Algorithm III, we can achieve as follows:

(A10)

In summary, Algorithm IV describes the back-propagation of the Algorithm III.

1:  Input: proxy parameters and iteration numbers .
2:  Centering: .
3:  Bounding ’s singular values: .
4:  Execute Step. 3 to 8 in Algorithm I.
5:  Output: orthogonalized weight matrix: .
Algorithm III ONI with acceleration.

Appendix B Proof of Convergence Condition for Newton’s Iteration

In Section 3 of the paper, we show that bounding the spectral of the proxy parameters matrix by

(A11)

and

(A12)

can satisfy the convergence condition of Newton’s Iterations as follows:

(A13)

where and the singular values of are nonzero. Here we will prove this conclusion, and we also prove that .

Proof.

By definition, can be represented as . Given Eqn. A11, we calculate

(A14)

Let’s denote and the eigenvalues of are . We have , since is a real symmetric matrix and the singular values of are nonzero. We also have and the eigenvalues of are . Furthermore, the eigenvalues of are , thus satisfying the convergence condition described by Eqn. A13.

Similarly, given , we have and its corresponding eigenvalues are . Therefore, the singular values of are , also satisfying the convergence condition described by Eqn. A13.

We have and . It’s easy to demonstrate that , since . ∎

1:  Input: and variables from respective forward pass: , , , .
2:  Calculate from Line 2 to Line 7 in Algorithm II.
3:  Calculate and from Eqn. A5 and Eqn. A6.
4:  Calculate based on Eqn. A.
5:  Calculate based on Eqn. A8.
6:  Calculate based on Eqn. A10.
7:  Output: .
Algorithm IV Back-propagation of ONI with acceleration.

In Section 3 of the paper, we show that the Newton’s iteration by bounding the spectrum with Eqn. A11 is equivalent to the Newton’s iteration proposed in [29]. Here, we provide the details. In [29], they bound the covariance matrix by the trace of as . It’s clear that used in Algorithm I is equal to , based on Eqn. A14 shown in the previous proof.

configurations cudnn cudnn + ONI-T1 cudnn + ONI-T3 cudnn + ONI-T5 cudnn + ONI-T7
,  n=d=256,  m=256 118.6 122.1 122.9 124.4 125.7
,  n=d=256,  m=32 15.8 18.3 18.9 19.5 20.8
,  n=d=1024,  m=32 71.1 81.7 84.3 89.5 94.2
,  n=d=256,  m=256 28.7 31.5 32.1 33.7 34.6
,  n=d=256,  m=32 10.1 13 13.6 14.2 15.3
,  n=d=1024,  m=32 22.2 27.6 29.7 32.9 37.0
Table A1: Comparison of wall-clock time (). We fix the input with size . We evaluate the total wall-clock time of training for each iteration (forward pass + back-propagation pass). Note that ‘cudnn  +  ONI-T5’ indicates the ‘cudnn’ convolution wrapped in our ONI method, using an iteration number of 5.
ONI-Full 5.66 0
OLM-G32 8 5.66
OLM-G16 9.85 8.07
OLM-G8 10.58 8.94
Table A2: Evaluation for row and column orthogonalization with the group based methods. The entries of proxy matrix are sampled from the Gaussian distribution . We evaluate the row orthogonality and column orthogonality . ‘OLM-G32’ indicates the eigen decomposition based orthogonalization method described in [27], with a group size of 32.

Appendix C Orthogonality for Group Based Method

In Section 3.4 of the paper, we argue that group based methods cannot ensure the whole matrix to be either row or column orthogonal, when . Here we provide more details.

We follow the experiments described in Figure 3 of the paper, where we sample the entries of proxy matrix from the Gaussian distribution . We apply the eigen decomposition based orthogonalization method [27] with group size , to obtain the orthogonalized matrix . We vary the group size . We evaluate the corresponding row orthogonality and column orthogonality . The results are shown in Table A2. We observe that the group based orthogonalization method cannot ensure the whole matrix to be either row or column orthogonal, while our ONI can ensure column orthogonality. We also observe that the group based method has degenerated orthogonality, with decreasing group size.

We also conduct an experiment when , where we sample the entries of proxy matrix from the Gaussian distribution . We vary the group size . Note that represents full orthogonalization. Figure A1 shows the distribution of the eigenvalues of . We again observe that the group based method cannot ensure the whole weight matrix to be row orthogonal. Furthermore, orthogonalization with smaller group size tends to be worse.

Figure A1: The distribution of the eigenvalues of with different group size . The entries of proxy matrix are sampled from the Gaussian distribution .

Appendix D Comparison of Wall Clock Times

In Section 3.6 of the paper, we show that, given a convolutional layer with filters and mini-batch data , the relative computational cost of ONI over the convolutional layer is

. In this section, we compare the of wall clock time between the convolution wrapping with our ONI and the standard convolution. In this experiment, our ONI is implemented based on Torch

[14] and we wrap it to the ‘cudnn’ convolution [12]. The experiments are run on a TITAN Xp.

We fix the input to size , and vary the kernel size ( and ), the feature dimensions ( and ) and the batch size . Table A1 shows the wall clock time under different configurations. We compare the standard ‘cudnn’ convolution (denoted as ‘cudnn’) and the ‘cudnn’ wrapped with our ONI (denoted as ‘cudnn  +  ONI’).

We observe that our method introduces negligible computational costs when using a convolution, feature dimension and batch size of . Our method may degenerate in efficiency with a smaller kernel size, larger feature dimension and smaller batch size, based on the computational complexity analysis. However, our method (with iteration of 5) ‘cudnn  +  ONI-T5’ only costs over the standard convolution ‘cudnn’, under the worst configuration, , and m=32.

Appendix E Proof of Theorems

Here we prove the two theorems described in Sections 3.5 and 3.6 of the paper.

Theorem 1. Let , where and . Assume: (1) , , and (2) , . If , we have the following properties: (1) ; (2) , ; (3) ; (4) , . In particular, if , property (2) and (3) hold; if , property (1) and (4) hold.

Proof.

Based on and , we have that is a square orthogonal matrix. We thus have . Besides, we have 111

We follow the common setup where the vectors are column vectors when their derivations are row vectors.

.

(1) Therefore, we have

(A15)

We thus get .

(2) It’s easy to calculate:

(A16)

The covariance of is given by