ONI
PyTorch and Torch implementation for our accepted CVPR 2020 paper (Oral): Controllable Orthogonalization in Training DNNs
view repo
Orthogonality is widely used for training deep neural networks (DNNs) due to its ability to maintain all singular values of the Jacobian close to 1 and reduce redundancy in representation. This paper proposes a computationally efficient and numerically stable orthogonalization method using Newton's iteration (ONI), to learn a layer-wise orthogonal weight matrix in DNNs. ONI works by iteratively stretching the singular values of a weight matrix towards 1. This property enables it to control the orthogonality of a weight matrix by its number of iterations. We show that our method improves the performance of image classification networks by effectively controlling the orthogonality to provide an optimal tradeoff between optimization benefits and representational capacity reduction. We also show that ONI stabilizes the training of generative adversarial networks (GANs) by maintaining the Lipschitz continuity of a network, similar to spectral normalization (SN), and further outperforms SN by providing controllable orthogonality.
READ FULL TEXT VIEW PDFPyTorch and Torch implementation for our accepted CVPR 2020 paper (Oral): Controllable Orthogonalization in Training DNNs
Training deep neural networks (DNNs) is often difficult due to the occurrence of vanishing/exploding gradients [7, 16, 51]. Preliminary research [39, 16]
has suggested that weight initialization techniques are essential for avoiding these issues. As such, various works have tried to tackle the problem by designing weight matrices that can provide nearly equal variance to activations from different layers
[16, 21]. Such a property can be further amplified by orthogonal weight initialization [58, 46, 62], which shows excellent theoretical results in convergence due to its ability to obtain a DNN’s dynamical isometry [58, 53, 72], i.e. all singular values of the input-output Jacobian are concentrated near 1. The improved performance of orthogonal initialization is empirically observed in [58, 46, 53, 70] and it makes training even 10,000-layer DNNs possible [70]. However, the initial orthogonality can be broken down and is not necessarily sustained throughout training [71].Previous works have tried to maintain the orthogonal weight matrix by imposing an additional orthogonality penalty on the objective function, which can be viewed as a ‘soft orthogonal constraint’ [51, 66, 71, 5, 3]. These methods show improved performance in image classification [71, 76, 40, 5], resisting attacks from adversarial examples [13], neural photo editing [11] and training generative adversarial networks (GAN) [10, 47]. However, the introduced penalty works like a pure regularization, and whether or not the orthogonality is truly maintained or training benefited is unclear. Other methods have been developed to directly solve the ‘hard orthogonal constraint’ [66, 5], either by Riemannian optimization [50, 20] or by orthogonal weight normalization [67, 27]. However, Riemannian optimization often suffers from training instability [20, 27], while orthogonal weight normalization [27] requires computationally expensive eigen decomposition, and the necessary back-propagation through this eigen decomposition may suffer from numerical instability, as shown in [33, 43].
We propose to perform orthogonalization by Newton’s iteration (ONI) [44, 8, 29] to learn an exact orthogonal weight matrix, which is computationally efficient and numerically stable. To further speed up the convergence of Newton’s iteration, we propose two techniques: 1) we perform centering to improve the conditioning of the proxy matrix; 2) we explore a more compact spectral bounding method to make the initial singular value of the proxy matrix closer to .
We provide an insightful analysis and show that ONI works by iteratively stretching the singular values of the weight matrix towards (Figure 1). This property makes ONI work well even if the weight matrix is singular (with multiple zero singular values), under which the eigen decomposition based method [27] often suffers from numerical instability [33, 43]. Moreover, we show that controlling orthogonality is necessary to balance the increase in optimization and reduction in representational capacity, and ONI can elegantly achieve this through its iteration number (Figure 1). Besides, ONI provides a unified solution for the row/column orthogonality, regardless of whether the weight matrix’s output dimension is smaller or larger than the input.
We also address practical strategies for effectively learning orthogonal weight matrices in DNNs. We introduce a constant of to initially scale the orthonormal weight matrix so that the dynamical isometry [58]
can be well maintained for deep ReLU networks
[49]. We conduct extensive experiments on multilayer perceptrons (MLPs) and convolutional neural networks (CNNs). Our proposed method benefits the training and improves the test performance over multiple datasets, including ImageNet
[55]. We also show that our method stabilizes the training of GANs and achieves improved performance on unsupervised image generation, compared to the widely used spectral normalization [47].Orthogonal filters have been extensively explored in signal processing since they are capable of preserving activation energy and reducing redundancy in representation [77]. Saxe et al. [58] introduced an orthogonal weight matrix for DNNs and showed that it achieves approximate dynamical isometry [58] for deep linear neural networks, therefore significantly improving the optimization efficiency [46, 62]. Pennington et al. [53] further found that the nonlinear sigmoid network can also obtain dynamical isometry when combined with orthogonal weight initialization [53, 70, 72].
Research has also been conducted into using orthogonal matrices
to avoid the gradient vanishing/explosion problems in recurrent neural networks (RNNs). These methods mainly focus on constructing square orthogonal matrices/unitary matrices for the hidden-to-hidden transformations in RNNs
[4, 67, 15, 66, 30, 35, 24]. This is done by either constructing a decomposed unitary weight matrix with a restricted [4] or full representational capability [67, 24], or by using soft constraints [66]. Different from these methods requiring a square weight matrix and limited to hidden-to-hidden transformations in RNNs, our method is more general and can adapt to situations where the weight matrix is not square.Our method is related to the methods that impose orthogonal penalties on the loss functions
[51, 66, 5]. Most works propose to use soft orthogonality regularization under the standard Frobenius norm [51, 66, 5], though other alternative orthogonal penalties were explored in [5]. There are also methods that propose to bound the singular values with periodical projection [34]. Our method targets at solving the ‘hard constraint’ and providing controllable orthogonality.One way to obtain exact orthogonality is through Riemannian optimization methods [50, 20]. These methods usually require a retract operation [2] to project the updated weight back to the Stiefel manifold [50, 20], which may result in training instability for DNNs [20, 27]
. Our method avoids this by employing re-parameterization to construct the orthogonal matrix
[27]. Our work is closely related to orthogonal weight normalization [27], which also uses re-parameterization to design an orthogonal transformation. However, [27] solves the problem by computationally expensive eigen decomposition and may result in numeric instability [33, 43]. We use Newton’s iteration [44, 8], which is more computationally efficient and numerically stable. We further argue that fully orthogonalizing the weight matrix limits the network’s learning capacity, which may result in degenerated performance [47, 10]. Another related work is spectral normalization [47], which uses reparametrization to bound only the maximum eigenvalue as 1. Our method can effectively interpolate between spectral normalization and full orthogonalization, by altering the iteration number.
Newton’s iteration has also been employed in DNNs for constructing bilinear/second-order pooling [43, 41], or whitening the activations [29]. [43] and [41] focused on calculating the square root of the covariance matrix, while our method computes the square root inverse of the covariance matrix, like the work in [29]. However, our work has several main differences from [29]: 1) In [29], they aimed to whiten the activation [28] over batch data using Newton’s iteration, while our work seeks to learn the orthogonal weight matrix, which is an entirely different research problem [32, 56, 28]; 2) We further improve the convergence speed compared to the Newton’s iteration proposed in [29] by providing more compact bounds; 3) Our method can maintain the Lipschitz continuity of the network and thus has potential in stabilizing the training of GANs [47, 10]. It is unclear whether or not the work in [29] has such a property, since it is data-dependent normalization [32, 47, 10].
Given the dataset composed of an input and its corresponding labels
, we represent a standard feed-forward neural network as a function
parameterized by . is a composition ofsimple nonlinear functions. Each of these consists of a linear transformation
with learnable weights and biases , followed by an element-wise nonlinearity: . Here indexes the layers. We denote the learnable parameters as . Training neural networks involves minimizing the discrepancy between the desired output and the predicted output , described by a loss function . Thus, the optimization objective is: .are sampled from the Gaussian distribution
. We show (a) the magnitude of the orthogonality, measured as , with respect to the iterations and (b) the distribution (log scale) of the eigenvalues of with different iterations.This paper starts with learning orthogonal filter banks (row orthogonalization of a weight matrix) for deep neural networks (DNNs). We assume for simplicity, and will discuss the situation where in Section 3.4. This problem is formulated in [27] as an optimization with layer-wise orthogonal constraints, as follows:
(1) | |||||
To solve this problem directly, Huang et al. [27] proposed to use the proxy parameters and construct the orthogonal weight matrix by minimizing them in a Frobenius norm over the feasible transformation sets, where the objective is:
(2) |
They solved this in a closed-form, with the orthogonal transformation as:
(3) |
where and
are the eigenvalues and corresponding eigenvectors of the covariance matrix
. Given the gradient , back-propagation must pass through the orthogonal transformation to calculate for updating . The closed formulation is concise; however, it encounters the following problems in practice: 1) Eigen decomposition is required, which is computationally expensive, especially on GPU devices [43]; 2) The back-propagation through the eigen decomposition requires the element-wise multiplication of a matrix [27], whose elements are given by , where . This may cause numerical instability when there exists equal eigenvalues of , which is discussed in [33, 43] and observed in our preliminary experiments, especially for high-dimensional space.Newton’s iteration calculates as follows:
(4) |
where is the number of iterations. Under the condition that , will converge to [8, 29].
in Eqn. 3.1 can be initialized to ensure that initially satisfies the convergence condition, e.g. ensuring , where are the singular values of . However, the condition is very likely to be violated when training DNNs, since varies.
To address this problem, we propose to maintain another proxy parameter and conduct a transformation , such that , inspired by the re-parameterization method [57, 27]. One straightforward way to ensure is to divide the spectral norm of , like the spectral normalization method does [47]
. However, it is computationally expensive to accurately calculate the spectral norm, since singular value decomposition is required. We thus propose to divide the Frobenius norm of
to perform spectral bounding:(5) |
It’s easy to demonstrate that Eqn. 5 satisfies the convergence condition of Newton’s iteration and we show that this method is equivalent to the Newton’s iteration proposed in [29] (See Appendix B for details). Algorithm 1 describes the proposed method, referred to as Orthogonalization by Newton’s Iteration (ONI), and its corresponding back-propagation is shown in Appendix A. We find that Algorithm 1 converges well (Figure 2). However, the concern is the speed of convergence, since 10 iterations are required in order to obtain a good orthogonalization. We thus further explore methods to speed up the convergence of ONI.
Our Newton’s iteration proposed for obtaining orthogonal matrix works by iteratively stretching the singular values of towards 1, as shown in Figure 2 (b). The speed of convergence depends on how close the singular values of initially are to [8]. We observe that the following factors benefit the convergence of Newton’s iteration: 1) The singular values of have a balanced distribution, which can be evaluated by the condition number of the matrix ; 2) The singular values of should be as close to 1 as possible after spectral bounding (Eqn. 5).
To achieve more balanced distributions for the eigenvalues of , we perform a centering operation over the proxy parameters , as follows
(6) |
The orthogonal transformation is then performed over the centered parameters . As shown in [39, 59], the covariance matrix of centered matrix is better conditioned than . We also experimentally observe that orthogonalization over centered parameters (indicated as ‘ONI+Center’) produces larger singular values on average at the initial stage (Figure 3 (b)), and thus converges faster than the original ONI (Figure 3 (a)).
To achieve larger singular values of after spectral bounding, we seek a more compact spectral bounding factor such that and satisfies the convergence condition. We find that satisfies the requirements, which is demonstrated in Appendix B. We thus perform spectral bounding based on the following formulation:
(7) |
More compact spectral bounding (CSB) is achieved using Eqn. 7, compared to Eqn. 5. For example, assuming that has equivalent singular values, the initial singular values of after spectral bounding will be when using Eqn. 7, while when using Eqn. 5. We also experimentally observe that using Eqn. 7 (denoted with ‘+CSB’ in Figure 3) results in a significantly faster convergence.
Algorithm 2 describes the accelerated ONI method with centering and more compact spectral bounding (Eqn. 7).
In previous sections, we assume , and obtain an orthogonalization solution. One question that remains is how to handle the situation when . When , the rows of cannot be orthogonal, because the rank of is less than/equal to . Under this situation, full orthogonalization using the eigenvalue decomposition based solution (Eqn. 3) may cause numerical instability, since there exists at least zero eigenvalues for the covariance matrix. These zero eigenvalues specifically lead to numerical instability during back-propagation (when element-wisely multiplying the scaling matrix , as discussed in Section 3.1).
Our orthogonalization solution by Newton’s iteration can avoid such problems, since there are no operations relating to dividing the eigenvalues of the covariance matrix. Therefore, our ONI can solve Eqn. 3.1 under the situation . More interestingly, our method can achieve column orthogonality for the weight matrix (that is, ) by solving Eqn. 3.1 directly under . Figure 4 shows the convergence behaviors of the row and column orthogonalizations. We observe ONI stretches the non-zero eigenvalues of the covariance matrix towards 1 in an iterative manner, and thus equivalently stretches the singular values of the weight matrix towards 1. Therefore, it ensures column orthogonality under the situation . Our method unifies the row and column orthogonalizations, and we further show in Section 3.5 that they both benefit in preserving the norm/distribution of the activation/gradient when training DNNs.
Note that, for , Huang et al. [27] proposed the group based methods by dividing the weights into groups of size and performing orthogonalization over each group, such that the weights in each group are row orthogonal. However, such a method cannot ensure the whole matrix to be either row or column orthogonal (See Appendix C for details).
One remarkable property of the orthogonal matrix is that it can preserve the norm and distribution of the activation for a linear transformation, given appropriate assumptions. Such properties are described in the following theorem.
Let , where and . Assume: (1) , , and (2) , . If , we have the following properties: (1) ; (2) , ; (3) ; (4) , . In particular, if , property (2) and (3) hold; if , property (1) and (4) hold.
The proof is provided in Appendix E. Theorem 1 shows the benefits of orthogonality in preventing gradients from exploding/vanishing, from an optimization perspective. Besides, the orthonormal weight matrix can be viewed as the embedded Stiefel manifold
with a degree of freedom
[1, 27], which regularizes the neural networks and can improve the model’s generalization [1, 27].However, this regularization may harm the representational capacity and result in degenerated performance, as shown in [27] and observed in our experiments. Therefore, controlling orthogonality is necessary to balance the increase in optimization benefit and reduction in representational capacity, when training DNNs. Our ONI can effectively control orthogonality using different numbers of iterations.
Based on Algorithm 2 and its corresponding backward pass, we can wrap our method in linear modules [57, 27], to learn filters/weights with orthogonality constraints for DNNs. After training, we calculate the weight matrix and save it for inference, as in the standard module.
Theorem 1 shows that the orthogonal matrix has remarkable properties for preserving the norm/distributions of activations during the forward and backward passes, for linear transformations. However, in practice, we need to consider the nonlinearity function as well. Here, we show that we can use an additional constant to scale the magnitude of the weight matrix for ReLU nonlinearity [49], such that the output-input Jacobian matrix of each layer has dynamical isometry.
Let , where and . Assume
is a normal distribution with
, . Denote the Jacobian matrix as . If , we have .The proof is shown in Appendix E. We propose to multiply the orthogonal weight matrix by a factor of for networks with ReLU activation. We experimentally show this improves the training efficiency in Section 4.1. Note that Theorems 1 and 2
are based on the assumption that the layer-wise input is Gaussian. Such a property can be approximately satisfied using batch normalization (BN)
[32]. Besides, if we apply BN before the linear transformation, there is no need to apply it again after the linear module, since the normalized property of BN is preserved according to Theorem 1. We experimentally show that such a process improves performance in Section 4.1.3.Following [27], we relax the constraint of orthonormal to orthogonal, with , where is the diagonal matrix. This can be viewed as the orthogonal filters having different contributions to the activations. To achieve this, we propose to use a learnable scalar parameter to fine-tune the norm of each filter [57, 27].
With regards to the convolutional layer parameterized by weights , where and are the height and width of the filter, we reshape as , where , and the orthogonalization is executed over the unrolled weight matrix .
Effects of maintaining orthogonality. Experiments are performed on a 10-layer MLP. (a) The training (solid lines) and testing (dashed lines) errors with respect to the training epochs; (b) The distribution of eigenvalues of the weight matrix
of the 5th layer, at the 200 iteration.Consider a convolutional layer with filters , and mini-batch data . The computational cost of our method, coming mainly from the Lines 3, 6 and 8 in Algorithm 1, is for each iteration during training. The relative cost of ONI over the constitutional layer is . During inference, we use the orthogonalized weight matrix , and thus do not introduce additional computational or memory costs. We provide the wall-clock times in Appendix D.
methods | g=2,k=1 | g=2,k=2 | g=2,k=3 | g=3,k=1 | g=3,k=2 | g=3,k=3 | g=4,k=1 | g=4,k=2 | g=4,k=3 |
---|---|---|---|---|---|---|---|---|---|
plain | 11.34 | 9.84 | 9.47 | 10.32 | 8.73 | 8.55 | 10.66 | 9.00 | 8.43 |
WN | 11.19 | 9.55 | 9.49 | 10.26 | 9.26 | 8.19 | 9.90 | 9.33 | 8.90 |
OrthInit | 10.57 | 9.49 | 9.33 | 10.34 | 8.94 | 8.28 | 10.35 | 10.6 | 9.39 |
OrthReg | 12.01 | 10.33 | 10.31 | 9.78 | 8.69 | 8.61 | 9.39 | 7.92 | 7.24 |
OLM-1 [27] | 10.65 | 8.98 | 8.32 | 9.23 | 8.05 | 7.23 | 9.38 | 7.45 | 7.04 |
OLM- | 10.15 | 8.32 | 7.80 | 8.74 | 7.23 | 6.87 | 8.02 | 6.79 | 6.56 |
ONI | 9.95 | 8.20 | 7.73 | 8.64 | 7.16 | 6.70 | 8.27 | 6.72 | 6.52 |
We evaluate our ONI on the Fashion-MNIST [69], CIFAR-10 [37] and ImageNet [55] datasets. We provide an ablation study on the iteration number of ONI in Section 4.1.4. Due to space limitations, we only provide essential components of the experimental setup; for more details, please refer to Appendix F. The code is available at https://github.com/huangleiBuaa/ONI.
We use an MLP with a ReLU activation [49]
, and vary the depth. The number of neurons in each layer is 256. We employ stochastic gradient descent (SGD) optimization with a batch size of 256, and the learning rates are selected based on the validation set (
samples from the training set) from .We first show that maintaining orthogonality can improve the training performance. We compare two baselines: 1) ‘plain’, the original network; and 2) ‘OrthInit’, in which the orthogonal initialization [58] is used. The training performances are shown in Figure 5 (a). We observe orthogonal initialization can improve the training efficiency in the initial phase (comparing with ‘plain’), after which the benefits of orthogonality degenerate (comparing with ‘ONI’) due to the updating of weights (Figure 5 (b)).
We experimentally show the effects of initially scaling the orthogonal weight matrix by a factor of . We also apply this technique to the ‘OLM’ method [27], in which the orthogonalization is solved by eigen decomposition. We refer to ‘OLM-NS’/‘ONI-NS’ as the ‘OLM’/‘ONI’ without scaling by . The results are shown in Figure 6. We observe that the scaling technique has no significant effect on shallow neural networks, e.g., the 6-layer MLP. However, for deeper neural networks, it produces significant performance boosts. For example, for the 20-layer MLP, neither ‘OLM’ nor ‘ONI’ can converge without the additional scaling factors, because the activation and gradient exponentially vanish (See Appendix F.1). Besides, our ‘ONI’ has a nearly identical performance compared to ‘OLM’, which indicates the effectiveness of our approximate orthogonalization with few iterations (e.g. 5).
BatchSize=128 | BatchSize=2 | |
---|---|---|
w/BN* [73] | 6.61 | – |
Xavier Init* [27] | 7.78 | – |
Fixup-init* [74] | 7.24 | – |
w/BN | 6.82 | 7.24 |
Xavier Init | 8.43 | 9.74 |
GroupNorm | 7.33 | 7.36 |
ONI | 6.56 | 6.67 |
Here, we evaluate ONI on VGG-style neural networks with convolutional layers. The network starts with a convolutional layer of filters, where is the varying width based on different configurations. We then sequentially stack three blocks, each of which has convolutional layers with filter numbers of , and , respectively. We vary the depth with in and the width with in . We use SGD with a momentum of 0.9 and batch size of 128. The best initial learning rate is chosen from over the validation set of 5,000 samples from the training set, and we divide the learning rate by 5 at 80 and 120 epochs, ending the training at 160 epochs. We compare our ‘ONI’ to several baselines, including orthogonal initialization [58] (‘OrthInit’), using soft orthogonal constraints as the penalty term [71] (‘OrthReg’), weight normalization [57] (‘WN’), ‘OLM’ [27] and the ‘plain’ network. Note that OLM [27] originally uses a scale of 1 (indicated as ‘OLM-1’), and we also apply the proposed scaling by (indicated as ‘OLM-’).
Table 1 shows the results. ‘ONI’ and ‘OLM-’ have significantly better performance under all network configurations (different depths and widths), which demonstrates the beneficial effects of maintaining orthogonality during training. We also observe ‘ONI’ and ‘OLM-’ converge faster than other baselines, in terms of training epochs (See Appendix F.2). Besides, our proposed ‘ONI’ achieves slightly better performance than ‘OLM-’ on average, over all configurations. Note that we train ‘OLM-’ with a group size of , as suggested in [27]. We also try full orthogonalization for ‘OLM-’. However, we observe either performance degeneration or numerical instability (e.g., the eigen decomposition cannot converge). We argue that the main reason for this is that full orthogonalization solved by OLM over-constrains the weight matrix, which harms the performance. Moreover, eigen decomposition based methods are more likely to result in numerical instability in high-dimensional space, due to the element-wise multiplication of a matrix during back-propagation [43], as discussed in Section 3.1.
Top-1 () | Top-5 () | Time (min./epoch) | |
---|---|---|---|
plain | 27.47 | 9.08 | 97 |
WN | 27.33 | 9.07 | 98 |
OrthInit | 27.75 | 9.21 | 97 |
OrthReg | 27.22 | 8.94 | 98 |
ONI | 26.31 | 8.38 | 104 |
Batch normalization (BN) is essential for stabilizing and accelerating the training [32] of DNNs [22, 26, 23, 63]. It is a standard configuration in residual networks [22]. However, it sometimes suffers from the small batch size problem [31, 68] and introduces too much stochasticity [65] when debugging neural networks. Several studies have tried to train deep residual networks without BN [60, 74]. Here, we show that, when using our ONI, the residual network without BN can also be well trained.
The experiments are executed on a 110-layer residual network (Res-110). We follow the same experimental setup as in [22], except that we run the experiments on one GPU. We also compare against the Xavier Init [16, 9], and group normalization (GN) [68]. ONI can be trained with a large learning rate of 0.1 and converge faster than BN, in terms of training epochs (See Appendix F.2). We observe that ONI has slightly better test performance than BN (Table 2). Finally, we also test the performance on a small batch size of 2. We find ONI continues to have better performance than BN in this case, and is not sensitive to the batch size, like GN [68].
ResNet w/o BN | ResNet | ResNetVar | ||||
---|---|---|---|---|---|---|
Method | Train | Test | Train | Test | Train | Test |
plain | 31.76 | 33.84 | 29.33 | 29.64 | 28.82 | 29.56 |
ONI | 27.05 | 31.17 | 29.28 | 29.57 | 28.12 | 28.92 |
Test error () | Time (min./epoch) | |||
---|---|---|---|---|
Method | 50 | 101 | 50 | 101 |
ResNet | 23.85 | 22.40 | 66 | 78 |
ResNet + ONI | 23.55 | 22.17 | 74 | 92 |
ResNetVar | 23.94 | 22.76 | 66 | 78 |
ResNetVar + ONI | 23.30 | 21.89 | 74 | 92 |
To further validate the effectiveness of our ONI on a large-scale dataset, we evaluate it on the ImageNet-2012 dataset. We keep almost all the experimental settings the same as the publicly available PyTorch implementation
[52]: We apply SGD with a momentum of 0.9, and a weight decay of 0.0001. We train for 100 epochs in total and set the initial learning rate to 0.1, lowering it by a factor of 10 at epochs 30, 60 and 90. For more details on the slight differences among different architectures and methods, see Appendix F.3.Table 3 shows the results on the 16-layer VGG [61]. Our ‘ONI’ outperforms ‘plain’, ‘WN’, ‘OrthInit’ and ‘OrthReg’ by a significant margin. Besides, ‘ONI’ can be trained with a large learning rate of 0.1, while the other methods cannot (the results are reported for an initial learning rate of 0.01). We also provide the running times in Table 3. The additional cost introduced by ‘ONI’ compared to ‘plain’ is negligible ().
We first perform an ablation study on an 18-layer residual network (ResNet) [22], applying our ONI. We use the original ResNet and the ResNet without BN [32]. We also consider the architecture with BN inserted after the nonlinearity, which we refer to as ‘ResNetVar’. We observe that our ONI improves the performance over all three architectures, as shown in Table 4
. One interesting observation is that ONI achieves the lowest training error on the ResNet without BN, which demonstrates its ability to facilitate optimization for large-scale datasets. We also observe that ONI has no significant difference in performance compared to ‘plain’ on ResNet. One possible reason is that the BN module and residual connection are well-suited for information propagation, causing ONI to have a lower net gain for such a large-scale classification task. However, we observe that, on ResNetVar, ONI obtains obviously better performance than ‘plain’. We argue that this boost is attributed to the orthogonal matrix’s ability to achieve approximate dynamical isometry, as described in Theorem
2.We also apply our ONI on a 50- and 101-layer residual network. The results are shown in Table 5. We again observe that ONI can improve the performance, without introducing significant computational cost.
ONI controls the spectrum of the weight matrix by the iteration number , as discussed before. Here, we explore the effect of on the performance of ONI over different datasets and architectures. We consider three configurations: 1) the 6-layer MLP for Fashion-MNIST; 2) the VGG-Style network with for CIFAR-10; and 3) the 18-layer ResNet without BN for ImageNet. The corresponding experimental setups are the same as described before. We vary and show the results in Figure 7. Our primary observation is that using either a small or large T degrades performance. This indicates that we need to control the magnitude of orthogonality to balance the increased optimization benefit and diminished representational capacity. Our empirical observation is that usually works the best for networks without residual connections, whereas usually works better for residual networks. We argue that the residual network itself already has good optimization [22], which reduces the optimization benefits of orthogonality.
Besides, we also observe that larger s have nearly equivalent performance for simple datasets, e.g. Fashion-MNIST, as shown in 7 (a). This suggests that amplifying the eigenbasis corresponding to a small singular value cannot help more, even though the network with a fully orthogonalized weight matrix can well fit the dataset. We further show the distributions of the singular values of the orthogonalized weight matrix in Appendix F.4.
How to stabilize GAN training is an open research problem [17, 56, 19]. One pioneering work is spectral normalization (SN) [47], which can maintain the Lipschitz continuity of a network by bounding the maximum eigenvalue of it’s weight matrices as 1. This technique has been extensively used in current GAN architectures [48, 75, 10, 38]. As stated before, our method is not only capable of bounding the maximum eigenvalue as 1, but can also control the orthogonality to amplify other eigenbasis with increased iterations, meanwhile orthogonal regularization is also a good technique for training GANs [10]. Here, we conduct a series of experiments for unsupervised image generation on CIFAR-10, and compare our method against the widely used SN [47].
We strictly follow the network architecture and training protocol reported in the SN paper [47]. We use both DCGAN [54] and ResNet [22, 19] architectures. We provide implementation details in Appendix G. We replace all the SN modules in the corresponding network with our ONI. Our main metric for evaluating the quality of generated samples is the Fréchet Inception Distance (FID) [25] (the lower the better). We also provide the corresponding Inception Score (IS) [56] in Appendix G.
We use the standard non-saturating function as the adversarial loss [17, 38] in the DCGAN architecture, following [47]. For optimization, we use the Adam optimizer [36] with the default hyper-parameters, as in [47]: learning rate , first momentum , second momentum , and the number of discriminator updates per generator update . We train the network over 200 epochs with a batch size of 64 (nearly 200k generator updates) to determine whether it suffers from training instability. Figure 8 (a) shows the FID of SN and ONI when varying Newton’s iteration number from 0 to 5. One interesting observation is that the ONI with only the initial spectral bounding described in Formula 7 () can also stabilize training, even though it has downgraded performance compared to SN. When , ONI achieves better performance than SN. This is because, based on what we observed, ONI stretches the maximum eigenvalue to nearly 1, while simultaneously amplifying other eigenvalues. Finally, we find that ONI achieves the best performance when , yielding an , compared to SN’s . Further increasing harms the training, possibly because too strong an orthogonalization downgrades the capacity of a network, as discussed in [47, 10].
We also conduct experiments to validate the stability of our proposed ONI under different experimental configurations: we use six configurations, following [47], by varying and (denoted by A-F, for more details please see Appendix) G.1. Figure 8 (b) shows the results of SN and ONI (with T=2) under these six configurations. We observe that our ONI is consistently better than SN.
For experiments on the ResNet architecture, we use the same setup as the DCGAN. Besides the standard non-saturating loss [17], we also evaluate the recently popularized hinge loss [42, 47, 10]. Figure 9 shows the results. We again observe that our ONI achieves better performance than SN under the ResNet architecture, both when using the non-saturating loss and hinge loss.
In this paper, we proposed an efficient and stable orthogonalization method by Newton’s iteration (ONI) to learn layer-wise orthogonal weight matrices in DNNs. We provided insightful analysis for ONI and demonstrated its ability to control orthogonality, which is a desirable property in training DNNs. ONI can be implemented as a linear layer and used to learn an orthogonal weight matrix, by simply substituting it for the standard linear module.
ONI can effectively bound the spectrum of a weight matrix in (, ) during the course of training. This property makes ONI a potential tool for validating some theoretical results relating to DNN’s generalization (e.g., the margin bounds shown in [6]) and resisting attacks from adversarial examples [13]. Furthermore, the advantage of ONI in stabilizing training w/o BN (BN usually disturbs the theoretical analysis since it depends on the sampled mini-batch input with stochasticity [32, 29]) makes it possible to validate these theoretical arguments under real scenarios.
Acknowledgement We thank Anna Hennig and Ying Hu for their help with proofreading.
Torch7: A matlab-like environment for machine learning.
In BigLearn, NIPS Workshop, 2011.Training deep networks with structured layers by matrix backpropagation.
In ICCV, 2015.OlÉ: Orthogonal low-rank embedding - a plug and play geometric loss for deep learning.
In CVPR, June 2018.International Journal of Computer Vision (IJCV)
, 115(3):211–252, 2015.Bayesian uncertainty estimation for batch normalized deep networks.
In ICML, 2018.Cappronet: Deep feature learning via orthogonal projections onto capsule subspaces.
In NeurIPS. 2018.Given the layer-wise orthogonal weight matrix , we can perform the forward pass to calculate the loss of the deep neural networks (DNNs). It’s necessary to back-propagate through the orthogonalization transformation, because we aim to update the proxy parameters . For illustration, we first describe the proposed orthogonalization by Newton’s iteration (ONI) in Algorithm I. Given the gradient with respect to the orthogonalized weight matrix , we target to compute
. The back-propagation is based on the chain rule. From Line 2 in Algorithm
I, we have:(A1) |
where indicates the trace of the corresponding matrix and can be calculated from Lines 3 and 8 in Algorithm I:
(A2) |
We thus need to calculate , which can be computed from Lines 5, 6 and 7 in Algorithm I:
(A3) |
where and can be iteratively calculated from Line 6 in Algorithm I as follows:
(A4) |
In summary, the back-propagation of Algorithm I is shown in Algorithm II.
We further derive the back-propagation of the accelerated ONI method with the centering and more compact spectral bounding operation, as described in Section 3.3 of the paper. For illustration, Algorithm III describes the forward pass of the accelerated ONI. Following the calculation in Algorithm II, we can obtain . To simplify the derivation, we represent Line 3 of Algorithm III as the following formulations:
(A5) | ||||
(A6) | ||||
(A7) |
It’s easy to calculate from Eqn. A5 and Eqn. A7 as follows:
(A8) |
where can be computed based on Eqn. A6 and Eqn. A7:
(A9) |
Based on Line 2 in Algorithm III, we can achieve as follows:
(A10) |
In Section 3 of the paper, we show that bounding the spectral of the proxy parameters matrix by
(A11) |
and
(A12) |
can satisfy the convergence condition of Newton’s Iterations as follows:
(A13) |
where and the singular values of are nonzero. Here we will prove this conclusion, and we also prove that .
By definition, can be represented as . Given Eqn. A11, we calculate
(A14) |
Let’s denote and the eigenvalues of are . We have , since is a real symmetric matrix and the singular values of are nonzero. We also have and the eigenvalues of are . Furthermore, the eigenvalues of are , thus satisfying the convergence condition described by Eqn. A13.
Similarly, given , we have and its corresponding eigenvalues are . Therefore, the singular values of are , also satisfying the convergence condition described by Eqn. A13.
We have and . It’s easy to demonstrate that , since . ∎
In Section 3 of the paper, we show that the Newton’s iteration by bounding the spectrum with Eqn. A11 is equivalent to the Newton’s iteration proposed in [29]. Here, we provide the details. In [29], they bound the covariance matrix by the trace of as . It’s clear that used in Algorithm I is equal to , based on Eqn. A14 shown in the previous proof.
configurations | cudnn | cudnn + ONI-T1 | cudnn + ONI-T3 | cudnn + ONI-T5 | cudnn + ONI-T7 |
---|---|---|---|---|---|
, n=d=256, m=256 | 118.6 | 122.1 | 122.9 | 124.4 | 125.7 |
, n=d=256, m=32 | 15.8 | 18.3 | 18.9 | 19.5 | 20.8 |
, n=d=1024, m=32 | 71.1 | 81.7 | 84.3 | 89.5 | 94.2 |
, n=d=256, m=256 | 28.7 | 31.5 | 32.1 | 33.7 | 34.6 |
, n=d=256, m=32 | 10.1 | 13 | 13.6 | 14.2 | 15.3 |
, n=d=1024, m=32 | 22.2 | 27.6 | 29.7 | 32.9 | 37.0 |
ONI-Full | 5.66 | 0 |
---|---|---|
OLM-G32 | 8 | 5.66 |
OLM-G16 | 9.85 | 8.07 |
OLM-G8 | 10.58 | 8.94 |
In Section 3.4 of the paper, we argue that group based methods cannot ensure the whole matrix to be either row or column orthogonal, when . Here we provide more details.
We follow the experiments described in Figure 3 of the paper, where we sample the entries of proxy matrix from the Gaussian distribution . We apply the eigen decomposition based orthogonalization method [27] with group size , to obtain the orthogonalized matrix . We vary the group size . We evaluate the corresponding row orthogonality and column orthogonality . The results are shown in Table A2. We observe that the group based orthogonalization method cannot ensure the whole matrix to be either row or column orthogonal, while our ONI can ensure column orthogonality. We also observe that the group based method has degenerated orthogonality, with decreasing group size.
We also conduct an experiment when , where we sample the entries of proxy matrix from the Gaussian distribution . We vary the group size . Note that represents full orthogonalization. Figure A1 shows the distribution of the eigenvalues of . We again observe that the group based method cannot ensure the whole weight matrix to be row orthogonal. Furthermore, orthogonalization with smaller group size tends to be worse.
In Section 3.6 of the paper, we show that, given a convolutional layer with filters and mini-batch data , the relative computational cost of ONI over the convolutional layer is
. In this section, we compare the of wall clock time between the convolution wrapping with our ONI and the standard convolution. In this experiment, our ONI is implemented based on Torch
[14] and we wrap it to the ‘cudnn’ convolution [12]. The experiments are run on a TITAN Xp.We fix the input to size , and vary the kernel size ( and ), the feature dimensions ( and ) and the batch size . Table A1 shows the wall clock time under different configurations. We compare the standard ‘cudnn’ convolution (denoted as ‘cudnn’) and the ‘cudnn’ wrapped with our ONI (denoted as ‘cudnn + ONI’).
We observe that our method introduces negligible computational costs when using a convolution, feature dimension and batch size of . Our method may degenerate in efficiency with a smaller kernel size, larger feature dimension and smaller batch size, based on the computational complexity analysis. However, our method (with iteration of 5) ‘cudnn + ONI-T5’ only costs over the standard convolution ‘cudnn’, under the worst configuration, , and m=32.
Theorem 1. Let , where and . Assume: (1) , , and (2) , . If , we have the following properties: (1) ; (2) , ; (3) ; (4) , . In particular, if , property (2) and (3) hold; if , property (1) and (4) hold.
Based on and , we have that is a square orthogonal matrix. We thus have . Besides, we have ^{1}^{1}1
We follow the common setup where the vectors are column vectors when their derivations are row vectors.
.(1) Therefore, we have
(A15) |
We thus get .
(2) It’s easy to calculate:
(A16) |
The covariance of is given by