The inductive bias encoded in a neural network is generally determined by two major aspects: how the neural network is structured (i.e., network architecture) and how the neural network is optimized (i.e., training algorithm). For the same network architecture, using different training algorithms could lead to a dramatic difference in generalization performance (Keskar & Socher, 2017; Reddi et al., 2019) even if the training loss is already close to zero, implying that different training procedures lead to different inductive biases. Therefore, how to effectively train a neural network that generalizes well remains an open challenge.
Recent theories (Gunasekar et al., 2017, 2018; Kawaguchi, 2016; Li et al., 2018; Gunasekar et al., 2018) suggest the importance of over-parameterization in linear neural networks. For example, (Gunasekar et al., 2017) shows that optimizing an underdetermined quadratic objective over a matrix with gradient descent on a factorization of leads to an implicit regularization that may improve generalization. There are also empirical evidences (Ding et al., 2019; Liu et al., 2019) which show that over-parameterzing the convolutional filters under some regularity is beneficial to generalization. Our paper aims to leverage the power of over-parameterization and explore more intrinsic structural priors for training a well-performing neural network.
Motivated by this goal, we propose a generic orthogonal over-parameterized training (OPT) framework for effectively training neural networks. Different from the existing neural network training, OPT over-parameterizes a neuron
with the multiplication of a learnable layer-shared orthogonal matrix
and a fixed randomly initialized weight vector. Therefore, the equivalent weight for the neuron is . Once each element of the neuron weight
has been randomly initialized following a zero-mean Gaussian distribution(He et al., 2015; Glorot & Bengio, 2010), we fix them throughout the entire training process. Then OPT learns a layer-shared orthogonal transformation that is applied to all the neurons (in the same layer). An illustration of OPT is given in Fig. 1. In contrast to the standard neural training, OPT decomposes the neuron into two components: an orthogonal transformation that learns a proper coordinate system and a weight vector that controls the position of the neuron. Essentially, the weight vectors of different neurons in the same layer determine the relative positions of these neurons, while the layer-shared orthogonal matrix specifies the coordinate system for these neurons.
Another motivation of OPT comes from an observation (Liu et al., 2018) that neural networks with lower hyperspherical energy generalizes better. Hyperspherical energy quantifies the diversity of neurons on a hypersphere, and essentially characterizes the relative positions of neurons via this form of diversity. (Liu et al., 2018) introduces hyperspherical energy as a regularization term in the network and does not guarantee the hyperspherical energy can be effectively minimized (due to the existence of data fitting loss). To address this issue, we leverage the property of hyperspherical energy that it is independent of the coordinate system in which the neurons live and only depends on their relative positions. Specifically, we prove that, if we randomly initialize the neuron weight with certain distributions, such distributions lead to minimum hyperspherical energy. It follows that OPT maintains the minimum energy throughout the training by learning a coordinate system (i.e., layer-shared orthogonal matrix) for the neurons. Therefore, OPT can guarantee that the hyperspherical energy is well minimized.
We consider several ways to learn the orthogonal transformation. The first one is to unroll different orthogonalization algorithms such as Gram-Schmidt process, Householder reflection and Löwdin’s symmetric orthogonalization. Different unrolled algorithms yield different implicit regularizations to construct the neuron weights. For example, symmetric orthogonalization guarantees that the new orthogonal basis has the least distance in the Hilbert space from the original non-orthogonal basis. Second, we consider to use a special parameterization (such as Cayley parameterization) to construct the orthogonal matrix, which is more efficient in training. Third, we try an orthogonal-preserving gradient descent to ensure the matrixto be orthogonal after each gradient update. Last, we propose a relaxation to the optimization problem by making the orthogonality a regularization for the coordinate system . These different ways of learning the orthogonal transformation for neurons encode different inductive biases to the neural network.
Moreover, we propose a refinement strategy to further reduce the hyperspherical energy for the random initialized neural weights . We directly minimize the hyperspherical energy of these randomly initialized neuron weights as a preprocessing step before training them on actual data. Finally, we provide theoretical justifications for why OPT yields better generalization than standard training.
We summarize the advantages of OPT as follows:
OPT is a universal neural network training framework with strong flexibility. There are several ways of learning the coordinate system (i.e., orthogonal transformation) and each one may impose a different inductive bias.
OPT is the first training method that can provably achieve minimum hyperspherical energy, leading to better generalization. More interestingly, OPT reveals that learning a proper coordinate system is crucial to the generalization, while the relative positions of neurons can be well characterized by hyperspherical energy.
There is no extra computational cost for OPT-trained neural networks in inference. It has the same inference speed and model size as its standard counterpart.
2 Related Work
Optimization for Deep Learning
Optimization for Deep Learning. A number of first-order optimization algorithms (Nesterov, 1983; Duchi et al., 2011; Kingma & Ba, 2014; Tieleman & Hinton, 2012; Zeiler, 2012; Reddi et al., 2019) are proposed to improve the empirical convergence and generalization of deep neural networks. Our work is in parallel with these optimization algorithms, since these algorithms can easily applied to our framework.
Parameterization of Neurons. There are various ways to parameterize a neuron for different applications. (Ding et al., 2019) over-parameterizes a 2D convolution kernel by combining a 2D kernel of the same size and two additional 1D asymmetric kernels. The resulted convolution kernel has the same effective parameters during training but more parameters at training time due to the additional asymmetric kernels. (Liu et al., 2019) constructs a neuron with a bilinear parameterization which regularizes the bilinear similarity matrix. (Yang et al., 2015) reparameterizes the neuron matrix with an adaptive fastfood transform to compress model parameters. (Jaderberg et al., 2014; Liu et al., 2015; Wang et al., 2017) employ sparse and low-rank structures to construct convolution kernels for efficient neural networks.
Hyperspherical learning. (Liu et al., 2017) proposes a neural network architecture that learn representations on a hypersphere and shows that the angular information in neural networks, in contrast to magnitude information, preserves the most semantic information. (Liu et al., 2018) defines hyperspherical energy that quantifies the diversity of neurons on a hypersphere and empirically shows that the minimum hyperspherical energy improves generalization.
3 Orthogonal Over-Parameterized Training
3.1 General Framework
OPT parameterizes the neuron as the multiplication of an orthogonal matrix and a neuron weight vector , and the equivalent neuron weight becomes . The output of this neuron can be represented by where is the input vector. In the OPT framework, we fix the randomly initialized neuron weight and only the orthogonal matrix is learned. In contrast, the standard neuron is directly formulated as , where the weight vector is learned in training.
Without loss of generality, we consider a two-layer linear MLP with a loss function(e.g., we use the least square loss: ). Specifically, the learning objective of the standard training and OPT are
where is the -th neuron in the first layer, and is the output neuron in the second layer. In OPT, each element of is usually sampled from a zero-mean Gaussian distribution, and is fixed throughout the entire training process. In general, OPT learns an orthogonal matrix that is applied to all the neurons instead of learning the individual neuron weight. Note that, we usually do not apply OPT to neurons in the output layer (e.g.,
in this MLP example, and the final linear classifiers in CNNs), since it makes little sense to fix a set of random linear classifiers. Therefore, the central problem is how to learn these layer-shared orthogonal matrices.
3.2 Hyperspherical Energy Perspective
We take a close look at OPT from the hyperspherical energy perspective. Following (Liu et al., 2018), the hyperspherical energy of neurons is defined as
in which is the -th neuron weight projected onto the unit hypersphere . Hyperspherical energy is used to characterize the diversity of neurons on a unit hypersphere. Assume that we have neurons in one layer, and we have learned an orthogonal matrix for these neurons. The hyperspherical energy of these OPT-trained neurons is given by
which concludes that OPT will not change the hyperspherical energy in each layer during the training.
Moreover, (Liu et al., 2018)
proves that minimum hyperspherical energy corresponds to the uniform distribution over the hypersphere. As a result, if the initialization of the neurons in the same layer follows the uniform distribution over the hypersphere, then we can guarantee that the hyperspherical energy is minimal in a probabilistic sense.
For the neuron where are initialized i.i.d. following a zero-mean Gaussian distribution (i.e., ), then its projection onto a unit hypersphere where is guaranteed to follow a uniform distribution on the unit hypersphere .
Theorem 1 implies that, if we initialize the neurons in the same layer with zero-mean Gaussian distribution, then the corresponding hyperspherical energy is guaranteed to be small. It is because the neurons will be uniformly distributed on the unit hypersphere and hyperpsherical energy well quantifies the uniformity on the hypersphere. More importantly, current prevailing neuron initializations such as Xavier (Glorot & Bengio, 2010) and Kaiming (He et al., 2015) are basically zero-mean Gaussian distribution. Therefore, our neurons naturally have very low hyperspherical energy from the very beginning in practice.
3.3 Unrolling Orthogonalization Algorithms
In order to learn the orthogonal transformation, we propose to unroll classic orthogonalization algorithms in numerical linear algebra and embed them into the neural network such that the training is still end-to-end. We need to make sure every step of the orthogonalization algorithm is differentiable, and the training flow is shown in Fig. 2.
Gram-Schmidt Process. This method takes a linearly independent set and produce an orthogonal set based on it. Gram-Schmidt Process (GS) usually takes the following steps to orthogonalize a set of vectors :
where the projection operator is given by and is the obtained orthogonal set. In practice, we can use modified GS for numerical stability. To achieve better orthogonality, we can also unroll an iterative GS (Hoffmann, 1989) with multiple iterative steps.
Householder Reflection. As one of the classic transformation that is used in QR factorization, Householder reflection (HR) can also compute the orthogonal set from a group of vectors. A Householder reflector is defined as where
is perpendicular to the reflection hyperplane. In QR factorization, HR is used to transform a (non-singular) square matrix into an orthogonal matrix and a upper triangular matrix. Given a matrix, we consider the first column vector . We use Householder reflector to transform to . Specifically, we construct as
which is an orthogonal matrix. We have that the first column of becomes . At the -th step, we can view the sub-matrix as a new , and use the same procedure to construct the Householder transformation . We construct the final Householder transformation as . Now we can gradually transform to a upper triangular matrix with Householder reflections. Therefore, we have that
where is the upper triangular matrix (which is different from the matrix in Fig. 2) and the final obtained orthogonal set is (i.e., ).
Löwdin’s Symmetric Orthogonalization. Let the matrix
be a given set of linearly independent vectors in an
-dimensional space. A non-singular linear transformationcan transform the basis to an orthogonal basis : . The matrix will be orthogonal if the following equation holds:
where is the Gram matrix of the given set . We obtain a general solution to the orthogonalization problem via the substitution: where
is an arbitrary unitary matrix. The specific choicegives the Löwdin’s symmetric orthogonalization (LS):
. We can analytically obtain the symmetric orthogonalization from the singular value decomposition:. Then LS gives as the orthogonal set for .
LS possesses a remarkable property which the other orthogonalizations do not have. The orthogonal set resembles the original set in a nearest-neighbour sense. Specifically, LS guarantees that (where and are the -th column of and , respectively) is minimized. Intuitively, LS indicates the gentlest pushing of the directions of the vectors in order to get them orthogonal.
Discussion. These orthogonalization algorithms are fully differentiable and end-to-end trainable. For better orthogonality, these algorithms can be used iteratively and we can unroll them with multiple iterations. Empirically, one-step unrolling usually works well already. We have also considered Givens rotation to construct orthogonal matrix, but it needs to traverse all lower triangular elements in the original set , which takes complexity. Therefore, it is too computationally expensive. More interestingly, each orthogonalization method encodes a unique inductive bias for the resulting neurons by imposing some implicit regularizations (e.g., least distance in Hilbert space for LS). More details about the orthogonalization are provided in Appendix A.
3.4 Orthogonal Parameterization
A more convenient way to ensure orthogonality while learning the matrix is to use a special parameterization that inherently guarantees orthogonality. The exponential parameterization use (where
denotes the matrix exponential) to represent an orthogonal matrix from a skew-symmetric matrix
. Cayley parameterization (CP) is a Padé approximation of the exponential parameterization, and is a more natural choice due to its simplicity. Cayley parameterization use the following transform to construct an orthogonal matrixfrom a skew-symmetric matrix :
where . We note that Cayley parameterization only produces the orthogonal matrices with determinant , which belongs to the special orthogonal group and thus . Specifically, it suffices to learn the upper or lower triangular of the matrix with unconstrained optimization to obtain a desired orthogonal matrix . Cayley parameterization does not cover the entire orthogonal group and is less flexible in terms of representation power, which serves as a explicit regularization for the neurons.
3.5 Orthogonality-Preserving Gradient Descent
An alternative way to guarantee orthogonality is to modify the gradient update for the transformation matrix . The general idea is to initialize with an arbitrary orthogonal matrix and then make sure every gradient update is to apply an orthogonal transformation to . It is essentially conducting gradient descent on the Stiefel manifold. There are plenty of work (Li et al., 2020; Wen & Yin, 2013; Wisdom et al., 2016; Lezcano-Casado & Martínez-Rubio, 2019; Arjovsky et al., 2016; Henaff et al., 2016; Jing et al., 2017) that focus on optimization on Stiefel manifold.
Given a matrix which is initialized as an orthogonal matrix, we aim to construct an orthogonal transformation as the gradient update. We use the Cayley transform to compute a parametric curve on the Stiefel manifold with a specific metric via a skew-symmetric matrix and use it as the update rule:
where and . denotes the orthogonal matrix in the -th iteration. denotes the original gradient of the loss function w.r.t. . We term such a gradient update as orthogonal-preserving gradient descent (OGD). To reduce the computational cost of the matrix inverse in Eq. (9), we use an iterative method (Li et al., 2020) to approximate the Cayley transform without matrix inverse. We arrive at the following fixed-point iteration by moving terms in Eq. (9):
which converges to the close-form Cayley transform with a rate of ( is the iteration number). In practice, we empirically find that two iterations will usually suffice for a reasonable approximation accuracy.
3.6 Relaxation to Orthogonal Regularization
We relax the optimization with an orthogonality constraint to a unconstrained optimization with orthogonality regularization (OR). Specifically, we remove the orthogonality constraint in Eq. (1), and adopt an orthogonality regularization for in the objective function, i.e., . Taking Eq. (1) as an example, the training objective becomes
is a hyperparameter. It serves as an approximation to the OPT objective. This relaxation can not guarantee that hyperspherical energy stays unchanged.
4 Refining the Random Initialization
Minimizing hyperspherical energy. Because we are randomly initializing the neuron weight vectors
, there will exist a variance that makes the hyperspherical energy deviate from the minima even if the hyperspherical energy is minimized in expectation. To more effectively reduce the hyperspherical energy, we propose to refine the random initialization by minimizing its hyperspherical energy as a preprocessing. Specifically, before feeding these neuron weight to OPT, we first minimize the hyperspherical energy in Eq.(2) with gradient descent (without the training data loss). More importantly, since the random initialized neurons can not minimize the half-space hyperspherical energy (Liu et al., 2018) where the collinearity redundancy is removed, we can also perform the half-space hyperspherical energy minimization as a preprocessing step.
Normalizing the randomly initialized neurons. Since the norm of the randomly initialized neuron weights serve a role similar to weighting the importance of different neurons, we further consider to normalize the neuron weights such that each neuron weight vector has the unit norm.
We evaluate both refinements in Section 7.4, and we also show that OPT still performs well without these refinements.
5 Theoretical Insights on Generalization
The key question we aim to answer in this section is why OPT may lead to better generalization. We have already shown that OPT can guarantee the minimum hyperspherical energy (MHE) in a probabilistic sense. Although empirical evidences (Liu et al., 2018) have shown significant and consistent performance gain by minimizing hyperspherical energy, why lower hyperspherical energy will lead to better generalization is still unclear. We argue that OPT leads to better generalization from two aspects: how OPT may affect training and generalization in theory and why minimum hyperspherical energy serves as a good inductive bias.
Our goal here is to leverage and apply existing theoretical results (Kawaguchi, 2016; Xie et al., 2016; Soudry & Carmon, 2016; Lee et al., 2016; Du et al., 2017; Allen-Zhu et al., 2018) to explain the role that MHE plays rather than proving sharp and novel generalization bounds. We simply consider one-hidden-layer networks as the hypothesis class: where
is ReLU. Since the magnitude ofcan be scaled into , we can restrict to be . Given a set of i.i.d. training sample where is drawn uniformly from the unit hypersphere, we minimize the least square loss . The gradient w.r.t. is
Let be the column concatenation of neuron weights. We aim to identify the conditions under which there are no spurious local minima. We rewrite that
where , , and . Therefore, we can obtain that
where is the training error and is the minimum singular value of . If we need the training error to be small, then we have to lower bound away from zero. We have the following result from (Xie et al., 2016).
With probability larger than
With probability larger than, we will have that where
and . The kernel function is .
Once MHE is achieved, the neurons will be uniformly distributed on the unit hypersphere. From Lemma 1, we can see that if the neurons are uniformly distributed on the unit hypersphere, will be very small and close to zero. Then will also be small, leading to large lower bound for . Therefore, MHE can result in small training error once the gradient norm is small. The result implies no spurious local minima if we use OPT for training.
We further argue that MHE induced by OPT serves as an important inductive bias for neural networks. As the standard regularizer for neural networks, weight decay controls the norm of the neuron weights, regularizing essentially one dimension of the weight. In contrast, MHE completes the missing pieces by regularizing the remaining dimensions of the weight. MHE encourages minimum hyperspherical redundancy between neurons. In the linear classifier case, MHE impose a prior of maximal inter-class separability.
Semi-randomness. OPT fixes the randomly initialized neuron weight vectors and only learns layer-shared orthogonal matrices for each layer, so OPT naturally imposes a strong randomness to the neurons and well combine the generalization from randomness and the strong approximation power from neural networks. Such randomness suggests that the specific configuration of relative position among neurons does not matter that much, and the coordinate system is more crucial for generalization. (Kawaguchi et al., 2018; Rahimi & Recht, 2008; Srivastava et al., 2014) also show that randomness can be beneficial to generalization.
Coordinate system vs. relative position. OPT shows that only learning the coordinate system yields much better generalization than learning neuron weights directly. This implies that the coordinate system is of great importance to generalization. However, relative position does not matter only when the hyperspherical energy is sufficiently low. In other words, hyperspherical energy can well characterize the relative position among neurons. Lower hyperspherical energy will lead to better generalization.
Flexible training. First, OPT can used in multi-task training (Mallya et al., 2018) where each set of orthogonal matrices represent one task. OPT can learn different set of orthogonal matrices for different tasks with the neuron weights remain the same. Second, we can perform progressive training with OPT. For example, after learning a set of orthogonal matrices on a large coarse-grained dataset (i.e., pretraining), we can multiple the orthogonal matrices back to the neuron weights and construct a new set of neuron weights. Then we can use the new neuron weights as a starting point and apply OPT to train on a small fine-grained dataset (i.e., finetuning).
Limitations and open problems The limitations of OPT include more GPU memory consumption and computation during training, more numerical issues when ensuring orthogonality and weak scalability for ultra wide neural networks. Therefore, there will be plenty of open problems in OPT, such as scalable and efficient training. Most significantly, OPT opens up a new possibility for studying theoretical generalization of deep networks. With the decomposition to hyperspherical energy and coordinate system, OPT provides a new perspective for future research.
7 Experiments and Results
7.1 Experimental settings
We evaluate OPT on various types of neural networks such as multi-layer perceptrons (for image classification), convolutional neural networks (for image classification), graph neural networks (for graph node classification), and point cloud neural networks (for point cloud classification). Our goal is to show the performance gain of OPT on training different neural networks rather than achieving the state-of-the-art performance on different tasks. More experimental details and network architectures are given in Appendix D.
7.2 Ablation Study and Exploratory Experiment
Necessity of orthogonality. We first experiment whether the orthogonality is necessary for OPT. We use both 6-layer CNN and 9-layer CNN (specified in Appendix D) on CIFAR-100. We compare OPT with a baseline with the same network architecture that learns a unconstrained matrix with only weight decay regularization. We term this baseline as unconstrained over-parameterized training (UPT). “FN” in Table 1 denotes whether the randomly initialized neuron weights are fixed throughout the training (“Y” for yes and “N” for no). “LR” denotes whether the learnable transformation is unconstrained (“U”) or orthogonal (“GS” for Gram-Schmidt process). The results in Table 1 show that without ensuring orthogonality, the performance of UPT is much worse than OPT that unrolls Gram-Schmidt process for orthogonality (no matter whether the neuron weights are fixed or not). Thus, orthogonality is indeed necessary.
Fixed weight vs. learnable weight. From Table 1, we can see that using fixed neuron weights is consistently better than learnable neuron weights in both UPT and OPT. It shows that fixing the neuron weights while learning the transformation matrix is beneficial to generalization.
High vs. low hyperspherical energy. We empirically verify that high hyperspherical energy corresponds to inferior generalization performance. To initialize neurons with high hyperspherical energy, we use a random initialization with mean equal to , , , and .
We use CNN-6 to conduct experiments on CIFAR-100. The results in Table 2 (“N/C” denotes not converged) show that the network with higher hyperspherical energy is more difficult to converge. Moreover, we find that if the hyperspherical energy is larger than a certain value, then the network can not converge at all. Note that, when the hyperspherical energy is small (near the minima), a little change in hyperspherical energy (e.g., from to ) can lead to dramatic generalization gap (e.g., from error rate to ). One can observe that higher hyperspherical energy leads to worse generalization.
7.3 Multi-Layer Perceptrons
We evaluate different variants of OPT for MLPs on MNIST. We use a 3-layer MLP for all the training methods. Specific training hyperparameters are given in Appendix D. Results in Table 3 shows the testing error on MNIST for cases where neuron weights use normal initialization or Xavier initialization (Glorot & Bengio, 2010). OPT (GS/HR/LS) are OPT with unrolled orthogonalization algorithms. OPT (CP) denotes OPT with Cayley parameterization. OPT (OGD) is OPT with orthogonal-preserving gradient descent. OPT (OR) denotes OPT with relaxed orthogonal regularization. We can see that OPT (GS) performs the best and all OPT variants outperform the baseline by a considerable margin.
7.4 Convolutional Neural Networks
OPT variants. We evaluate all the OPT variants with a plain 6-layer CNN and a plain 9-layer CNN on CIFAR-100. Detailed network architectures are given in Appendix D. All the neurons are initialized by (He et al., 2015)2015) is used by default. Results in Table 4 show that nearly all OPT variants consistently outperform both baseline and the HS-MHE regularization (Liu et al., 2018) by a significant margin. The HS-MHE regularization puts the half-space hyperspherical energy into loss function and minimizes it with stochastic gradients, which is a naive way to minimize the hyperspherical energy. From the results, we observe that OPT (HR) performs the worse among all OPT variants. In contrast, OPT (GS) achieves the best testing error, implying that Gram-Schmidt process imposes a suitable inductive bias for CNNs on CIFAR-100.
Training without batch normalization. We further evaluate how OPT performs without batch normalization. In specific, we use CNN-6 as our backbone network and test on CIFAR-100. From Table 5, one can see that OPT variants again outperform both baseline and HS-MHE (Liu et al., 2018), validating that OPT can work reasonably well without batch normalization. Among all the OPT variants, Cayley parameterization achieves very competitive testing error with lower than standard training.
Training dynamics. We also look into how hyperspherical energy and testing error changes while training with OPT. For hyperspherical energy, we can see from Fig. 3 that the hyperspherical energy of the baseline will increase dramatically at the beginning and then gradually go down, but it still stay in a relatively high value at the end. MHE can effectively reduce the hyperspherical energy at the end of the training. In contrast, all OPT variants can maintain a very low hyperspherical energy from the beginning. OPT (GS) and OPT (CP) keeps exactly the same hyperspherical energy as the randomly initialized neurons, while OPT (OR) may increase the hyperspherical energy a little bit since it is a relaxation. For testing error, all OPT variants converge stably and their final accuracies outperform the others.
Refining neuron initialization. We also evaluate two refinement tricks for the neuron initialization. First, we consider the hyperspherical energy minimization as a preprocessing for the neuron weights. We conduct the experiment using CNN-6 on CIFAR-100. Specifically, we run a gradient descent for 5k iterations to minimize the hyperspherical energy of the neuron weights before the training gets started. We also compute the hyperspherical energy (before the training starts, and after the preprocessing of energy minimization) in Table 6. All the methods have the same random initialization with the same random seed, so the hyperspherical energy all starts at . After the neuron preprocessing, we have the energy of for the MHE objective and for the half-space MHE objective. More importantly, Table 6 shows that such a refinement can effectively improve the generalization of OPT and further reduce the testing error on CIFAR-100.
|Method||w/o Norm||w/ Norm|
Then we experiment the neuron weight normalization in OPT. Normalized neurons make a lot of senses in OPT because the scale of randomly initialized weights does not have any useful property. After randomly initializing the neurons, we directly normalize the scale of the weights to . These randomly initialized neurons still possess the important property of achieving minimum hyperspherical energy. Specifically, we use CNN-6 to perform classification on CIFAR-100. The results in Table 7 show that normalizing the neurons can largely boost the performance of OPT.
OPT for ResNet. To show that OPT is agnostic to different CNN architectures, we perform classification experiments on CIFAR-100 with both ResNet-20 and ResNet-32 (He et al., 2016). We use OPT (GS) and OPT (CP) to train ResNet-20 and ResNet-32. The results in Table 8 show that OPT achieves consistent improvements on ResNet compared to the standard training.
|Method||Top-1 Err.||Top-5 Err.|
ImageNet. We test OPT on ImageNet-2012. Since OPT consumes more GPU memory in large-scale settings, we use a GPU memory-efficient OPT (CP) to train a plain 10-layer CNN (detailed structure is specified in Appendix D) on ImageNet. Note that, our purpose here is to validate the superiority of OPT over the corresponding baseline. From Table 9, we can see that OPT (CP) achieves and improvements on top-1 and top-5 error, respectively.
|Method||5-shot Acc. (%)|
|MAML (Finn et al., 2017)||62.71 0.71|
|ProtoNet (Snell et al., 2017)||64.24 0.72|
|Baseline (Chen et al., 2019)||62.53 0.69|
|Baseline w/ OPT||63.27 0.68|
|Baseline++ (Chen et al., 2019)||66.43 0.63|
|Baseline++ w/ OPT||66.68 0.66|
Few-shot learning. To evaluate the cross-task generalization of OPT, we conduct few-shot learning on Mini-ImageNet, following the same experimental setting as (Chen et al., 2019). More detailed experimental settings are provided in Appendix D. Specifically, we apply OPT with CP to train both the baseline and baseline++ described in (Chen et al., 2019), and immediately obtain obvious improvements. Therefore, OPT-trained networks generalize well in challenging few-shot scenarios.
7.5 Graph Neural Networks
We also test OPT with graph convolution networks (Kipf & Welling, 2016) for graph node classification. For fair comparison, we use exactly the same implementation, hyperparameters and experimental setup as (Kipf & Welling, 2016). Training a GCN with OPT is not that straightforward. Specifically, GCN uses the following forward model:
where . We note that is the adjacency matrix of the graph, (
is an identity matrix), and. is the feature matrix of nodes in the graph (feature dimension is ). is the weights of the classifiers. is the weights of the classifiers. is the weight matrix of size where is the dimension of the hidden space. We treat each column vector of as a neuron, so there are neurons in total. Then we naturally apply OPT to train these neurons of dimension in GCN.
We conduct experiments on Cora and Pubmed datsets (Sen et al., 2008). The goal here is to verify the effectiveness of OPT on GCN instead of achieving state-of-the-art performance on graph node classification. The results in Table 11 show a reasonable improvement achieved by OPT, validating OPT’s universality of training different types of neural networks on different modalities.
7.6 Point Cloud Neural Networks
We further test OPT on PointNet (Qi et al., 2017), a type of neural network that takes raw point clouds as input and classify them based on semantics. To simplify the comparison and remove all the bells and whistles, we use a vanilla PointNet (without T-Net) as our backbone network. We apply OPT to train the MLPs in PointNet. We follow the same experimental settings as (Qi et al., 2017) and evaluate on the ModelNet-40 dataset (Wu et al., 2015). The results are given in Table 12. We can observe that both OPT achieve better accuracy than the PointNet baseline, and OPT (CP) achieves nearly improvement. It is in fact significant because we do not add any additional parameters to the network. Although our accuracy is not the state-of-the-art performance, we can still validate the effectiveness of OPT. Most importantly, the improvement on PointNet further validates that OPT is a generic and effective training framework for different types of neural networks.
8 Concluding Remarks
This paper proposes a novel training framework for all types of neural networks. OPT over-parameterizes neurons with neuron weights (randomly initialized and fixed) and a layer-shared orthogonal matrix (learnable). OPT provably achieves minimum hyperspherical energy and maintains the energy during training. We give theoretical insights and extensive empirical evidences to validate OPT’s superiority.
- Allen-Zhu et al. (2018) Allen-Zhu, Z., Li, Y., and Song, Z. A convergence theory for deep learning via over-parameterization. arXiv preprint arXiv:1811.03962, 2018.
- Annavarapu (2013) Annavarapu, R. N. Singular value decomposition and the centrality of löwdin orthogonalizations. American Journal of Computational and Applied Mathematics, 3(1):33–35, 2013.
Arjovsky et al. (2016)
Arjovsky, M., Shah, A., and Bengio, Y.
Unitary evolution recurrent neural networks.In ICML, 2016.
- Bansal et al. (2018) Bansal, N., Chen, X., and Wang, Z. Can we gain more from orthogonality regularizations in training deep cnns? In NeurIPS, 2018.
- Bilyk & Lacey (2015) Bilyk, D. and Lacey, M. T. One bit sensing, discrepancy, and stolarsky principle. arXiv preprint arXiv:1511.08452, 2015.
- Chen et al. (2019) Chen, W.-Y., Liu, Y.-C., Kira, Z., Wang, Y.-C. F., and Huang, J.-B. A closer look at few-shot classification. arXiv preprint arXiv:1904.04232, 2019.
- Ding et al. (2019) Ding, X., Guo, Y., Ding, G., and Han, J. Acnet: Strengthening the kernel skeletons for powerful cnn via asymmetric convolution blocks. In ICCV, 2019.
- Du et al. (2017) Du, S. S., Lee, J. D., Tian, Y., Poczos, B., and Singh, A. Gradient descent learns one-hidden-layer cnn: Don’t be afraid of spurious local minima. arXiv preprint arXiv:1712.00779, 2017.
- Duchi et al. (2011) Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. JMLR, 2011.
- Finn et al. (2017) Finn, C., Abbeel, P., and Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, 2017.
- Glorot & Bengio (2010) Glorot, X. and Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In AISTATS, 2010.
- Gunasekar et al. (2017) Gunasekar, S., Woodworth, B. E., Bhojanapalli, S., Neyshabur, B., and Srebro, N. Implicit regularization in matrix factorization. In NeurIPS, 2017.
- Gunasekar et al. (2018) Gunasekar, S., Lee, J. D., Soudry, D., and Srebro, N. Implicit bias of gradient descent on linear convolutional networks. In NeurIPS, 2018.
- Ha et al. (2016) Ha, D., Dai, A., and Le, Q. V. Hypernetworks. arXiv preprint arXiv:1609.09106, 2016.
- He et al. (2015) He, K., Zhang, X., Ren, S., and Sun, J. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In ICCV, 2015.
- He et al. (2016) He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In CVPR, 2016.
- Henaff et al. (2016) Henaff, M., Szlam, A., and LeCun, Y. Recurrent orthogonal networks and long-memory tasks. arXiv preprint arXiv:1602.06662, 2016.
- Hoffmann (1989) Hoffmann, W. Iterative algorithms for gram-schmidt orthogonalization. Computing, 41(4):335–348, 1989.
- Ioffe & Szegedy (2015) Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
- Jaderberg et al. (2014) Jaderberg, M., Vedaldi, A., and Zisserman, A. Speeding up convolutional neural networks with low rank expansions. In BMVC, 2014.
- Jia et al. (2016) Jia, X., De Brabandere, B., Tuytelaars, T., and Gool, L. V. Dynamic filter networks. In NeurIPS, 2016.
- Jing et al. (2017) Jing, L., Shen, Y., Dubcek, T., Peurifoy, J., Skirlo, S., LeCun, Y., Tegmark, M., and Soljačić, M. Tunable efficient unitary neural networks (eunn) and their application to rnns. In ICML, 2017.
- Kawaguchi (2016) Kawaguchi, K. Deep learning without poor local minima. In NeurIPS, 2016.
- Kawaguchi et al. (2018) Kawaguchi, K., Xie, B., and Song, L. Deep semi-random features for nonlinear function approximation. In AAAI, 2018.
- Keskar & Socher (2017) Keskar, N. S. and Socher, R. Improving generalization performance by switching from adam to sgd. arXiv preprint arXiv:1712.07628, 2017.
- Kingma & Ba (2014) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Kipf & Welling (2016) Kipf, T. N. and Welling, M. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
- Lee et al. (2016) Lee, J. D., Simchowitz, M., Jordan, M. I., and Recht, B. Gradient descent only converges to minimizers. In COLT, 2016.
- Lezcano-Casado & Martínez-Rubio (2019) Lezcano-Casado, M. and Martínez-Rubio, D. Cheap orthogonal constraints in neural networks: A simple parametrization of the orthogonal and unitary group. arXiv preprint arXiv:1901.08428, 2019.
- Li et al. (2020) Li, J., Li, F., and Todorovic, S. Efficient riemannian optimization on the stiefel manifold via the cayley transform. In ICLR, 2020.
- Li et al. (2018) Li, Y., Ma, T., and Zhang, H. Algorithmic regularization in over-parameterized matrix sensing and neural networks with quadratic activations. In COLT, 2018.
- Lin et al. (2013) Lin, M., Chen, Q., and Yan, S. Network in network. arXiv preprint arXiv:1312.4400, 2013.
- Liu et al. (2015) Liu, B., Wang, M., Foroosh, H., Tappen, M., and Pensky, M. Sparse convolutional neural networks. In CVPR, 2015.
- Liu et al. (2017) Liu, W., Zhang, Y.-M., Li, X., Yu, Z., Dai, B., Zhao, T., and Song, L. Deep hyperspherical learning. In NeurIPS, 2017.
- Liu et al. (2018) Liu, W., Lin, R., Liu, Z., Liu, L., Yu, Z., Dai, B., and Song, L. Learning towards minimum hyperspherical energy. In NeurIPS, 2018.
- Liu et al. (2019) Liu, W., Liu, Z., Rehg, J. M., and Song, L. Neural similarity learning. In NeurIPS, 2019.
- Mallya et al. (2018) Mallya, A., Davis, D., and Lazebnik, S. Piggyback: Adapting a single network to multiple tasks by learning to mask weights. In ECCV, 2018.
- Nesterov (1983) Nesterov, Y. E. A method for solving the convex programming problem with convergence rate o (1/k^ 2). In Dokl. akad. nauk Sssr, 1983.
- Peel et al. (2010) Peel, T., Anthoine, S., and Ralaivola, L. Empirical bernstein inequalities for u-statistics. In NeurIPS, 2010.
- Qi et al. (2017) Qi, C. R., Su, H., Mo, K., and Guibas, L. J. Pointnet: Deep learning on point sets for 3d classification and segmentation. In CVPR, 2017.
- Rahimi & Recht (2008) Rahimi, A. and Recht, B. Random features for large-scale kernel machines. In NeurIPS, 2008.
- Reddi et al. (2019) Reddi, S. J., Kale, S., and Kumar, S. On the convergence of adam and beyond. arXiv preprint arXiv:1904.09237, 2019.
- Sen et al. (2008) Sen, P., Namata, G., Bilgic, M., Getoor, L., Galligher, B., and Eliassi-Rad, T. Collective classification in network data. AI magazine, 2008.
- Snell et al. (2017) Snell, J., Swersky, K., and Zemel, R. Prototypical networks for few-shot learning. In NeurIPS, 2017.
- Soudry & Carmon (2016) Soudry, D. and Carmon, Y. No bad local minima: Data independent training error guarantees for multilayer neural networks. arXiv preprint arXiv:1605.08361, 2016.
- Srivastava et al. (2014) Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. Dropout: a simple way to prevent neural networks from overfitting. JMLR, 2014.
- Srivastava (2000) Srivastava, V. A unified view of the orthogonalization methods. Journal of Physics A: Mathematical and General, 33(35):6219, 2000.
Tieleman & Hinton (2012)
Tieleman, T. and Hinton, G.
Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude.
COURSERA: Neural networks for machine learning, 4(2):26–31, 2012.
- Wang et al. (2017) Wang, M., Liu, B., and Foroosh, H. Factorized convolutional neural networks. In ICCV Workshops, 2017.
- Wen & Yin (2013) Wen, Z. and Yin, W. A feasible method for optimization with orthogonality constraints. Mathematical Programming, 2013.
- Wisdom et al. (2016) Wisdom, S., Powers, T., Hershey, J., Le Roux, J., and Atlas, L. Full-capacity unitary recurrent neural networks. In NeurIPS, 2016.
- Wu et al. (2015) Wu, Z., Song, S., Khosla, A., Yu, F., Zhang, L., Tang, X., and Xiao, J. 3d shapenets: A deep representation for volumetric shapes. In CVPR, 2015.
- Xie et al. (2016) Xie, B., Liang, Y., and Song, L. Diverse neural network learns true target functions. arXiv preprint arXiv:1611.03131, 2016.
- Yang et al. (2015) Yang, Z., Moczulski, M., Denil, M., de Freitas, N., Smola, A., Song, L., and Wang, Z. Deep fried convnets. In ICCV, 2015.
- Zeiler (2012) Zeiler, M. D. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.
Appendix A Details of Unrolled Orthogoanlization Algorithms
a.1 Gram-Schmidt Process
Gram-Schmidt Process. GS process is a method for orthonormalizing a set of vectors in an inner product space, i.e., the Euclidean space equipped with the standard inner product. Specifically, GS process performs the following operations to orthogonalize a set of vectors :
where denotes the projection of the vector onto the vector . The set denotes the output orthonormal set. The algorithm flowchart can be described as follows:
The vectors in the algorithm above are used to compute the QR factorization, which is not useful in orthogonalization and therefore does not need to be stored. When the GS process is implemented on a finite-precision computer, the vectors are often not quite orthogonal, because of rounding errors. Besides the standard GS process, there is a modified Gram-Schmidt (MGS) algorithm which enjoys better numerical stability. This approach gives the same result as the original formula in exact arithmetic and introduces smaller errors in finite-precision arithmetic. Specifically, GS computes the following formula:
Instead of computing the vector as in Eq. (18), MGS computes the orthogonal basis differently. MGS does not subtract the projections of the original vector set, and instead remove the projection of the previously constructed orthogonal basis. Specifically, MGS computes the following series of formulas:
where each step finds a vector that is orthogonal to . Therefore, is also orthogonalized against any errors brought by the computation of . In practice, although MGS enjoys better numerical stability, we find the empirical performance of GS and MGS is almost the same in OPT. However, MGS takes longer time to complete since the computation of each orthogonal basis is an iterative process. Therefore, we usually stick to classic GS for OPT.
Iterative Gram-Schmidt Process. Iterative Gram-Schmidt (IGS) process is an iterative version of the GS process. It is shown in (Hoffmann, 1989) that GS process can be carried out iteratively to obtain a basis matrix that is orthogonal in almost full working precision. The IGS algorithm is given as follows:
The vectors in the algorithm above are used to compute the QR factorization, which is not useful in orthogonalization and therefore does not need to be explicitly computed. The while loop in IGS is an iterative procedure. In practice, we can unroll a fixed number of steps for the while loop in order to improve the orthogonality. The resulting in the -th step corresponds to the solution of the equation where . The IGS process corresponds to the Gauss-Jacobi iteration for solving this equation.
Both GS and IGS are easy to be embedded in the neural networks, since they are both differentiable. In our experiments, we find that the performance gain of unrolling multiple steps in IGS over GS is not very obvious (partially because GS has already achieved nearly perfect orthogonality), but IGS costs longer training time. Therefore, we unroll the classic GS process by default.
a.2 Householder Reflection
Let be a non-zero vector. A matrix of the form
is a Householder reflection. The vector is the Householder vector. If a vector is multiplied by the matrix , then it will be reflected in the hyperplane . Householder matrices are symmetric and orthogonal.
For a vector , we let where is a vector of (the first element is and the remaining elements are ). Then we construct the Householder reflection matrix with and multiply it to :
which indicates that we can make any non-zero vector become where is some constant by using Householder reflection. By left-multiplying a reflection we can turn a dense vector into a vector with the same length and with only a single nonzero entry. Repeating this times gives us the Householder QR factorization, which also orthogonalizes the original input matrix. Householder reflection orthogonalizes a matrix by triangularizing it:
where is a upper-triangular matrix in the QR factorization. is constructed by where is the Householder reflection that is performed on the vector . The algorithm flowchart is given as follows:
The algorithm follows the Matlab notation where denotes the submatrix of from the -th column to the -th column and from the -th row to the -th row. Note that, there are a number of variants for the Householder reflection orthogonalization, such as the implicit variant where we do not store each reflection explicitly. Here is the final orthogonal matrix we need.
a.3 Löwdin’s Symmetric Orthogonalization
Let be a set of linearly independent vectors in a -dimensional space. We define a general non-singular linear transformation that can transform the basis to a new basis :
where the basis will be orthonormal if (the transpose will become conjugate transpose in complex space)
where is the gram matrix of the given basis .
A general solution to this orthogonalization problem can be obtained via the substitution:
in which is an arbitrary orthogonal (or unitary) matrix. When , we will have the symmetric orthogonalization, namely
When in which diagonalizes , then we have the canonical orthogonalization, namely
Because diagonalizes , we have that . Therefore, we have the transformation as
. This is essentially an eigenvalue decomposition of the symmetric matrix.
In order to compute the Löwdin’s symmetric orthogonalized basis sets, we can use singular value decomposition. Specifically, SVD of the original basis set is given by
where both and are orthogonal matrices. is the diagonal matrix of singular values. Therefore, we have that
where we have due to the connections between eigenvalue decomposition and SVD. Therefore, we end up with
which is the output orthogonal matrix for Löwdin’s symmetric orthogonalization.
An interesting feature of the symmetric orthogonalization is to ensure that
where and are the -th column vectors of and , respectively. denotes the set of all possible orthonormal sets in the range of . This means that the symmetric orthogonalization functions (or ) are the least distant in the Hilbert space from the original functions . Therefore, symmetric orthogonalization indicates the gentlest pushing of the directions of the vectors in order to make them orthogonal.
Appendix B Proof of Theorem 1
To be more specific, neurons with each element initialized by a zero-mean Gaussian distribution are uniformly distributed on a hypersphere. We show this argument with the following theorem.
The normalized vector of Gaussian variables is uniformly distributed on the sphere. Formally, let and be independent. Then the vector
follows the uniform distribution on , where is a normalization factor.
A random variable has distributionif it has the density function
A -dimensional random vector has distribution if the components are independent and have distribution each. Then the density of is given by
Then we introduce the following lemma (Lemma 2
) about the orthogonal-invariance of the normal distribution.
Let be a -dimensional random vector with distribution and be an orthogonal matrix (). Then also has distribution .
For any measurable set , we have that