1 Introduction
The inductive bias encoded in a neural network is generally determined by two major aspects: how the neural network is structured (i.e., network architecture) and how the neural network is optimized (i.e., training algorithm). For the same network architecture, using different training algorithms could lead to a dramatic difference in generalization performance (Keskar & Socher, 2017; Reddi et al., 2019) even if the training loss is already close to zero, implying that different training procedures lead to different inductive biases. Therefore, how to effectively train a neural network that generalizes well remains an open challenge.
Recent theories (Gunasekar et al., 2017, 2018; Kawaguchi, 2016; Li et al., 2018; Gunasekar et al., 2018) suggest the importance of overparameterization in linear neural networks. For example, (Gunasekar et al., 2017) shows that optimizing an underdetermined quadratic objective over a matrix with gradient descent on a factorization of leads to an implicit regularization that may improve generalization. There are also empirical evidences (Ding et al., 2019; Liu et al., 2019) which show that overparameterzing the convolutional filters under some regularity is beneficial to generalization. Our paper aims to leverage the power of overparameterization and explore more intrinsic structural priors for training a wellperforming neural network.
Motivated by this goal, we propose a generic orthogonal overparameterized training (OPT) framework for effectively training neural networks. Different from the existing neural network training, OPT overparameterizes a neuron
with the multiplication of a learnable layershared orthogonal matrix
and a fixed randomly initialized weight vector
. Therefore, the equivalent weight for the neuron is . Once each element of the neuron weighthas been randomly initialized following a zeromean Gaussian distribution
(He et al., 2015; Glorot & Bengio, 2010), we fix them throughout the entire training process. Then OPT learns a layershared orthogonal transformation that is applied to all the neurons (in the same layer). An illustration of OPT is given in Fig. 1. In contrast to the standard neural training, OPT decomposes the neuron into two components: an orthogonal transformation that learns a proper coordinate system and a weight vector that controls the position of the neuron. Essentially, the weight vectors of different neurons in the same layer determine the relative positions of these neurons, while the layershared orthogonal matrix specifies the coordinate system for these neurons.Another motivation of OPT comes from an observation (Liu et al., 2018) that neural networks with lower hyperspherical energy generalizes better. Hyperspherical energy quantifies the diversity of neurons on a hypersphere, and essentially characterizes the relative positions of neurons via this form of diversity. (Liu et al., 2018) introduces hyperspherical energy as a regularization term in the network and does not guarantee the hyperspherical energy can be effectively minimized (due to the existence of data fitting loss). To address this issue, we leverage the property of hyperspherical energy that it is independent of the coordinate system in which the neurons live and only depends on their relative positions. Specifically, we prove that, if we randomly initialize the neuron weight with certain distributions, such distributions lead to minimum hyperspherical energy. It follows that OPT maintains the minimum energy throughout the training by learning a coordinate system (i.e., layershared orthogonal matrix) for the neurons. Therefore, OPT can guarantee that the hyperspherical energy is well minimized.
We consider several ways to learn the orthogonal transformation. The first one is to unroll different orthogonalization algorithms such as GramSchmidt process, Householder reflection and Löwdin’s symmetric orthogonalization. Different unrolled algorithms yield different implicit regularizations to construct the neuron weights. For example, symmetric orthogonalization guarantees that the new orthogonal basis has the least distance in the Hilbert space from the original nonorthogonal basis. Second, we consider to use a special parameterization (such as Cayley parameterization) to construct the orthogonal matrix, which is more efficient in training. Third, we try an orthogonalpreserving gradient descent to ensure the matrix
to be orthogonal after each gradient update. Last, we propose a relaxation to the optimization problem by making the orthogonality a regularization for the coordinate system . These different ways of learning the orthogonal transformation for neurons encode different inductive biases to the neural network.Moreover, we propose a refinement strategy to further reduce the hyperspherical energy for the random initialized neural weights . We directly minimize the hyperspherical energy of these randomly initialized neuron weights as a preprocessing step before training them on actual data. Finally, we provide theoretical justifications for why OPT yields better generalization than standard training.
We summarize the advantages of OPT as follows:

[leftmargin=*,nosep,nolistsep]

OPT is a universal neural network training framework with strong flexibility. There are several ways of learning the coordinate system (i.e., orthogonal transformation) and each one may impose a different inductive bias.

OPT is the first training method that can provably achieve minimum hyperspherical energy, leading to better generalization. More interestingly, OPT reveals that learning a proper coordinate system is crucial to the generalization, while the relative positions of neurons can be well characterized by hyperspherical energy.

There is no extra computational cost for OPTtrained neural networks in inference. It has the same inference speed and model size as its standard counterpart.

OPT is shown useful in multilayer perceptrons (MLPs), convolutional neural networks (CNNs), point cloud networks (PointNet)
(Qi et al., 2017), and graph convolutional networks (GCN) (Kipf & Welling, 2016).
2 Related Work
Optimization for Deep Learning
. A number of firstorder optimization algorithms (Nesterov, 1983; Duchi et al., 2011; Kingma & Ba, 2014; Tieleman & Hinton, 2012; Zeiler, 2012; Reddi et al., 2019) are proposed to improve the empirical convergence and generalization of deep neural networks. Our work is in parallel with these optimization algorithms, since these algorithms can easily applied to our framework.Parameterization of Neurons. There are various ways to parameterize a neuron for different applications. (Ding et al., 2019) overparameterizes a 2D convolution kernel by combining a 2D kernel of the same size and two additional 1D asymmetric kernels. The resulted convolution kernel has the same effective parameters during training but more parameters at training time due to the additional asymmetric kernels. (Liu et al., 2019) constructs a neuron with a bilinear parameterization which regularizes the bilinear similarity matrix. (Yang et al., 2015) reparameterizes the neuron matrix with an adaptive fastfood transform to compress model parameters. (Jaderberg et al., 2014; Liu et al., 2015; Wang et al., 2017) employ sparse and lowrank structures to construct convolution kernels for efficient neural networks.
Hyperspherical learning. (Liu et al., 2017) proposes a neural network architecture that learn representations on a hypersphere and shows that the angular information in neural networks, in contrast to magnitude information, preserves the most semantic information. (Liu et al., 2018) defines hyperspherical energy that quantifies the diversity of neurons on a hypersphere and empirically shows that the minimum hyperspherical energy improves generalization.
3 Orthogonal OverParameterized Training
3.1 General Framework
OPT parameterizes the neuron as the multiplication of an orthogonal matrix and a neuron weight vector , and the equivalent neuron weight becomes . The output of this neuron can be represented by where is the input vector. In the OPT framework, we fix the randomly initialized neuron weight and only the orthogonal matrix is learned. In contrast, the standard neuron is directly formulated as , where the weight vector is learned in training.
Without loss of generality, we consider a twolayer linear MLP with a loss function
(e.g., we use the least square loss: ). Specifically, the learning objective of the standard training and OPT areStandard:  (1)  
OPT:  
where is the th neuron in the first layer, and is the output neuron in the second layer. In OPT, each element of is usually sampled from a zeromean Gaussian distribution, and is fixed throughout the entire training process. In general, OPT learns an orthogonal matrix that is applied to all the neurons instead of learning the individual neuron weight. Note that, we usually do not apply OPT to neurons in the output layer (e.g.,
in this MLP example, and the final linear classifiers in CNNs), since it makes little sense to fix a set of random linear classifiers. Therefore, the central problem is how to learn these layershared orthogonal matrices.
3.2 Hyperspherical Energy Perspective
We take a close look at OPT from the hyperspherical energy perspective. Following (Liu et al., 2018), the hyperspherical energy of neurons is defined as
(2) 
in which is the th neuron weight projected onto the unit hypersphere . Hyperspherical energy is used to characterize the diversity of neurons on a unit hypersphere. Assume that we have neurons in one layer, and we have learned an orthogonal matrix for these neurons. The hyperspherical energy of these OPTtrained neurons is given by
(3)  
which concludes that OPT will not change the hyperspherical energy in each layer during the training.
Moreover, (Liu et al., 2018)
proves that minimum hyperspherical energy corresponds to the uniform distribution over the hypersphere. As a result, if the initialization of the neurons in the same layer follows the uniform distribution over the hypersphere, then we can guarantee that the hyperspherical energy is minimal in a probabilistic sense.
Theorem 1.
For the neuron where are initialized i.i.d. following a zeromean Gaussian distribution (i.e., ), then its projection onto a unit hypersphere where is guaranteed to follow a uniform distribution on the unit hypersphere .
Theorem 1 implies that, if we initialize the neurons in the same layer with zeromean Gaussian distribution, then the corresponding hyperspherical energy is guaranteed to be small. It is because the neurons will be uniformly distributed on the unit hypersphere and hyperpsherical energy well quantifies the uniformity on the hypersphere. More importantly, current prevailing neuron initializations such as Xavier (Glorot & Bengio, 2010) and Kaiming (He et al., 2015) are basically zeromean Gaussian distribution. Therefore, our neurons naturally have very low hyperspherical energy from the very beginning in practice.
3.3 Unrolling Orthogonalization Algorithms
In order to learn the orthogonal transformation, we propose to unroll classic orthogonalization algorithms in numerical linear algebra and embed them into the neural network such that the training is still endtoend. We need to make sure every step of the orthogonalization algorithm is differentiable, and the training flow is shown in Fig. 2.
GramSchmidt Process. This method takes a linearly independent set and produce an orthogonal set based on it. GramSchmidt Process (GS) usually takes the following steps to orthogonalize a set of vectors :
(4)  
where the projection operator is given by and is the obtained orthogonal set. In practice, we can use modified GS for numerical stability. To achieve better orthogonality, we can also unroll an iterative GS (Hoffmann, 1989) with multiple iterative steps.
Householder Reflection. As one of the classic transformation that is used in QR factorization, Householder reflection (HR) can also compute the orthogonal set from a group of vectors. A Householder reflector is defined as where
is perpendicular to the reflection hyperplane. In QR factorization, HR is used to transform a (nonsingular) square matrix into an orthogonal matrix and a upper triangular matrix. Given a matrix
, we consider the first column vector . We use Householder reflector to transform to . Specifically, we construct as(5) 
which is an orthogonal matrix. We have that the first column of becomes . At the th step, we can view the submatrix as a new , and use the same procedure to construct the Householder transformation . We construct the final Householder transformation as . Now we can gradually transform to a upper triangular matrix with Householder reflections. Therefore, we have that
(6) 
where is the upper triangular matrix (which is different from the matrix in Fig. 2) and the final obtained orthogonal set is (i.e., ).
Löwdin’s Symmetric Orthogonalization. Let the matrix
be a given set of linearly independent vectors in an
dimensional space. A nonsingular linear transformation
can transform the basis to an orthogonal basis : . The matrix will be orthogonal if the following equation holds:(7) 
where is the Gram matrix of the given set . We obtain a general solution to the orthogonalization problem via the substitution: where
is an arbitrary unitary matrix. The specific choice
gives the Löwdin’s symmetric orthogonalization (LS):. We can analytically obtain the symmetric orthogonalization from the singular value decomposition:
. Then LS gives as the orthogonal set for .LS possesses a remarkable property which the other orthogonalizations do not have. The orthogonal set resembles the original set in a nearestneighbour sense. Specifically, LS guarantees that (where and are the th column of and , respectively) is minimized. Intuitively, LS indicates the gentlest pushing of the directions of the vectors in order to get them orthogonal.
Discussion. These orthogonalization algorithms are fully differentiable and endtoend trainable. For better orthogonality, these algorithms can be used iteratively and we can unroll them with multiple iterations. Empirically, onestep unrolling usually works well already. We have also considered Givens rotation to construct orthogonal matrix, but it needs to traverse all lower triangular elements in the original set , which takes complexity. Therefore, it is too computationally expensive. More interestingly, each orthogonalization method encodes a unique inductive bias for the resulting neurons by imposing some implicit regularizations (e.g., least distance in Hilbert space for LS). More details about the orthogonalization are provided in Appendix A.
3.4 Orthogonal Parameterization
A more convenient way to ensure orthogonality while learning the matrix is to use a special parameterization that inherently guarantees orthogonality. The exponential parameterization use (where
denotes the matrix exponential) to represent an orthogonal matrix from a skewsymmetric matrix
. Cayley parameterization (CP) is a Padé approximation of the exponential parameterization, and is a more natural choice due to its simplicity. Cayley parameterization use the following transform to construct an orthogonal matrix
from a skewsymmetric matrix :(8) 
where . We note that Cayley parameterization only produces the orthogonal matrices with determinant , which belongs to the special orthogonal group and thus . Specifically, it suffices to learn the upper or lower triangular of the matrix with unconstrained optimization to obtain a desired orthogonal matrix . Cayley parameterization does not cover the entire orthogonal group and is less flexible in terms of representation power, which serves as a explicit regularization for the neurons.
3.5 OrthogonalityPreserving Gradient Descent
An alternative way to guarantee orthogonality is to modify the gradient update for the transformation matrix . The general idea is to initialize with an arbitrary orthogonal matrix and then make sure every gradient update is to apply an orthogonal transformation to . It is essentially conducting gradient descent on the Stiefel manifold. There are plenty of work (Li et al., 2020; Wen & Yin, 2013; Wisdom et al., 2016; LezcanoCasado & MartínezRubio, 2019; Arjovsky et al., 2016; Henaff et al., 2016; Jing et al., 2017) that focus on optimization on Stiefel manifold.
Given a matrix which is initialized as an orthogonal matrix, we aim to construct an orthogonal transformation as the gradient update. We use the Cayley transform to compute a parametric curve on the Stiefel manifold with a specific metric via a skewsymmetric matrix and use it as the update rule:
(9) 
where and . denotes the orthogonal matrix in the th iteration. denotes the original gradient of the loss function w.r.t. . We term such a gradient update as orthogonalpreserving gradient descent (OGD). To reduce the computational cost of the matrix inverse in Eq. (9), we use an iterative method (Li et al., 2020) to approximate the Cayley transform without matrix inverse. We arrive at the following fixedpoint iteration by moving terms in Eq. (9):
(10) 
which converges to the closeform Cayley transform with a rate of ( is the iteration number). In practice, we empirically find that two iterations will usually suffice for a reasonable approximation accuracy.
3.6 Relaxation to Orthogonal Regularization
We relax the optimization with an orthogonality constraint to a unconstrained optimization with orthogonality regularization (OR). Specifically, we remove the orthogonality constraint in Eq. (1), and adopt an orthogonality regularization for in the objective function, i.e., . Taking Eq. (1) as an example, the training objective becomes
(11) 
where
is a hyperparameter. It serves as an approximation to the OPT objective. This relaxation can not guarantee that hyperspherical energy stays unchanged.
4 Refining the Random Initialization
Minimizing hyperspherical energy. Because we are randomly initializing the neuron weight vectors
, there will exist a variance that makes the hyperspherical energy deviate from the minima even if the hyperspherical energy is minimized in expectation. To more effectively reduce the hyperspherical energy, we propose to refine the random initialization by minimizing its hyperspherical energy as a preprocessing. Specifically, before feeding these neuron weight to OPT, we first minimize the hyperspherical energy in Eq.(
2) with gradient descent (without the training data loss). More importantly, since the random initialized neurons can not minimize the halfspace hyperspherical energy (Liu et al., 2018) where the collinearity redundancy is removed, we can also perform the halfspace hyperspherical energy minimization as a preprocessing step.Normalizing the randomly initialized neurons. Since the norm of the randomly initialized neuron weights serve a role similar to weighting the importance of different neurons, we further consider to normalize the neuron weights such that each neuron weight vector has the unit norm.
We evaluate both refinements in Section 7.4, and we also show that OPT still performs well without these refinements.
5 Theoretical Insights on Generalization
The key question we aim to answer in this section is why OPT may lead to better generalization. We have already shown that OPT can guarantee the minimum hyperspherical energy (MHE) in a probabilistic sense. Although empirical evidences (Liu et al., 2018) have shown significant and consistent performance gain by minimizing hyperspherical energy, why lower hyperspherical energy will lead to better generalization is still unclear. We argue that OPT leads to better generalization from two aspects: how OPT may affect training and generalization in theory and why minimum hyperspherical energy serves as a good inductive bias.
Our goal here is to leverage and apply existing theoretical results (Kawaguchi, 2016; Xie et al., 2016; Soudry & Carmon, 2016; Lee et al., 2016; Du et al., 2017; AllenZhu et al., 2018) to explain the role that MHE plays rather than proving sharp and novel generalization bounds. We simply consider onehiddenlayer networks as the hypothesis class: where
is ReLU. Since the magnitude of
can be scaled into , we can restrict to be . Given a set of i.i.d. training sample where is drawn uniformly from the unit hypersphere, we minimize the least square loss . The gradient w.r.t. is(12) 
Let be the column concatenation of neuron weights. We aim to identify the conditions under which there are no spurious local minima. We rewrite that
(13) 
where , , and . Therefore, we can obtain that
(14) 
where is the training error and is the minimum singular value of . If we need the training error to be small, then we have to lower bound away from zero. We have the following result from (Xie et al., 2016).
Once MHE is achieved, the neurons will be uniformly distributed on the unit hypersphere. From Lemma 1, we can see that if the neurons are uniformly distributed on the unit hypersphere, will be very small and close to zero. Then will also be small, leading to large lower bound for . Therefore, MHE can result in small training error once the gradient norm is small. The result implies no spurious local minima if we use OPT for training.
We further argue that MHE induced by OPT serves as an important inductive bias for neural networks. As the standard regularizer for neural networks, weight decay controls the norm of the neuron weights, regularizing essentially one dimension of the weight. In contrast, MHE completes the missing pieces by regularizing the remaining dimensions of the weight. MHE encourages minimum hyperspherical redundancy between neurons. In the linear classifier case, MHE impose a prior of maximal interclass separability.
6 Discussions
Semirandomness. OPT fixes the randomly initialized neuron weight vectors and only learns layershared orthogonal matrices for each layer, so OPT naturally imposes a strong randomness to the neurons and well combine the generalization from randomness and the strong approximation power from neural networks. Such randomness suggests that the specific configuration of relative position among neurons does not matter that much, and the coordinate system is more crucial for generalization. (Kawaguchi et al., 2018; Rahimi & Recht, 2008; Srivastava et al., 2014) also show that randomness can be beneficial to generalization.
Coordinate system vs. relative position. OPT shows that only learning the coordinate system yields much better generalization than learning neuron weights directly. This implies that the coordinate system is of great importance to generalization. However, relative position does not matter only when the hyperspherical energy is sufficiently low. In other words, hyperspherical energy can well characterize the relative position among neurons. Lower hyperspherical energy will lead to better generalization.
Flexible training. First, OPT can used in multitask training (Mallya et al., 2018) where each set of orthogonal matrices represent one task. OPT can learn different set of orthogonal matrices for different tasks with the neuron weights remain the same. Second, we can perform progressive training with OPT. For example, after learning a set of orthogonal matrices on a large coarsegrained dataset (i.e., pretraining), we can multiple the orthogonal matrices back to the neuron weights and construct a new set of neuron weights. Then we can use the new neuron weights as a starting point and apply OPT to train on a small finegrained dataset (i.e., finetuning).
Limitations and open problems The limitations of OPT include more GPU memory consumption and computation during training, more numerical issues when ensuring orthogonality and weak scalability for ultra wide neural networks. Therefore, there will be plenty of open problems in OPT, such as scalable and efficient training. Most significantly, OPT opens up a new possibility for studying theoretical generalization of deep networks. With the decomposition to hyperspherical energy and coordinate system, OPT provides a new perspective for future research.
7 Experiments and Results
7.1 Experimental settings
We evaluate OPT on various types of neural networks such as multilayer perceptrons (for image classification), convolutional neural networks (for image classification), graph neural networks (for graph node classification), and point cloud neural networks (for point cloud classification). Our goal is to show the performance gain of OPT on training different neural networks rather than achieving the stateoftheart performance on different tasks. More experimental details and network architectures are given in Appendix D.
7.2 Ablation Study and Exploratory Experiment
Method  FN  LR  CNN6  CNN9 
Baseline      37.59  33.55 
UPT  N  U  48.47  46.72 
UPT  Y  U  42.61  39.38 
OPT  N  GS  37.24  32.95 
OPT  Y  GS  33.02  31.03 
Necessity of orthogonality. We first experiment whether the orthogonality is necessary for OPT. We use both 6layer CNN and 9layer CNN (specified in Appendix D) on CIFAR100. We compare OPT with a baseline with the same network architecture that learns a unconstrained matrix with only weight decay regularization. We term this baseline as unconstrained overparameterized training (UPT). “FN” in Table 1 denotes whether the randomly initialized neuron weights are fixed throughout the training (“Y” for yes and “N” for no). “LR” denotes whether the learnable transformation is unconstrained (“U”) or orthogonal (“GS” for GramSchmidt process). The results in Table 1 show that without ensuring orthogonality, the performance of UPT is much worse than OPT that unrolls GramSchmidt process for orthogonality (no matter whether the neuron weights are fixed or not). Thus, orthogonality is indeed necessary.
Fixed weight vs. learnable weight. From Table 1, we can see that using fixed neuron weights is consistently better than learnable neuron weights in both UPT and OPT. It shows that fixing the neuron weights while learning the transformation matrix is beneficial to generalization.
High vs. low hyperspherical energy. We empirically verify that high hyperspherical energy corresponds to inferior generalization performance. To initialize neurons with high hyperspherical energy, we use a random initialization with mean equal to , , , and .
Mean  Energy  Error (%) 
0  3.5109  32.49 
1e3  3.5117  33.11 
1e2  3.5160  39.51 
2e2  3.5531  53.89 
3e2  3.6761  N/C 
5e2  4.2776  N/C 
We use CNN6 to conduct experiments on CIFAR100. The results in Table 2 (“N/C” denotes not converged) show that the network with higher hyperspherical energy is more difficult to converge. Moreover, we find that if the hyperspherical energy is larger than a certain value, then the network can not converge at all. Note that, when the hyperspherical energy is small (near the minima), a little change in hyperspherical energy (e.g., from to ) can lead to dramatic generalization gap (e.g., from error rate to ). One can observe that higher hyperspherical energy leads to worse generalization.
7.3 MultiLayer Perceptrons
Method  Normal  Xavier 
Baseline  6.05  2.14 
OPT (GS)  5.11  1.45 
OPT (HR)  5.31  1.60 
OPT (LS)  5.32  1.54 
OPT (CP)  5.14  1.49 
OPT (OGD)  5.38  1.56 
OPT (OR)  5.41  1.78 
We evaluate different variants of OPT for MLPs on MNIST. We use a 3layer MLP for all the training methods. Specific training hyperparameters are given in Appendix D. Results in Table 3 shows the testing error on MNIST for cases where neuron weights use normal initialization or Xavier initialization (Glorot & Bengio, 2010). OPT (GS/HR/LS) are OPT with unrolled orthogonalization algorithms. OPT (CP) denotes OPT with Cayley parameterization. OPT (OGD) is OPT with orthogonalpreserving gradient descent. OPT (OR) denotes OPT with relaxed orthogonal regularization. We can see that OPT (GS) performs the best and all OPT variants outperform the baseline by a considerable margin.
7.4 Convolutional Neural Networks
Method  CNN6  CNN9 
Baseline  37.59  33.55 
HSMHE  34.97  32.87 
OPT (GS)  33.02  31.03 
OPT (HR)  35.67  32.75 
OPT (LS)  34.48  31.22 
OPT (CP)  33.53  31.28 
OPT (OGD)  33.33  31.47 
OPT (OR)  34.70  32.63 
OPT variants. We evaluate all the OPT variants with a plain 6layer CNN and a plain 9layer CNN on CIFAR100. Detailed network architectures are given in Appendix D. All the neurons are initialized by (He et al., 2015)
(Ioffe & Szegedy, 2015) is used by default. Results in Table 4 show that nearly all OPT variants consistently outperform both baseline and the HSMHE regularization (Liu et al., 2018) by a significant margin. The HSMHE regularization puts the halfspace hyperspherical energy into loss function and minimizes it with stochastic gradients, which is a naive way to minimize the hyperspherical energy. From the results, we observe that OPT (HR) performs the worse among all OPT variants. In contrast, OPT (GS) achieves the best testing error, implying that GramSchmidt process imposes a suitable inductive bias for CNNs on CIFAR100.Method  Error (%) 
Baseline  38.95 
HSMHE  36.90 
OPT (GS)  35.61 
OPT (HR)  37.51 
OPT (LS)  35.83 
OPT (CP)  34.88 
OPT (OGD)  35.38 
OPT (OR)  N/C 
Training without batch normalization. We further evaluate how OPT performs without batch normalization. In specific, we use CNN6 as our backbone network and test on CIFAR100. From Table 5, one can see that OPT variants again outperform both baseline and HSMHE (Liu et al., 2018), validating that OPT can work reasonably well without batch normalization. Among all the OPT variants, Cayley parameterization achieves very competitive testing error with lower than standard training.
Training dynamics. We also look into how hyperspherical energy and testing error changes while training with OPT. For hyperspherical energy, we can see from Fig. 3 that the hyperspherical energy of the baseline will increase dramatically at the beginning and then gradually go down, but it still stay in a relatively high value at the end. MHE can effectively reduce the hyperspherical energy at the end of the training. In contrast, all OPT variants can maintain a very low hyperspherical energy from the beginning. OPT (GS) and OPT (CP) keeps exactly the same hyperspherical energy as the randomly initialized neurons, while OPT (OR) may increase the hyperspherical energy a little bit since it is a relaxation. For testing error, all OPT variants converge stably and their final accuracies outperform the others.
Method  Standard  MHE  HSMHE 
OPT (GS)  33.02  32.99  32.78 
OPT (LS)  34.48  34.43  34.37 
OPT (CP)  33.53  33.50  33.42 
Energy  3.5109  3.5003  3.4976 
Refining neuron initialization. We also evaluate two refinement tricks for the neuron initialization. First, we consider the hyperspherical energy minimization as a preprocessing for the neuron weights. We conduct the experiment using CNN6 on CIFAR100. Specifically, we run a gradient descent for 5k iterations to minimize the hyperspherical energy of the neuron weights before the training gets started. We also compute the hyperspherical energy (before the training starts, and after the preprocessing of energy minimization) in Table 6. All the methods have the same random initialization with the same random seed, so the hyperspherical energy all starts at . After the neuron preprocessing, we have the energy of for the MHE objective and for the halfspace MHE objective. More importantly, Table 6 shows that such a refinement can effectively improve the generalization of OPT and further reduce the testing error on CIFAR100.
Method  w/o Norm  w/ Norm 
Baseline  37.59   
OPT (GS)  33.02  32.54 
OPT (HR)  35.67  35.30 
OPT (LS)  34.48  32.11 
OPT (CP)  33.53  32.49 
OPT (OGD)  33.37  32.70 
OPT (OR)  34.70  33.27 
Then we experiment the neuron weight normalization in OPT. Normalized neurons make a lot of senses in OPT because the scale of randomly initialized weights does not have any useful property. After randomly initializing the neurons, we directly normalize the scale of the weights to . These randomly initialized neurons still possess the important property of achieving minimum hyperspherical energy. Specifically, we use CNN6 to perform classification on CIFAR100. The results in Table 7 show that normalizing the neurons can largely boost the performance of OPT.
Method  ResNet20  ResNet32 
Baseline  31.11  30.16 
OPT (GS)  30.73  29.56 
OPT (CP)  30.47  29.31 
OPT for ResNet. To show that OPT is agnostic to different CNN architectures, we perform classification experiments on CIFAR100 with both ResNet20 and ResNet32 (He et al., 2016). We use OPT (GS) and OPT (CP) to train ResNet20 and ResNet32. The results in Table 8 show that OPT achieves consistent improvements on ResNet compared to the standard training.
Method  Top1 Err.  Top5 Err. 
Baseline  44.32  21.13 
OPT (CP)  43.67  20.26 
ImageNet. We test OPT on ImageNet2012. Since OPT consumes more GPU memory in largescale settings, we use a GPU memoryefficient OPT (CP) to train a plain 10layer CNN (detailed structure is specified in Appendix D) on ImageNet. Note that, our purpose here is to validate the superiority of OPT over the corresponding baseline. From Table 9, we can see that OPT (CP) achieves and improvements on top1 and top5 error, respectively.
Method  5shot Acc. (%) 
MAML (Finn et al., 2017)  62.71 0.71 
ProtoNet (Snell et al., 2017)  64.24 0.72 
Baseline (Chen et al., 2019)  62.53 0.69 
Baseline w/ OPT  63.27 0.68 
Baseline++ (Chen et al., 2019)  66.43 0.63 
Baseline++ w/ OPT  66.68 0.66 
Fewshot learning. To evaluate the crosstask generalization of OPT, we conduct fewshot learning on MiniImageNet, following the same experimental setting as (Chen et al., 2019). More detailed experimental settings are provided in Appendix D. Specifically, we apply OPT with CP to train both the baseline and baseline++ described in (Chen et al., 2019), and immediately obtain obvious improvements. Therefore, OPTtrained networks generalize well in challenging fewshot scenarios.
7.5 Graph Neural Networks
We also test OPT with graph convolution networks (Kipf & Welling, 2016) for graph node classification. For fair comparison, we use exactly the same implementation, hyperparameters and experimental setup as (Kipf & Welling, 2016). Training a GCN with OPT is not that straightforward. Specifically, GCN uses the following forward model:
(16) 
where . We note that is the adjacency matrix of the graph, (
is an identity matrix), and
. is the feature matrix of nodes in the graph (feature dimension is ). is the weights of the classifiers. is the weights of the classifiers. is the weight matrix of size where is the dimension of the hidden space. We treat each column vector of as a neuron, so there are neurons in total. Then we naturally apply OPT to train these neurons of dimension in GCN.Method  Cora  Pubmed 
GCN Baseline  81.3  79.0 
OPT (CP)  82.0  79.4 
OPT (OGD)  82.3  79.5 
We conduct experiments on Cora and Pubmed datsets (Sen et al., 2008). The goal here is to verify the effectiveness of OPT on GCN instead of achieving stateoftheart performance on graph node classification. The results in Table 11 show a reasonable improvement achieved by OPT, validating OPT’s universality of training different types of neural networks on different modalities.
7.6 Point Cloud Neural Networks
Method  Acc. (%) 
PointNet Baseline  87.1 
OPT (GS)  87.23 
OPT (CP)  87.86 
We further test OPT on PointNet (Qi et al., 2017), a type of neural network that takes raw point clouds as input and classify them based on semantics. To simplify the comparison and remove all the bells and whistles, we use a vanilla PointNet (without TNet) as our backbone network. We apply OPT to train the MLPs in PointNet. We follow the same experimental settings as (Qi et al., 2017) and evaluate on the ModelNet40 dataset (Wu et al., 2015). The results are given in Table 12. We can observe that both OPT achieve better accuracy than the PointNet baseline, and OPT (CP) achieves nearly improvement. It is in fact significant because we do not add any additional parameters to the network. Although our accuracy is not the stateoftheart performance, we can still validate the effectiveness of OPT. Most importantly, the improvement on PointNet further validates that OPT is a generic and effective training framework for different types of neural networks.
8 Concluding Remarks
This paper proposes a novel training framework for all types of neural networks. OPT overparameterizes neurons with neuron weights (randomly initialized and fixed) and a layershared orthogonal matrix (learnable). OPT provably achieves minimum hyperspherical energy and maintains the energy during training. We give theoretical insights and extensive empirical evidences to validate OPT’s superiority.
References
 AllenZhu et al. (2018) AllenZhu, Z., Li, Y., and Song, Z. A convergence theory for deep learning via overparameterization. arXiv preprint arXiv:1811.03962, 2018.
 Annavarapu (2013) Annavarapu, R. N. Singular value decomposition and the centrality of löwdin orthogonalizations. American Journal of Computational and Applied Mathematics, 3(1):33–35, 2013.

Arjovsky et al. (2016)
Arjovsky, M., Shah, A., and Bengio, Y.
Unitary evolution recurrent neural networks.
In ICML, 2016.  Bansal et al. (2018) Bansal, N., Chen, X., and Wang, Z. Can we gain more from orthogonality regularizations in training deep cnns? In NeurIPS, 2018.
 Bilyk & Lacey (2015) Bilyk, D. and Lacey, M. T. One bit sensing, discrepancy, and stolarsky principle. arXiv preprint arXiv:1511.08452, 2015.
 Chen et al. (2019) Chen, W.Y., Liu, Y.C., Kira, Z., Wang, Y.C. F., and Huang, J.B. A closer look at fewshot classification. arXiv preprint arXiv:1904.04232, 2019.
 Ding et al. (2019) Ding, X., Guo, Y., Ding, G., and Han, J. Acnet: Strengthening the kernel skeletons for powerful cnn via asymmetric convolution blocks. In ICCV, 2019.
 Du et al. (2017) Du, S. S., Lee, J. D., Tian, Y., Poczos, B., and Singh, A. Gradient descent learns onehiddenlayer cnn: Don’t be afraid of spurious local minima. arXiv preprint arXiv:1712.00779, 2017.
 Duchi et al. (2011) Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. JMLR, 2011.
 Finn et al. (2017) Finn, C., Abbeel, P., and Levine, S. Modelagnostic metalearning for fast adaptation of deep networks. In ICML, 2017.
 Glorot & Bengio (2010) Glorot, X. and Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In AISTATS, 2010.
 Gunasekar et al. (2017) Gunasekar, S., Woodworth, B. E., Bhojanapalli, S., Neyshabur, B., and Srebro, N. Implicit regularization in matrix factorization. In NeurIPS, 2017.
 Gunasekar et al. (2018) Gunasekar, S., Lee, J. D., Soudry, D., and Srebro, N. Implicit bias of gradient descent on linear convolutional networks. In NeurIPS, 2018.
 Ha et al. (2016) Ha, D., Dai, A., and Le, Q. V. Hypernetworks. arXiv preprint arXiv:1609.09106, 2016.
 He et al. (2015) He, K., Zhang, X., Ren, S., and Sun, J. Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification. In ICCV, 2015.
 He et al. (2016) He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In CVPR, 2016.
 Henaff et al. (2016) Henaff, M., Szlam, A., and LeCun, Y. Recurrent orthogonal networks and longmemory tasks. arXiv preprint arXiv:1602.06662, 2016.
 Hoffmann (1989) Hoffmann, W. Iterative algorithms for gramschmidt orthogonalization. Computing, 41(4):335–348, 1989.
 Ioffe & Szegedy (2015) Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
 Jaderberg et al. (2014) Jaderberg, M., Vedaldi, A., and Zisserman, A. Speeding up convolutional neural networks with low rank expansions. In BMVC, 2014.
 Jia et al. (2016) Jia, X., De Brabandere, B., Tuytelaars, T., and Gool, L. V. Dynamic filter networks. In NeurIPS, 2016.
 Jing et al. (2017) Jing, L., Shen, Y., Dubcek, T., Peurifoy, J., Skirlo, S., LeCun, Y., Tegmark, M., and Soljačić, M. Tunable efficient unitary neural networks (eunn) and their application to rnns. In ICML, 2017.
 Kawaguchi (2016) Kawaguchi, K. Deep learning without poor local minima. In NeurIPS, 2016.
 Kawaguchi et al. (2018) Kawaguchi, K., Xie, B., and Song, L. Deep semirandom features for nonlinear function approximation. In AAAI, 2018.
 Keskar & Socher (2017) Keskar, N. S. and Socher, R. Improving generalization performance by switching from adam to sgd. arXiv preprint arXiv:1712.07628, 2017.
 Kingma & Ba (2014) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Kipf & Welling (2016) Kipf, T. N. and Welling, M. Semisupervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
 Lee et al. (2016) Lee, J. D., Simchowitz, M., Jordan, M. I., and Recht, B. Gradient descent only converges to minimizers. In COLT, 2016.
 LezcanoCasado & MartínezRubio (2019) LezcanoCasado, M. and MartínezRubio, D. Cheap orthogonal constraints in neural networks: A simple parametrization of the orthogonal and unitary group. arXiv preprint arXiv:1901.08428, 2019.
 Li et al. (2020) Li, J., Li, F., and Todorovic, S. Efficient riemannian optimization on the stiefel manifold via the cayley transform. In ICLR, 2020.
 Li et al. (2018) Li, Y., Ma, T., and Zhang, H. Algorithmic regularization in overparameterized matrix sensing and neural networks with quadratic activations. In COLT, 2018.
 Lin et al. (2013) Lin, M., Chen, Q., and Yan, S. Network in network. arXiv preprint arXiv:1312.4400, 2013.
 Liu et al. (2015) Liu, B., Wang, M., Foroosh, H., Tappen, M., and Pensky, M. Sparse convolutional neural networks. In CVPR, 2015.
 Liu et al. (2017) Liu, W., Zhang, Y.M., Li, X., Yu, Z., Dai, B., Zhao, T., and Song, L. Deep hyperspherical learning. In NeurIPS, 2017.
 Liu et al. (2018) Liu, W., Lin, R., Liu, Z., Liu, L., Yu, Z., Dai, B., and Song, L. Learning towards minimum hyperspherical energy. In NeurIPS, 2018.
 Liu et al. (2019) Liu, W., Liu, Z., Rehg, J. M., and Song, L. Neural similarity learning. In NeurIPS, 2019.
 Mallya et al. (2018) Mallya, A., Davis, D., and Lazebnik, S. Piggyback: Adapting a single network to multiple tasks by learning to mask weights. In ECCV, 2018.
 Nesterov (1983) Nesterov, Y. E. A method for solving the convex programming problem with convergence rate o (1/k^ 2). In Dokl. akad. nauk Sssr, 1983.
 Peel et al. (2010) Peel, T., Anthoine, S., and Ralaivola, L. Empirical bernstein inequalities for ustatistics. In NeurIPS, 2010.
 Qi et al. (2017) Qi, C. R., Su, H., Mo, K., and Guibas, L. J. Pointnet: Deep learning on point sets for 3d classification and segmentation. In CVPR, 2017.
 Rahimi & Recht (2008) Rahimi, A. and Recht, B. Random features for largescale kernel machines. In NeurIPS, 2008.
 Reddi et al. (2019) Reddi, S. J., Kale, S., and Kumar, S. On the convergence of adam and beyond. arXiv preprint arXiv:1904.09237, 2019.
 Sen et al. (2008) Sen, P., Namata, G., Bilgic, M., Getoor, L., Galligher, B., and EliassiRad, T. Collective classification in network data. AI magazine, 2008.
 Snell et al. (2017) Snell, J., Swersky, K., and Zemel, R. Prototypical networks for fewshot learning. In NeurIPS, 2017.
 Soudry & Carmon (2016) Soudry, D. and Carmon, Y. No bad local minima: Data independent training error guarantees for multilayer neural networks. arXiv preprint arXiv:1605.08361, 2016.
 Srivastava et al. (2014) Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. Dropout: a simple way to prevent neural networks from overfitting. JMLR, 2014.
 Srivastava (2000) Srivastava, V. A unified view of the orthogonalization methods. Journal of Physics A: Mathematical and General, 33(35):6219, 2000.

Tieleman & Hinton (2012)
Tieleman, T. and Hinton, G.
Lecture 6.5rmsprop: Divide the gradient by a running average of its recent magnitude.
COURSERA: Neural networks for machine learning
, 4(2):26–31, 2012.  Wang et al. (2017) Wang, M., Liu, B., and Foroosh, H. Factorized convolutional neural networks. In ICCV Workshops, 2017.
 Wen & Yin (2013) Wen, Z. and Yin, W. A feasible method for optimization with orthogonality constraints. Mathematical Programming, 2013.
 Wisdom et al. (2016) Wisdom, S., Powers, T., Hershey, J., Le Roux, J., and Atlas, L. Fullcapacity unitary recurrent neural networks. In NeurIPS, 2016.
 Wu et al. (2015) Wu, Z., Song, S., Khosla, A., Yu, F., Zhang, L., Tang, X., and Xiao, J. 3d shapenets: A deep representation for volumetric shapes. In CVPR, 2015.
 Xie et al. (2016) Xie, B., Liang, Y., and Song, L. Diverse neural network learns true target functions. arXiv preprint arXiv:1611.03131, 2016.
 Yang et al. (2015) Yang, Z., Moczulski, M., Denil, M., de Freitas, N., Smola, A., Song, L., and Wang, Z. Deep fried convnets. In ICCV, 2015.
 Zeiler (2012) Zeiler, M. D. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.
Appendix A Details of Unrolled Orthogoanlization Algorithms
a.1 GramSchmidt Process
GramSchmidt Process. GS process is a method for orthonormalizing a set of vectors in an inner product space, i.e., the Euclidean space equipped with the standard inner product. Specifically, GS process performs the following operations to orthogonalize a set of vectors :
Step 1:  (17)  
Step 2:  
Step 3:  
Step 4:  
Step n: 
where denotes the projection of the vector onto the vector . The set denotes the output orthonormal set. The algorithm flowchart can be described as follows:
The vectors in the algorithm above are used to compute the QR factorization, which is not useful in orthogonalization and therefore does not need to be stored. When the GS process is implemented on a finiteprecision computer, the vectors are often not quite orthogonal, because of rounding errors. Besides the standard GS process, there is a modified GramSchmidt (MGS) algorithm which enjoys better numerical stability. This approach gives the same result as the original formula in exact arithmetic and introduces smaller errors in finiteprecision arithmetic. Specifically, GS computes the following formula:
(18)  
Instead of computing the vector as in Eq. (18), MGS computes the orthogonal basis differently. MGS does not subtract the projections of the original vector set, and instead remove the projection of the previously constructed orthogonal basis. Specifically, MGS computes the following series of formulas:
(19)  
where each step finds a vector that is orthogonal to . Therefore, is also orthogonalized against any errors brought by the computation of . In practice, although MGS enjoys better numerical stability, we find the empirical performance of GS and MGS is almost the same in OPT. However, MGS takes longer time to complete since the computation of each orthogonal basis is an iterative process. Therefore, we usually stick to classic GS for OPT.
Iterative GramSchmidt Process. Iterative GramSchmidt (IGS) process is an iterative version of the GS process. It is shown in (Hoffmann, 1989) that GS process can be carried out iteratively to obtain a basis matrix that is orthogonal in almost full working precision. The IGS algorithm is given as follows:
The vectors in the algorithm above are used to compute the QR factorization, which is not useful in orthogonalization and therefore does not need to be explicitly computed. The while loop in IGS is an iterative procedure. In practice, we can unroll a fixed number of steps for the while loop in order to improve the orthogonality. The resulting in the th step corresponds to the solution of the equation where . The IGS process corresponds to the GaussJacobi iteration for solving this equation.
Both GS and IGS are easy to be embedded in the neural networks, since they are both differentiable. In our experiments, we find that the performance gain of unrolling multiple steps in IGS over GS is not very obvious (partially because GS has already achieved nearly perfect orthogonality), but IGS costs longer training time. Therefore, we unroll the classic GS process by default.
a.2 Householder Reflection
Let be a nonzero vector. A matrix of the form
(20) 
is a Householder reflection. The vector is the Householder vector. If a vector is multiplied by the matrix , then it will be reflected in the hyperplane . Householder matrices are symmetric and orthogonal.
For a vector , we let where is a vector of (the first element is and the remaining elements are ). Then we construct the Householder reflection matrix with and multiply it to :
(21) 
which indicates that we can make any nonzero vector become where is some constant by using Householder reflection. By leftmultiplying a reflection we can turn a dense vector into a vector with the same length and with only a single nonzero entry. Repeating this times gives us the Householder QR factorization, which also orthogonalizes the original input matrix. Householder reflection orthogonalizes a matrix by triangularizing it:
(22) 
where is a uppertriangular matrix in the QR factorization. is constructed by where is the Householder reflection that is performed on the vector . The algorithm flowchart is given as follows:
The algorithm follows the Matlab notation where denotes the submatrix of from the th column to the th column and from the th row to the th row. Note that, there are a number of variants for the Householder reflection orthogonalization, such as the implicit variant where we do not store each reflection explicitly. Here is the final orthogonal matrix we need.
a.3 Löwdin’s Symmetric Orthogonalization
Let be a set of linearly independent vectors in a dimensional space. We define a general nonsingular linear transformation that can transform the basis to a new basis :
(23) 
where the basis will be orthonormal if (the transpose will become conjugate transpose in complex space)
(24) 
where is the gram matrix of the given basis .
A general solution to this orthogonalization problem can be obtained via the substitution:
(25) 
in which is an arbitrary orthogonal (or unitary) matrix. When , we will have the symmetric orthogonalization, namely
(26) 
When in which diagonalizes , then we have the canonical orthogonalization, namely
(27) 
Because diagonalizes , we have that . Therefore, we have the transformation as
. This is essentially an eigenvalue decomposition of the symmetric matrix
.In order to compute the Löwdin’s symmetric orthogonalized basis sets, we can use singular value decomposition. Specifically, SVD of the original basis set is given by
(28) 
where both and are orthogonal matrices. is the diagonal matrix of singular values. Therefore, we have that
(29)  
where we have due to the connections between eigenvalue decomposition and SVD. Therefore, we end up with
(30) 
which is the output orthogonal matrix for Löwdin’s symmetric orthogonalization.
An interesting feature of the symmetric orthogonalization is to ensure that
(31) 
where and are the th column vectors of and , respectively. denotes the set of all possible orthonormal sets in the range of . This means that the symmetric orthogonalization functions (or ) are the least distant in the Hilbert space from the original functions . Therefore, symmetric orthogonalization indicates the gentlest pushing of the directions of the vectors in order to make them orthogonal.
Appendix B Proof of Theorem 1
To be more specific, neurons with each element initialized by a zeromean Gaussian distribution are uniformly distributed on a hypersphere. We show this argument with the following theorem.
Theorem 2.
The normalized vector of Gaussian variables is uniformly distributed on the sphere. Formally, let and be independent. Then the vector
(32) 
follows the uniform distribution on , where is a normalization factor.
Proof.
A random variable has distribution
if it has the density function(33) 
A dimensional random vector has distribution if the components are independent and have distribution each. Then the density of is given by
(34) 
Then we introduce the following lemma (Lemma 2
) about the orthogonalinvariance of the normal distribution.
Lemma 2.
Let be a dimensional random vector with distribution and be an orthogonal matrix (). Then also has distribution .
Proof.
For any measurable set , we have that
(35)  
Comments
There are no comments yet.