Orthogonal Over-Parameterized Training

04/09/2020 ∙ by Weiyang Liu, et al. ∙ 7

The inductive bias of a neural network is largely determined by the architecture and the training algorithm. To achieve good generalization, how to effectively train a neural network is even more important than designing the architecture. We propose a novel orthogonal over-parameterized training (OPT) framework that can provably minimize the hyperspherical energy which characterizes the diversity of neurons on a hypersphere. By constantly maintaining the minimum hyperspherical energy during training, OPT can greatly improve the network generalization. Specifically, OPT fixes the randomly initialized weights of the neurons and learns an orthogonal transformation that applies to these neurons. We propose multiple ways to learn such an orthogonal transformation, including unrolling orthogonalization algorithms, applying orthogonal parameterization, and designing orthogonality-preserving gradient update. Interestingly, OPT reveals that learning a proper coordinate system for neurons is crucial to generalization and may be more important than learning a specific relative position of neurons. We further provide theoretical insights of why OPT yields better generalization. Extensive experiments validate the superiority of OPT.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The inductive bias encoded in a neural network is generally determined by two major aspects: how the neural network is structured (i.e., network architecture) and how the neural network is optimized (i.e., training algorithm). For the same network architecture, using different training algorithms could lead to a dramatic difference in generalization performance (Keskar & Socher, 2017; Reddi et al., 2019) even if the training loss is already close to zero, implying that different training procedures lead to different inductive biases. Therefore, how to effectively train a neural network that generalizes well remains an open challenge.

Figure 1: An overview of orthogonal over-parameterized training.

Recent theories (Gunasekar et al., 2017, 2018; Kawaguchi, 2016; Li et al., 2018; Gunasekar et al., 2018) suggest the importance of over-parameterization in linear neural networks. For example, (Gunasekar et al., 2017) shows that optimizing an underdetermined quadratic objective over a matrix with gradient descent on a factorization of leads to an implicit regularization that may improve generalization. There are also empirical evidences (Ding et al., 2019; Liu et al., 2019) which show that over-parameterzing the convolutional filters under some regularity is beneficial to generalization. Our paper aims to leverage the power of over-parameterization and explore more intrinsic structural priors for training a well-performing neural network.

Motivated by this goal, we propose a generic orthogonal over-parameterized training (OPT) framework for effectively training neural networks. Different from the existing neural network training, OPT over-parameterizes a neuron

with the multiplication of a learnable layer-shared orthogonal matrix

and a fixed randomly initialized weight vector

. Therefore, the equivalent weight for the neuron is . Once each element of the neuron weight

has been randomly initialized following a zero-mean Gaussian distribution 

(He et al., 2015; Glorot & Bengio, 2010), we fix them throughout the entire training process. Then OPT learns a layer-shared orthogonal transformation that is applied to all the neurons (in the same layer). An illustration of OPT is given in Fig. 1. In contrast to the standard neural training, OPT decomposes the neuron into two components: an orthogonal transformation that learns a proper coordinate system and a weight vector that controls the position of the neuron. Essentially, the weight vectors of different neurons in the same layer determine the relative positions of these neurons, while the layer-shared orthogonal matrix specifies the coordinate system for these neurons.

Another motivation of OPT comes from an observation (Liu et al., 2018) that neural networks with lower hyperspherical energy generalizes better. Hyperspherical energy quantifies the diversity of neurons on a hypersphere, and essentially characterizes the relative positions of neurons via this form of diversity. (Liu et al., 2018) introduces hyperspherical energy as a regularization term in the network and does not guarantee the hyperspherical energy can be effectively minimized (due to the existence of data fitting loss). To address this issue, we leverage the property of hyperspherical energy that it is independent of the coordinate system in which the neurons live and only depends on their relative positions. Specifically, we prove that, if we randomly initialize the neuron weight with certain distributions, such distributions lead to minimum hyperspherical energy. It follows that OPT maintains the minimum energy throughout the training by learning a coordinate system (i.e., layer-shared orthogonal matrix) for the neurons. Therefore, OPT can guarantee that the hyperspherical energy is well minimized.

We consider several ways to learn the orthogonal transformation. The first one is to unroll different orthogonalization algorithms such as Gram-Schmidt process, Householder reflection and Löwdin’s symmetric orthogonalization. Different unrolled algorithms yield different implicit regularizations to construct the neuron weights. For example, symmetric orthogonalization guarantees that the new orthogonal basis has the least distance in the Hilbert space from the original non-orthogonal basis. Second, we consider to use a special parameterization (such as Cayley parameterization) to construct the orthogonal matrix, which is more efficient in training. Third, we try an orthogonal-preserving gradient descent to ensure the matrix

to be orthogonal after each gradient update. Last, we propose a relaxation to the optimization problem by making the orthogonality a regularization for the coordinate system . These different ways of learning the orthogonal transformation for neurons encode different inductive biases to the neural network.

Moreover, we propose a refinement strategy to further reduce the hyperspherical energy for the random initialized neural weights . We directly minimize the hyperspherical energy of these randomly initialized neuron weights as a preprocessing step before training them on actual data. Finally, we provide theoretical justifications for why OPT yields better generalization than standard training.

We summarize the advantages of OPT as follows:

  • [leftmargin=*,nosep,nolistsep]

  • OPT is a universal neural network training framework with strong flexibility. There are several ways of learning the coordinate system (i.e., orthogonal transformation) and each one may impose a different inductive bias.

  • OPT is the first training method that can provably achieve minimum hyperspherical energy, leading to better generalization. More interestingly, OPT reveals that learning a proper coordinate system is crucial to the generalization, while the relative positions of neurons can be well characterized by hyperspherical energy.

  • There is no extra computational cost for OPT-trained neural networks in inference. It has the same inference speed and model size as its standard counterpart.

  • OPT is shown useful in multi-layer perceptrons (MLPs), convolutional neural networks (CNNs), point cloud networks (PointNet) 

    (Qi et al., 2017), and graph convolutional networks (GCN) (Kipf & Welling, 2016).

2 Related Work

Optimization for Deep Learning

. A number of first-order optimization algorithms (Nesterov, 1983; Duchi et al., 2011; Kingma & Ba, 2014; Tieleman & Hinton, 2012; Zeiler, 2012; Reddi et al., 2019) are proposed to improve the empirical convergence and generalization of deep neural networks. Our work is in parallel with these optimization algorithms, since these algorithms can easily applied to our framework.

Parameterization of Neurons. There are various ways to parameterize a neuron for different applications. (Ding et al., 2019) over-parameterizes a 2D convolution kernel by combining a 2D kernel of the same size and two additional 1D asymmetric kernels. The resulted convolution kernel has the same effective parameters during training but more parameters at training time due to the additional asymmetric kernels. (Liu et al., 2019) constructs a neuron with a bilinear parameterization which regularizes the bilinear similarity matrix. (Yang et al., 2015) reparameterizes the neuron matrix with an adaptive fastfood transform to compress model parameters. (Jaderberg et al., 2014; Liu et al., 2015; Wang et al., 2017) employ sparse and low-rank structures to construct convolution kernels for efficient neural networks.

Hyperspherical learning. (Liu et al., 2017) proposes a neural network architecture that learn representations on a hypersphere and shows that the angular information in neural networks, in contrast to magnitude information, preserves the most semantic information. (Liu et al., 2018) defines hyperspherical energy that quantifies the diversity of neurons on a hypersphere and empirically shows that the minimum hyperspherical energy improves generalization.

3 Orthogonal Over-Parameterized Training

3.1 General Framework

OPT parameterizes the neuron as the multiplication of an orthogonal matrix and a neuron weight vector , and the equivalent neuron weight becomes . The output of this neuron can be represented by where is the input vector. In the OPT framework, we fix the randomly initialized neuron weight and only the orthogonal matrix is learned. In contrast, the standard neuron is directly formulated as , where the weight vector is learned in training.

Without loss of generality, we consider a two-layer linear MLP with a loss function

(e.g., we use the least square loss: ). Specifically, the learning objective of the standard training and OPT are

Standard: (1)
OPT:

where is the -th neuron in the first layer, and is the output neuron in the second layer. In OPT, each element of is usually sampled from a zero-mean Gaussian distribution, and is fixed throughout the entire training process. In general, OPT learns an orthogonal matrix that is applied to all the neurons instead of learning the individual neuron weight. Note that, we usually do not apply OPT to neurons in the output layer (e.g.,

in this MLP example, and the final linear classifiers in CNNs), since it makes little sense to fix a set of random linear classifiers. Therefore, the central problem is how to learn these layer-shared orthogonal matrices.

3.2 Hyperspherical Energy Perspective

We take a close look at OPT from the hyperspherical energy perspective. Following (Liu et al., 2018), the hyperspherical energy of neurons is defined as

(2)

in which is the -th neuron weight projected onto the unit hypersphere . Hyperspherical energy is used to characterize the diversity of neurons on a unit hypersphere. Assume that we have neurons in one layer, and we have learned an orthogonal matrix for these neurons. The hyperspherical energy of these OPT-trained neurons is given by

(3)

which concludes that OPT will not change the hyperspherical energy in each layer during the training.

Moreover, (Liu et al., 2018)

proves that minimum hyperspherical energy corresponds to the uniform distribution over the hypersphere. As a result, if the initialization of the neurons in the same layer follows the uniform distribution over the hypersphere, then we can guarantee that the hyperspherical energy is minimal in a probabilistic sense.

Theorem 1.

For the neuron where are initialized i.i.d. following a zero-mean Gaussian distribution (i.e., ), then its projection onto a unit hypersphere where is guaranteed to follow a uniform distribution on the unit hypersphere .

Theorem 1 implies that, if we initialize the neurons in the same layer with zero-mean Gaussian distribution, then the corresponding hyperspherical energy is guaranteed to be small. It is because the neurons will be uniformly distributed on the unit hypersphere and hyperpsherical energy well quantifies the uniformity on the hypersphere. More importantly, current prevailing neuron initializations such as Xavier (Glorot & Bengio, 2010) and Kaiming (He et al., 2015) are basically zero-mean Gaussian distribution. Therefore, our neurons naturally have very low hyperspherical energy from the very beginning in practice.

3.3 Unrolling Orthogonalization Algorithms

Figure 2: Unrolled orthogonalization.

In order to learn the orthogonal transformation, we propose to unroll classic orthogonalization algorithms in numerical linear algebra and embed them into the neural network such that the training is still end-to-end. We need to make sure every step of the orthogonalization algorithm is differentiable, and the training flow is shown in Fig. 2.

Gram-Schmidt Process. This method takes a linearly independent set and produce an orthogonal set based on it. Gram-Schmidt Process (GS) usually takes the following steps to orthogonalize a set of vectors :

(4)

where the projection operator is given by and is the obtained orthogonal set. In practice, we can use modified GS for numerical stability. To achieve better orthogonality, we can also unroll an iterative GS (Hoffmann, 1989) with multiple iterative steps.

Householder Reflection. As one of the classic transformation that is used in QR factorization, Householder reflection (HR) can also compute the orthogonal set from a group of vectors. A Householder reflector is defined as where

is perpendicular to the reflection hyperplane. In QR factorization, HR is used to transform a (non-singular) square matrix into an orthogonal matrix and a upper triangular matrix. Given a matrix

, we consider the first column vector . We use Householder reflector to transform to . Specifically, we construct as

(5)

which is an orthogonal matrix. We have that the first column of becomes . At the -th step, we can view the sub-matrix as a new , and use the same procedure to construct the Householder transformation . We construct the final Householder transformation as . Now we can gradually transform to a upper triangular matrix with Householder reflections. Therefore, we have that

(6)

where is the upper triangular matrix (which is different from the matrix in Fig. 2) and the final obtained orthogonal set is (i.e., ).

Löwdin’s Symmetric Orthogonalization. Let the matrix

be a given set of linearly independent vectors in an

-dimensional space. A non-singular linear transformation

can transform the basis to an orthogonal basis : . The matrix will be orthogonal if the following equation holds:

(7)

where is the Gram matrix of the given set . We obtain a general solution to the orthogonalization problem via the substitution: where

is an arbitrary unitary matrix. The specific choice

gives the Löwdin’s symmetric orthogonalization (LS):

. We can analytically obtain the symmetric orthogonalization from the singular value decomposition:

. Then LS gives as the orthogonal set for .

LS possesses a remarkable property which the other orthogonalizations do not have. The orthogonal set resembles the original set in a nearest-neighbour sense. Specifically, LS guarantees that (where and are the -th column of and , respectively) is minimized. Intuitively, LS indicates the gentlest pushing of the directions of the vectors in order to get them orthogonal.

Discussion. These orthogonalization algorithms are fully differentiable and end-to-end trainable. For better orthogonality, these algorithms can be used iteratively and we can unroll them with multiple iterations. Empirically, one-step unrolling usually works well already. We have also considered Givens rotation to construct orthogonal matrix, but it needs to traverse all lower triangular elements in the original set , which takes complexity. Therefore, it is too computationally expensive. More interestingly, each orthogonalization method encodes a unique inductive bias for the resulting neurons by imposing some implicit regularizations (e.g., least distance in Hilbert space for LS). More details about the orthogonalization are provided in Appendix A.

3.4 Orthogonal Parameterization

A more convenient way to ensure orthogonality while learning the matrix is to use a special parameterization that inherently guarantees orthogonality. The exponential parameterization use (where

denotes the matrix exponential) to represent an orthogonal matrix from a skew-symmetric matrix

. Cayley parameterization (CP) is a Padé approximation of the exponential parameterization, and is a more natural choice due to its simplicity. Cayley parameterization use the following transform to construct an orthogonal matrix

from a skew-symmetric matrix :

(8)

where . We note that Cayley parameterization only produces the orthogonal matrices with determinant , which belongs to the special orthogonal group and thus . Specifically, it suffices to learn the upper or lower triangular of the matrix with unconstrained optimization to obtain a desired orthogonal matrix . Cayley parameterization does not cover the entire orthogonal group and is less flexible in terms of representation power, which serves as a explicit regularization for the neurons.

3.5 Orthogonality-Preserving Gradient Descent

An alternative way to guarantee orthogonality is to modify the gradient update for the transformation matrix . The general idea is to initialize with an arbitrary orthogonal matrix and then make sure every gradient update is to apply an orthogonal transformation to . It is essentially conducting gradient descent on the Stiefel manifold. There are plenty of work (Li et al., 2020; Wen & Yin, 2013; Wisdom et al., 2016; Lezcano-Casado & Martínez-Rubio, 2019; Arjovsky et al., 2016; Henaff et al., 2016; Jing et al., 2017) that focus on optimization on Stiefel manifold.

Given a matrix which is initialized as an orthogonal matrix, we aim to construct an orthogonal transformation as the gradient update. We use the Cayley transform to compute a parametric curve on the Stiefel manifold with a specific metric via a skew-symmetric matrix and use it as the update rule:

(9)

where and . denotes the orthogonal matrix in the -th iteration. denotes the original gradient of the loss function w.r.t. . We term such a gradient update as orthogonal-preserving gradient descent (OGD). To reduce the computational cost of the matrix inverse in Eq. (9), we use an iterative method (Li et al., 2020) to approximate the Cayley transform without matrix inverse. We arrive at the following fixed-point iteration by moving terms in Eq. (9):

(10)

which converges to the close-form Cayley transform with a rate of ( is the iteration number). In practice, we empirically find that two iterations will usually suffice for a reasonable approximation accuracy.

3.6 Relaxation to Orthogonal Regularization

We relax the optimization with an orthogonality constraint to a unconstrained optimization with orthogonality regularization (OR). Specifically, we remove the orthogonality constraint in Eq. (1), and adopt an orthogonality regularization for in the objective function, i.e., . Taking Eq. (1) as an example, the training objective becomes

(11)

where

is a hyperparameter. It serves as an approximation to the OPT objective. This relaxation can not guarantee that hyperspherical energy stays unchanged.

4 Refining the Random Initialization

Minimizing hyperspherical energy. Because we are randomly initializing the neuron weight vectors

, there will exist a variance that makes the hyperspherical energy deviate from the minima even if the hyperspherical energy is minimized in expectation. To more effectively reduce the hyperspherical energy, we propose to refine the random initialization by minimizing its hyperspherical energy as a preprocessing. Specifically, before feeding these neuron weight to OPT, we first minimize the hyperspherical energy in Eq.(

2) with gradient descent (without the training data loss). More importantly, since the random initialized neurons can not minimize the half-space hyperspherical energy (Liu et al., 2018) where the collinearity redundancy is removed, we can also perform the half-space hyperspherical energy minimization as a preprocessing step.

Normalizing the randomly initialized neurons. Since the norm of the randomly initialized neuron weights serve a role similar to weighting the importance of different neurons, we further consider to normalize the neuron weights such that each neuron weight vector has the unit norm.

We evaluate both refinements in Section 7.4, and we also show that OPT still performs well without these refinements.

5 Theoretical Insights on Generalization

The key question we aim to answer in this section is why OPT may lead to better generalization. We have already shown that OPT can guarantee the minimum hyperspherical energy (MHE) in a probabilistic sense. Although empirical evidences (Liu et al., 2018) have shown significant and consistent performance gain by minimizing hyperspherical energy, why lower hyperspherical energy will lead to better generalization is still unclear. We argue that OPT leads to better generalization from two aspects: how OPT may affect training and generalization in theory and why minimum hyperspherical energy serves as a good inductive bias.

Our goal here is to leverage and apply existing theoretical results (Kawaguchi, 2016; Xie et al., 2016; Soudry & Carmon, 2016; Lee et al., 2016; Du et al., 2017; Allen-Zhu et al., 2018) to explain the role that MHE plays rather than proving sharp and novel generalization bounds. We simply consider one-hidden-layer networks as the hypothesis class: where

is ReLU. Since the magnitude of

can be scaled into , we can restrict to be . Given a set of i.i.d. training sample where is drawn uniformly from the unit hypersphere, we minimize the least square loss . The gradient w.r.t. is

(12)

Let be the column concatenation of neuron weights. We aim to identify the conditions under which there are no spurious local minima. We rewrite that

(13)

where , , and . Therefore, we can obtain that

(14)

where is the training error and is the minimum singular value of . If we need the training error to be small, then we have to lower bound away from zero. We have the following result from (Xie et al., 2016).

Lemma 1.

With probability larger than

, we will have that where

(15)

and . The kernel function is .

Once MHE is achieved, the neurons will be uniformly distributed on the unit hypersphere. From Lemma 1, we can see that if the neurons are uniformly distributed on the unit hypersphere, will be very small and close to zero. Then will also be small, leading to large lower bound for . Therefore, MHE can result in small training error once the gradient norm is small. The result implies no spurious local minima if we use OPT for training.

We further argue that MHE induced by OPT serves as an important inductive bias for neural networks. As the standard regularizer for neural networks, weight decay controls the norm of the neuron weights, regularizing essentially one dimension of the weight. In contrast, MHE completes the missing pieces by regularizing the remaining dimensions of the weight. MHE encourages minimum hyperspherical redundancy between neurons. In the linear classifier case, MHE impose a prior of maximal inter-class separability.

6 Discussions

Semi-randomness. OPT fixes the randomly initialized neuron weight vectors and only learns layer-shared orthogonal matrices for each layer, so OPT naturally imposes a strong randomness to the neurons and well combine the generalization from randomness and the strong approximation power from neural networks. Such randomness suggests that the specific configuration of relative position among neurons does not matter that much, and the coordinate system is more crucial for generalization. (Kawaguchi et al., 2018; Rahimi & Recht, 2008; Srivastava et al., 2014) also show that randomness can be beneficial to generalization.

Coordinate system vs. relative position. OPT shows that only learning the coordinate system yields much better generalization than learning neuron weights directly. This implies that the coordinate system is of great importance to generalization. However, relative position does not matter only when the hyperspherical energy is sufficiently low. In other words, hyperspherical energy can well characterize the relative position among neurons. Lower hyperspherical energy will lead to better generalization.

Flexible training. First, OPT can used in multi-task training (Mallya et al., 2018) where each set of orthogonal matrices represent one task. OPT can learn different set of orthogonal matrices for different tasks with the neuron weights remain the same. Second, we can perform progressive training with OPT. For example, after learning a set of orthogonal matrices on a large coarse-grained dataset (i.e., pretraining), we can multiple the orthogonal matrices back to the neuron weights and construct a new set of neuron weights. Then we can use the new neuron weights as a starting point and apply OPT to train on a small fine-grained dataset (i.e., finetuning).

Limitations and open problems The limitations of OPT include more GPU memory consumption and computation during training, more numerical issues when ensuring orthogonality and weak scalability for ultra wide neural networks. Therefore, there will be plenty of open problems in OPT, such as scalable and efficient training. Most significantly, OPT opens up a new possibility for studying theoretical generalization of deep networks. With the decomposition to hyperspherical energy and coordinate system, OPT provides a new perspective for future research.

7 Experiments and Results

7.1 Experimental settings

We evaluate OPT on various types of neural networks such as multi-layer perceptrons (for image classification), convolutional neural networks (for image classification), graph neural networks (for graph node classification), and point cloud neural networks (for point cloud classification). Our goal is to show the performance gain of OPT on training different neural networks rather than achieving the state-of-the-art performance on different tasks. More experimental details and network architectures are given in Appendix D.

7.2 Ablation Study and Exploratory Experiment

Method FN LR CNN-6 CNN-9
Baseline - - 37.59 33.55
UPT N U 48.47 46.72
UPT Y U 42.61 39.38
OPT N GS 37.24 32.95
OPT Y GS 33.02 31.03
Table 1: Error (%) on CIFAR-100.

Necessity of orthogonality. We first experiment whether the orthogonality is necessary for OPT. We use both 6-layer CNN and 9-layer CNN (specified in Appendix D) on CIFAR-100. We compare OPT with a baseline with the same network architecture that learns a unconstrained matrix with only weight decay regularization. We term this baseline as unconstrained over-parameterized training (UPT). “FN” in Table 1 denotes whether the randomly initialized neuron weights are fixed throughout the training (“Y” for yes and “N” for no). “LR” denotes whether the learnable transformation is unconstrained (“U”) or orthogonal (“GS” for Gram-Schmidt process). The results in Table 1 show that without ensuring orthogonality, the performance of UPT is much worse than OPT that unrolls Gram-Schmidt process for orthogonality (no matter whether the neuron weights are fixed or not). Thus, orthogonality is indeed necessary.

Fixed weight vs. learnable weight. From Table 1, we can see that using fixed neuron weights is consistently better than learnable neuron weights in both UPT and OPT. It shows that fixing the neuron weights while learning the transformation matrix is beneficial to generalization.

High vs. low hyperspherical energy. We empirically verify that high hyperspherical energy corresponds to inferior generalization performance. To initialize neurons with high hyperspherical energy, we use a random initialization with mean equal to , , , and .

Mean Energy Error (%)
0 3.5109 32.49
1e-3 3.5117 33.11
1e-2 3.5160 39.51
2e-2 3.5531 53.89
3e-2 3.6761 N/C
5e-2 4.2776 N/C
Table 2: Different energy.

We use CNN-6 to conduct experiments on CIFAR-100. The results in Table 2 (“N/C” denotes not converged) show that the network with higher hyperspherical energy is more difficult to converge. Moreover, we find that if the hyperspherical energy is larger than a certain value, then the network can not converge at all. Note that, when the hyperspherical energy is small (near the minima), a little change in hyperspherical energy (e.g., from to ) can lead to dramatic generalization gap (e.g., from error rate to ). One can observe that higher hyperspherical energy leads to worse generalization.

7.3 Multi-Layer Perceptrons

Method Normal Xavier
Baseline 6.05 2.14
OPT (GS) 5.11 1.45
OPT (HR) 5.31 1.60
OPT (LS) 5.32 1.54
OPT (CP) 5.14 1.49
OPT (OGD) 5.38 1.56
OPT (OR) 5.41 1.78
Table 3: Error (%) on MNIST.

We evaluate different variants of OPT for MLPs on MNIST. We use a 3-layer MLP for all the training methods. Specific training hyperparameters are given in Appendix D. Results in Table 3 shows the testing error on MNIST for cases where neuron weights use normal initialization or Xavier initialization (Glorot & Bengio, 2010). OPT (GS/HR/LS) are OPT with unrolled orthogonalization algorithms. OPT (CP) denotes OPT with Cayley parameterization. OPT (OGD) is OPT with orthogonal-preserving gradient descent. OPT (OR) denotes OPT with relaxed orthogonal regularization. We can see that OPT (GS) performs the best and all OPT variants outperform the baseline by a considerable margin.

7.4 Convolutional Neural Networks

Method CNN-6 CNN-9
Baseline 37.59 33.55
HS-MHE 34.97 32.87
OPT (GS) 33.02 31.03
OPT (HR) 35.67 32.75
OPT (LS) 34.48 31.22
OPT (CP) 33.53 31.28
OPT (OGD) 33.33 31.47
OPT (OR) 34.70 32.63
Table 4: Err. (%) on CIFAR-100.

OPT variants. We evaluate all the OPT variants with a plain 6-layer CNN and a plain 9-layer CNN on CIFAR-100. Detailed network architectures are given in Appendix D. All the neurons are initialized by (He et al., 2015)

. Batch normalization 

(Ioffe & Szegedy, 2015) is used by default. Results in Table 4 show that nearly all OPT variants consistently outperform both baseline and the HS-MHE regularization (Liu et al., 2018) by a significant margin. The HS-MHE regularization puts the half-space hyperspherical energy into loss function and minimizes it with stochastic gradients, which is a naive way to minimize the hyperspherical energy. From the results, we observe that OPT (HR) performs the worse among all OPT variants. In contrast, OPT (GS) achieves the best testing error, implying that Gram-Schmidt process imposes a suitable inductive bias for CNNs on CIFAR-100.

Method Error (%)
Baseline 38.95
HS-MHE 36.90
OPT (GS) 35.61
OPT (HR) 37.51
OPT (LS) 35.83
OPT (CP) 34.88
OPT (OGD) 35.38
OPT (OR) N/C
Table 5: No BatchNorm.

Training without batch normalization. We further evaluate how OPT performs without batch normalization. In specific, we use CNN-6 as our backbone network and test on CIFAR-100. From Table 5, one can see that OPT variants again outperform both baseline and HS-MHE (Liu et al., 2018), validating that OPT can work reasonably well without batch normalization. Among all the OPT variants, Cayley parameterization achieves very competitive testing error with lower than standard training.

Figure 3: Training dynamics on CIFAR-100. Left: Hyperspherical energy vs. iteration. Right: Testing error vs. iteration.

Training dynamics. We also look into how hyperspherical energy and testing error changes while training with OPT. For hyperspherical energy, we can see from Fig. 3 that the hyperspherical energy of the baseline will increase dramatically at the beginning and then gradually go down, but it still stay in a relatively high value at the end. MHE can effectively reduce the hyperspherical energy at the end of the training. In contrast, all OPT variants can maintain a very low hyperspherical energy from the beginning. OPT (GS) and OPT (CP) keeps exactly the same hyperspherical energy as the randomly initialized neurons, while OPT (OR) may increase the hyperspherical energy a little bit since it is a relaxation. For testing error, all OPT variants converge stably and their final accuracies outperform the others.

Method Standard MHE HS-MHE
OPT (GS) 33.02 32.99 32.78
OPT (LS) 34.48 34.43 34.37
OPT (CP) 33.53 33.50 33.42
Energy 3.5109 3.5003 3.4976
Table 6: Refining energy.

Refining neuron initialization. We also evaluate two refinement tricks for the neuron initialization. First, we consider the hyperspherical energy minimization as a preprocessing for the neuron weights. We conduct the experiment using CNN-6 on CIFAR-100. Specifically, we run a gradient descent for 5k iterations to minimize the hyperspherical energy of the neuron weights before the training gets started. We also compute the hyperspherical energy (before the training starts, and after the preprocessing of energy minimization) in Table 6. All the methods have the same random initialization with the same random seed, so the hyperspherical energy all starts at . After the neuron preprocessing, we have the energy of for the MHE objective and for the half-space MHE objective. More importantly, Table 6 shows that such a refinement can effectively improve the generalization of OPT and further reduce the testing error on CIFAR-100.

Method w/o Norm w/ Norm
Baseline 37.59 -
OPT (GS) 33.02 32.54
OPT (HR) 35.67 35.30
OPT (LS) 34.48 32.11
OPT (CP) 33.53 32.49
OPT (OGD) 33.37 32.70
OPT (OR) 34.70 33.27
Table 7: Normalized neurons.

Then we experiment the neuron weight normalization in OPT. Normalized neurons make a lot of senses in OPT because the scale of randomly initialized weights does not have any useful property. After randomly initializing the neurons, we directly normalize the scale of the weights to . These randomly initialized neurons still possess the important property of achieving minimum hyperspherical energy. Specifically, we use CNN-6 to perform classification on CIFAR-100. The results in Table 7 show that normalizing the neurons can largely boost the performance of OPT.

Method ResNet-20 ResNet-32
Baseline 31.11 30.16
OPT (GS) 30.73 29.56
OPT (CP) 30.47 29.31
Table 8: ResNets (%).

OPT for ResNet. To show that OPT is agnostic to different CNN architectures, we perform classification experiments on CIFAR-100 with both ResNet-20 and ResNet-32 (He et al., 2016). We use OPT (GS) and OPT (CP) to train ResNet-20 and ResNet-32. The results in Table 8 show that OPT achieves consistent improvements on ResNet compared to the standard training.

Method Top-1 Err. Top-5 Err.
Baseline 44.32 21.13
OPT (CP) 43.67 20.26
Table 9: ImageNet (%).

ImageNet. We test OPT on ImageNet-2012. Since OPT consumes more GPU memory in large-scale settings, we use a GPU memory-efficient OPT (CP) to train a plain 10-layer CNN (detailed structure is specified in Appendix D) on ImageNet. Note that, our purpose here is to validate the superiority of OPT over the corresponding baseline. From Table 9, we can see that OPT (CP) achieves and improvements on top-1 and top-5 error, respectively.

Method 5-shot Acc. (%)
MAML (Finn et al., 2017) 62.71 0.71
ProtoNet (Snell et al., 2017) 64.24 0.72
Baseline (Chen et al., 2019) 62.53 0.69
Baseline w/ OPT 63.27 0.68
Baseline++ (Chen et al., 2019) 66.43 0.63
Baseline++ w/ OPT 66.68 0.66
Table 10: Few-shot learning.

Few-shot learning. To evaluate the cross-task generalization of OPT, we conduct few-shot learning on Mini-ImageNet, following the same experimental setting as (Chen et al., 2019). More detailed experimental settings are provided in Appendix D. Specifically, we apply OPT with CP to train both the baseline and baseline++ described in (Chen et al., 2019), and immediately obtain obvious improvements. Therefore, OPT-trained networks generalize well in challenging few-shot scenarios.

7.5 Graph Neural Networks

We also test OPT with graph convolution networks (Kipf & Welling, 2016) for graph node classification. For fair comparison, we use exactly the same implementation, hyperparameters and experimental setup as (Kipf & Welling, 2016). Training a GCN with OPT is not that straightforward. Specifically, GCN uses the following forward model:

(16)

where . We note that is the adjacency matrix of the graph, (

is an identity matrix), and

. is the feature matrix of nodes in the graph (feature dimension is ). is the weights of the classifiers. is the weights of the classifiers. is the weight matrix of size where is the dimension of the hidden space. We treat each column vector of as a neuron, so there are neurons in total. Then we naturally apply OPT to train these neurons of dimension in GCN.

Method Cora Pubmed
GCN Baseline 81.3 79.0
OPT (CP) 82.0 79.4
OPT (OGD) 82.3 79.5
Table 11: Graph networks.

We conduct experiments on Cora and Pubmed datsets (Sen et al., 2008). The goal here is to verify the effectiveness of OPT on GCN instead of achieving state-of-the-art performance on graph node classification. The results in Table 11 show a reasonable improvement achieved by OPT, validating OPT’s universality of training different types of neural networks on different modalities.

7.6 Point Cloud Neural Networks

Method Acc. (%)
PointNet Baseline 87.1
OPT (GS) 87.23
OPT (CP) 87.86
Table 12: PointNets.

We further test OPT on PointNet (Qi et al., 2017), a type of neural network that takes raw point clouds as input and classify them based on semantics. To simplify the comparison and remove all the bells and whistles, we use a vanilla PointNet (without T-Net) as our backbone network. We apply OPT to train the MLPs in PointNet. We follow the same experimental settings as (Qi et al., 2017) and evaluate on the ModelNet-40 dataset (Wu et al., 2015). The results are given in Table 12. We can observe that both OPT achieve better accuracy than the PointNet baseline, and OPT (CP) achieves nearly improvement. It is in fact significant because we do not add any additional parameters to the network. Although our accuracy is not the state-of-the-art performance, we can still validate the effectiveness of OPT. Most importantly, the improvement on PointNet further validates that OPT is a generic and effective training framework for different types of neural networks.

8 Concluding Remarks

This paper proposes a novel training framework for all types of neural networks. OPT over-parameterizes neurons with neuron weights (randomly initialized and fixed) and a layer-shared orthogonal matrix (learnable). OPT provably achieves minimum hyperspherical energy and maintains the energy during training. We give theoretical insights and extensive empirical evidences to validate OPT’s superiority.

References

Appendix A Details of Unrolled Orthogoanlization Algorithms

a.1 Gram-Schmidt Process

Gram-Schmidt Process. GS process is a method for orthonormalizing a set of vectors in an inner product space, i.e., the Euclidean space equipped with the standard inner product. Specifically, GS process performs the following operations to orthogonalize a set of vectors :

Step 1: (17)
Step 2:
Step 3:
Step 4:
Step n:

where denotes the projection of the vector onto the vector . The set denotes the output orthonormal set. The algorithm flowchart can be described as follows:

Input:

Output:

for  do

      
end for
Algorithm 1 Gram-Schmidt Process

The vectors in the algorithm above are used to compute the QR factorization, which is not useful in orthogonalization and therefore does not need to be stored. When the GS process is implemented on a finite-precision computer, the vectors are often not quite orthogonal, because of rounding errors. Besides the standard GS process, there is a modified Gram-Schmidt (MGS) algorithm which enjoys better numerical stability. This approach gives the same result as the original formula in exact arithmetic and introduces smaller errors in finite-precision arithmetic. Specifically, GS computes the following formula:

(18)

Instead of computing the vector as in Eq. (18), MGS computes the orthogonal basis differently. MGS does not subtract the projections of the original vector set, and instead remove the projection of the previously constructed orthogonal basis. Specifically, MGS computes the following series of formulas:

(19)

where each step finds a vector that is orthogonal to . Therefore, is also orthogonalized against any errors brought by the computation of . In practice, although MGS enjoys better numerical stability, we find the empirical performance of GS and MGS is almost the same in OPT. However, MGS takes longer time to complete since the computation of each orthogonal basis is an iterative process. Therefore, we usually stick to classic GS for OPT.

Iterative Gram-Schmidt Process. Iterative Gram-Schmidt (IGS) process is an iterative version of the GS process. It is shown in (Hoffmann, 1989) that GS process can be carried out iteratively to obtain a basis matrix that is orthogonal in almost full working precision. The IGS algorithm is given as follows:

Input:

Output:

for  do

       while  is False do
            
       end while
      
end for
Algorithm 2 Iterative Gram-Schmidt Process

The vectors in the algorithm above are used to compute the QR factorization, which is not useful in orthogonalization and therefore does not need to be explicitly computed. The while loop in IGS is an iterative procedure. In practice, we can unroll a fixed number of steps for the while loop in order to improve the orthogonality. The resulting in the -th step corresponds to the solution of the equation where . The IGS process corresponds to the Gauss-Jacobi iteration for solving this equation.

Both GS and IGS are easy to be embedded in the neural networks, since they are both differentiable. In our experiments, we find that the performance gain of unrolling multiple steps in IGS over GS is not very obvious (partially because GS has already achieved nearly perfect orthogonality), but IGS costs longer training time. Therefore, we unroll the classic GS process by default.

a.2 Householder Reflection

Let be a non-zero vector. A matrix of the form

(20)

is a Householder reflection. The vector is the Householder vector. If a vector is multiplied by the matrix , then it will be reflected in the hyperplane . Householder matrices are symmetric and orthogonal.

For a vector , we let where is a vector of (the first element is and the remaining elements are ). Then we construct the Householder reflection matrix with and multiply it to :

(21)

which indicates that we can make any non-zero vector become where is some constant by using Householder reflection. By left-multiplying a reflection we can turn a dense vector into a vector with the same length and with only a single nonzero entry. Repeating this times gives us the Householder QR factorization, which also orthogonalizes the original input matrix. Householder reflection orthogonalizes a matrix by triangularizing it:

(22)

where is a upper-triangular matrix in the QR factorization. is constructed by where is the Householder reflection that is performed on the vector . The algorithm flowchart is given as follows:


Algorithm 3 Householder Reflection Orthogonalization

Input:

Output: , where is the orthogonal matrix and is a upper triangular matrix


for  do

      
end for

Procedure 
       if  then
            
      else
             if  then
                  
            else
                  
             end if
            
       end if
      

Procedure 
      

The algorithm follows the Matlab notation where denotes the submatrix of from the -th column to the -th column and from the -th row to the -th row. Note that, there are a number of variants for the Householder reflection orthogonalization, such as the implicit variant where we do not store each reflection explicitly. Here is the final orthogonal matrix we need.

a.3 Löwdin’s Symmetric Orthogonalization

Let be a set of linearly independent vectors in a -dimensional space. We define a general non-singular linear transformation that can transform the basis to a new basis :

(23)

where the basis will be orthonormal if (the transpose will become conjugate transpose in complex space)

(24)

where is the gram matrix of the given basis .

A general solution to this orthogonalization problem can be obtained via the substitution:

(25)

in which is an arbitrary orthogonal (or unitary) matrix. When , we will have the symmetric orthogonalization, namely

(26)

When in which diagonalizes , then we have the canonical orthogonalization, namely

(27)

Because diagonalizes , we have that . Therefore, we have the transformation as

. This is essentially an eigenvalue decomposition of the symmetric matrix

.

In order to compute the Löwdin’s symmetric orthogonalized basis sets, we can use singular value decomposition. Specifically, SVD of the original basis set is given by

(28)

where both and are orthogonal matrices. is the diagonal matrix of singular values. Therefore, we have that

(29)

where we have due to the connections between eigenvalue decomposition and SVD. Therefore, we end up with

(30)

which is the output orthogonal matrix for Löwdin’s symmetric orthogonalization.

An interesting feature of the symmetric orthogonalization is to ensure that

(31)

where and are the -th column vectors of and , respectively. denotes the set of all possible orthonormal sets in the range of . This means that the symmetric orthogonalization functions (or ) are the least distant in the Hilbert space from the original functions . Therefore, symmetric orthogonalization indicates the gentlest pushing of the directions of the vectors in order to make them orthogonal.

More interestingly, the symmetric orthogonalized basis sets has unique geometric properties (Srivastava, 2000; Annavarapu, 2013) if we consider the Schweinler-Wigner matrix in terms of the sum of squared projections.

Appendix B Proof of Theorem 1

To be more specific, neurons with each element initialized by a zero-mean Gaussian distribution are uniformly distributed on a hypersphere. We show this argument with the following theorem.

Theorem 2.

The normalized vector of Gaussian variables is uniformly distributed on the sphere. Formally, let and be independent. Then the vector

(32)

follows the uniform distribution on , where is a normalization factor.

Proof.

A random variable has distribution

if it has the density function

(33)

A -dimensional random vector has distribution if the components are independent and have distribution each. Then the density of is given by

(34)

Then we introduce the following lemma (Lemma 2

) about the orthogonal-invariance of the normal distribution.

Lemma 2.

Let be a -dimensional random vector with distribution and be an orthogonal matrix (). Then also has distribution .

Proof.

For any measurable set , we have that

(35)