Learning Neural Networks with Adaptive Regularization

07/14/2019 ∙ by Han Zhao, et al. ∙ Microsoft Carnegie Mellon University 3

Feed-forward neural networks can be understood as a combination of an intermediate representation and a linear hypothesis. While most previous works aim to diversify the representations, we explore the complementary direction by performing an adaptive and data-dependent regularization motivated by the empirical Bayes method. Specifically, we propose to construct a matrix-variate normal prior (on weights) whose covariance matrix has a Kronecker product structure. This structure is designed to capture the correlations in neurons through backpropagation. Under the assumption of this Kronecker factorization, the prior encourages neurons to borrow statistical strength from one another. Hence, it leads to an adaptive and data-dependent regularization when training networks on small datasets. To optimize the model, we present an efficient block coordinate descent algorithm with analytical solutions. Empirically, we demonstrate that the proposed method helps networks converge to local optima with smaller stable ranks and spectral norms. These properties suggest better generalizations and we present empirical results to support this expectation. We also verify the effectiveness of the approach on multiclass classification and multitask regression problems with various network structures.



There are no comments yet.


page 19

page 20

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Although deep neural networks have been widely applied in various domains [25, 27, 19], usually its parameters are learned via the principle of maximum likelihood, hence its success crucially hinges on the availability of large scale datasets. When training rich models on small datasets, explicit regularization techniques are crucial to alleviate overfitting. Previous works have explored various regularization [39] and data augmentation [38, 19] techniques to learn diversified representations. In this paper, we look into an alternative direction by proposing an adaptive and data-dependent regularization method to encourage neurons of the same layer to share statistical strength. The goal of our method is to prevent overfitting when training (large) networks on small dataset. Our key insight stems from the famous argument by Efron [8] in the literature of the empirical Bayes method: It is beneficial to learn from the experience of others

. From an algorithmic perspective, we argue that the connection weights of neurons in the same layer (row/column vectors of the weight matrix) will be correlated with each other through the backpropagation learning. Hence, by learning the correlations of the weight matrix, a neuron can “borrow statistical strength” from other neurons in the same layer.

As an illustrating example, consider a simple setting where the input is fully connected to a hidden layer , which is further fully connected to the single output . Let

be the nonlinear activation function, e.g., ReLU 

[33], be the connection matrix between the input layer and the hidden layer, and be the vector connecting the output and the hidden layer. Without loss of generality, ignoring the bias term in each layer, we have: . Consider using the usual loss function and take the derivative of w.r.t. . We obtain the update formula in backpropagation as , where is the componentwise derivative of w.r.t. its input argument, and is the learning rate. Realize that is a rank 1 matrix, and the component of is either 0 or 1. Hence, the update for each row vector of is linearly proportional to . Note that the observation holds for any input pair , so the update formula implies that the row vectors of are correlated with each other. Although in this example we only discuss a one-hidden-layer network, it is straightforward to verify that the gradient update formula for general feed-forward networks admits the same rank one structure. The above observation leads us to the following question:

Can we define a prior distribution over that captures the correlations through the learning process for better generalization?

Our Contributions

To answer the above question, we develop an adaptive regularization method for neural nets inspired by the empirical Bayes method. Motivated by the example above, we propose a matrix-variate normal prior whose covariance matrix admits a Kronecker product structure to capture the correlations between different neurons. Using tools from convex analysis, we present an efficient block coordinate descent algorithm with analytical solutions to optimize the model. Empirically, we show the proposed method helps the network converge to local optima with smaller stable ranks and spectral norms, and we verify the effectiveness of the approach on both multiclass classification and multitask regression problems with various network structures.

2 Preliminary

Notation and Setup

We use lowercase letter to represent scalar and lowercase bold letter to denote vector. Capital letter, e.g., , is reserved for matrix. Calligraphic letter, such as , is used to denote set. We write as the trace of a matrix , as the determinant of and as ’s vectorization by column. is used to represent the set for any integer . Other notations will be introduced whenever needed. Suppose we have access to a training set of pairs of data instances

. We consider the supervised learning setting where

and . Let be the conditional distribution of given with parameter . The parametric form of the conditional distribution is assumed be known. In this paper, we assume the model parameter is sampled from a prior distribution

with hyperparameter

. On the other hand, given , the posterior distribution of is denoted by .

The Empirical Bayes Method

To compute the predictive distribution, we need access to the value of the hyperparameter . However, complete information about the hyperparameter is usually not available in practice. To this end, empirical Bayes method [36, 10, 1, 12, 9]

proposes to estimate

from the data directly using the marginal distribution:


Under specific choice of the likelihood function and the prior distribution , e.g., conjugate pairs, we can solve the above integral in closed form. In certain cases we can even obtain an analytic solution of , which can then be plugged into the prior distribution. At a high level, by learning the hyperparameter in the prior distribution directly from data, the empirical Bayes method provides us a principled and data-dependent way to obtain an estimator of . In fact, when both the prior and the likelihood functions are normal, it has been formally shown that the empirical Bayes estimators, e.g., the James-Stein estimator [23] and the Efron-Morris estimator [11], dominate the classic maximum likelihood estimator (MLE) in terms of quadratic loss for every choice of the model parameter . At a colloquial level, the success of the empirical Bayes method can be attributed to the effect of “borrowing statistical strength” [8], which also makes it a powerful tool in multitask learning [28, 43] and meta-learning [15].

3 Learning with Adaptive Regularization

In this section we first propose an adaptive regularization (AdaReg) method, which is inspired by the empirical Bayes method, for learning neural networks. We then combine our observation in Sec. 1 to develop an efficient adaptive learning algorithm with matrix-variate normal prior. Through our derivation, we provide several connections and interpretations with other learning paradigms.

3.1 The Proposed Adaptive Regularization

When the likelihood function is implemented as a neural network, the marginalization in (1) over model parameter cannot be computed exactly. Nevertheless, instead of performing expensive Monte-Carlo simulation, we propose to estimate both the model parameter and the hyperparameter

in the prior simultaneously from the joint distribution

. Specifically, given an estimate of the model parameter, by maximizing the joint distribution w.r.t. , we can obtain as an approximation of the maximum marginal likelihood estimator. As a result, we can use to further refine the estimate by maximizing the posterior distribution as follows:


The maximizer of (2) can in turn be used in an updated joint distribution. Formally, we can define the following optimization problem that characterizes our Adaptive Regularization (AdaReg) framework:


It is worth connecting the optimization problem (3) to the classic maximum a posteriori (MAP) inference and also discuss their difference. If we drop the inner optimization over the hyperparameter in the prior distribution. Then for any fixed value , (3) reduces to MAP with the prior defined by the specific choice of , and the maximizer corresponds to the mode of the posterior distribution given by . From this perspective, the optimization problem in (3) actually defines a series of MAP inference problems, and the sequence defines a solution path towards the final model parameter. On the algorithmic side, the optimization problem (3) also suggests a natural block coordinate descent algorithm where we alternatively optimize over and until the convergence of the objective function. An illustration of the framework is shown in Fig. 1.

Figure 1: Illustration for Bayes/ Empirical Bayes, and our proposed adaptive regularization.

3.2 Neural Network with Matrix-Normal Prior

Inspired by the observation from Sec. 1

, we propose to define a matrix-variate normal distribution 

[16] over the connection weight matrix : , where and are the row and column covariance matrices, respectively.222

The probability density function is given by

Equivalently, one can understand the matrix-variate normal distribution over as a multivariate normal distribution with a Kronecker product covariance structure over : . It is then easy to check that the marginal prior distributions over the row and column vectors of are given by:

We point out that the Kronecker product structure of the covariance matrix exactly captures our prior about the connection matrix : the fan-in/fan-out of neurons in the same layer (row/column vectors of ) are correlated with the same correlation matrix in the prior, and they only differ at the scales.

For illustration purpose, let us consider the simple feed-forward network discussed in Sec. 1. Consider a reparametrization of the model by defining and to be the corresponding precision matrices and plug in the prior distribution into the our AdaReg framework (see (3)). After routine algebraic simplifications, we reach the following concrete optimization problem:

subject to (4)

where is a constant that only depends on and , and . Note that the constraint is necessary to guarantee the feasible set to be compact so that the optimization problem is well formulated and a minimum is attainable. 333The constraint is only for the ease of presentation in the following part and can be readily removed. It is not hard to show that in general the optimization problem (4) is not jointly convex in terms of , and this holds even if the activation function is linear. However, as we will show later, for any fixed , the reparametrization makes the partial optimization over and bi-convex. More importantly, we can derive an efficient algorithm that finds the optimal for any fixed in time with closed form solutions. This allows us to apply our algorithm to networks of large sizes, where a typical hidden layer can contain thousands of nodes. Note that this is in contrast to solving a general semi-definite programming (SDP) problem using black-box algorithm, e.g., the interior-point method [32], which is computationally intensive and hard to scale to networks with moderate sizes. Before we delve into the details on solving (4), it is instructive to discuss some of its connections and differences to other learning paradigms.

Maximum-A-Posteriori Estimation

Essentially, for model parameter , (4) defines a sequence of MAP problems where each MAP is indexed by the pair of precision matrices at iteration . Equivalently, at each stage of the optimization, we can interpret (4) as placing a matrix variate normal prior on where the precision matrix in the prior is given by . From this perspective, if we fix and , , then (4) naturally reduces to learning with regularization [26]. More generally, for non-diagonal precision matrices, the regularization term for becomes:

and this is exactly the Tikhonov regularization [13] imposed on where the Tikhonov matrix is given by . But instead of manually designing the regularization matrix to improve the conditioning of the estimation problem, we propose to also learn both precision matrices (so as well) from data. From an algorithmic perspective, serves as a preconditioning matrix w.r.t. model parameter to reshape the gradient according to the geometry of the data [17, 18, 7].

Volume Minimization

Let us consider the function over the positive definite cone. It is well known that the log-determinant function is concave [3]. Hence for any pair of matrices , the following inequality holds:


Applying the above inequality twice by fixing and respectively leads to the following inequalities:

Realize . Summing the above two inequalities leads to:


where is a constant that only depends on and . Recall that computes the squared volume of the parallelepiped spanned by the column vectors of . Hence (6) gives us a natural interpretation of the objective function in (4): the regularizer essentially upper bounds the log-volume of the two parallelepipeds spanned by the row and column vectors of . But instead of measuring the volume using standard Euclidean inner product, it also takes into account the local curvatures defined by and , respectively. For vectors with fixed lengths, the volume of the parallelepiped spanned by them becomes smaller when they are more linearly correlated, either positively or negatively. At a colloquial level, this means that the regularizer in (4) forces fan-in/fan-out of neurons at the same layer to be either positively or negatively correlated with each other, and this corresponds exactly to the effect of sharing statistical strengths.

3.3 The Algorithm

In this section we describe a block coordinate descent algorithm to optimize the objective function in (4) and detail how to efficiently solve the matrix optimization subproblems in closed form using tools from convex analysis. Due to space limit, we defer proofs and detailed derivation to appendix. Given a pair of constants , we define the following thresholding function :


We summarize our block coordinate descent algorithm to solve (4) in Alg. 1. In each iteration, Alg. 1 takes a first-order algorithm

, e.g., the stochastic gradient descent, to optimize the parameters of the neural network by backpropagation. It then proceeds to compute the optimal solutions for

and using InvThreshold  as a sub-procedure. Alg. 1 terminates when a stationary point is found.

We now proceed to show that the procedure InvThreshold  finds the optimal solution given all the other variables fixed. Due to the symmetry between and in (4), we will only prove this for , and similar arguments can be applied to as well. Fix both , and ignore all the terms that do not depend on , the sub-problem on optimizing becomes:


It is not hard to show that the optimization problem (8) is convex. Define the constraint set and the indicator function iff else . Given the convexity of (8), we can use the indicator function to first transform (8) into an unconstrained one and use the first-order optimality condition to characterize the optimal solution: , where is the normal cone w.r.t.  at . With the help of Lemma 1 in appendix, equivalently, we have . Geometrically, this means that the optimum is the Euclidean projection of onto . Hence in order to solve (8), it suffices if we can solve the following Euclidean projection problem efficiently, where is a real symmetric matrix:


The following theorem characterizes the optimal solution to the above Euclidean projection problem:

Theorem 1.

Let with eigendecomposition as and be the Euclidean projection operator onto , then .

Corollary 1.

Let be eigendecomposed as , then the optimal solution to (8) is given by .

Similar arguments can be made to derive the solution for in (4). The final algorithm is very simple as it only contains one SVD, hence its time complexity is . Note that the total number of parameters in the network is at least , hence the algorithm is efficient as it scales sub-quadratically in terms of number of parameters in the network.

1:Initial value , and , first-order optimization algorithm .
2:for  until convergence do
3:     Fix , , optimize by backpropagation and algorithm
6:end for
7:procedure InvThreshold()
8:     Compute SVD:
9:     Hard thresholding
10:     return
11:end procedure
Algorithm 1 Block Coordinate Descent for Adaptive Regularization

4 Experiments

In this section we demonstrate the effectiveness of AdaReg in learning practical deep neural networks on real-world datasets. We report generalization, optimization as well as stability results.

4.1 Experimental Setup

Multiclass Classification (MNIST & CIFAR10)

In this experiment, we show that AdaReg provides an effective regularization on the network parameters. To this end, we use a convolutional neural network as our baseline model. To show the effect of regularization, we gradually increase the training set size. In MNIST we use the step from 60 to 60,000 (11 different experiments) and in CIFAR10 we consider the step from 5,000 to 50,000 (10 different experiments). For each training set size, we repeat the experiments for 10 times. The mean along with its standard deviation are shown as the statistics. Moreover, since both the optimization and generalization of neural networks are sensitive to the size of minibatches 

[24, 14]

, we study two minibatch settings for 256 and 2048, respectively. In our method, we place a matrix-variate normal prior over the weight matrix of the last softmax layer, and we use Alg. 

1 to optimize both the model weights and two covariance matrices.

Multitask Regression (SARCOS)

SARCOS relates to an inverse dynamics problem for a seven degree-of-freedom (DOF) SARCOS anthropomorphic robot arm 

[41]. The goal of this task is to map from a 21-dimensional input space (7 joint positions, 7 joint velocities, 7 joint accelerations) to the corresponding 7 joint torques. Hence there are 7 tasks and the inputs are shared among all the tasks. The training set and test set contain 44,484 and 4,449 examples, respectively. Again, we apply AdaReg on the last layer weight matrix, where each row corresponds to a separate task vector.

We compare AdaReg with classic regularization methods in the literature, including weight decay, dropout [39]

, batch normalization (BN) 

[22] and the DeCov method [6]

. We also note that we fix all the hyperparameters such as learning rate to be the same for all the methods. We report evaluation metrics on test set as a measure of generalization. To understand how the proposed adaptive regularization helps in optimization, we visualize the trajectory of the loss function during training. Lastly, we also present the inferred correlation of the weight matrix for qualitative study.

4.2 Results and Analysis

Multiclass Classification (MNIST & CIFAR10)

Results on the multiclass classification for different training sizes are show in Fig. 2. For both MNIST and CIFAR10, we find AdaReg, Weight Decay, and Dropout are the effective regularization methods, while Batch Normalization and DeCov vary in different settings. Batch Normalization suffers from large batch size in CIFAR10 (comparing Fig. 2 (c) and (d)) but is not sensitive to batch size in MNIST (comparing Fig. 2 (a) and (b)). The performance deterioration in large batch size of Batch Normalization is also observed by [21]. DeCov, on the other hand, improves the generalization in MNIST with batch size 256 (see Fig. 2 (a)), while it demonstrates only comparable or even worse performance in other settings. To conclude, as training set size grows, AdaReg consistently performs better generalization as comparing to other regularization methods. We also note that AdaReg is not sensitive to the size of minibatches while most of the methods suffer from large minibatches. In appendix, we show the combination of AdaReg with other generalization methods can usually lead to even better results.

Method 1st 2nd 3rd 4th 5th 6th 7th
MTL 0.4418 0.3472 0.5222 0.5036 0.6024 0.4727 0.5298
MTL-Dropout 0.4413 0.3271 0.5202 0.5063 0.6036 0.4711 0.5345
MTL-BN 0.4768 0.3770 0.5396 0.5216 0.6117 0.4936 0.5479
MTL-DeCoV 0.4027 0.3137 0.4703 0.4515 0.5229 0.4224 0.4716
MTL-AdaReg 0.4769 0.3969 0.5485 0.5308 0.6202 0.5085 0.5561
Table 1:

Explained variance of different methods on 7 regression tasks from the SARCOS dataset.

Multitask Regression (SARCOS)

In this experiment we are interested in investigating whether AdaReg can lead to better generalization for multiple related regression problems. To do so, we report the explained variance as a normalized metric, e.g., one minus the ratio between mean squared error and the variance of different methods in Table 1. The larger the explained variance, the better the predictive performance. In this case we observe a consistent improvement of AdaReg over other competitors on all the 7 regression tasks. We would like to emphasize that all the experiments share exactly the same experimental protocol, including network structure, optimization algorithm, training iteration, etc, so that the performance differences can only be explained by different ways of regularizations. For better visualization, we also plot the result in appendix.


It has recently been empirically shown that BN helps optimization not by reducing internal covariate shift, but instead by smoothing the landscape of the loss function [37]. To understand how AdaReg improves generalization, in Fig. 3, we plot the values of the cross entropy loss function on both the training and test sets during optimization using Alg. 1

. The experiment is performed in MNIST with batch size 256/2048. In this experiment, we fix the number of outer loop to be 2/5 and each block optimization over network weights contains 50 epochs. Because of the stochastic optimization over model weights, we can see several unstable peaks in function value around iteration 50 when trained with AdaReg, which corresponds to the transition phase between two consecutive outer loops with different row/column covariance matrices. In all the cases AdaReg converges to better local optima of the loss landscape, which lead to better generalization on the test set as well because they have smaller loss values on the test set when compared with training without AdaReg.

Figure 2: Generalization performance on MNIST and CIFAR10. AdaReg improves generalization under both minibatch settings.
(a) T/B: 600/256
(b) T/B: 6000/256
(c) T/B: 600/2048
(d) T/B: 6000/2048
Figure 3: Optimization trajectory of AdaReg on MNIST with training size/batch size on training and test sets. AdaReg helps to converge to better local optima. Note the -scale on -axis.
(a) MNIST: S. rank
(b) MNIST: S. norm
(c) CIFAR10: S. rank
(d) CIFAR10: S. norm
Figure 4: Comparisons of stable ranks (S. rank) and spectral norms (S. norm) from different methods on MNIST and CIFAR10. -axis corresponds to the training size.

Stable rank and spectral norm

Given a matrix , the stable rank of , denoted as , is defined as

. As its name suggests, the stable rank is more stable than the rank because it is largely unaffected by tiny singular values. It has recently been shown 

[34, Theorem 1] that the generalization error of neural networks crucially depends on both the stable ranks and the spectral norms of connection matrices in the network. Specifically, it can be shown that the generalization error is upper bounded by , where

is the number of layers in the network. Essentially, this upper bound suggests that smaller spectral norm (smoother function mapping) and stable rank (skewed spectrum) leads to better generalization.

To understand why AdaReg improves generalization, in Fig. 4, we plot both the stable rank and the spectral norm of the weight matrix in the last layer of the CNNs used in our MNIST and CIFAR10 experiments. We compare 3 methods: CNN without any regularization, CNN trained with weight decay and CNN with AdaReg. For each setting we repeat the experiments for 5 times, and we plot the mean along with its standard deviation. From Fig. 3(a) and Fig. 3(c) it is clear that AdaReg leads to a significant reduction in terms of the stable rank when compared with weight decay, and this effect is consistent in all the experiments with different training size. Similarly, in Fig. 3(b) and Fig. 3(d) we plot the spectral norm of the weight matrix. Again, both weight decay and AdaReg help reduce the spectral norm in all settings, but AdaReg plays a more significant role than the usual weight decay. Combining the experiments with the generalization upper bound introduced above, we can see that training with AdaReg leads to an estimator of that has lower stable rank and smaller spectral norm, which explains why it achieves a better generalization performance. Furthermore, this observation holds on the SARCOS datasets as well, and we show the results in the appendix.

(a) CNN, Acc: 89.34
(b) AdaReg, Acc: 92.50
(c) CNN, Acc: 98.99
(d) AdaReg, Acc: 99.19
Figure 5: Correlation matrix of the weight matrix in the softmax layer. The left two correspond to dataset with training size 600 and the right two with size 60,000. Acc means the test set accuracy.

Correlation Matrix

To verify that AdaReg imposes the effect of “sharing statistical strength” during training, we visualize the weight matrix of the softmax layer by computing the corresponding correlation matrix, as shown in Fig. 5. In Fig. 5, darker color means stronger correlation. We conduct two experiments with training size 600 and 60,000 respectively. As we can observe, training with AdaReg leads to weight matrix with stronger correlations, and this effect is more evident when the training set is large. This is consistent with our analysis of sharing statistical strengths. As a sanity check, from Fig. 5 we can also see that similar digits, e.g., 1 and 7, share a positive correlation while dissimilar ones, e.g., 1 and 8, share a negative correlation.

5 Related Work

Despite the name, empirical Bayes method is in fact a frequentist approach to obtain estimator with favorable properties. On the other hand, truly Bayesian inference would instead put a posterior distribution over model weights to characterize the uncertainty during training 

[30, 20, 2]. However, due to the complexity of nonlinear neural networks, analytic posterior is not available, hence strong independent assumptions over model weight have to be made in order to achieve computationally tractable variational solution. Typically, both the prior and the variational posterior are assumed to fully factorize over model weights. As an exception, Sun et al. [40], Louizos and Welling [29]

seek to learn Bayesian neural nets where they approximate the intractable posterior distribution using matrix-variate Gaussian distribution. The prior for weights are still assumed to be known and fixed. As a comparison, we use matrix-variate Gaussian as the prior distribution and we learn the hyperparameter in the prior from data. Hence our method does not belong to Bayesian neural nets: we instead use the empirical Bayes principle to derive adaptive regularization method in order to have better generalization, as done in 

[4, 35].

Different kinds of regularization approaches have been studied and designed for neural networks, e.g., weight decay [26], early stopping [5], Dropout [39] and the more recent DeCov [6] method. BN was proposed to reduce the internal covariate shift during training, but recently it has been empirically shown to actually smooth the landscape of the loss function [37]. As a comparison, we propose AdaReg as an adaptive regularization method, with the aim to reduce overfitting by allowing neurons to share statistical strengths. From the optimization perspective, learning the row and column covariance matrices help to converge to better local optimum that also generalizes better.

The Kronecker factorization assumption has also been applied in the literature of neural networks to approximate the Fisher information matrix in second-order optimization methods [31, 42]. The main idea here is to approximate the curvature of the loss function’s landscape, in order to achieve better convergence speed compared with first-order method while maintaining the tractability of such computation.

6 Conclusion

Inspired by empirical Bayes method, we propose an adaptive regularization (AdaReg) with matrix-variate normal prior for model parameters in deep neural networks. The prior encourages neurons to borrow statistical strength from other neurons during the learning process, and it provides an effective regularization when training networks on small datasets. To optimize the model, we design an efficient block coordinate descent algorithm to learn both model weights and the covariance structures. Empirically, on three datasets we demonstrate that AdaReg improves generalization by finding better local optima with smaller spectral norms and stable ranks.



In this appendix we present all the missing proofs in the main paper. We also provide detailed descriptions of our experiments.

Appendix A Detailed Derivation and Proofs of Our Algorithm

We first show that the optimization problem (8) is convex:

Proposition 1.

The optimization problem (8) is convex.


It is clear that the objective function is convex: the trace term is linear in and it is well-known that the is concave in the positive definite cone [3], hence it trivially follows that is convex in .

It remains to show that the constraint set is also convex. Let be any feasible points, i.e., and . Let , we have:

where we use to denote the spectral norm of a matrix. Now since both and

are positive definite, the spectral norm is also the largest eigenvalue, hence this shows that


To show the other direction, we use the Courant-Fischer characterization of eigenvalues. Let denote the minimum eigenvalue of a real symmetric matrix , then by the Courant-Fischer min-max theorem, we have:

For the matrix , let be the vector corresponding to the minimum eigenvalue, hence we have:

which also means that , and this completes the proof. ∎

The following key lemma characterizes the structure of the normal cone:

Lemma 1.

Let , then .


Let . We want to show . By definition of the normal cone, since , we have:

Now realize that and is a compact set, it follows

is the solution of the following linear program:

Since both and are real symmetric matrix, we can decompose them as and , where both are orthogonal matrices and are diagonal matrices with the corresponding eigenvalues in decreasing order. Plug them into the objective function, we have:

Define and , where we use to denote the Hadamard product between two matrices. Since both and are orthogonal matrices, we know that is also orthogonal, which implies:

As a result,

is a doubly stochastic matrix and we can further simplify the objective function as:

where and are dimensional vectors that contain the eigenvalues of and in decreasing order, respectively. Now for any and in decreasing order, we have:


From (10), in order for to maximize the linear program, it must hold that and all the eigenvalues of are . But due to the assumption that , in this case we also know that all the eigenvalues of are , hence also minimizes the above linear program, which implies:

In other words, we have . Using exactly the same arguments it is clear to see that the other direction also holds, hence we have . ∎

Based on the previous first-order optimality condition, it is clear to see that Lemma 1 implies . Geometrically, this means that the optimum is the Euclidean projection of onto . Hence we proceed to derive the projection operator: See 1


Since is real and symmetric, we can reparametrize as where

is an orthogonal matrix and

is a diagonal matrix whose entries corresponds to the eigenvalues of . Recall that corresponds to a rigid transformation that preserves length, so we have:


Define . Now by the fact that can be eigendecomposed as , we can further simplify (11) as:

where the last inequality holds because . In order to achieve the first equality, should be a diagonal matrix, which means . In this case, . To achieve the second equality, simply let , which completes the proof. ∎

Appendix B More Experiments

In this section we first describe the network structures used in our main experiments and present more experimental results.

b.1 Network Structures

Multiclass Classification (MNIST & CIFAR10)

We use a convolutional neural network as our baseline model. The network used in the experiment has the following structure: ---. The notation denotes a convolutional layer with kernel size from depth to ; the notation denotes a fully connected layer with size . Similarly, CIFAR10 considers the structure: ----.

Multitask Regression (SARCOS)

The network structure is given by --.

b.2 Stable Rank and Spectral Norm on SARCOS

We also show the experimental results of stable ranks and spectral norms on the SARCOS dataset. For the SARCOS dataset, the weight matrix being regularized is of dimension . Again, we compare the results using three methods: MTL, MTL-WeightDecay and MTL-AdaReg. As can be observed from Table 2, compared with the weight decay regularization, AdaReg greatly reduces both the stable rank and the spectral norm of learned weight matrix, which also helps to explain why MTL-AdaReg generalizes better compared with MTL and MTL-WeightDecay.

Stable Rank Spectral Norm
MTL 4.48 0.96
MTL-WeightDecay 4.83 0.92
MTL-AdaReg 2.88 0.70
Table 2: Stable rank and spectral norm on SARCOS.

b.3 Combination

As discussed in the main text, combining the proposed AdaReg with BN can further improve the generalization performance, due to the complementary effects between these two approaches: BN helps smoothing the landscape of the loss function while AdaReg also changes the curvature via the row and column covariance matrices (see Fig. 6).

On the other hand, we do not observe significant difference when combining AdaReg with Dropout on this dataset. While we are not clear what is the exact reason for this effect, we conjecture this is due to the fact that Dropout works as a regularizer that prevents coadaptation while AdaReg instead encourages neurons to learn from each other.

(a) Batch size = 256.
(b) Batch size = 2048.
Figure 6: Combine AdaReg with BN and Dropout on MNIST.

b.4 Ablations

In all the experiments, the AdaReg algorithm is performed on the softmax layer. Here, we study the effects of applying AdaReg algorithm in all layers, all layers, all layers, and the last layer (i.e., softmax layer). We first discuss how we handle the convolutions in our AdaReg algorithm. Consider a convolutional layer with {input channel, output channel, kernel width, kernel height} being {

}, we vectorize the original 4-D tensor to be a 2-D matrix of size

. The AdaReg algorithm can therefore be directly applied on this transformed matrix. Next, we perform the experiment on MNIST with batch size 2048 in Fig. 7. The training set size here is chosen as {128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768, 60000}.

We find that simply applying the AdaReg algorithm in the softmax layer reaches best generalization as comparing to applying AdaReg on more layers. The improvement is more obvious when the training set size is small. We argue that neural networks can be realized as a combination of a complex nonlinear transformation (i.e., feature extraction) and a linear model (i.e., softmax layer). Since AdaReg represents a correlation learning in the weight matrix, it implies that implicit correlations of neurons can also be discovered. In the real world setting, different tasks should be correlated. Therefore, applying AdaReg in the linear model shall improve the model performance by discovering these tasks correlations. On the contrary, the nonlinear features should be decorrelated for the purpose of generalization. Hence, applying AdaReg in previous layers may lead to adversarial effect.

Figure 7: Applying AdaReg on different layers in neural networks for MNIST with batch size 2048.

b.5 Covariance matrices in the prior

One byproduct that AdaReg brings to us is the learned row and column covariance matrices, which can be used in exploratory data analysis to understand the correlations between learned features and different output tasks. To this end, we visualize both the row and column covariance matrices in Fig. 8. The two covariance matrices on the first row correspond to the ones learned on a training set with 600 instances while the two on the second row are trained with the full dataset on MNIST.

(a) Row Cov. matrix trained on 600 instances.
(b) Column Cov. matrix trained on 600 instances.
(c) Row Cov. matrix trained on 60,000 instances.
(d) Column Cov. matrix trained on 60,000 instances.
Figure 8: Recovered row covariance matrix and column covariance matrix in the prior distribution on MNIST.

From Fig. 8 we can make the following observations: the structure of both covariance matrices become more evident when trained with larger dataset, and this is consistent with the Bayesian principle because more data provide more evidence. Second, we observe in our experiments that the variances of both matrices are small. In fact, the variance of the row covariance matrix achieves the lower bound limit at convergence. Lastly, comparing the row covariance matrix in Fig. 8 with the one computed from model weights in Fig. 5, we can see that both matrices exhibit the same correlation patterns, except that the one obtained from model weights are more evident, which is due to the fact that model weights are closer to data evidence than the row covariance matrix in the Bayesian hierarchy.

On the other hand, the column covariance matrix in Fig. 8 also exhibits rich correlations between the learned features. Again, these patterns become more evident with more data.

Figure 9: Explained variance of different methods on 7 regression tasks from the SARCOS dataset.