MMA Regularization: Decorrelating Weights of Neural Networks by Maximizing the Minimal Angles

06/06/2020 ∙ by Zhennan Wang, et al. ∙ Shenzhen University 0

The strong correlation between neurons or filters can significantly weaken the generalization ability of neural networks. Inspired by the well-known Tammes problem, we propose a novel diversity regularization method to address this issue, which makes the normalized weight vectors of neurons or filters distributed on a hypersphere as uniformly as possible, through maximizing the minimal pairwise angles (MMA). This method can easily exert its effect by plugging the MMA regularization term into the loss function with negligible computational overhead. The MMA regularization is simple, efficient, and effective. Therefore, it can be used as a basic regularization method in neural network training. Extensive experiments demonstrate that MMA regularization is able to enhance the generalization ability of various modern models and achieves considerable performance improvements on CIFAR100 and TinyImageNet datasets. In addition, experiments on face verification show that MMA regularization is also effective for feature learning.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Although neural networks have achieved state-of-the-art results in a variety of tasks, they contain redundant neurons or filters due to the over-parametrization issue (Shang et al., 2016; Li et al., 2017), which is prevalent in networks (RoyChowdhury et al., 2017). The redundance can lead to catching limited directions in feature space and poor generalization performance (Morcos et al., 2018).

To address the redundancy problem and make neurons more discriminative, some methods are developed to encourage the angular diversity between pairwise weight vectors of neurons or filters in a layer, which can be categorized into the following three types. The first type reduces the redundancy by dropping some weights and then retraining them iteratively during optimization (Prakash et al., 2019; Han et al., 2016; Qiao et al., 2019), which suffers from complex training scheme and very long training phase. The second type is the widely used orthogonal regularization (Rodríguez et al., 2017; Xie et al., 2017b; Liu et al., 2017; Xie et al., 2017a), which exploits a regularization term in loss function to enforce the pairwise weight vectors as orthogonal as possible. However, it has been proven that orthogonal regularization tends to group neurons closer, especially when the number of neurons is greater than the dimension (Liu et al., 2018), and therefore it only produces marginal improvements (Prakash et al., 2019). The third type also utilizes a regularization term but to encourage the weight vectors uniformly spaced through minimizing the hyperspherical potential energy (Liu et al., 2018; Lin et al., 2020) inspired from the Thomson problem (Thomson, 1904; Smale, 1998). Nonetheless, its disadvantage is that both the time complexity and the space complexity are very high (Liu et al., 2018), and it suffers from a huge number of local minima and stationary points due to its highly non-convex and non-linear objective function (Lin et al., 2020).

In this paper, we propose a simple, efficient, and effective method of angular diversity regularization which penalizes the minimum angles between pairwise weight vectors in each layer. Similar to the intuition of the third type mentioned above, the most diverse state is that the normalized weight vectors are distributed on a hypersphere uniformly. To model the criterion of uniformity, we employ the well-known Tammes problem, that is, to find the arrangement of points on a unit sphere which maximizes the minimum distance between any two points (Tammes, 1930; Musin and Tarasov, 2015; Petković and Živić, 2013; Lovisolo and Da Silva, 2001; Pernici et al., 2019). However, the optimal solutions for the Tammes problem only exist for some combinations of the number of points and dimensions , which are collected on the N.J.A. Sloane’s homepage (Sloane et al., 2000)

, and obtaining a uniform distribution for an arbitrary combination of

and is still an open mathematical problem (Musin and Tarasov, 2015). In this paper, we propose a numerical optimization method to get approximate solutions for the Tammes problem through maximizing the minimal pairwise angles between weight vectors, named as MMA for abbreviation. We further develop the MMA regularization for neural networks to promote the angular diversity of weight vectors in each layer and thus improve the generalization performance.

There are several advantages of MMA regularization: (a) As analyzed in Section 3.2, the gradient of MMA loss is stable and consistent, therefore it is easy to optimize and get near optimal solutions for the Tammes problem as shown in Table 1; (b) As verified in Table 3

, the MMA regularization is easy to implement with negligible computational overhead, but with considerable performance improvements; (c) The MMA regularization is effective for both the hidden layers and the output layer, decorrelating the filters and enlarging the inter-class separability respectively. Therefore, it can be applied to multiple tasks, such as image classification and face verification demonstrated in this paper. To intuitively make sense of the effectiveness of MMA regularization, we visualize the cosine similarity of filters from the first layer of VGG19-BN trained on CIFAR100 in Figure 

1. We compare several different methods of angular diversity regularization, including orthogonal regularization in (Rodríguez et al., 2017), MHE regularization in (Liu et al., 2018), and the proposed MMA regularization. The results show that the MMA regularization gets the most uncorrelated filters. Besides, the MMA regularization keeps some negative correlations which have been verified to be beneficial for neural networks (Chelaru and Dragoi, 2016).

Figure 1: Comparison of filter cosine similarity from the first layer of VGG19-BN trained on CIFAR100 with several different methods of angular diversity regularization. The number of similarity values above 0.2 is 495 (baseline), 120 (orthogonal), 51 (MHE), 0 (MMA), demonstrating the effectiveness of MMA regularization.

In summary, the main contributions of this paper are three-fold:

  • We propose a numerical method for the Tammes problem, called MMA, which can get near optimal solutions under arbitrary combinations of the number of points and dimensions.

  • We develop the novel MMA regularization which effectively promotes the angular diversity of weight vectors and therefore improves the generalization power of neural networks.

  • Various experiments on multiple tasks show that MMA regularization is generally effective and can become a basic regularization method for training neural networks.

2 Related Work

To improve the generalization power of neural networks, many regularization methods have been proposed, such as weight decay (Krogh and Hertz, 1992), decoupled weight decay (Loshchilov and Hutter, 2017), weight elimination (Weigend et al., 1991), nuclear norm (Recht et al., 2010), dropout (Srivastava et al., 2014), dropconnect (Wan et al., 2013), adding noise (An, 1996), and early stopping (Morgan and Bourlard, 1990).

Recently, diversity-promoting regularization approaches are emerging. These methods mainly penalize the neural networks by adding a regularization term to the loss function. The regularization term either promotes the diversity of activations through minimizing the cross-covariance of hidden activations (Cogswell et al., 2015), or directly promotes the diversity of neurons or filters through enforcing the pairwise orthogonality (Rodríguez et al., 2017; Xie et al., 2017b; Liu et al., 2017; Xie et al., 2017a) or minimizing the global potential energy (Liu et al., 2018; Lin et al., 2020). For many tasks, these methods obtain marginal improvements (Rodríguez et al., 2017; Prakash et al., 2019; Xie et al., 2017c; Brock et al., 2016). Another stream of approaches gets comparatively diverse neurons or filters by cyclically dropping and relearning some of the weights (Prakash et al., 2019; Han et al., 2016; Qiao et al., 2019), which leads to substantial performance gains, but suffers from complex training. In contrast, our proposed simple MMA regularization achieves significant performance improvements while employing the standard training procedures.

The most related work to our method is MHE (Liu et al., 2018), which targets the uniform distribution of normalized weight vectors on a hypersphere as well. However, the MHE is inspired by the Thomson problem (Thomson, 1904) and models the criterion of uniformity as the minimum global potential energy, which suffers from high computational complexity and lots of local minima (Lin et al., 2020). Inspired by the Tammes problem (Tammes, 1930; Lovisolo and Da Silva, 2001), our proposed MMA regularization models the criterion as maximizing the minimum angles, that is the key reason why our method is more efficient and effective.

3 MMA Regularization

As our proposed regularization is inspired by the Tammes problem, we firstly analyze the Tammes problem and propose a numerical method called MMA which maximizes the minimal pairwise angles between the vectors. Then we make a comparison of several numerical methods for the Tammes problem by gradient analysis, which demonstrates the advantage of the proposed MMA. Finally, we develop a novel angular diversity regularization for neural networks by the proposed MMA.

3.1 The Tammes Problem and Proposed Numerical Method MMA

Construction of points spaced uniformly on a unit hypersphere is an important problem for various applications ranging from coding theory to computational geometry (Petković and Živić, 2013). There are many ways to model the criterion of uniformity. One approach is to maximize the minimal pairwise distance between the points (Petković and Živić, 2013), i.e.


where denotes the coordinate vector of the -th point, denotes the -normalized vector, and the denotes the Euclidean norm. This criterion means the points on a unit sphere are spaced uniformly when the minimal pairwise distance is maximized, which is known as the Tammes problem (Tammes, 1930; Lovisolo and Da Silva, 2001) or the optimal spherical code (Ericson and Zinoviev, 2001; Sloane et al., 2000). Denoting the dimension with and the number of points with , we firstly analyze the analytical solutions for the case of , and then propose the numerical solutions for the case of .

The analytical solutions for . As the distance between any two points on a unit hypersphere is inversely proportional to the cosine similarity, the Tammes problem is equivalent to minimize the maximal pairwise cosine similarity, i.e.


The maximum of must be larger than the average. Therefore, the minimum is derived as:


Therefore, the minimum of maximal pairwise cosine similarity is , which can be reached when all pairwise angles between the points are equal to each other, and the sum of all vectors is a zero vector. This criterion has a matrix form:


where denotes the set of

-normalized points. According to the matrix theory, the eigenvalues of matrix

are with algebraic multiplicity of 1 and with algebraic multiplicity of . As all the eigenvalues of are greater than or equal to zero, is a semi-positive definite matrix. According to the spectral theorem (Axler, 1997), can be gotten through the eigendecomposition of , which is the analytical solution for the Tammes problem. However, since the rank of is , the rank of and the minimum dimension of the points are also . Therefore, this analytical solution only exists for the case of .

Figure 2: Comparison of the gradient norm changed with pairwise angle. The gradient of MMA loss is stable and consistent.

The numerical solutions for . So far, under the case of , the analytical solutions for the Tammes problem only exist for some combinations of and  (Sloane et al., 2000). For most combinations, the optimal solutions do not exist. Consequently, numerical methods are used to get approximate solutions.

As the objective (Equation 1) of Tammes problem is not globally differentiable (Pintér, 2001), the conventional solution (Adams, 1997) alternatively optimizes a differentiable potential energy function to get the approximate solutions, as discussed in next subsection. Nonetheless, with the help of SGD (Rumelhart et al., 1986) and modern automatic differentiation library (Paszke et al., 2019), we can now directly use Equation (1) to implement optimization and get the approximate solutions. However, the calculation of Euclidean length is expensive. Alternatively, as mentioned in Equation (2), we can use the cosine similarity as the objective function, called cosine loss, which is formulated as follows:


where denotes the cosine similarity matrix of the points. Employing the global maximum similarity as Equation (2) is inefficient, as it only updates the closest pair of points. Therefore, we alternatively use the average of each vector’s maximum similarity.

The cosine loss can be optimized quickly taking the advantage of matrix form. However, we find this loss is hard to converge, especially for the case that is very close to , which is very prevalent in neural networks (RoyChowdhury et al., 2017). As analyzed in next subsection, this is because the gradient is too small to cover random fluctuations during the optimization. Gaining insight from the ArcFace (Deng et al., 2019), we propose the angular version of cosine loss as the object function:


where denotes the pairwise angle matrix. As this loss maximizes the minimal pairwise angles, we name it MMA loss for abbreviation. The MMA loss is very efficient and robust for optimization, so it is easy to get near optimal numerical solutions for the Tammes problem. Besides, it can also get close solutions for the case , which is validated in Section 4. In next subsection, we demonstrate the advantage of the proposed MMA loss through gradient analysis and comparison.

3.2 The Gradient Analysis

In this subsection, we analyze and compare the gradients of loss functions generating approximate solutions for uniformly spaced points. To simplify the derivation, we only consider the norm of the gradient of the core function, composing the summation in loss functions, w.r.t. corresponding weight vector . For intuitive comparison, the analysis results are presented in Figure 2.

Corresponding to the cosine loss referred to Equation (5), the gradient norm is derived as follows:


where represents the projection matrix of . From the above derivation and Figure 2, we can see that the gradient norm is very small when pairwise angle is close to zero. That is why the cosine loss is hard to converge for the case that and are close to each other, as experimented in Section 4. Next, we derive the gradient norm corresponding to the MMA loss referred to Equation (6):


Compared to the gradient norm corresponding to the cosine loss, as referred to Equation (7), the gradient norm corresponding to the MMA loss is independent of the pairwise angle , so it would not encounter the very small gradient even though is zero. Figure 2 shows that the gradient norm

corresponding to MMA loss is stable and consistent. Therefore, the MMA loss is easy to optimize and get near optimal solutions for the Tammes problem, verified by experiments in Section 4.

In addition to the above two loss functions, we also analyze the Riesz-Fisher loss (Petković and Živić, 2013) and the logarithmic loss (Petković and Živić, 2013) which are often used to get uniformly distributed points on a hypersphere. The philosophy behind the two loss functions is that the points on a hypersphere are uniformly spaced when the potential energy is minimum, and both of them are formulated as kernel functions of the potential energy. The Riesz-Fisher loss and the corresponding gradient norm are:



is a hyperparameter, and is set to 1 in Figure 

2 for easy comparison. The logarithmic loss and the corresponding gradient norm are:


Due to the limited space, more details of derivation are presented in the supplementary material. As visualized in Figure 2, the Riesz-Fisher loss and logarithmic loss have similar properties: the gradient norm is sharp around angles near zero and drops rapidly as the angle increases. Besides, the greater the of Riesz-Fisher loss is, the sharper the gradient norm becomes. The very large gradient norm around angles near zero can cause instability and prevent the normal learning of neural networks, and the very small gradient norm around angles away from zero makes the updates inefficient. We argue that is why the two loss functions just get inaccurate solutions for the Tammes problem in Section 4 and perform not so good in terms of accuracy in Table 3.

3.3 MMA Regularization for Neural Networks

In this subsection, we develop the MMA regularization for neural networks, which promotes the learning towards uniformly distributed weight vectors in angular space. For , we can employ the cosine similarity matrix in Equation (4) to constrain the weights. However, as the MMA loss can generate accurate approximate solutions in any case and is easy to implement, we uniformly exploit the MMA loss referred to Equation (6) as the angular regularization:


where denotes regularization coefficient, denotes the total number of layers, including convolutional layers and fully connected layers, and denotes the weight matrix of the -th layer with each row denoting a vectorized filter or neuron.

The MMA regularization is complementary and orthogonal to weight decay (Krogh and Hertz, 1992). Weight decay regularizes the Euclidean norm of weight vectors, while MMA regularization promotes the direction diversity of weight vectors. MMA regularization can be applied into both hidden layers and output layer. For hidden layers, MMA regularization can reduce the redundancy of filters, which is very common in neural networks (RoyChowdhury et al., 2017). Consequently, the unnecessary overlap in the features captured by the network’s filters is diminished. For output layer, MMA regularization can maximize the inter-class separability and therefore enhance the discriminative power of neural networks.

4 Experiments for the Tammes Problem

This section compares several numerical methods for the Tammes problem, measured by the minimum angle, as shown in Table 1. The first column denotes the dimension and the second column denotes the number of points . The third column refers the minimal pairwise angles of the optimal solutions collected in (Sloane et al., 2000). The rest columns are the minimum angle obtained by several different numerical methods, including MMA loss in Equation (6), cosine loss in Equation (5), Riesz-Fisher loss with in Equation (9), and logarithmic loss in Equation (11

). The weights are initialized with values drawn from the standard normal distribution and then optimized by SGD 

(Rumelhart et al., 1986) with 10000 iterations. The initial learning rate is set to 0.1 and reduced by a factor of 5 once learning stagnates, and the momentum is set to 0.9.

3 4 109.5 109.5 109.5 109.4 109.5
3 30 38.6 38.5 0 34.9 35.4
3 130 18.5 17.6 0 12.6 16.7
4 5 104.5 104.5 104.5 104.5 104.5
4 30 54.3 54.0 53.7 49.3 48.6
4 130 33.4 32.0 32.1 27.9 27.2
4 600 19.8 19.3 0 15.9 13.1
5 6 101.5 101.5 101.5 101.2 101.5
5 30 65.6 65.5 64.0 57.1 60.0
5 130 43.8 42.9 42.5 35.8 35.6
Table 1: Minimum angle (degree) obtained by several different loss functions for the Tammes problem. The best results are highlighted in bold.

For , as analyzed in Section 3.1, each pairwise angle of optimal solutions is , verified by the second row (=3, =4), the fifth row (=4, =5), and the ninth row (=5, =6), from which we can observe that the optimal solutions can be easily achieved by any of the four numerical methods. For , all the numerical solutions are more or less prone to be worse than the optimal solutions. However, the MMA loss can robustly obtain the closest solutions to the optimal. The cosine loss can also achieve very close solutions, but it is not robust for the cases of too many points like the third row (=3, =30), the forth row (=3, =130), and the eighth row (=4, =600). This is due to the too small gradient as analyzed in Section 3.2. The Riesz-Fisher loss and the logarithmic loss are also robust, but they converge to solutions far from the optimal.

5 Experiments on Image Classification

5.1 Implementation Settings

We conduct image classification experiments on CIFAR100 (Krizhevsky and Hinton, 2009) and TinyImageNet (Le and Yang, 2015). For both datasets, we follow the simple data augmentation in (Lee et al., 2015). We employ various classic networks as the backbone networks, including ResNet56 (He et al., 2016), VGG19 (Simonyan and Zisserman, 2014)

with batch normalization 

(Ioffe and Szegedy, 2015) denoted by VGG19-BN, VGG16 with batch normalization denoted by VGG16-BN, WideResNet (Zagoruyko and Komodakis, 2016) with 16 layers and a widen factor of 8 denoted by WRN-16-8, and DenseNet (Huang et al., 2017) with 40 layers and a growth rate of 12 denoted by DenseNet-40-12. We denote the corresponding MMA regularization version of models by X-MMA. For fair comparison, not only the X-MMA models but also the corresponding backbones are trained from scratch, so our results may be slightly different from the ones presented in the original papers due to different random seeds and hardware settings. For CIFAR100, the hyperparameters and settings are the same as the original papers. For TinyImageNet, we follow the settings in (Wang et al., 2020)

. Besides, all the random seeds are fixed, so the experiments are reproducible and comparisons are absolutely fair. Moreover, in order to reduce the variance of evaluation, we employ the average accuracy of last five epoches as the evaluation criterion.

5.2 Ablation Study

To understand the behavior of MMA regularization, we conduct comprehensive ablation experiments on CIFAR100. Except otherwise noted, we use the VGG19-BN to implement ablation experiments.

Impact of the hyperparameter. The MMA regularization coefficient is the only hyperparameter. As the skip connections have implicitly promoted the angular diversity of neurons (Orhan and Pitkow, 2017), we separately select the VGG19-BN and ResNet20 to investigate the impact of different coefficients for models without and with skip connections, as shown in Figure LABEL:fig:hyperparameter and Figure LABEL:fig:hyperparameter_skipnet respectively. From both of the figures, we can see that the effect of MMA regularization with too small coefficients is not obvious. However, too large coefficients improve slightly or even decrease the performance. This is because too strong regularization prevents the normal learning of neural networks to some extent. For the VGG19-BN, MMA regularization is not very sensitive to the hyperparameter and works well from 0.03 to 0.2, therefore proving the effectiveness of MMA regularization. For the Resnet20, it is sensitive because of the skip connections. In the following experiments, we set MMA regularization coefficient to 0.07 for VGG models and 0.03 for models with skip connections.

Model TOP-1 TOP-5
baseline 72.08 90.5
hidden 73.45 90.91
hidden+output 73.73 91.21
Table 2: Accuracy (%) of applying MMA regularization to different layers.

Effectiveness for hidden layers and output layer. The MMA regularization is applicable to both the hidden layers and the output layer. In Table 2, we study the effect of MMA regularization applied to hidden layers (hidden) and all layers (hidden+output). The results show that the hidden version improves over the VGG19-BN baseline with a considerable margin and, moreover, the hidden+output version improves the performance further. This indicates that the MMA regularization is effective for both the hidden layers and the output layer, and the effects can be accumulated. As analyzed in Section 3.3, the effectiveness for hidden layers comes from decorrelating the filters or neurons, and the effectiveness for output layer comes from enlarging the inter-class separability.

Comparison with other angular regularization. This section compares several angular regularization from the perspective of calculating time per batch, occupied memory, accuracy, and the minimum pairwise angles of several layers, as shown in Table 3. Besides the MMA regularization, we also consider the MHE (Liu et al., 2018) regularization and the widely used orthogonal regularization (Rodríguez et al., 2017; Xie et al., 2017b; Liu et al., 2017; Xie et al., 2017a) which also penalize the pairwise angles. The MHE actually takes the Riesz-Fisher loss (s>0) or logarithmic loss (s=0) to implement regularization (Liu et al., 2018). The orthogonal regularization promotes all the pairwise weight vectors to be orthogonal. Here, we adopt the orthogonal regularization in (Rodríguez et al., 2017):


where denotes the -normalized weight matrix of the -th layer,

denotes identity matrix, and

denotes the Frobenius norm.

Regularization Time(s) /Batch Memory (MiB) Accuracy (%) Minimum Angle (degree)
TOP-1 TOP-5 L3-3 L4-3 L5-3 Classify
baseline 0.070 1127 72.08 90.50 70.1 16.0 30.8 54.0
MMA 0.095 1229 73.73 91.21 85.7 84.7 85.3 84.9
MHE(s=2) 0.259 5551 71.91 90.76 74.2 58.1 68.5 71.3
MHE(s=0) 0.253 5551 72.08 90.72 69.6 48.6 57.8 63.4
orthogonal 0.096 1237 72.83 90.98 79.3 75.3 81.2 63.1
Table 3: Comparison of several different methods of angular regularization. The MMA achieves the most diverse filters and highest accuracy with negligible computational overhead.

The coefficient is set to 0.07 for MMA, 1.0 for MHE (Liu et al., 2018), and 0.0001 for orthogonal regularization (Xie et al., 2017a). Due to the limit of GPU memory, we employ the mini-batch version of MHE (Liu et al., 2018), which iteratively takes a random batch (here, 30% of the total) of weight vectors to calculate the loss. For the comparison of minimal pairwise angle, we select the third layer of the third block, forth block, and fifth block, and the classification layer, which are denoted by L3-3, L4-3, L5-3, and Classify

respectively. These experiments are based on PyTorch 

(Paszke et al., 2019) and NVIDIA GeForce GTX 1080 GPU.

Compared to the baseline, the MMA regularization and orthogonal regularization slightly increase the calculating time and occupied memory. However, the MHE regularization greatly increases that due to the computation of all the pairwise distances. In terms of accuracy, the MMA regularization improves over the baseline by a substantial margin. The orthogonal regularization is also effective but inferior to the MMA regularization. The MHE regularization is just comparable to the baseline, which may be because of the unstable gradient as analyzed in Section 3.2. We also observe that there is a strong link between the minimal pairwise angles in hidden layers and the accuracy—the larger the minimal angles, the higher the accuracy. This is because the larger minimal angle means the more diverse filters which would improve the generalizability of models. The MMA regularization is also the most effective to enlarge the minimal pairwise angle of classification layer, which would increase the inter-class separability and enhance the discriminative power of neural networks. More plots and comparison of the minimal pairwise angles are shown in the supplementary material.

5.3 Results and Analysis

Model TOP-1 TOP-5
ResNet56 70.39 91.12
ResNet56-MMA 70.90 91.25
VGG19-BN 72.08 90.50
VGG19-BN-MMA 73.73 91.21
WRN-16-8 78.97 94.84
WRN-16-8-MMA 79.34 95.05
DenseNet-40-12 73.98 92.74
DenseNet-40-12-MMA 74.61 92.77
Table 4: Accuracy (%) on CIFAR100.

We firstly compare various modern architectures with their MMA regularization versions on CIFAR100. From the results shown in Table 4, we can see that the X-MMA can typically improve the corresponding backbone models. Especially, MMA regularization improves the TOP-1 accuracy of VGG19-BN by 1.65%. MMA regularization is also able to robustly improve the performance of models with skip connections like ResNet, DenseNet, and WideResNet, although the improvement is not as distinct as in VGG. This is because the skip connections have implicitly reduced feature correlations to some extent (Orhan and Pitkow, 2017).

Model TOP-1 TOP-5
ResNet56 54.80 78.71
ResNet56-MMA 55.24 78.92
VGG16-BN 62.16 82.41
VGG16-BN-MMA 63.37 82.68
Table 5: Accuracy (%) on TinyImageNet.

To further demonstrate the consistency of MMA’s superiority, we also evaluate the MMA regularization with ResNet56 and VGG16-BN on TinyImageNet, with the coefficient of 0.01 and 0.07 respectively. The results are reported in Table 5, where the X-MMA models successfully outperform the original backbones on both Top-1 and Top-5 accuracy. It is worth emphasizing that the X-MMA models achieve the improvements with quite negligible computational overhead and without modifying the original network architecture.

6 ArcFace+: Applying MMA Regularization to ArcFace

ArcFace (Deng et al., 2019) is one of the state-of-the-art face verification methods, which proposes an additive angular margin between the learned feature and the target weight vector in the classification layer. This method essentially encourages intra-class feature compactness by promoting the learned features close to the target weight vectors. As analyzed in Section 3.3, MMA regularization can achieve diverse weight vectors and therefore improve inter-class separability for classification layer. Consequently, the MMA regularization is complementary to the objective of ArcFace and should boost accuracy further. Motivated by this analysis, we propose ArcFace+ by applying MMA Regularization to ArcFace. The objective function of ArcFace+ is defined as:


where is the angular margin of ArcFace, is the regularization coefficient, and is the weight matrix of classification layer.

For fair comparison, both the ArcFace and ArcFace+ are trained from scratch, therefore our results of the ArcFace may be slightly different from the ones presented in the original paper due to different settings and hardware. The implementation settings are detailed in the supplementary material.

Method LFW CFP-FP AgeDB-30
ArcFace 99.35 95.30 94.62
ArcFace+ 99.45 95.59 95.15
Table 6: Comparison of verification results (%).

From the results shown in Table 6, we can see that the ArcFace+ outperforms ArcFace across all the three verification datasets by margins which are very significant in the field of face verification. This comparison validates the effectiveness of MMA regularization in feature learning. Note that these results are obtained with the default coefficient 0.03, we argue the results may be better with hyperparameter tuning.

7 Conclusion

In this paper, we propose a novel regularization method for neural networks, called MMA regularization, to encourage the angularly uniform distribution of weight vectors and therefore decorrelate the filters or neurons. The MMA regularization has stable and consistent gradient, and is easy to implement with negligible computational overhead, and is effective for both the hidden layers and the output layer. Extensive experiments on image classification demonstrate that the MMA regularization is able to enhance the generalization power of neural networks by considerable improvements. Moreover, MMA regularization is also effective for feature learning with significant margins, due to enlarging the inter-class separability. As the MMA can be viewed as a basic regularization method for neural networks, we will explore the effectiveness of MMA regularization on other tasks, such as object detection, object tracking, and image captioning, etc.


  • P. G. Adams (1997) A numerical approach to tamme’s problem in euclidean n-space. Cited by: §3.1.
  • G. An (1996)

    The effects of adding noise during backpropagation training on a generalization performance

    Neural computation 8 (3), pp. 643–674. Cited by: §2.
  • S. J. Axler (1997) Linear algebra done right. Vol. 2, Springer. Cited by: §3.1.
  • A. Brock, T. Lim, J. M. Ritchie, and N. Weston (2016) Neural photo editing with introspective adversarial networks. arXiv preprint arXiv:1609.07093. Cited by: §2.
  • M. I. Chelaru and V. Dragoi (2016) Negative correlations in visual cortical networks. Cerebral Cortex 26 (1), pp. 246–256. Cited by: §1.
  • M. Cogswell, F. Ahmed, R. Girshick, L. Zitnick, and D. Batra (2015) Reducing overfitting in deep networks by decorrelating representations. arXiv preprint arXiv:1511.06068. Cited by: §2.
  • J. Deng, J. Guo, N. Xue, and S. Zafeiriou (2019)

    Arcface: additive angular margin loss for deep face recognition


    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 4690–4699. Cited by: §3.1, §6.
  • T. Ericson and V. Zinoviev (2001) Codes on euclidean spheres. Elsevier. Cited by: §3.1.
  • S. Han, J. Pool, S. Narang, H. Mao, E. Gong, S. Tang, E. Elsen, P. Vajda, M. Paluri, J. Tran, et al. (2016) Dsd: dense-sparse-dense training for deep neural networks. arXiv preprint arXiv:1607.04381. Cited by: §1, §2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §5.1.
  • G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. Cited by: §5.1.
  • S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §5.1.
  • A. Krizhevsky and G. Hinton (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §5.1.
  • A. Krogh and J. A. Hertz (1992) A simple weight decay can improve generalization. In Advances in neural information processing systems, pp. 950–957. Cited by: §2, §3.3.
  • Y. Le and X. Yang (2015)

    Tiny imagenet visual recognition challenge

    CS 231N. Cited by: §5.1.
  • C. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu (2015) Deeply-supervised nets. In Artificial Intelligence and Statistics, pp. 562–570. Cited by: §5.1.
  • H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf (2017) Pruning filters for efficient convnets. Cited by: §1.
  • R. Lin, W. Liu, Z. Liu, C. Feng, Z. Yu, J. M. Rehg, L. Xiong, and L. Song (2020) Regularizing neural networks via minimizing hyperspherical energy. arXiv preprint arXiv:1906.04892. Cited by: §1, §2, §2.
  • W. Liu, R. Lin, Z. Liu, L. Liu, Z. Yu, B. Dai, and L. Song (2018) Learning towards minimum hyperspherical energy. In Advances in neural information processing systems, pp. 6222–6233. Cited by: §1, §1, §2, §2, §5.2, §5.2.
  • W. Liu, Y. Zhang, X. Li, Z. Yu, B. Dai, T. Zhao, and L. Song (2017) Deep hyperspherical learning. In Advances in neural information processing systems, pp. 3950–3960. Cited by: §1, §2, §5.2.
  • I. Loshchilov and F. Hutter (2017) Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101. Cited by: §2.
  • L. Lovisolo and E. Da Silva (2001) Uniform distribution of points on a hyper-sphere with applications to vector bit-plane encoding. IEE Proceedings-Vision, Image and Signal Processing 148 (3), pp. 187–193. Cited by: §1, §2, §3.1.
  • A. S. Morcos, D. G. Barrett, N. C. Rabinowitz, and M. Botvinick (2018) On the importance of single directions for generalization. arXiv preprint arXiv:1803.06959. Cited by: §1.
  • N. Morgan and H. Bourlard (1990)

    Generalization and parameter estimation in feedforward nets: some experiments

    In Advances in neural information processing systems, pp. 630–637. Cited by: §2.
  • O. R. Musin and A. S. Tarasov (2015) The tammes problem for n= 14. Experimental Mathematics 24 (4), pp. 460–468. Cited by: §1.
  • A. E. Orhan and X. Pitkow (2017) Skip connections eliminate singularities. arXiv preprint arXiv:1701.09175. Cited by: §5.2, §5.3.
  • A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. (2019)

    PyTorch: an imperative style, high-performance deep learning library

    In Advances in Neural Information Processing Systems, pp. 8024–8035. Cited by: §3.1, §5.2.
  • F. Pernici, M. Bruni, C. Baecchi, and A. Del Bimbo (2019) Fix your features: stationary and maximally discriminative embeddings using regular polytope (fixed classifier) networks. arXiv preprint arXiv:1902.10441. Cited by: §1.
  • M. D. Petković and N. Živić (2013) The fekete problem and construction of the spherical coverage by cones. Facta universitatis-series: Mathematics and Informatics 28 (4), pp. 393–402. Cited by: §1, §3.1, §3.2.
  • J. D. Pintér (2001) Globally optimized spherical point arrangements: model variants and illustrative results. Annals of Operations Research 104 (1-4), pp. 213–230. Cited by: §3.1.
  • A. Prakash, J. Storer, D. Florencio, and C. Zhang (2019) Repr: improved training of convolutional filters. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10666–10675. Cited by: §1, §2.
  • S. Qiao, Z. Lin, J. Zhang, and A. L. Yuille (2019) Neural rejuvenation: improving deep network training by enhancing computational resource utilization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 61–71. Cited by: §1, §2.
  • B. Recht, M. Fazel, and P. A. Parrilo (2010) Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM review 52 (3), pp. 471–501. Cited by: §2.
  • P. Rodríguez, J. Gonzàlez, G. Cucurull, J. M. Gonfaus, and F. X. Roca (2017) Regularizing cnns with locally constrained decorrelations. In 5th International Conference on Learning Representations, ICLR 2017, Conference Track Proceedings, Cited by: §1, §1, §2, §5.2.
  • A. RoyChowdhury, P. Sharma, and E. G. Learned-Miller (2017) Reducing duplicate filters in deep neural networks. In NIPS workshop on Deep Learning: Bridging Theory and Practice, Cited by: §1, §3.1, §3.3.
  • D. E. Rumelhart, G. E. Hinton, and R. J. Williams (1986) Learning representations by back-propagating errors. nature 323 (6088), pp. 533–536. Cited by: §3.1, §4.
  • W. Shang, K. Sohn, D. Almeida, and H. Lee (2016)

    Understanding and improving convolutional neural networks via concatenated rectified linear units


    international conference on machine learning

    pp. 2217–2225. Cited by: §1.
  • K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §5.1.
  • N.J.A. Sloane, R.H. Hardin, W.D. Smith, et al. (2000) Tables of spherical codes. See Cited by: §1, §3.1, §3.1, §4.
  • S. Smale (1998) Mathematical problems for the next century. The mathematical intelligencer 20 (2), pp. 7–15. Cited by: §1.
  • N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15 (1), pp. 1929–1958. Cited by: §2.
  • P. M. L. Tammes (1930) On the origin of number and arrangement of the places of exit on the surface of pollen-grains. Recueil des travaux botaniques néerlandais 27 (1), pp. 1–84. Cited by: §1, §2, §3.1.
  • J. J. Thomson (1904) XXIV. on the structure of the atom: an investigation of the stability and periods of oscillation of a number of corpuscles arranged at equal intervals around the circumference of a circle; with application of the results to the theory of atomic structure. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 7 (39), pp. 237–265. Cited by: §1, §2.
  • L. Wan, M. Zeiler, S. Zhang, Y. Le Cun, and R. Fergus (2013) Regularization of neural networks using dropconnect. In International conference on machine learning, pp. 1058–1066. Cited by: §2.
  • C. Wang, G. Zhang, and R. Grosse (2020) Picking winning tickets before training by preserving gradient flow. arXiv preprint arXiv:2002.07376. Cited by: §5.1.
  • A. S. Weigend, D. E. Rumelhart, and B. A. Huberman (1991) Generalization by weight-elimination with application to forecasting. In Advances in neural information processing systems, pp. 875–882. Cited by: §2.
  • D. Xie, J. Xiong, and S. Pu (2017a) All you need is beyond a good init: exploring better solution for training extremely deep convolutional neural networks with orthonormality and modulation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6176–6185. Cited by: §1, §2, §5.2, §5.2.
  • P. Xie, Y. Deng, Y. Zhou, A. Kumar, Y. Yu, J. Zou, and E. P. Xing (2017b) Learning latent space models with angular constraints. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3799–3810. Cited by: §1, §2, §5.2.
  • P. Xie, B. Poczos, and E. P. Xing (2017c) Near-orthogonality regularization in kernel methods.. In UAI, Vol. 3, pp. 6. Cited by: §2.
  • S. Zagoruyko and N. Komodakis (2016) Wide residual networks. arXiv preprint arXiv:1605.07146. Cited by: §5.1.