1 Introduction
Although neural networks have achieved stateoftheart results in a variety of tasks, they contain redundant neurons or filters due to the overparametrization issue (Shang et al., 2016; Li et al., 2017), which is prevalent in networks (RoyChowdhury et al., 2017). The redundance can lead to catching limited directions in feature space and poor generalization performance (Morcos et al., 2018).
To address the redundancy problem and make neurons more discriminative, some methods are developed to encourage the angular diversity between pairwise weight vectors of neurons or filters in a layer, which can be categorized into the following three types. The first type reduces the redundancy by dropping some weights and then retraining them iteratively during optimization (Prakash et al., 2019; Han et al., 2016; Qiao et al., 2019), which suffers from complex training scheme and very long training phase. The second type is the widely used orthogonal regularization (Rodríguez et al., 2017; Xie et al., 2017b; Liu et al., 2017; Xie et al., 2017a), which exploits a regularization term in loss function to enforce the pairwise weight vectors as orthogonal as possible. However, it has been proven that orthogonal regularization tends to group neurons closer, especially when the number of neurons is greater than the dimension (Liu et al., 2018), and therefore it only produces marginal improvements (Prakash et al., 2019). The third type also utilizes a regularization term but to encourage the weight vectors uniformly spaced through minimizing the hyperspherical potential energy (Liu et al., 2018; Lin et al., 2020) inspired from the Thomson problem (Thomson, 1904; Smale, 1998). Nonetheless, its disadvantage is that both the time complexity and the space complexity are very high (Liu et al., 2018), and it suffers from a huge number of local minima and stationary points due to its highly nonconvex and nonlinear objective function (Lin et al., 2020).
In this paper, we propose a simple, efficient, and effective method of angular diversity regularization which penalizes the minimum angles between pairwise weight vectors in each layer. Similar to the intuition of the third type mentioned above, the most diverse state is that the normalized weight vectors are distributed on a hypersphere uniformly. To model the criterion of uniformity, we employ the wellknown Tammes problem, that is, to find the arrangement of points on a unit sphere which maximizes the minimum distance between any two points (Tammes, 1930; Musin and Tarasov, 2015; Petković and Živić, 2013; Lovisolo and Da Silva, 2001; Pernici et al., 2019). However, the optimal solutions for the Tammes problem only exist for some combinations of the number of points and dimensions , which are collected on the N.J.A. Sloane’s homepage (Sloane et al., 2000)
, and obtaining a uniform distribution for an arbitrary combination of
and is still an open mathematical problem (Musin and Tarasov, 2015). In this paper, we propose a numerical optimization method to get approximate solutions for the Tammes problem through maximizing the minimal pairwise angles between weight vectors, named as MMA for abbreviation. We further develop the MMA regularization for neural networks to promote the angular diversity of weight vectors in each layer and thus improve the generalization performance.There are several advantages of MMA regularization: (a) As analyzed in Section 3.2, the gradient of MMA loss is stable and consistent, therefore it is easy to optimize and get near optimal solutions for the Tammes problem as shown in Table 1; (b) As verified in Table 3
, the MMA regularization is easy to implement with negligible computational overhead, but with considerable performance improvements; (c) The MMA regularization is effective for both the hidden layers and the output layer, decorrelating the filters and enlarging the interclass separability respectively. Therefore, it can be applied to multiple tasks, such as image classification and face verification demonstrated in this paper. To intuitively make sense of the effectiveness of MMA regularization, we visualize the cosine similarity of filters from the first layer of VGG19BN trained on CIFAR100 in Figure
1. We compare several different methods of angular diversity regularization, including orthogonal regularization in (Rodríguez et al., 2017), MHE regularization in (Liu et al., 2018), and the proposed MMA regularization. The results show that the MMA regularization gets the most uncorrelated filters. Besides, the MMA regularization keeps some negative correlations which have been verified to be beneficial for neural networks (Chelaru and Dragoi, 2016).In summary, the main contributions of this paper are threefold:

We propose a numerical method for the Tammes problem, called MMA, which can get near optimal solutions under arbitrary combinations of the number of points and dimensions.

We develop the novel MMA regularization which effectively promotes the angular diversity of weight vectors and therefore improves the generalization power of neural networks.

Various experiments on multiple tasks show that MMA regularization is generally effective and can become a basic regularization method for training neural networks.
2 Related Work
To improve the generalization power of neural networks, many regularization methods have been proposed, such as weight decay (Krogh and Hertz, 1992), decoupled weight decay (Loshchilov and Hutter, 2017), weight elimination (Weigend et al., 1991), nuclear norm (Recht et al., 2010), dropout (Srivastava et al., 2014), dropconnect (Wan et al., 2013), adding noise (An, 1996), and early stopping (Morgan and Bourlard, 1990).
Recently, diversitypromoting regularization approaches are emerging. These methods mainly penalize the neural networks by adding a regularization term to the loss function. The regularization term either promotes the diversity of activations through minimizing the crosscovariance of hidden activations (Cogswell et al., 2015), or directly promotes the diversity of neurons or filters through enforcing the pairwise orthogonality (Rodríguez et al., 2017; Xie et al., 2017b; Liu et al., 2017; Xie et al., 2017a) or minimizing the global potential energy (Liu et al., 2018; Lin et al., 2020). For many tasks, these methods obtain marginal improvements (Rodríguez et al., 2017; Prakash et al., 2019; Xie et al., 2017c; Brock et al., 2016). Another stream of approaches gets comparatively diverse neurons or filters by cyclically dropping and relearning some of the weights (Prakash et al., 2019; Han et al., 2016; Qiao et al., 2019), which leads to substantial performance gains, but suffers from complex training. In contrast, our proposed simple MMA regularization achieves significant performance improvements while employing the standard training procedures.
The most related work to our method is MHE (Liu et al., 2018), which targets the uniform distribution of normalized weight vectors on a hypersphere as well. However, the MHE is inspired by the Thomson problem (Thomson, 1904) and models the criterion of uniformity as the minimum global potential energy, which suffers from high computational complexity and lots of local minima (Lin et al., 2020). Inspired by the Tammes problem (Tammes, 1930; Lovisolo and Da Silva, 2001), our proposed MMA regularization models the criterion as maximizing the minimum angles, that is the key reason why our method is more efficient and effective.
3 MMA Regularization
As our proposed regularization is inspired by the Tammes problem, we firstly analyze the Tammes problem and propose a numerical method called MMA which maximizes the minimal pairwise angles between the vectors. Then we make a comparison of several numerical methods for the Tammes problem by gradient analysis, which demonstrates the advantage of the proposed MMA. Finally, we develop a novel angular diversity regularization for neural networks by the proposed MMA.
3.1 The Tammes Problem and Proposed Numerical Method MMA
Construction of points spaced uniformly on a unit hypersphere is an important problem for various applications ranging from coding theory to computational geometry (Petković and Živić, 2013). There are many ways to model the criterion of uniformity. One approach is to maximize the minimal pairwise distance between the points (Petković and Živić, 2013), i.e.
(1) 
where denotes the coordinate vector of the th point, denotes the normalized vector, and the denotes the Euclidean norm. This criterion means the points on a unit sphere are spaced uniformly when the minimal pairwise distance is maximized, which is known as the Tammes problem (Tammes, 1930; Lovisolo and Da Silva, 2001) or the optimal spherical code (Ericson and Zinoviev, 2001; Sloane et al., 2000). Denoting the dimension with and the number of points with , we firstly analyze the analytical solutions for the case of , and then propose the numerical solutions for the case of .
The analytical solutions for . As the distance between any two points on a unit hypersphere is inversely proportional to the cosine similarity, the Tammes problem is equivalent to minimize the maximal pairwise cosine similarity, i.e.
(2) 
The maximum of must be larger than the average. Therefore, the minimum is derived as:
(3) 
Therefore, the minimum of maximal pairwise cosine similarity is , which can be reached when all pairwise angles between the points are equal to each other, and the sum of all vectors is a zero vector. This criterion has a matrix form:
(4) 
where denotes the set of
normalized points. According to the matrix theory, the eigenvalues of matrix
are with algebraic multiplicity of 1 and with algebraic multiplicity of . As all the eigenvalues of are greater than or equal to zero, is a semipositive definite matrix. According to the spectral theorem (Axler, 1997), can be gotten through the eigendecomposition of , which is the analytical solution for the Tammes problem. However, since the rank of is , the rank of and the minimum dimension of the points are also . Therefore, this analytical solution only exists for the case of .The numerical solutions for . So far, under the case of , the analytical solutions for the Tammes problem only exist for some combinations of and (Sloane et al., 2000). For most combinations, the optimal solutions do not exist. Consequently, numerical methods are used to get approximate solutions.
As the objective (Equation 1) of Tammes problem is not globally differentiable (Pintér, 2001), the conventional solution (Adams, 1997) alternatively optimizes a differentiable potential energy function to get the approximate solutions, as discussed in next subsection. Nonetheless, with the help of SGD (Rumelhart et al., 1986) and modern automatic differentiation library (Paszke et al., 2019), we can now directly use Equation (1) to implement optimization and get the approximate solutions. However, the calculation of Euclidean length is expensive. Alternatively, as mentioned in Equation (2), we can use the cosine similarity as the objective function, called cosine loss, which is formulated as follows:
(5) 
where denotes the cosine similarity matrix of the points. Employing the global maximum similarity as Equation (2) is inefficient, as it only updates the closest pair of points. Therefore, we alternatively use the average of each vector’s maximum similarity.
The cosine loss can be optimized quickly taking the advantage of matrix form. However, we find this loss is hard to converge, especially for the case that is very close to , which is very prevalent in neural networks (RoyChowdhury et al., 2017). As analyzed in next subsection, this is because the gradient is too small to cover random fluctuations during the optimization. Gaining insight from the ArcFace (Deng et al., 2019), we propose the angular version of cosine loss as the object function:
(6) 
where denotes the pairwise angle matrix. As this loss maximizes the minimal pairwise angles, we name it MMA loss for abbreviation. The MMA loss is very efficient and robust for optimization, so it is easy to get near optimal numerical solutions for the Tammes problem. Besides, it can also get close solutions for the case , which is validated in Section 4. In next subsection, we demonstrate the advantage of the proposed MMA loss through gradient analysis and comparison.
3.2 The Gradient Analysis
In this subsection, we analyze and compare the gradients of loss functions generating approximate solutions for uniformly spaced points. To simplify the derivation, we only consider the norm of the gradient of the core function, composing the summation in loss functions, w.r.t. corresponding weight vector . For intuitive comparison, the analysis results are presented in Figure 2.
Corresponding to the cosine loss referred to Equation (5), the gradient norm is derived as follows:
(7) 
where represents the projection matrix of . From the above derivation and Figure 2, we can see that the gradient norm is very small when pairwise angle is close to zero. That is why the cosine loss is hard to converge for the case that and are close to each other, as experimented in Section 4. Next, we derive the gradient norm corresponding to the MMA loss referred to Equation (6):
(8) 
Compared to the gradient norm corresponding to the cosine loss, as referred to Equation (7), the gradient norm corresponding to the MMA loss is independent of the pairwise angle , so it would not encounter the very small gradient even though is zero. Figure 2 shows that the gradient norm
corresponding to MMA loss is stable and consistent. Therefore, the MMA loss is easy to optimize and get near optimal solutions for the Tammes problem, verified by experiments in Section 4.
In addition to the above two loss functions, we also analyze the RieszFisher loss (Petković and Živić, 2013) and the logarithmic loss (Petković and Živić, 2013) which are often used to get uniformly distributed points on a hypersphere. The philosophy behind the two loss functions is that the points on a hypersphere are uniformly spaced when the potential energy is minimum, and both of them are formulated as kernel functions of the potential energy. The RieszFisher loss and the corresponding gradient norm are:
(9) 
(10) 
where
is a hyperparameter, and is set to 1 in Figure
2 for easy comparison. The logarithmic loss and the corresponding gradient norm are:(11) 
(12) 
Due to the limited space, more details of derivation are presented in the supplementary material. As visualized in Figure 2, the RieszFisher loss and logarithmic loss have similar properties: the gradient norm is sharp around angles near zero and drops rapidly as the angle increases. Besides, the greater the of RieszFisher loss is, the sharper the gradient norm becomes. The very large gradient norm around angles near zero can cause instability and prevent the normal learning of neural networks, and the very small gradient norm around angles away from zero makes the updates inefficient. We argue that is why the two loss functions just get inaccurate solutions for the Tammes problem in Section 4 and perform not so good in terms of accuracy in Table 3.
3.3 MMA Regularization for Neural Networks
In this subsection, we develop the MMA regularization for neural networks, which promotes the learning towards uniformly distributed weight vectors in angular space. For , we can employ the cosine similarity matrix in Equation (4) to constrain the weights. However, as the MMA loss can generate accurate approximate solutions in any case and is easy to implement, we uniformly exploit the MMA loss referred to Equation (6) as the angular regularization:
(13) 
where denotes regularization coefficient, denotes the total number of layers, including convolutional layers and fully connected layers, and denotes the weight matrix of the th layer with each row denoting a vectorized filter or neuron.
The MMA regularization is complementary and orthogonal to weight decay (Krogh and Hertz, 1992). Weight decay regularizes the Euclidean norm of weight vectors, while MMA regularization promotes the direction diversity of weight vectors. MMA regularization can be applied into both hidden layers and output layer. For hidden layers, MMA regularization can reduce the redundancy of filters, which is very common in neural networks (RoyChowdhury et al., 2017). Consequently, the unnecessary overlap in the features captured by the network’s filters is diminished. For output layer, MMA regularization can maximize the interclass separability and therefore enhance the discriminative power of neural networks.
4 Experiments for the Tammes Problem
This section compares several numerical methods for the Tammes problem, measured by the minimum angle, as shown in Table 1. The first column denotes the dimension and the second column denotes the number of points . The third column refers the minimal pairwise angles of the optimal solutions collected in (Sloane et al., 2000). The rest columns are the minimum angle obtained by several different numerical methods, including MMA loss in Equation (6), cosine loss in Equation (5), RieszFisher loss with in Equation (9), and logarithmic loss in Equation (11
). The weights are initialized with values drawn from the standard normal distribution and then optimized by SGD
(Rumelhart et al., 1986) with 10000 iterations. The initial learning rate is set to 0.1 and reduced by a factor of 5 once learning stagnates, and the momentum is set to 0.9.optimal  
3  4  109.5  109.5  109.5  109.4  109.5 
3  30  38.6  38.5  0  34.9  35.4 
3  130  18.5  17.6  0  12.6  16.7 
4  5  104.5  104.5  104.5  104.5  104.5 
4  30  54.3  54.0  53.7  49.3  48.6 
4  130  33.4  32.0  32.1  27.9  27.2 
4  600  19.8  19.3  0  15.9  13.1 
5  6  101.5  101.5  101.5  101.2  101.5 
5  30  65.6  65.5  64.0  57.1  60.0 
5  130  43.8  42.9  42.5  35.8  35.6 
For , as analyzed in Section 3.1, each pairwise angle of optimal solutions is , verified by the second row (=3, =4), the fifth row (=4, =5), and the ninth row (=5, =6), from which we can observe that the optimal solutions can be easily achieved by any of the four numerical methods. For , all the numerical solutions are more or less prone to be worse than the optimal solutions. However, the MMA loss can robustly obtain the closest solutions to the optimal. The cosine loss can also achieve very close solutions, but it is not robust for the cases of too many points like the third row (=3, =30), the forth row (=3, =130), and the eighth row (=4, =600). This is due to the too small gradient as analyzed in Section 3.2. The RieszFisher loss and the logarithmic loss are also robust, but they converge to solutions far from the optimal.
5 Experiments on Image Classification
5.1 Implementation Settings
We conduct image classification experiments on CIFAR100 (Krizhevsky and Hinton, 2009) and TinyImageNet (Le and Yang, 2015). For both datasets, we follow the simple data augmentation in (Lee et al., 2015). We employ various classic networks as the backbone networks, including ResNet56 (He et al., 2016), VGG19 (Simonyan and Zisserman, 2014)
with batch normalization
(Ioffe and Szegedy, 2015) denoted by VGG19BN, VGG16 with batch normalization denoted by VGG16BN, WideResNet (Zagoruyko and Komodakis, 2016) with 16 layers and a widen factor of 8 denoted by WRN168, and DenseNet (Huang et al., 2017) with 40 layers and a growth rate of 12 denoted by DenseNet4012. We denote the corresponding MMA regularization version of models by XMMA. For fair comparison, not only the XMMA models but also the corresponding backbones are trained from scratch, so our results may be slightly different from the ones presented in the original papers due to different random seeds and hardware settings. For CIFAR100, the hyperparameters and settings are the same as the original papers. For TinyImageNet, we follow the settings in (Wang et al., 2020). Besides, all the random seeds are fixed, so the experiments are reproducible and comparisons are absolutely fair. Moreover, in order to reduce the variance of evaluation, we employ the average accuracy of last five epoches as the evaluation criterion.
5.2 Ablation Study
To understand the behavior of MMA regularization, we conduct comprehensive ablation experiments on CIFAR100. Except otherwise noted, we use the VGG19BN to implement ablation experiments.
Impact of the hyperparameter. The MMA regularization coefficient is the only hyperparameter. As the skip connections have implicitly promoted the angular diversity of neurons (Orhan and Pitkow, 2017), we separately select the VGG19BN and ResNet20 to investigate the impact of different coefficients for models without and with skip connections, as shown in Figure LABEL:fig:hyperparameter and Figure LABEL:fig:hyperparameter_skipnet respectively. From both of the figures, we can see that the effect of MMA regularization with too small coefficients is not obvious. However, too large coefficients improve slightly or even decrease the performance. This is because too strong regularization prevents the normal learning of neural networks to some extent. For the VGG19BN, MMA regularization is not very sensitive to the hyperparameter and works well from 0.03 to 0.2, therefore proving the effectiveness of MMA regularization. For the Resnet20, it is sensitive because of the skip connections. In the following experiments, we set MMA regularization coefficient to 0.07 for VGG models and 0.03 for models with skip connections.
Model  TOP1  TOP5 

baseline  72.08  90.5 
hidden  73.45  90.91 
hidden+output  73.73  91.21 
Effectiveness for hidden layers and output layer. The MMA regularization is applicable to both the hidden layers and the output layer. In Table 2, we study the effect of MMA regularization applied to hidden layers (hidden) and all layers (hidden+output). The results show that the hidden version improves over the VGG19BN baseline with a considerable margin and, moreover, the hidden+output version improves the performance further. This indicates that the MMA regularization is effective for both the hidden layers and the output layer, and the effects can be accumulated. As analyzed in Section 3.3, the effectiveness for hidden layers comes from decorrelating the filters or neurons, and the effectiveness for output layer comes from enlarging the interclass separability.
Comparison with other angular regularization. This section compares several angular regularization from the perspective of calculating time per batch, occupied memory, accuracy, and the minimum pairwise angles of several layers, as shown in Table 3. Besides the MMA regularization, we also consider the MHE (Liu et al., 2018) regularization and the widely used orthogonal regularization (Rodríguez et al., 2017; Xie et al., 2017b; Liu et al., 2017; Xie et al., 2017a) which also penalize the pairwise angles. The MHE actually takes the RieszFisher loss (s>0) or logarithmic loss (s=0) to implement regularization (Liu et al., 2018). The orthogonal regularization promotes all the pairwise weight vectors to be orthogonal. Here, we adopt the orthogonal regularization in (Rodríguez et al., 2017):
(14) 
where denotes the normalized weight matrix of the th layer,
denotes identity matrix, and
denotes the Frobenius norm.Regularization  Time(s) /Batch  Memory (MiB)  Accuracy (%)  Minimum Angle (degree)  

TOP1  TOP5  L33  L43  L53  Classify  
baseline  0.070  1127  72.08  90.50  70.1  16.0  30.8  54.0 
MMA  0.095  1229  73.73  91.21  85.7  84.7  85.3  84.9 
MHE(s=2)  0.259  5551  71.91  90.76  74.2  58.1  68.5  71.3 
MHE(s=0)  0.253  5551  72.08  90.72  69.6  48.6  57.8  63.4 
orthogonal  0.096  1237  72.83  90.98  79.3  75.3  81.2  63.1 
The coefficient is set to 0.07 for MMA, 1.0 for MHE (Liu et al., 2018), and 0.0001 for orthogonal regularization (Xie et al., 2017a). Due to the limit of GPU memory, we employ the minibatch version of MHE (Liu et al., 2018), which iteratively takes a random batch (here, 30% of the total) of weight vectors to calculate the loss. For the comparison of minimal pairwise angle, we select the third layer of the third block, forth block, and fifth block, and the classification layer, which are denoted by L33, L43, L53, and Classify
respectively. These experiments are based on PyTorch
(Paszke et al., 2019) and NVIDIA GeForce GTX 1080 GPU.Compared to the baseline, the MMA regularization and orthogonal regularization slightly increase the calculating time and occupied memory. However, the MHE regularization greatly increases that due to the computation of all the pairwise distances. In terms of accuracy, the MMA regularization improves over the baseline by a substantial margin. The orthogonal regularization is also effective but inferior to the MMA regularization. The MHE regularization is just comparable to the baseline, which may be because of the unstable gradient as analyzed in Section 3.2. We also observe that there is a strong link between the minimal pairwise angles in hidden layers and the accuracy—the larger the minimal angles, the higher the accuracy. This is because the larger minimal angle means the more diverse filters which would improve the generalizability of models. The MMA regularization is also the most effective to enlarge the minimal pairwise angle of classification layer, which would increase the interclass separability and enhance the discriminative power of neural networks. More plots and comparison of the minimal pairwise angles are shown in the supplementary material.
5.3 Results and Analysis
Model  TOP1  TOP5 

ResNet56  70.39  91.12 
ResNet56MMA  70.90  91.25 
VGG19BN  72.08  90.50 
VGG19BNMMA  73.73  91.21 
WRN168  78.97  94.84 
WRN168MMA  79.34  95.05 
DenseNet4012  73.98  92.74 
DenseNet4012MMA  74.61  92.77 
We firstly compare various modern architectures with their MMA regularization versions on CIFAR100. From the results shown in Table 4, we can see that the XMMA can typically improve the corresponding backbone models. Especially, MMA regularization improves the TOP1 accuracy of VGG19BN by 1.65%. MMA regularization is also able to robustly improve the performance of models with skip connections like ResNet, DenseNet, and WideResNet, although the improvement is not as distinct as in VGG. This is because the skip connections have implicitly reduced feature correlations to some extent (Orhan and Pitkow, 2017).
Model  TOP1  TOP5 

ResNet56  54.80  78.71 
ResNet56MMA  55.24  78.92 
VGG16BN  62.16  82.41 
VGG16BNMMA  63.37  82.68 
To further demonstrate the consistency of MMA’s superiority, we also evaluate the MMA regularization with ResNet56 and VGG16BN on TinyImageNet, with the coefficient of 0.01 and 0.07 respectively. The results are reported in Table 5, where the XMMA models successfully outperform the original backbones on both Top1 and Top5 accuracy. It is worth emphasizing that the XMMA models achieve the improvements with quite negligible computational overhead and without modifying the original network architecture.
6 ArcFace+: Applying MMA Regularization to ArcFace
ArcFace (Deng et al., 2019) is one of the stateoftheart face verification methods, which proposes an additive angular margin between the learned feature and the target weight vector in the classification layer. This method essentially encourages intraclass feature compactness by promoting the learned features close to the target weight vectors. As analyzed in Section 3.3, MMA regularization can achieve diverse weight vectors and therefore improve interclass separability for classification layer. Consequently, the MMA regularization is complementary to the objective of ArcFace and should boost accuracy further. Motivated by this analysis, we propose ArcFace+ by applying MMA Regularization to ArcFace. The objective function of ArcFace+ is defined as:
(15) 
where is the angular margin of ArcFace, is the regularization coefficient, and is the weight matrix of classification layer.
For fair comparison, both the ArcFace and ArcFace+ are trained from scratch, therefore our results of the ArcFace may be slightly different from the ones presented in the original paper due to different settings and hardware. The implementation settings are detailed in the supplementary material.
Method  LFW  CFPFP  AgeDB30 

ArcFace  99.35  95.30  94.62 
ArcFace+  99.45  95.59  95.15 
From the results shown in Table 6, we can see that the ArcFace+ outperforms ArcFace across all the three verification datasets by margins which are very significant in the field of face verification. This comparison validates the effectiveness of MMA regularization in feature learning. Note that these results are obtained with the default coefficient 0.03, we argue the results may be better with hyperparameter tuning.
7 Conclusion
In this paper, we propose a novel regularization method for neural networks, called MMA regularization, to encourage the angularly uniform distribution of weight vectors and therefore decorrelate the filters or neurons. The MMA regularization has stable and consistent gradient, and is easy to implement with negligible computational overhead, and is effective for both the hidden layers and the output layer. Extensive experiments on image classification demonstrate that the MMA regularization is able to enhance the generalization power of neural networks by considerable improvements. Moreover, MMA regularization is also effective for feature learning with significant margins, due to enlarging the interclass separability. As the MMA can be viewed as a basic regularization method for neural networks, we will explore the effectiveness of MMA regularization on other tasks, such as object detection, object tracking, and image captioning, etc.
References
 A numerical approach to tamme’s problem in euclidean nspace. Cited by: §3.1.

The effects of adding noise during backpropagation training on a generalization performance
. Neural computation 8 (3), pp. 643–674. Cited by: §2.  Linear algebra done right. Vol. 2, Springer. Cited by: §3.1.
 Neural photo editing with introspective adversarial networks. arXiv preprint arXiv:1609.07093. Cited by: §2.
 Negative correlations in visual cortical networks. Cerebral Cortex 26 (1), pp. 246–256. Cited by: §1.
 Reducing overfitting in deep networks by decorrelating representations. arXiv preprint arXiv:1511.06068. Cited by: §2.

Arcface: additive angular margin loss for deep face recognition
. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 4690–4699. Cited by: §3.1, §6.  Codes on euclidean spheres. Elsevier. Cited by: §3.1.
 Dsd: densesparsedense training for deep neural networks. arXiv preprint arXiv:1607.04381. Cited by: §1, §2.
 Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §5.1.
 Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. Cited by: §5.1.
 Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §5.1.
 Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §5.1.
 A simple weight decay can improve generalization. In Advances in neural information processing systems, pp. 950–957. Cited by: §2, §3.3.

Tiny imagenet visual recognition challenge
. CS 231N. Cited by: §5.1.  Deeplysupervised nets. In Artificial Intelligence and Statistics, pp. 562–570. Cited by: §5.1.
 Pruning filters for efficient convnets. Cited by: §1.
 Regularizing neural networks via minimizing hyperspherical energy. arXiv preprint arXiv:1906.04892. Cited by: §1, §2, §2.
 Learning towards minimum hyperspherical energy. In Advances in neural information processing systems, pp. 6222–6233. Cited by: §1, §1, §2, §2, §5.2, §5.2.
 Deep hyperspherical learning. In Advances in neural information processing systems, pp. 3950–3960. Cited by: §1, §2, §5.2.
 Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101. Cited by: §2.
 Uniform distribution of points on a hypersphere with applications to vector bitplane encoding. IEE ProceedingsVision, Image and Signal Processing 148 (3), pp. 187–193. Cited by: §1, §2, §3.1.
 On the importance of single directions for generalization. arXiv preprint arXiv:1803.06959. Cited by: §1.

Generalization and parameter estimation in feedforward nets: some experiments
. In Advances in neural information processing systems, pp. 630–637. Cited by: §2.  The tammes problem for n= 14. Experimental Mathematics 24 (4), pp. 460–468. Cited by: §1.
 Skip connections eliminate singularities. arXiv preprint arXiv:1701.09175. Cited by: §5.2, §5.3.

PyTorch: an imperative style, highperformance deep learning library
. In Advances in Neural Information Processing Systems, pp. 8024–8035. Cited by: §3.1, §5.2.  Fix your features: stationary and maximally discriminative embeddings using regular polytope (fixed classifier) networks. arXiv preprint arXiv:1902.10441. Cited by: §1.
 The fekete problem and construction of the spherical coverage by cones. Facta universitatisseries: Mathematics and Informatics 28 (4), pp. 393–402. Cited by: §1, §3.1, §3.2.
 Globally optimized spherical point arrangements: model variants and illustrative results. Annals of Operations Research 104 (14), pp. 213–230. Cited by: §3.1.
 Repr: improved training of convolutional filters. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10666–10675. Cited by: §1, §2.
 Neural rejuvenation: improving deep network training by enhancing computational resource utilization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 61–71. Cited by: §1, §2.
 Guaranteed minimumrank solutions of linear matrix equations via nuclear norm minimization. SIAM review 52 (3), pp. 471–501. Cited by: §2.
 Regularizing cnns with locally constrained decorrelations. In 5th International Conference on Learning Representations, ICLR 2017, Conference Track Proceedings, Cited by: §1, §1, §2, §5.2.
 Reducing duplicate filters in deep neural networks. In NIPS workshop on Deep Learning: Bridging Theory and Practice, Cited by: §1, §3.1, §3.3.
 Learning representations by backpropagating errors. nature 323 (6088), pp. 533–536. Cited by: §3.1, §4.

Understanding and improving convolutional neural networks via concatenated rectified linear units
. Ininternational conference on machine learning
, pp. 2217–2225. Cited by: §1.  Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §5.1.
 Tables of spherical codes. See http://neilsloane.com/packings/. Cited by: §1, §3.1, §3.1, §4.
 Mathematical problems for the next century. The mathematical intelligencer 20 (2), pp. 7–15. Cited by: §1.
 Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15 (1), pp. 1929–1958. Cited by: §2.
 On the origin of number and arrangement of the places of exit on the surface of pollengrains. Recueil des travaux botaniques néerlandais 27 (1), pp. 1–84. Cited by: §1, §2, §3.1.
 XXIV. on the structure of the atom: an investigation of the stability and periods of oscillation of a number of corpuscles arranged at equal intervals around the circumference of a circle; with application of the results to the theory of atomic structure. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 7 (39), pp. 237–265. Cited by: §1, §2.
 Regularization of neural networks using dropconnect. In International conference on machine learning, pp. 1058–1066. Cited by: §2.
 Picking winning tickets before training by preserving gradient flow. arXiv preprint arXiv:2002.07376. Cited by: §5.1.
 Generalization by weightelimination with application to forecasting. In Advances in neural information processing systems, pp. 875–882. Cited by: §2.
 All you need is beyond a good init: exploring better solution for training extremely deep convolutional neural networks with orthonormality and modulation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6176–6185. Cited by: §1, §2, §5.2, §5.2.
 Learning latent space models with angular constraints. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 3799–3810. Cited by: §1, §2, §5.2.
 Nearorthogonality regularization in kernel methods.. In UAI, Vol. 3, pp. 6. Cited by: §2.
 Wide residual networks. arXiv preprint arXiv:1605.07146. Cited by: §5.1.