Recent Advances in Large Margin Learning

by   Yiwen Guo, et al.
ByteDance Inc.

This paper serves as a survey of recent advances in large margin training and its theoretical foundations, mostly for (nonlinear) deep neural networks (DNNs) that are probably the most prominent machine learning models for large-scale data in the community over the past decade. We generalize the formulation of classification margins from classical research to latest DNNs, summarize theoretical connections between the margin, network generalization, and robustness, and introduce recent efforts in enlarging the margins for DNNs comprehensively. Since the viewpoint of different methods is discrepant, we categorize them into groups for ease of comparison and discussion in the paper. Hopefully, our discussions and overview inspire new research work in the community that aim to improve the performance of DNNs, and we also point to directions where the large margin principle can be verified to provide theoretical evidence why certain regularizations for DNNs function well in practice. We managed to shorten the paper such that the crucial spirit of large margin learning and related methods are better emphasized.


Large Margin Few-Shot Learning

The key issue of few-shot learning is learning to generalize. In this pa...

Adversarial Margin Maximization Networks

The tremendous recent success of deep neural networks (DNNs) has sparked...

Comparison theorems on large-margin learning

This paper studies binary classification problem associated with a famil...

Efficient Visual Recognition with Deep Neural Networks: A Survey on Recent Advances and New Directions

Visual recognition is currently one of the most important and active res...

Adaptive Inference through Early-Exit Networks: Design, Challenges and Directions

DNNs are becoming less and less over-parametrised due to recent advances...

Interpreting Deep Learning: The Machine Learning Rorschach Test?

Theoretical understanding of deep learning is one of the most important ...

Large-Margin kNN Classification Using a Deep Encoder Network

KNN is one of the most popular classification methods, but it often fail...

1 Introduction

The concept of large margin learning arises along with the development of support vector machine (SVM) 

[14, 87]

, which aims to keep the value of empirical risk fixed and minimize the confidence interval, in contrast to neural networks that target mostly at the empirical risk instead 

[87, 88]. Benefit from solid theoretical basis from statistical learning, large margin classifiers show promise in both generalization ability and robustness. Since the late 1990s, they have been intensively studied and widely adopted [17, 85, 116, 15, 76].

Yet, every learning machine has its day. Recent years have witnessed a revive of neural networks, partially owing to the advances of computational units that are capable of processing large-scale datasets. By learning representations from data, they, especially deep neural networks (DNNs), have advanced the state-of-the-arts of many tasks for machine intelligence [38, 31]. One might be curious about the relationship between DNNs and conventional large margin classifiers (e.g., SVM), and in view of this, we would like to answer the following questions in this survey: 1) Is the large margin principle essential or beneficial to the performance of DNNs? 2) if yes, is it (implicitly) supported with a normal training mechanism used in practice? 3) If no (for both questions), how to enlarge the margin for classification models that are nonlinear and structural complex like DNNs? This paper presents an overview of existing work on these points. To the best of our knowledge, there is no such survey in the literature, and our work will bridge the information gap and inspire new research hopefully.

1.1 A Formal Definition of Classification Margin

To get started, let us first introduce a formal definition of the classification margin, which applies to a variety of different classifiers, including both linear and nonlinear ones.

It can be slightly different to formulate the margin of binary and multi-class classification. We first consider the binary scenario, in which a label from is assigned by a classifier to its input . Given which maps the -dimensional input into a one-dimensional decision space, an instance-specific margin is defined as


for any . indicates the concerned -norm, and, in the Euclidean space, we have . For a classifier whose prediction is given by a linear function of its input , we can rewrite as , and the instance-specific margin further has a closed-form solution in this case: . It can be of interest to great study a margin over the whole training set. SVM in the linear case enlarges such a classification margin by minimizing and constraining the value of (to be more specific, by letting for all ). Taking advantage of some kernel trick [73] and utilizing nonlinear functions in reproducing kernel Hilbert spaces, it is natural to extend the expressions and analyses of margins to SVMs with higher expressivity [88].

With the expression of classification margin provided for SVMs, we obtain certification on the superiority of their generalization ability and robustness [6, 71, 104]. For DNNs, it is very challenging, if not impossible, to derive analytic expressions for their classification margins, on account of the hierarchical model structure and complex nonlinearity of . Therefore, the study of classification margin, generalization ability, and robustness of DNNs also attract great attention recently. In the following sections, we will try to answer the three questions and given an overview of recent advances in margin related studies for (nonlinear) DNNs. Before giving more details, it is worth stressing that the classification margin concerned in SVMs and this paper is in the input space of the learning machine. We will also mention some margins in the output or representation spaces of DNNs in the rest of the paper.

2 The Relationships between Margin, Generalization, and Robustness

In this section, we attempt to answer the first two questions raised in Section 1. First, it is the question about benefits of large margin learning to DNN-based classifications. Related studies from theoretical perspectives have been performed for decades on the basis of some shallow models like SVMs. In this section, we focus on research work based on DNNs.

2.1 Large Margin Is Beneficial to DNNs as Well

Before delving deep into these studies, we first introduce the concept of robustness, (a.k.a., algorithmic robustness [105] and robustness in patterns [76]

), which is an essential property of classifiers and closely related to the classification margin. As is known from the definition, for any reasonable input (e.g., any natural image that can be fed into a scene classification system), the classifier will hold its prediction with any pixel-wise perturbation smaller than the margin (i.e.,

). We consider as a robust model if is sufficiently large so that the perturbation lead to samples seemingly from the other classes. Over the last few years, the robustness of DNN models has draw more attention along with developments of adversarial attacks [82, 30, 61, 52, 103]. It has been demonstrated that one can easily manipulate the prediction of a state-of-the-art DNN model by adding subtle perturbations to its input. By definition, given the function and an input , any perturbation to within the hyper-sphere would not alter the model prediction. That said, the concept of adversarial robustness that describes the ability of a model to resist adversarial attacks, is intrinsically related to the classification margin.

Fig. 1:

An example of the adversarial example which is misclassified as purse by a ResNet-50 trained on ImageNet. We enlarge the perturbation by

for better illustration in the middle picture.

What relates to robustness and being important as well is the generalization ability. Suppose that a DNN model is to be learned to minimize the risk , in which

is a set that exhaustively collects all learnable parameters in the network. Since the joint distribution of

and is generally unknown in practice, the objective of expected risk minimization cannot be pursued directly and we opt to minimize an empirical risk instead, using a set of training samples , i.e., . In spite of being a rule of thumb, there is always a gap between minimizing and , and we can use the generalization error to characterize such a gap formally [104, 48].

It has been discovered that the margin and some other robustness measure of a classifier bound the generalization error of the classifier in theory [104, 8]. Naturally, we prefer machine learning models with lower generalization error and it has been demonstrated that, for a -robust algorithm with for all reasonable  111Formal definition of can be found in [104, 77] and it characterizes how the training data is exploited by the algorithm. is the number of sample partitions and bounds the discrepancy between training and possible test losses in each partition., we have, with probability at least , it holds that


The result establishes connections between the generalization ability and robustness of classification models. Inspired the result, Sobolić et al. [77] proposed to bound the margin of a DNN such that both its robustness and generalization ability are guaranteed, and it was achieved by constraining the Jacobian matrix. For a classification model that was fed with an -dimensional input each time and outputted neurons before softmax, the

Jacobian matrix should also be instance-specific and it was obtained by calculating the gradient of the function with respect to its input. Superior test-set accuracies were obtained using the Jacobian regularization. They also generalized the theoretical and empirical analyses to stable invariant classifiers (e.g., convolutional neural networks, CNNs) 

[78, 41], and methods that could enhance robustness to data variations are suggested. With all the facts, we know that, for Question 1, the large margin principle is beneficial to DNNs, both in the generalization ability and robustness.

2.2 Large Margin Cannot be Trivially Obtained

Let us now turn to answering Question 2

. Apparently, large margins cannot be naturally obtained with a normal training mechanism (i.e., using stochastic gradient descent to minimize just a cross-entropy loss) for DNN models in practice, otherwise their adversarial vulnerability would not be considered severe. Note that although it has been proved that linear networks trained on

separable data using stochastic gradient descent converge to maximum margin solutions as  [79, 32, 44, 63], it was also demonstrated that the convergence rate was extremely slow (e.g.,  [79] using the cross-entropy loss), making it hardly achievable in practice. Similar results can also be derived for non-separable data [45]. In addition to the results that were derived based on the implicit bias of stochastic gradient descent, it has also been shown that over parameterization also leads to improved margins [95]. Under an infinite network width regime, stochastic gradient descent of a two-layer network model leads to an inference function in a reproducing kernel Hilbert space of the neural tangent kernel [19, 42, 2], and it can be proved that a proper explicit regularization can indeed improve margins.

3 Achieve Large Margin for DNNs

Now that we have answered the first two concerned questions in the previous section, we will attempt to answer the third question in this sections. Figure 2 is a summarization of what follows. Section 3.1 to 3.3 attempt to group training mechanisms that affect the margin with theoretical guarantees into several categories for better clarity.

Fig. 2: The concerned classification margin is closely related to adversarial robustness and generalization ability, and we will introduce methods that aim to enlarge the margins for DNNs, by local linearization [106, 107, 25], certification [84, 101, 13], or relying on a margin in the decision space [57, 89, 20] in combination with some (possibly implicit) Lipschitz constraints.

3.1 Regularization on Lipschitz Constant and Output Margins

We know that conventional regularization strategies like weight decay [51] and dropout [40] do not help in enlarging the margin for nonlinear DNNs, which seems different from SVMs. It indicates that the hierarchical structure of DNNs is worthy further exploring, considering that the optimization of an SVM can be roughly considered as training the final layer of a DNN with weight decay.

Unlike for SVMs, derivation and calculation of a classification margin seem infeasible for nonlinear DNNs, and hence some approximations are pursued as surrogates. For instance, as mentioned, Sobolić et al. [77] took advantage of the fact that nonlinear DNNs were Lipschitz continuous and used the inequality


to constrain the margin, which was basically an distance between two points in the input space. We know from Eq. (3) that the distance between any two points in the prediction space is an essential factor of the classification margin. Fix , the classification margin can be guaranteed by minimizing a Lipschitz constant (i.e., ). It was also theoretically proven that the Lipschitz constant has direct relationships with both robustness [39, 100] and generalization [5, 64, 96] for a variety of nonlinear DNNs. Unsurprisingly, the Lipschitz constant and some “margins” in the model decision space, which laid the foundation of a spectrum of methods as will be introduced in the following paragraphs, jointly guarantee the (adversarial) robustness of DNNs. The norm of the Jacobian matrix, is then suggested to be penalized during training for improved robustness [77, 43], since it is actually a local Lipschitz constant fulfilling the inequality in Eq. (3) for any specific and it measures the sensitivity of a learning machine by definition [66]. In addition to the norm of the Jacobian matrix, the Euclidean norm of input gradient [69], the curvature of Hessian matrix [62], the spectral norm of weight matrices [108], and a cross-Lipschitz functional [39] can all be used to enhance the robustness or generalization ability of DNNs, and larger margin is simultaneously anticipated. In fact, these network properties are all closely related to the Lipschitz constant and have some chained inequality [43, 33]. In binary classification, some of the regularizations share the same essential ingredients (i.e., maximizing prediction confidence and minimizing local Lipschitz constants) in theory and show similar empirical results [33].

As mentioned, there is also a line of work focusing on “margins” in the decision or a representation space (i.e., output margin) of DNNs. For instance, Tang [83] proposed to replace the softmax cross-entropy loss with an SVM-derived one, leading to a positive margin in the representation space characterized by the penultimate layer and a winning solution to the ICML’13 representation learning challenge [29]. Sun et al. [81] introduced margin-based penalties to the objective of training DNNs, motivated by theoretical analyses from the perspective of the margin bound. The triplet loss [98] that imposes a margin of representation distance between each positive sample pair and each negative sample pair was also considered, e.g., in FaceNet [72]

. Since then such output margin has been actively discussed and studied in the face recognition community. Beside what was applied in FaceNet, angular softmax (A-Softmax)

222See also large margin softmax (L-Softmax) in [57], which is very similar to A-Softmax. and additive margin softmax (AM-Softmax) were further developed and used in SphereFace [56] and CosineFace [89], respectively, to directly enhance the conventional cross-entropy loss with softmax. There are also ensemble soft-margin softmax (M-Softma) [93], virtual softmax (V-Softmax) [11], large margin cosine loss (LMCL) [90], and additive angular margin loss (AAML) [20], just to name a few. They achieved remarkable success in the task of face identification and verification, and some of them also show promising accuracy for classifying natural scene images. The difference between these methods mostly lie in the way of decomposing the cross-entropy loss. Table I compares them. It is also worth mentioning that the cross-entropy loss itself can also be interpreted as a margin-based loss, with input-specific margins in the output space, and it was shown that enlarging such input-specific margins could be used as a regularization and led to improved test-set accuracy [50].

Similar ideas for enlarging output margins have also been considered in speech recognition methods [111, 112, 91], in which computational modules like feedback connections [31] and self-attentions [3]

often serve in the backbone models. It is expected to obtain deep feature representations that lead to the largest SVM margins. A two-stage pipeline was initially developed for training such models 

[111, 112], in which learnable parameters in the final and prior layers were updated separately. For more recent methods, automatic differentiation [67] was adopted, such that the feature representations and large margin SVM classifiers can be optimized jointly. Such a large margin principle was also utilized in few-shot learning [94], PU learning [24, 28]

, and anomaly detection 


Method Test data Margin Add Mul
L-Softmax [57] image, face Angular
A-Softmax [56] face Angular
M-Softmax [93] image Logit
V-Softmax [11] image, face Logit
AM-Softmax [89] face Cosine
LMCL [90] face Cosine
AAML [20] face Angular
TABLE I: Large margin classification for face recognition. The methods differ from each other by introducing margins in angular, logit, or cosine spaces and by incorporating scaling factors (Mul) or additive terms (Add). Given as a learnable matrix before softmax and as the feature representation of a DNN, if

is fixed, then we can rewrite the linear transformation

as and incorporate a scaling factor and an additive term into it to encourage the angular margin as . Note that although M-Softmax targets to image classification, it is closely related to the other methods and thus we list it here as well.

While these methods considered classification margins of all training instances, it seems more efficient and reasonable to mainly focus on support vectors, just like in SVMs. In this spirit, Wang et al. 

[92] combined the margin-based softmax with hard example mining [74, 55], such that training focused more on harder samples which were considered more informative. Since only the “support vectors” were used for calculating gradients and updating parameters, the training process became more efficient. Being viewed as a functional abstraction of the training dataset, the set of “support vectors” has also been utilized to address catastrophic forgetting [49] in incremental learning DNNs [54].

3.2 Local Linearization

The methods introduced in Section 3.1 enlarged “margins” mostly in the decision space of DNN models. Efforts might also be devoted in other representation spaces [114], however, as discussed [1], owing to the distance distortions between input and representation spaces, the classification margins in the input space of DNNs were not necessarily maximized by methods in this category 333Although it has been theoretically shown that enlarging the output margin is beneficial to the generalization ability of DNNs, along with constrained classifier norms [5, 65] or constrained complexity of each layer [97] . An et al. [1] hence further enforced the transformations of middle layers of a DNN to be contraction mappings, in order to achieve large classification margins. One step further, Bansal et al. [4] advocated a similar learning objective to that of SVMs, in which the method of Lagrange multipliers is utilized. Both of them can be seen as using layer-wise approximations to restricting some whole network property (e.g., the Lipschitz constant). Recently, independent work from Yan et al. [106] and Elsayed et al. [25] proposed to enlarge classification margins via local linearization and reasonable approximations. In fact, by simply rewriting the constraint using Taylor’s approximation of with respect to , one can obtain a closed-form solution to the worst-case perturbations even for nonlinear DNNs. Elsayed et al. proposed to maximize


in which was the dual norm of , with . In essence, the method can be considered as a simplified and efficient version of Yan et al.’s [106] which took one step further and followed a similar mechanism to DeepFool [61]. Specifically, an iterative local linearization was utilized by DeepFool and Yan et al.’s method to pursue approximations to evaluating the classification margins [106], which can be more accurate yet more computational demanding. Shortly after, Ding et al. [22] also discussed large margin training for DNNs, and they proposed to incorporate


into the learning objective of DNNs, in which

was an estimation of the instance-specific margin

for , indicated the set of correctly classified training samples using the network model, and was a hyper-parameter. To well approximate , they took advantage of the PGD attack [59] and adopted its variant as a proxy of the “shortest successful perturbation”.

(a) min
(b) average
(c) aggregation
(d) shrinkage
Fig. 3: Different choices of of the key components result in different penalty on instance-specific margins. More specifically, (a) considers only the worst case in the training set, (b) combined with an identity function treat all training samples equally, (c) and (d) incorporates the aggregation function and shrinkage function, respectively.

3.2.1 Key Components and More Discussions

In existing work that explicitly incorporates margin-based regularizers into learning objectives, given the set of training samples , one might write the regularization term as


in which is a monotonically increasing function. It has been discussed that the specific choice of (that focuses on the minimum margin solely) and indicates preference in improving robustness or generalization ability [107, 102]. The function is introduced to optionally put more stress on samples with smaller margins. can be used to enlarge margins for correctly classified training samples only, since the margin for incorrectly classified samples are not well defined. They form the three key components in developing different margin-based regularizers.

Somewhat surprisingly, Wu et al. [102] showed that there might exist an interesting trade-off between minimum margins and average margins. It is uncontroversial, at least for SVMs, that maximizing the minimum margin (i.e., choosing the formulation with in Eq. (6)) leads to improved generalization ability, while the average-based term in Eq. (6) favors adversarial robustness 444

It has also been shown that in addition to the average margin which characterizes the first-order statistics of the margin distribution, the variance of the margin is also of importance 

[113]. In fact, we typically use the average magnitude to evaluate the (adversarial) robustness of DNNs [61, 60, 9, 35]. However, it is also discovered that using an identity function for as results in poor prediction accuracy on benign inputs. The regularizer is found to be dominated by some “extremely robust” samples so that if all the samples are treated equally in the regularizer, it would be difficult to well trade off the benign-set accuracy and adversarial robustness. A nonlinear “shrinkage” function that penalize more on samples with small margins can be introduced to relieve the problem [106, 80].

Although using or instead of the average operation in the regularizer aligns better with the generalization ability, it may lead to slower training convergence since at most one (instance-specific) margin in a whole batch of samples is effectively optimized at each training iteration. Yan et al. [107] proposed to address this issue by incorporating an “aggregation” function that aggregates some samples in the batch rather than using only one.

For , some methods chose it to be a set of correctly classified samples [106, 107, 22] while the others simply used the whole training set. For datasets where nearly training accuracy can be obtained, the two options actually lead to similar performance. See Figure 3 for a comparison of different settings in the functions.

3.3 DNN Certification

A related and surging category of methods for DNN certification (also known as DNN verification) was also widely explored in the community [84, 12, 47, 23, 101, 75, 7, 13, 110, 26]. These DNN certification methods aimed at maximizing the volume of the hyper-sphere centered at each training instance, in which all data points are predicted into the same class. In fact, the radius of such a certified hyper-sphere is actually a lower bound of the classification margin, therefore the techniques could also be regarded as opting to encouraging large classification margins.

The certifications of DNNs are normally rigorous such that the margins and DNN robustness can be theoretically guaranteed. One of their demerits might be the high computational complexity. In fact, it is challenging for most of them to be generalized to large networks on large datasets (e.g., ImageNet 

[70]). For fast certification, Lee et al. [53] and Croce et al. [18] proposed to expand linear regions where training samples reside. The linear regions can be smaller than the certified hyper-spheres, and they are related to the classification margins similarly. The relationship between margins and the Lipschitz constant is also exploited, to achieve better certification efficiency [86].

4 Data Augmentation and DNN Compression May Affect Margins

In addition to the methods introduced in Section 3, there exist other methods that possibly achieve large classification margins as some sort of a byproduct. In general, methods that benefit the generalization ability and test accuracy of DNNs may unintentionally enlarge margins to some extent. From this point of view, it has been discussed under what condition(s) can data augmentation lead to margin improvement [68]. An achieved result is that, for linear classifiers or linearly separable data, polynomially many more samples are required for a very specific data augmentation strategy to obtain optimal margins. Adversarial training [59] is often also regarded as a data augmentation strategy, and we know that it has to enlarge margin to guarantee performance. In fact, it has been shown that adversarial training converges to maximum margin solutions faster than normal training [10]. Other augmentations include cutout [21] and mixup [109] may also be related with enlarged margins, considering their success in improving the generalization ability of DNNs, and we encourage future work to explore along this direction.

Though it lacks obvious evidence that can demonstrate the relationship between DNN margins and other learning technologies, we would like to discuss directions that can possibly be explored. The first set of technologies that attract our attention is DNN compression, including network pruning [37, 34, 99] and quantization [16, 115], since improved or at least similar test-set accuracy can be achieved with significantly fewer learnable parameters (in bit for quantization) using these methods [36, 115]. It is also discovered that such network compression leads to improved adversarial robustness in certain circumstances [35, 27]. As have been introduced, both the generalization ability and robustness are closely related to the classification margins, thus we conjecture that it is possible that compression along with re-training also help to achieve models with enlarged margins.

5 Estimation of The Margin

Classification margins have been used for a variety of goals, e.g., improving and estimating DNN model robustness and generalization ability [25, 106, 107, 46, 22]. However, it is still an open problem to find an accurate approximation for the DNN classification margin. As mentioned, the radius of a certified hyper-sphere bounds the margin from blow, while the adversarial examples act as upper bounds of the margin. Hence, taking advantage of DNN certification [84, 12, 47, 23, 101, 75, 7, 13, 110, 26] and adversarial attacks [9, 59], one can reasonably estimate the range where the classification margin resides in. Other methods for estimating the margin, probably not rigorously, can also be found in Section 3.

6 Conclusion

In this paper, we survey recent research efforts on classification margin for (nonlinear) DNNs. Unlike for SVMs, the studies are more challenging for DNNs on account of their hierarchical structure and complex nonlinearity. We first revisit some classical work in the last century and highlight the focus of this paper, and we then summarize connections between the margin, generalization, and robustness, mostly from a theoretical point of view, which highlights the importance of large margin even in the state-of-the-art DNN models. We review methods that target at large margin DNNs over the past few years and categorize them into groups, in a comprehensive but summarized manner. We manage to shorten the paper such that crucial spirit of large margin learning and related methods are better emphasized. We share our view on the key components of current winning methods and point to directions that can possibly be explored.


This work is supported by NSFC.


  • [1] S. An, M. Hayat, S. H. Khan, M. Bennamoun, F. Boussaid, and F. Sohel (2015) Contractive rectifier networks for nonlinear maximum margin classification. In ICCV, pp. 2515–2523. Cited by: §3.2.
  • [2] S. Arora, S. S. Du, W. Hu, Z. Li, R. R. Salakhutdinov, and R. Wang (2019) On exact computation with an infinitely wide neural net. In Advances in Neural Information Processing Systems, pp. 8141–8150. Cited by: §2.2.
  • [3] D. Bahdanau, K. Cho, and Y. Bengio (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §3.1.
  • [4] Y. Bansal, M. Advani, D. Cox, and A. Saxe (2018) Minnorm training: an algorithm for training overcomplete deep neural networks. arXiv preprint arXiv:1806.00730. Cited by: §3.2.
  • [5] P. L. Bartlett, D. J. Foster, and M. J. Telgarsky (2017) Spectrally-normalized margin bounds for neural networks. In NeurIPS, pp. 6240–6249. Cited by: §3.1, footnote 3.
  • [6] P. Bartlett and J. Shawe-Taylor (1999) Generalization performance of support vector machines and other pattern classifiers. Advances in Kernel methods - support vector learning, pp. 43–54. Cited by: §1.1.
  • [7] A. Boopathy, T. Weng, P. Chen, S. Liu, and L. Daniel (2019) Cnn-cert: an efficient framework for certifying robustness of convolutional neural networks. In AAAI, Vol. 33, pp. 3240–3247. Cited by: §3.3, §5.
  • [8] K. Cao, C. Wei, A. Gaidon, N. Arechiga, and T. Ma (2019) Learning imbalanced datasets with label-distribution-aware margin loss. In Advances in Neural Information Processing Systems, pp. 1567–1578. Cited by: §2.1.
  • [9] N. Carlini and D. Wagner (2017) Towards evaluating the robustness of neural networks. In IEEE Symposium on Security and Privacy (SP), pp. 39–57. Cited by: §3.2.1, §5.
  • [10] Z. Charles, S. Rajput, S. Wright, and D. Papailiopoulos (2019) Convergence and margin of adversarial training on separable data. arXiv preprint arXiv:1905.09209. Cited by: §4.
  • [11] B. Chen, W. Deng, and H. Shen (2018) Virtual class enhanced discriminative embedding learning. In Advances in Neural Information Processing Systems, pp. 1942–1952. Cited by: §3.1, TABLE I.
  • [12] C. Cheng, G. Nührenberg, and H. Ruess (2017) Maximum resilience of artificial neural networks. In International Symposium on Automated Technology for Verification and Analysis, pp. 251–268. Cited by: §3.3, §5.
  • [13] J. Cohen, E. Rosenfeld, and Z. Kolter (2019) Certified adversarial robustness via randomized smoothing. In ICML, pp. 1310–1320. Cited by: Fig. 2, §3.3, §5.
  • [14] C. Cortes and V. Vapnik (1995) Support-vector networks. Machine learning 20 (3), pp. 273–297. Cited by: §1.
  • [15] A. Cotter, S. Shalev-Shwartz, and N. Srebro (2013) Learning optimally sparse support vector machines. In International Conference on Machine Learning, pp. 266–274. Cited by: §1.
  • [16] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio (2016) Binarized neural networks: training deep neural networks with weights and activations constrained to+ 1 or-1. arXiv preprint arXiv:1602.02830. Cited by: §4.
  • [17] K. Crammer and Y. Singer (2001) On the algorithmic implementation of multiclass kernel-based vector machines. Journal of machine learning research 2 (Dec), pp. 265–292. Cited by: §1.
  • [18] F. Croce, M. Andriushchenko, and M. Hein (2019)

    Provable robustness of relu networks via maximization of linear regions

    In AISTATS, Cited by: §3.3.
  • [19] A. Daniely (2017) SGD learns the conjugate kernel class of the network. In Advances in Neural Information Processing Systems, pp. 2422–2430. Cited by: §2.2.
  • [20] J. Deng, J. Guo, and S. Zafeiriou (2019) Arcface: additive angular margin loss for deep face recognition. In CVPR, Cited by: Fig. 2, §3.1, TABLE I.
  • [21] T. DeVries and G. W. Taylor (2017) Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552. Cited by: §4.
  • [22] G. W. Ding, Y. Sharma, K. Y. C. Lui, and R. Huang (2019) Max-margin adversarial (mma) training: direct input space margin maximization through adversarial training. arXiv preprint arXiv:1812.02637. Cited by: §3.2.1, §3.2, §5.
  • [23] K. Dvijotham, R. Stanforth, S. Gowal, T. A. Mann, and P. Kohli (2018) A dual approach to scalable verification of deep networks.. In UAI, pp. 550–559. Cited by: §3.3, §5.
  • [24] C. Elkan and K. Noto (2008) Learning classifiers from only positive and unlabeled data. In SIGKDD international conference on Knowledge discovery and data mining, pp. 213–220. Cited by: §3.1.
  • [25] G. F. Elsayed, D. Krishnan, H. Mobahi, K. Regan, and S. Bengio (2018) Large margin deep networks for classification. In NeurIPS, Cited by: Fig. 2, §3.2, §5.
  • [26] A. Fromherz, K. Leino, M. Fredrikson, B. Parno, and C. Păsăreanu (2021) Fast geometric projections for local robustness certification. In ICLR, Cited by: §3.3, §5.
  • [27] A. Galloway, G. W. Taylor, and M. Moussa (2017) Attacking binarized neural networks. In ICLR, Cited by: §4.
  • [28] T. Gong, G. Wang, J. Ye, Z. Xu, and M. Lin (2018) Margin based pu learning. In AAAI, Cited by: §3.1.
  • [29] I. J. Goodfellow, D. Erhan, P. L. Carrier, A. Courville, M. Mirza, B. Hamner, W. Cukierski, Y. Tang, D. Thaler, D. Lee, et al. (2013) Challenges in representation learning: a report on three machine learning contests. arXiv preprint arXiv:1307.0414. Cited by: §3.1.
  • [30] I. J. Goodfellow, J. Shlens, and C. Szegedy (2015) Explaining and harnessing adversarial examples. In ICLR, Cited by: §2.1.
  • [31] A. Graves, A. Mohamed, and G. Hinton (2013)

    Speech recognition with deep recurrent neural networks

    In ICASSP, Cited by: §1, §3.1.
  • [32] S. Gunasekar, J. D. Lee, D. Soudry, and N. Srebro (2018) Implicit bias of gradient descent on linear convolutional networks. In NeurIPS, pp. 9461–9471. Cited by: §2.2.
  • [33] Y. Guo, L. Chen, Y. Chen, and C. Zhang (2020) On connections between regularizations for improving dnn robustness. IEEE transactions on pattern analysis and machine intelligence. Cited by: §3.1.
  • [34] Y. Guo, A. Yao, and Y. Chen (2016) Dynamic network surgery for efficient dnns. In NeurIPS, pp. 1379–1387. Cited by: §4.
  • [35] Y. Guo, C. Zhang, C. Zhang, and Y. Chen (2018) Sparse dnns with improved adversarial robustness. In NeurIPS, pp. 242–251. Cited by: §3.2.1, §4.
  • [36] S. Han, J. Pool, S. Narang, H. Mao, E. Gong, S. Tang, E. Elsen, P. Vajda, M. Paluri, J. Tran, et al. (2016) DSD: dense-sparse-dense training for deep neural networks. In ICLR, Cited by: §4.
  • [37] S. Han, J. Pool, J. Tran, and W. Dally (2015) Learning both weights and connections for efficient neural network. In NeurIPS, pp. 1135–1143. Cited by: §4.
  • [38] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, Cited by: §1.
  • [39] M. Hein and M. Andriushchenko (2017) Formal guarantees on the robustness of a classifier against adversarial manipulation. In NeurIPS, pp. 2266–2276. Cited by: §3.1.
  • [40] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov (2012) Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580. Cited by: §3.1.
  • [41] J. Huang, Q. Qiu, G. Sapiro, and R. Calderbank (2015) Discriminative robust transformation learning. In NeurIPS, pp. 1333–1341. Cited by: §2.1.
  • [42] A. Jacot, F. Gabriel, and C. Hongler (2018) Neural tangent kernel: convergence and generalization in neural networks. In Advances in neural information processing systems, pp. 8571–8580. Cited by: §2.2.
  • [43] D. Jakubovitz and R. Giryes (2018) Improving dnn robustness to adversarial attacks using jacobian regularization. In ECCV, Cited by: §3.1.
  • [44] Z. Ji and M. Telgarsky (2018) Gradient descent aligns the layers of deep linear networks. In ICLR, Cited by: §2.2.
  • [45] Z. Ji and M. Telgarsky (2019) The implicit bias of gradient descent on nonseparable data. In Conference on Learning Theory, pp. 1772–1798. Cited by: §2.2.
  • [46] Y. Jiang, D. Krishnan, H. Mobahi, and S. Bengio (2019) Predicting the generalization gap in deep networks with margin distributions. In ICLR, Cited by: §5.
  • [47] G. Katz, C. Barrett, D. L. Dill, K. Julian, and M. J. Kochenderfer (2017) Reluplex: an efficient smt solver for verifying deep neural networks. In International Conference on Computer Aided Verification, pp. 97–117. Cited by: §3.3, §5.
  • [48] K. Kawaguchi, L. P. Kaelbling, and Y. Bengio (2017)

    Generalization in deep learning

    arXiv preprint arXiv:1710.05468. Cited by: §2.1.
  • [49] R. Kemker, M. McClure, A. Abitino, T. L. Hayes, and C. Kanan (2018) Measuring catastrophic forgetting in neural networks. In AAAI, Cited by: §3.1.
  • [50] T. Kobayashi (2019) Large margin in softmax cross-entropy loss.. In BMVC, pp. 139. Cited by: §3.1.
  • [51] A. Krogh and J. A. Hertz (1992) A simple weight decay can improve generalization. In NeurIPS, pp. 950–957. Cited by: §3.1.
  • [52] A. Kurakin, I. Goodfellow, and S. Bengio (2017) Adversarial machine learning at scale. In ICLR, Cited by: §2.1.
  • [53] G. Lee, D. Alvarez-Melis, and T. S. Jaakkola (2019) Towards robust, locally linear deep networks. In ICLR, Cited by: §3.3.
  • [54] Y. Li, Z. Li, L. Ding, Y. Pan, C. Huang, Y. Hu, W. Chen, and X. Gao (2018) Supportnet: solving catastrophic forgetting in class incremental learning with support data. arXiv preprint arXiv:1806.02942. Cited by: §3.1.
  • [55] T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár (2017) Focal loss for dense object detection. In ICCV, pp. 2980–2988. Cited by: §3.1.
  • [56] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song (2017) Sphereface: deep hypersphere embedding for face recognition. In CVPR, Cited by: §3.1, TABLE I.
  • [57] W. Liu, Y. Wen, Z. Yu, and M. Yang (2016) Large-margin softmax loss for convolutional neural networks.. In ICML, pp. 507–516. Cited by: Fig. 2, TABLE I, footnote 2.
  • [58] W. Liu, W. Luo, Z. Li, P. Zhao, and S. Gao (2019) Margin learning embedded prediction for video anomaly detection with a few anomalies. In IJCAI, pp. 3023–3030. Cited by: §3.1.
  • [59] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu (2018) Towards deep learning models resistant to adversarial attacks. In ICLR, Cited by: §3.2, §4, §5.
  • [60] S. Moosavi-Dezfooli, A. Fawzi, O. Fawzi, and P. Frossard (2017) Universal adversarial perturbations. In CVPR, Cited by: §3.2.1.
  • [61] S. Moosavi-Dezfooli, A. Fawzi, and P. Frossard (2016) DeepFool: a simple and accurate method to fool deep neural networks. In CVPR, Cited by: §2.1, §3.2.1, §3.2.
  • [62] S. Moosavi-Dezfooli, A. Fawzi, J. Uesato, and P. Frossard (2019) Robustness via curvature regularization, and vice versa. In CVPR, Cited by: §3.1.
  • [63] M. S. Nacson, J. Lee, S. Gunasekar, P. H. P. Savarese, N. Srebro, and D. Soudry (2019) Convergence of gradient descent on separable data. In AISTATS, pp. 3420–3428. Cited by: §2.2.
  • [64] B. Neyshabur, S. Bhojanapalli, D. McAllester, and N. Srebro (2017) Exploring generalization in deep learning. In NeurIPS, pp. 5947–5956. Cited by: §3.1.
  • [65] B. Neyshabur, S. Bhojanapalli, and N. Srebro (2017) A pac-bayesian approach to spectrally-normalized margin bounds for neural networks. arXiv preprint arXiv:1707.09564. Cited by: footnote 3.
  • [66] R. Novak, Y. Bahri, D. A. Abolafia, J. Pennington, and J. Sohl-Dickstein (2018) Sensitivity and generalization in neural networks: an empirical study. In ICLR, Cited by: §3.1.
  • [67] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017)

    Automatic differentiation in PyTorch

    In NIPS Autodiff Workshop, Cited by: §3.1.
  • [68] S. Rajput, Z. Feng, Z. Charles, P. Loh, and D. Papailiopoulos (2019) Does data augmentation lead to positive margin?. In ICML, Cited by: §4.
  • [69] A. S. Ross and F. Doshi-Velez (2018) Improving the adversarial robustness and interpretability of deep neural networks by regularizing their input gradients. In AAAI, Cited by: §3.1.
  • [70] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei (2015) ImageNet large scale visual recognition challenge. IJCV. Cited by: §3.3.
  • [71] B. Schölkopf, A. J. Smola, R. C. Williamson, and P. L. Bartlett (2000) New support vector algorithms. Neural computation 12 (5), pp. 1207–1245. Cited by: §1.1.
  • [72] F. Schroff, D. Kalenichenko, and J. Philbin (2015) Facenet: a unified embedding for face recognition and clustering. In CVPR, pp. 815–823. Cited by: §3.1.
  • [73] J. Shawe-Taylor, N. Cristianini, et al. (2004) Kernel methods for pattern analysis. Cambridge university press. Cited by: §1.1.
  • [74] A. Shrivastava, A. Gupta, and R. Girshick (2016) Training region-based object detectors with online hard example mining. In CVPR, pp. 761–769. Cited by: §3.1.
  • [75] G. Singh, T. Gehr, M. Mirman, M. Püschel, and M. Vechev (2018) Fast and effective robustness certification. In NeurIPS, pp. 10802–10813. Cited by: §3.3, §5.
  • [76] A. J. Smola, P. J. Bartlett, D. Schuurmans, and B. Schölkopf (2000) Advances in large margin classifiers. MIT press. Cited by: §1, §2.1.
  • [77] J. Sokolić, R. Giryes, G. Sapiro, and M. R. Rodrigues (2017) Robust large margin deep neural networks. IEEE Transactions on Signal Processing 65 (16), pp. 4265–4280. Cited by: §2.1, §3.1, footnote 1.
  • [78] J. Sokolic, R. Giryes, G. Sapiro, and M. Rodrigues (2017) Generalization error of invariant classifiers. In AISTATS, pp. 1094–1103. Cited by: §2.1.
  • [79] D. Soudry, E. Hoffer, M. S. Nacson, S. Gunasekar, and N. Srebro (2018) The implicit bias of gradient descent on separable data. Journal of Machine Learning Research 19 (1), pp. 2822–2878. Cited by: §2.2.
  • [80] D. Stutz, M. Hein, and B. Schiele (2019) Confidence-calibrated adversarial training: towards robust models generalizing beyond the attack used during training. arXiv preprint arXiv:1910.06259. Cited by: §3.2.1.
  • [81] S. Sun, W. Chen, L. Wang, X. Liu, and T. Liu (2016) On the depth of deep neural networks: a theoretical view.. In AAAI, pp. 2066–2072. Cited by: §3.1.
  • [82] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus (2014) Intriguing properties of neural networks. In ICLR, Cited by: §2.1.
  • [83] Y. Tang (2013) Deep learning using linear support vector machines. ICML Workshop. Cited by: §3.1.
  • [84] V. Tjeng, K. Xiao, and R. Tedrake (2017) Evaluating robustness of neural networks with mixed integer programming. arXiv preprint arXiv:1711.07356. Cited by: Fig. 2, §3.3, §5.
  • [85] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun (2005) Large margin methods for structured and interdependent output variables. Journal of machine learning research 6 (Sep), pp. 1453–1484. Cited by: §1.
  • [86] Y. Tsuzuku, I. Sato, and M. Sugiyama (2018) Lipschitz-margin training: scalable certification of perturbation invariance for deep neural networks. In NeurIPS, pp. 6541–6550. Cited by: §3.3.
  • [87] V. Vapnik (1999)

    An overview of statistical learning theory

    IEEE transactions on neural networks 10 (5), pp. 988–999. Cited by: §1.
  • [88] V. Vapnik (2013) The nature of statistical learning theory. Springer science & business media. Cited by: §1.1, §1.
  • [89] F. Wang, J. Cheng, W. Liu, and H. Liu (2018) Additive margin softmax for face verification. IEEE Signal Processing Letters 25 (7), pp. 926–930. Cited by: Fig. 2, §3.1, TABLE I.
  • [90] H. Wang, Y. Wang, Z. Zhou, X. Ji, Z. Li, D. Gong, J. Zhou, and W. Liu (2018) CosFace: large margin cosine loss for deep face recognition. In CVPR, pp. 5265–5274. Cited by: §3.1, TABLE I.
  • [91] P. Wang, J. Cui, C. Weng, and D. Yu (2019) Large margin training for attention based end-to-end speech recognition. In Interspeech, pp. 246–250. Cited by: §3.1.
  • [92] X. Wang, S. Wang, S. Zhang, T. Fu, H. Shi, and T. Mei (2018) Support vector guided softmax loss for face recognition. arXiv preprint arXiv:1812.11317. Cited by: §3.1.
  • [93] X. Wang, S. Zhang, Z. Lei, S. Liu, X. Guo, and S. Z. Li (2018) Ensemble soft-margin softmax loss for image classification. In IJCAI, Cited by: §3.1, TABLE I.
  • [94] Y. Wang, X. Wu, Q. Li, J. Gu, W. Xiang, L. Zhang, and V. O. Li (2018) Large margin few-shot learning. arXiv preprint arXiv:1807.02872. Cited by: §3.1.
  • [95] C. Wei, J. D. Lee, Q. Liu, and T. Ma (2019) Regularization matters: generalization and optimization of neural nets v.s. their induced kernel. In NeurIPS, Cited by: §2.2.
  • [96] C. Wei and T. Ma (2019) Data-dependent sample complexity of deep neural networks via lipschitz augmentation. In NeurIPS, Cited by: §3.1.
  • [97] C. Wei and T. Ma (2020) Improved sample complexities for deep networks and robust classification via an all-layer margin. In ICML, Cited by: footnote 3.
  • [98] K. Q. Weinberger and L. K. Saul (2009) Distance metric learning for large margin nearest neighbor classification. Journal of Machine Learning Research 10 (Feb), pp. 207–244. Cited by: §3.1.
  • [99] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li (2016) Learning structured sparsity in deep neural networks. In NeurIPS, pp. 2074–2082. Cited by: §4.
  • [100] T. Weng, H. Zhang, P. Chen, J. Yi, D. Su, Y. Gao, C. Hsieh, and L. Daniel (2018) Evaluating the robustness of neural networks: an extreme value theory approach. In ICLR, Cited by: §3.1.
  • [101] E. Wong and Z. Kolter (2018) Provable defenses against adversarial examples via the convex outer adversarial polytope. In ICML, pp. 5283–5292. Cited by: Fig. 2, §3.3, §5.
  • [102] K. Wu and Y. Yu (2019) Understanding adversarial robustness: the trade-off between minimum and average margin. arXiv preprint arXiv:1907.11780. Cited by: §3.2.1, §3.2.1.
  • [103] C. Xie, J. Wang, Z. Zhang, Y. Zhou, L. Xie, and A. Yuille (2017) Adversarial examples for semantic segmentation and object detection. In ICCV, Cited by: §2.1.
  • [104] H. Xu, C. Caramanis, and S. Mannor (2009) Robustness and regularization of support vector machines. Journal of Machine Learning Research 10 (Jul), pp. 1485–1510. Cited by: §1.1, §2.1, §2.1, footnote 1.
  • [105] H. Xu and S. Mannor (2012) Robustness and generalization. Machine learning 86 (3), pp. 391–423. Cited by: §2.1.
  • [106] Z. Yan, Y. Guo, and C. Zhang (2018) Deep defense: training dnns with improved adversarial robustness. In NeurIPS, Cited by: Fig. 2, §3.2.1, §3.2.1, §3.2, §5.
  • [107] Z. Yan, Y. Guo, and C. Zhang (2019) Adversarial margin maximization networks. IEEE Transactions on Pattern Analysis and Machine Ingelligence. Cited by: Fig. 2, §3.2.1, §3.2.1, §3.2.1, §5.
  • [108] Y. Yoshida and T. Miyato (2017) Spectral norm regularization for improving the generalizability of deep learning. arXiv preprint arXiv:1705.10941. Cited by: §3.1.
  • [109] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz (2017) Mixup: beyond empirical risk minimization. arXiv preprint arXiv:1710.09412. Cited by: §4.
  • [110] H. Zhang, T. Weng, P. Chen, C. Hsieh, and L. Daniel (2018)

    Efficient neural network robustness certification with general activation functions

    In NeurIPS, pp. 4939–4948. Cited by: §3.3, §5.
  • [111] S. Zhang, C. Liu, K. Yao, and Y. Gong (2015) Deep neural support vector machines for speech recognition. In ICASSP, pp. 4275–4279. Cited by: §3.1.
  • [112] S. Zhang, R. Zhao, C. Liu, J. Li, and Y. Gong (2016) Recurrent support vector machines for speech recognition. In ICASSP, pp. 5885–5889. Cited by: §3.1.
  • [113] T. Zhang and Z. Zhou (2019) Optimal margin distribution machine. IEEE Transactions on Knowledge and Data Engineering 32 (6), pp. 1143–1156. Cited by: footnote 4.
  • [114] Y. Zhong and W. Deng (2019) Adversarial learning with margin-based triplet embedding regularization. arXiv preprint arXiv:1909.09481. Cited by: §3.2.
  • [115] A. Zhou, A. Yao, Y. Guo, L. Xu, and Y. Chen (2017) Incremental network quantization: towards lossless cnns with low-precision weights. In ICLR, Cited by: §4.
  • [116] J. Zhu, S. Rosset, R. Tibshirani, and T. J. Hastie (2004) 1-norm support vector machines. In Advances in neural information processing systems, pp. 49–56. Cited by: §1.