1 Introduction
Although deep neural networks (DNNs) have shown stateoftheart performance on a variety of challenging computer vision tasks, most of them are still ”notorious” for requiring massive amount of training data. In addition, a bunch of recent works also demonstrate that DNNs are vulnerable to adversarial examples
[12, 27, 25, 30], indicating the models have problem generalizing to unseen data with possible types of distortions [38, 24]. These undesirable facts motivate us to analyze the generalization ability of DNNs and further find a principled way to improve it.Our work in this paper stems specifically from a geometric point of view. We delve deeply into the margin of DNNs and advocate a large margin principle for training networks. Generally, such a principle can enhance the obtained models over two aspects: (a). maximum margin classifiers usually possess better generalization ability, which is both theoretically guaranteed and empirically verified [6, 42, 32, 51]; and (b). they can also be naturally more robust to adversarial examples since it takes more efforts to perturb the inputs towards decision boundaries. This paper substantially extends the work of Yan et al. published at NeurIPS 2018 [52].
The original concept of the large margin principle dates back to last century. Novikoff, Cortes and Vapnik proved essential theorems for the perceptron
[28] and support vector machines (SVMs) [6], based on a geometric margin and an assumption that the training data can be separated with it:(1) 
Unlike the wellknown scheme in linear classification [6], the geometric margin of nonlinear DNNs scarcely has closeform solutions, making it nontrivial to get incorporated in the training objective. Although many attempts have been made to pursue this target, most of them just focus on a margin in the representation space (a.k.a, feature space) [39, 17, 23, 45]. Learned representations in such a manner may show better intraclass compactness and interclass discriminability, but in practice a large margin in the feature space itself does not necessarily guarantee a large margin in the input space [2], on account of the distance distortions of nonlinear hierarchical models. To address this problem, several recent methods have been developed to suggest a large margin directly in the input space. For example, An et al. [2] study contractive mappings in DNNs and propose contractive rectifier networks, while Sokolić et al. [34] try penalizing the Frobenius norm of a Jacobian matrix instead. These methods show significantly superior performance in scenarios where for instance the amount of training data is severely limited. However, aggressive assumptions and approximations seem inevitable in their implementations, making them less effective in practical scenarios where the assumptions are not really satisfied.
In this paper, we propose adversarial margin maximization (AMM), a learningbased regularization that exploits an adversarial perturbation as a proxy of . Our core idea is to incorporate an
normbased adversarial attack into the training process, and leverage its perturbation magnitude as an estimation of the geometric margin. Current stateoftheart attacks typically achieve
100% success rate on powerful DNNs [25, 4, 10], while the norm of perturbation can be reasonably small and thus fairly close to the real margin values. Since the adversarial perturbation is also parameterized by the network parameters (including weights and biases), our AMM regularizer can be jointly learned with the original objective through backpropagation. We conduct extensive experiments on MNIST, CIFAR10/100, SVHN and ImageNet datasets to testify the effectiveness of our method. The results demonstrate that our AMM significantly improves the testset accuracy of a variety of DNN architectures, indicating an enhanced generalization ability as the trainingset accuracies remain similar.The rest of this paper is organized as follows. In Section 2, we introduce representative marginbased methods for training DNNs. In Section 3, we highlight our motivation and describe our AMM in detail. In Section 4 we experimentally validate the effectiveness of our proposed method. Finally, Section 5 draws the conclusions.
2 Related Work
Marginbased methods have a long research history in the field of pattern recognition and machine learning. Theoretical relationships between the generalization ability and geometric margin of linear classifiers have been comprehensively studied
[42]and the idea of leveraging large margin principle and constructing maximal margin separating hyperplane
[42] also act as essential ingredients of many classical learning machines, c.f. the famous SVM [6].Benefit from its solid theoretical foundation and intuitive geometric explanations, the large margin principle has been widely applied to a variety of realworld applications such as face detection
[29], text classification [40], gene selection for cancer classification [13], etc. Nevertheless, there is as of yet few methods for applying such principle to DNNs which are ubiquitous tools for solving modern machine learning and pattern recognition tasks (but are also structurally very complex and generally considered as blackboxes). This is mostly because the margin cannot be calculated analytically in general as with SVM.In view of such opportunities and challenges, one line of researches targets at improving a ”margin” in the representation space of DNNs instead. For instance, Tang [39]
replaces the final softmax layer with a linear SVM to maximize a margin in the last layer. Hu et al.
[17] propose a discriminative learning method for deep face verification, by enforcing the distance between each positive pair in the representation space to be smaller than a fixed threshold, and that to be larger than another threshold for all negative pairs, where a margin is naturally formed. A similar strategy, first invented by Weinberger et al. [47] and dubbed as triplet loss, has also been widely adopted to many face recognition systems, e.g., FaceNet [33]. Sun et al. [36] theoretically study a margin in the output layer of DNNs, and propose a way of reducing empirical margin errors. In the same spirit, Wang et al. [45] propose to further enhance the discriminability of DNN features via an ensemble strategy.Stick with the representation space, some recent works also advocate large margins under different “metrics”, e.g., cosine similarity
[44]and the angular distance between logit vectors and the groundtruth
[23, 22, 8]. In essence, these methods maximize the interclass variance while minimizing the intraclass variance, and thus the learned representations can be more discriminative. However, as previously discussed
[2], owing to the high structural complexity of DNNs and possible distance distortions of nonlinear models, a large margin in the feature space does not necessarily assure a large margin in the input space. That being said, the aforementioned benefits in Section 1 are not guaranteed. See section 4 for some empirical analyses. It is also worthy of mentioning that some previous works suggest that DNNs trained using stochastic gradients converge to large margin classifiers, but the convergence speed is very slow [35, 46].A few attempts have also been made towards enlarging the margin in input spaces. In a recent work, An et al. [2] propose contractive rectifier networks, in which the input margin is proved to be bounded from below by an output margin which is inherently easier to optimize by leveraging offtheshelf methods like SVM. Sokolić et al. [34] reveal connections between the spectral norm of a Jacobian matrix and margin, and try to regularize a simplified version of its Frobenius norm. Contemporaneous with our work, Elsayed et al. [11] propose to use a onestep linear approximation to DNN mappings and enlarge the margin explicitly. These methods are closely related to ours, but their implementations require rough approximations and can be suboptimal in many practical applications. Some detailed discussions are deferred to Section 3.3.4 and experimental results for comparing with them will be provided in Section 4. Ding et al. [9] also aim to approximate the margin more accurately but their method differs with ours in multiple ways as will be discussed in the Appendix.
3 Adversarial Margin Maximization
In this section, we introduce our method for pursuing large margin in DNNs. First, we briefly review our recent work of deep defense [52] aimed at training DNNs with improved adversarial robustness. We believe its functionality can be naturally regarded as maximizing the margin. We then formalize the definition of margin and provide discussions of the generalization ability based on it. Finally, we introduce our AMM for improving the generalization ability of DNNs.
3.1 Our Deep Defense
Deep Defense improves the adversarial robustness of DNNs by endowing models with the ability of learning from attacks [52]. On account of the high success rate and reasonable computation complexity, it chooses DeepFool [25] as a backbone and tries to enlarge the norm of its perturbations. ^{1}^{1}1There exist stronger attacks which approximate the margin more precisely (like the Carlini and Wagner’s [4]), but they are computationally more complex; we have demonstrated that defending DeepFool helps to defend Carlini and Wagner’s attack [52] as well therefore it can be a reasonable proxy.
Suppose a binary classifier , where the input is an dimensional vector and the predictions are made by taking the sign of classifier’s outputs. DeepFool generates the perturbation with an iterative procedure. At the th step (), the perturbation is obtained by applying firstorder approximation to and solving:
(2) 
where denotes , and . Problem (2) can be solved analytically:
(3) 
After computing all the s sequentially, the final DeepFool perturbation is obtained by adding up the s obtained from each step:
(4) 
where is the maximum iteration allowed. If the prediction class of changes at any iteration before the th, the loop should terminate in advance. Such procedure directly generalizes to multiclass cases, as long as a target label is properly chosen at each iteration. Given a baseline network, the procedure may take about 13 iterations to converge on small datasets such as MNIST/CIFAR10, and 36 iterations on large datasets like ImageNet [7]^{2}^{2}2For example, it convergences within 3 iterations on all the MNIST images with a 5layer LeNet reference model, and 6 iterations on 99.63% of the ImageNet images with a ResNet18 reference model..
3.1.1 Regularization and High Order Gradients
In fact, the norm of is a popular metric for evaluating the adversarial attacks and the robustness of DNNs [25, 4, 5]. Given an input vector , the gradient , as well as the perturbation , are both parameterized by the learnable parameters of . Consequently, in order to give preference to those functions with stronger robustness, one can simply penalize the norm of
as a regularization term during training. With modern deep learning frameworks such as PyTorch
[31]and TensorFlow
[1], differentiating can be done via automatic derivation with higher order gradients. One might also achieve this by building a “reverse” network to mimic the backward process of , as described in [52].We emphasize the high order gradients form an essential component in our method and. In principle, if no gradient flows through , the regularization can be viewed as maximizing the norm in a normalized logit space, given as the normalizer. Considering possible distance distortions of nonlinear hierarchical DNNs, it is definitely less effective in influencing the perturbation or whatever else. Experimental analysis will be further given in Section 4.3.1 to verify the importance of the high order gradients.
3.1.2 Correctly and Incorrectly Classified Samples
Deep Defense applies different regularization strategies on correctly and incorrectly classified samples. Specifically, if an input is correctly classified during training, we expect it to be pushed further away from the decision boundary, thus a smaller value of is anticipated. In practice, the target class at each iteration is chosen to be the one results in the smallest . Conversely, if an input is misclassified by the current model, we instead expect it to be closer to the decision boundary (between its current prediction and the groundtruth label), since we may intuitively hope the input sample to be correctly classified by the model in the future. The target class is always set to be the groundtruth and a larger value of is anticipated.
In summary, the Deep Defense regularizer can be written as:
(5) 
or similarly
(6) 
where is the number of training samples, is the index set of correctly classified training examples, is its complement, are two hyperparameters balancing these two groups of samples, is the shrinkage function that balances examples within the same group (details in Section 3.3). The sets and are updated in each training iteration. The whole optimization problem is given by:
(7) 
where is the original classification objective (e.g., crossentropy or hinge loss), is the coefficient for regularizer, and is the weight decay term. We adopt the unnormalized version (6) in this paper, since it connects the most to the margin to be defined in Section 3.2, and further the generalization ability.
3.2 Margin, Robustness and Generalization
Deep defense achieves remarkable performance on resisting different adversarial attacks. Apart from the improved robustness, we also observe increased inference accuracies on the benign testsets (cf., the fourth column of Table 1 in the paper [52]). We believe the superiority of our inference accuracies is related to the nature of large margin principle. In order to analyze the conjecture in detail, we first formalize the definition of an instancespecific margin and introduce some prior theoretical results [51] as below.
Definition 3.1
Let us denote by a decision function, then the instancespecific margin of a sample w.r.t. is the minimal distance from to the decision boundary.
Definition 3.2
[51] Let be a sampled training set, and
be the loss function. An algorithm is
robust if the sample space can be partitioned into K disjoint sets denoted by , , such that for all and ,(8) 
Theorem 3.1
[51] If there exists for all , then the learning algorithm is robust, in which is the covering number^{3}^{3}3The definition of covering number can be found in [41], and it is monotonically decreasing w.r.t. its first argument. of the input space , and is the norm, in which the 0/1 loss is chosen.
Theorem 3.1 establishes an intrinsic connection between the concerned instancespecific margin and a defined “robustness”. Such robustness is different from the adversarial robustness by definition, but it theoretically connects to the generalization error of , as shown in Theorem 3.2.
Theorem 3.2
[51] Let be the underlying distribution of the sample . If an algorithm is robust and for all , then for any
, with probability at least
, it holds that(9) 
in which is the generalization error of , given by
(10) 
Theorem 3.2, along with Theorem 3.1, advocates a large margin in the input space and guarantees the generalization ability of learning machines. Also, it partially explains the superiority of our Deep Defense trained DNNs on benignset inference accuracies. However, since the regularizer is originally designed for resisting attacks, it may be suboptimal for improving the generalization ability (or reducing the generalization error) of learning machines.
In fact, for linear binary classifiers where , assuming the training data is fully separable, then the regularization boils down to minimizing^{4}^{4}4We choose , and for simplicity.:
(11) 
Since scaling by any positive scalar does not change the value of our regularization term, we constrain to make the problem wellposed. We denote the index set of samples from positive class and negative class by and , respectively, and further assume the number of training samples in positive and negative classes to be identical (i.e., ), then Eq. (11) can be rewritten as:
(12) 
in which the bias term has been canceled out. Obviously, minimizing Eq. (12) under the constraint yields , where is a normalizer to make sure . Geometrically, the decision boundary corresponding to calculating is orthogonal to the line segment connecting the centers of mass of positive training samples and that of negative training samples. Note that all training samples in (12), no matter how far away from the decision boundary, have equal contribution to . An undesired consequence of such formulation is that, the regularizer can be severely influenced by samples not really close to the decision boundary. As a result, such may process a poor global margin , since is generally the “worstcase” distance from training samples to the decision boundary, dominated mainly by those close to it.
3.3 Our Main Framework
We know from the previous section that although the margin, robustness and generalization ability of DNNs are theoretically connected, there is an intrinsic distinction between our current method and a desired margin maximization. In this section, we provide further analyses and introduce aggregation function (in Section 3.3.1) and shrinkage function (in Section 3.3.2) designed specifically to exploit the large margin principle more effectively in practice.
3.3.1 Aggregation Function
Deep Defense in Eq. (6) aggregates regularization information from training samples (in a batch) by taking average. However, this aggregation strategy can be suboptimal for improving the generalization ability, while Theorem 3.2 also suggests a minimal perturbation (rather than the adopted average). Ideally, one can apply regularization only on the sample with minimal perturbation over the whole training set to maximize . Nevertheless, the gradient of such a regularizer will be zero for most of the training samples, and in practice it takes much longer time to train and achieve satisfactory results. This is different from the wellknown scheme in linear SVM.
To achieve a reasonable tradeoff between the theoretical requirement and regularization strength, we consider using a min aggregation function within each batch instead of the whole training set during training. Specifically, for correctly classified samples we apply regularization to it iff two conditions are fulfilled simultaneously: (a). is the smallest among all samples with the same groundtruth label in this batch, and (b). belongs to the top 20% smallest in this batch. If a correctly classified sample does not satisfy these two conditions, we simply set its regularization term to zero in the current training step. While if a sample is misclassified by the current model, we expect to decrease its distances to the correct predictions. Analogous to the above codec for MIN, we denote the original Deep Defense strategy (i.e., averaging all) as AVG.
3.3.2 Shrinkage Function
As discusses [52], if we penalize an perturbation directly (i.e., setting in Eq. (5) and (6)), some “extremely robust” samples may dominate the regularization term, which shall pose a negative impact on the training procedure. What’s worse, the regularization term will never diminish to zero with a linear . To alleviate the problems, we attempt to choose a nonlinear ”shrinkage” function for . It should be monotonically increasing, such that the correctly classified samples with abnormally small are penalized more than those with relatively large values. Essentially, concentrating more on samples with small instancespecific margins coheres with the evidence in Theorem 3.2, since we know the minimal (instancespecific) margin probably connects the most to the generalization ability. We will demonstrate the performance of different choices: (a). , denoted by LIN, (b). , denoted by EXP and (c), , denoted by INV, on training DNNs, which also differs from the wellknown scheme in linear SVM. For INV, we make sure by setting appropriate values for and truncating abnormally large values for with a threshold.
3.3.3 Experiments on Toy Data
We first conduct an explanatory experiment by synthesizing 2D data to illustrate how the choices of the functions may affect classification in a binary case. Suppose that the 2D data from the two classes are uniformly distributed in rectangles
and , respectively. For each class, we synthesize 200 samples for training and another 200 heldout for testing. We train linear classifiers withto minimize the regularization term taking various forms. We set batch size to 20, and train models for 1000 epochs with the standard SGD optimizer. The learning rate is initially set to 0.1, and cut by half every 250 epochs. We use the popular momentum of 0.9 and a weight decay of 1e4.
The learned decision boundaries and in different configurations are illustrated in Fig. 1 together with test samples. With purely AVG+LIN, the obtained boundary is roughly orthogonal to the line connecting (1.0, 1.0) and (1.0, 1.0), which is consistent with our previous analysis in Section 3.2. Although we have striven to penalize incorrect classifications, the obtained model in this setting is still unable to gain excellent accuracy, because all training samples contribute equally to pushing the decision boundary. Better generalization ability and thus improved testset accuracies can be obtained in the AVG+EXP and AVG+INV settings, as depicted in Fig. 1 (b) and (c). This verifies the effectiveness of the nonlinear shrinkage functions aimed at regularizing ”extremely robust” samples less. It can also be seen that the MIN aggregation function leads to a more sensible margin. Optimal (or near optimal) boundaries are obtained in the MIN settings, which attains the largest possible margin of 1. As discussed in Section 3.2, the MIN aggregation function is more related with the geometric margin in comparison with the AVG so it can directly facilitate the margin as well as the generalization ability.
Method  Aggregation  Shrinkage  Error Rate (%) 
baseline      1.790.06 
AVG  LIN  2.260.05  
AVG  INV  1.180.03  
AMM  AVG  EXP  0.940.02 
MIN  LIN  1.280.02  
MIN  INV  0.970.03  
MIN  EXP  0.900.03 
We further conduct an experiment on MNIST [20]
, which is a realworld dataset. Here a simple multilayer perceptron (MLP) is adopted and trained with crossentropy loss. To achieve the best performance within each configuration, we first run a grid search for
, , and . Table I shows the final error rates while Fig. 2 illustrates the convergence curves of our AMM with different aggregation and shrinkage functions. We repeat the training five times with different initialization instantiations to report also the standard derivations of error rates. We see with AVG+LIN the obtained mean error rate even increases from 1.79% to 2.29% in comparison with the reference model, proving that treating all samples equally in the regularizer can pose negative impact on the generalization ability. With the help of the MIN aggregation function, our AMM achieves a 1.28% error rate, which is far lower than that with AVG+LIN (2.26%), as well as that of the reference model (1.79%). Such positive effect of the MIN aggregation function is consistent with our observation on the synthetic 2D data.The benefit of the nonlinear shrinkage functions is also highlighted. When compared with the LIN function, nonlinear shrinkage functions INV and EXP gain relative error decreases of 47% and 58% within the same AVG setting, and 24% and 29% within the MIN setting. Such results well explain our intuitions and insights in Section 3.3.2.
3.3.4 Implementation Details and Discussions
Our framework has an intrinsic connection with SVM. For a linear binary classification problem with separable training data, SVM can be viewed as a special case in our framework provided the model is linear, if we remove the classification objective in the whole optimization problem (7). It can be easily verified that the MIN+LIN regularization (along with the weight decay term) is the equivalent with a hard margin linear SVM, if the current model achieves excellent accuracy (i.e.100%) on the training set and we use all training samples in each batch. Moreover, as testified in the previous section, the aggregation function ought to be more essential than the shrinkage function in linear binary cases.
The landscape and margin of nonlinear DNNs are much more complex and generally infeasible to compute in comparison with those of linear models, especially when multiple classes get involved. Our framework exploits an adversarial perturbation as a proxy of the margin. With different configurations on the aggregation and shrinkage functions, it formulates a variety of regularization types. They might devote more to the generalization ability (e.g., the one with MIN+EXP) or robustness (e.g., our Deep Defense with AVG and approximately EXP). All the encompassed variants share a similar core idea that is to incorporate an adversarial attack into the training process. In particular, we utilize the perturbation norm as an estimation of the margin. Current stateoftheart attacks typically achieve 100% success rate on powerful DNNs, while the norm of perturbation can be reasonably small and thus fairly close to the real margin values. We specifically choose to comply with previous theoretical analysis. Also, we know from Section 3.3.3 that the MIN+EXP trades off testset performance in favor of theoretical margins. We leave the choices for the classification objective to customized network configurations, in parallel with our AMM configurations. In fact, we have tested our AMM with popular choices for including the crossentropy loss and hinge loss but never found a significant difference in the experiments.
By delving deeply into the geometric margin, we unify a set of learningbased regularizers within the proposed AMM framework. Guidelines are correspondingly provided in case one prefers the generalization ability to robustness or DNNs to linear models. Contemporaneous with our work, Elsayed et al. [11] propose to linearize the forward mapping of DNNs, somewhat similar to a singlestep Deep Defense without utilizing the high order gradients (as in Section 3.1) and nonlinear shrinkage function (as in Section 3.3.2). See more discussions in Section 4.3.1.
4 Experimental Verifications
In this section, we experimentally verify the remarks and conjectures presented in previous sections and evaluate the performance of our AMM with specifically MIN+EXP on various datasets (including MNIST, CIFAR10/100, SVHN and ImageNet). We compare our derived method with the stateofthearts to demonstrate its effectiveness.
4.1 Datasets and Models
We perform extensive experiments on five commonly used classification dataset: MNIST [20], CIFAR10/100 [19], SVHN [26] and ImageNet [7]. Dataset and network configurations are described as below. For MNIST, CIFAR10/100, and SVHN, we construct a heldout validation set for hyperparameter selection by randomly choosing 5k images from the training set. For ImageNet, as a common practice, we train models on the 1.2 million training images, and report top1 error rates on the 50k validation images.
4.1.1 Mnist
MNIST consists of 70k grayscale images, in which 60k of them are used for training and the remaining are used for test. We train deep networks with different architectures on MNIST: (a). a fourlayer MLP (2 hidden layers, 800 neurons in each) with ReLU activations, (b). LeNet
[20] and (c). a deeper CNN with 12 weight layers named “LiuNet” [23, 45]. Similar to many previous works, we subtract the mean for both training and test data in preprocessing, and no data augmentation is adopted. For more details about these architectures, please see our appendix.4.1.2 CIFAR10/100, and SVHN
Both CIFAR10 and CIFAR100 contain 60k color images, including 50k training images and 10k test images. SVHN is composed of 630k color images in which 604k of them are used for training and the remaining for testing. For these datasets, we train six networks: (a). a light ConvNet with the same architecture as in [49], (b). the networkinnetwork (NIN) [21], (c). the “LiuNet“ as applied in the CIFAR10 experiments in [23]., (d)(e). the standard ResNet20/56 [15] architectures, and (f) a DenseNet40 [18] in which all layers are connected. We uniformly resize each image to 36x36, and randomly crop a 32x32 patch during training as data augmentation. Moreover, we apply random horizontal flipping with a probability of 0.5 to combat overfitting, except SVHN.
4.1.3 ImageNet
ImageNet is a highly challenging image classification benchmark which consists of millions of highresolution images over 1,000 classes. Starting from ResNet [15], deep models with skip connections have advanced the stateofthearts on this highly challenging dataset [53, 18, 50, 16]. We adopt ResNet18/50 [15] and SENet50 [16] which includes numerous skip connections as representative architectures to validate our method. Following previous works [37], we randomly crop a patch whose size is uniformly distributed between 8% and 100% of the original image size, with aspect ratio uniformly distributed in . Then we resize the cropped patch to 224x224 and feed it into the network. As a common practice, random horizontal flipping is also applied.
4.2 Training Protocol
We use the crossentropy loss in the training objective for
, as with previous works. Table 2 in the appendix summarizes the batch size, maximal number of epoch, and learning rate policy used in our experiments. We start training with some initial learning rate (shown in the 5th column of Table 2) and we anneal them by some multipliers at certain epochs (specified in the 6th column). The standard stochastic gradient descent optimizer with a momentum of 0.9 and a weight decay of 1e4 is adopted in all experiments. All the hyperparameters are tuned on the validation set with reference networks in order to achieve their supreme performance.
For relatively small models and datasets, we initialize models with the socalled “MSRA” strategy [14]
and train from scratch, otherwise we finetune from our trained reference models (more details can be found in Table 2 in the appendix). To avoid abnormally large gradients and probably a drift away of classification loss, we project gradient tensors onto a Euclidean sphere with radius 10, if their norm exceeds the threshold 10. This technique is also known as “clip gradients” and has been widely adopted in the community. When calculating adversarial perturbations for our AMM, we allow a maximal iteration of
, which is sufficient to fool DNNs on 100% of the training samples in most cases. Hyperparameters in our regularizer is determined by cross validation in the heldout validation set, as described.4.3 Exploratory Experiments on MNIST
As a popular dataset for evaluating the generalization performance of classifiers [39, 2, 3, 23, 49], MNIST is a reasonable choice for us to get started. We shall analyze the impacts of different configurations in our framework.
4.3.1 Effect of High Order Gradients and Others
Let us first investigate the effect of introducing high order gradients, which serves as an essential component in our framework. It is triggered when backpropagating gradients through in our regularizer, which is usually difficult to formalize and compute for DNNs. We invoke the automatic differentiation mechanism in PyTorch [31] to achieve this. One can also build inverse networks to mimic the backward process of DNNs, as in [52].
Model  Error (%) 
Reference  1.790.06 
w/o high order gradients  1.460.04 
w/ high order gradients  0.900.03 
We try masking the gradient flow of by treating the entries of as constants, as done by Elsayed et al. [11]. In general, they expect to enlarge the margin by penalizing the norm of a linear perturbation, if it goes below a threshold. Such approximation may lead to conceptually easier implementations but definitely also results in distinctions from the gradient direction to pursue a large margin. With or without the high order gradients, we obtain different MLP models using our AMM. They are compared in Table II and Fig. 3, along with the “Reference” that indicates the baseline MLPs with
(i.e., no AMM). Means and standard deviations of the error rates calculated from all five runs are shown.
Though both methods achieve decreased error rates than the “Reference”, models trained with gradient masked on (i.e., without high order gradients) demonstrate apparently worse performance (1.46%, pink in Fig. 3) than those with full gradients (0.90%, yellow in Fig. 3). Except for the high order gradients, Elsayed et al.’s method [11] also miss several other components that may further hinder it from achieving comparable performance with ours. For empirical validations, we try following its main technical insights and implementing the method for empirical validation. We follow the singlestep setting and a thresholdbased shrinkage function and summarize its results in Table II and Fig. 3 for comparison. We see its prediction error is even higher than ours without high order gradients.
4.3.2 Training with Less Samples
Our AMM enhances the generalization ability and robustness of DNNs in different aspects. We consider the possibility of training DNN models with fewer samples in this section. Specifically, we sample images randomly from the MNIST training set, and train MLP models on these subsets. Once sampled, these subsets are fixed for all training procedures. The MIN+EXP setting and identical , , and as in our previous experiments are adopted.
Fig. 4 illustrates how test error rates vary with the number of training samples. Same with previous experiments, we perform five runs for each method and the shaded areas demonstrate the standard deviations. Clearly, models with our full gradient AMM achieve consistently better performance than the “Reference” models and other competitors. With only 20k training images, our method helps to achieve error, which is even slightly lower than that of the vanilla models with all 60k training images ( error).
4.3.3 Training with Noisy Labels
We train models with possibly noisy labels in this experiment to simulate unreliable human annotations in many realworld applications. We shall use all 60k MNIST training images, but for a portion of them, we substitute random integers in [0, 9] for their “groundtruth” labels. For , we construct 12 training partitions each consists of images with the original labels and images with random labels.
The “Reference” and our AMM models are trained, and their error rates are shown in Fig. 5. Without specific regularization, error rates of reference models increase drastically when a portion of labels are corrupted. For instance, when training on a set containing 55k images with random labels, models with our adversarial regularization are still able to achieve an average test error less than 10%. However, that of the reference models goes above 40%, which are obviously too high, considering it is a 10class classification problem. Our implementation of Elsayedet al.’s method also achieves promising performance on the test set, though consistently inferior to ours.
4.4 Image Classification Experiments
In this section, we testify the effectiveness of our method on benchmark DNN architectures on different image classification datasets including MNIST, CIFAR10/100, SVHN and ImageNet. It is compared with a variety of stateoftheart margininspired methods. Same with previous experiments in Section 4.3, during training we adopt the MIN+EXP regularizer in our method for all experiments. For evaluation, both error rates and margin (estimated using DeepFool perturbation) on the test set are reported to verify the effectiveness of our method.
Method  Architecture  Error (%)  Margin  Augmentation 
Bayes by Backprop [3]  MLP (800)  1.32     
DropConnect [43]  MLP (800)  1.200.03     
DLSVM [39]  MLP (512)  0.87    Gaussian 
CRN [2]  LeNet  0.73     
DropConnect [43]  LeNet  0.630.03     
DisturbLabel [49]  LeNet  0.63     
LSoftmax [23]  LiuNet  0.31     
EMSoftmax [45]  LiuNet  0.27 ^{a}     
Reference  MLP (800)  1.790.06  0.76   
Ours  MLP (800)  0.900.03  1.90   
Reference  LeNet  0.870.02  1.07   
Ours  LeNet  0.560.02  2.21   
Reference  LiuNet  0.410.02  1.60   
Ours  LiuNet  0.330.03  3.41   

Ensemble of 2 LiuNet models are used.
4.4.1 Mnist
Aforementioned MLP, LeNet and LiuNet architectures are used as reference models. Generally, DNNs have to be deterministic during the attack process, such that we can find a reasonable approximation of minimal distance to the fixed decision boundaries. However, for DNNs equipped with batch normalization layers, if we implement the DeepFool attack naïvely, the perturbation of a particular sample may depend on other samples in the same batch since all of them share the same mean and variance in batch normalization procedure. In order to bypass such dependency and achieve better efficiency, our implementation follows that described in
[23], with one exception that we replace all batch normalization layers with group normalization [48] with group size 32. Empirically we found this difference has little (often negative) impact on the error rates of reference network. We adopt the MIN+EXP setting, and use , , selected on MLPs for all the three architectures for simplicity, although potential better hyperparameters may be obtained by running grid search on each of them. The error rates of different methods are shown in Table III. For a fair comparison, we also provide architectures in the second column of Table III. The annotation MLP () represents an MLP model with two hidden layers, and each of them has neurons. Our method outperforms competitive methods considering the MLP and LeNet architecture, except one case where comparisons are not completely fair: DLSVM [39] obtains 0.87% error (our: 0.90%) using MLP with additional Gaussian noise added to the input images during training, but we do not use any data augmentation techniques,. For LiuNet, our method also achieves comparable error rate (0.33%) with LSoftmax [23] (0.31%). EMSoftmax [45] achieves the lowest 0.27% error using an ensemble of 2 LiuNets, while our performance is measured on a single model. Moreover, we see our method significantly and consistently decreases the error rates of reference models, by relative improvements of 49%, 35%, and 19% on MLP, LeNet, and LiuNet, respectively.Method  Architecture  Error (%)  Margin  Augmentation 
DropConnect [43]  LeNet  18.7     
DisturbLabel [49]  LeNet  14.48    hflip & crop 
DLSVM [39]  LeNet  11.9    hflip & jitter 
CRN [2]  VGG16  8.8    hflip & crop 
LSoftmax [23]  LiuNet  5.92    hflip & crop 
EMSoftmax [45]  LiuNet  4.98 ^{a}    hflip & crop 
Reference  LeNet  14.93  0.16  hflip & crop 
Ours  LeNet  13.87  0.24  hflip & crop 
Reference  NIN  10.39  0.21  hflip & crop 
Ours  NIN  9.87  0.30  hflip & crop 
Reference  LiuNet  6.25  0.15  hflip & crop 
Ours  LiuNet  5.85  0.29  hflip & crop 
Reference  ResNet20  8.20  0.10  hflip & crop 
Ours  ResNet20  7.62  0.22  hflip & crop 
Reference  ResNet56  5.96  0.16  hflip & crop 
Ours  ResNet56  5.75  0.32  hflip & crop 
Reference  DenseNet40  5.75  0.11  hflip & crop 
Ours  DenseNet40  5.61  0.18  hflip & crop 

Ensemble of 2 LiuNet models are used.
4.4.2 Cifar10
For CIFAR10, we evaluate our method with LeNet, NIN, LiuNet, ResNet20/56, and DenseNet40. The architecture of LeNet and NIN are directly copied from our previous work [52], and that of LiuNet is adapted from [23]. We choose the CIFAR10 LiuNet architecture as described in [23], and replace all batch normalization layers with group normalization layers of group size 32, as in our MNIST experiments. For ResNets and DenseNets, we adopt the standard architectures as described in previous works [15, 18], and simply freeze all batch normalization layers during both training and testing to break the the interbatch dependency as described in Section 4.4.1. Hyperparameters , , and are casually tuned on the holdout validation set as described in Section 4.1, and final error rates are reported using models trained on the full training set of 50k images. Table IV summarizes results for CIFAR10 experiments. For fair comparison we also show data augmentation strategies in the last column of Table IV. Majority of methods use horizontal flip and random crop, while Tang et al. [39] use horizontal flip and color jitter, which may partially explain the surprisingly low error rate (11.9%) obtained with LeNet. In most test cases considering the same architecture and data augmentation strategy, our regularizer produces lower error rates than all other competitive methods. The only exception is EMSoftmax with LiuNet, which achieves 4.98% error (ours 5.85%). However, their result is obtained on an ensemble of 2 LiuNet models, while our results are measured on a single LiuNet model without any ensemble. Moreover, it can be seen that our method also provides significant absolute improvements to all six reference models with different architectures.
Method  Architecture  Error (%)  Margin  Augmentation 
DisturbLabel [49]  LeNet  41.84    hflip & crop 
CRN [2]  VGG16  34.4     
LSoftmax [23]  LiuNet  29.53     
LSoftmax [23]  LiuNet  28.04 ^{a}    hflip & crop 
EMSoftmax [45]  LiuNet  24.04 ^{b}    hflip & crop 
Reference  LeNet  43.30  0.11  hflip & crop 
Ours  LeNet  41.68  0.23  hflip & crop 
Reference  NIN  37.75  0.12  hflip & crop 
Ours  NIN  34.49  0.21  hflip & crop 
Reference  LiuNet  26.87  0.10  hflip & crop 
Ours  LiuNet  25.91  0.18  hflip & crop 
Reference  ResNet20  33.20  0.05  hflip & crop 
Ours  ResNet20  32.96  0.16  hflip & crop 
Reference  ResNet56  26.70  0.06  hflip & crop 
Ours  ResNet56  26.54  0.14  hflip & crop 
Reference  DenseNet40  25.93  0.04  hflip & crop 
Ours  DenseNet40  25.62  0.11  hflip & crop 

Ensemble of 2 LiuNet models are used.
4.4.3 Cifar100
Similar to the CIFAR10 experiment, we also evaluate our method on LeNet, NIN, LiuNet, ResNet20/56, and DenseNet40 for CIFAR100. LeNets, NINs, ResNets and DenseNets are kept the same with the CIFAR10 experiment except that the output widths of the last fullyconnected layers are increased from 10 to 100 for 100way classification. For LiuNet, we adopt the CIFAR100 LiuNet architecture described in [23], which is slightly larger than the CIFAR10 LiuNet for a fair comparison. The hyperparameter tuning and final evaluation protocol are the same with all previous experiments in this paper. Results are summarized in Table V. As the original LSoftmax paper [23] only provides results without data augmentation on CIFAR100, we copy the result from [45], as denoted by superscript “a” in the table. It can be seen that our method outperforms DisturbLabel [49] and LSoftmax [23] under the same architectures. Again, EMsoftmax [45] achieves a lower error rate 26.86% than ours 25.91% using model ensembling, while we only measure single model performance. For all six considered architectures, our method is able to provide performance gain to the reference model.
Method  Architecture  Error (%)  Margin  Augmentation 
DisturbLabel [49]  LeNet  3.27    crop 
Reference  LeNet  3.32  0.45  crop 
Ours  LeNet  3.12  0.99  crop 
Reference  NIN  2.67  0.48  crop 
Ours  NIN  2.46  1.14  crop 
Reference  LiuNet  1.79  0.50  crop 
Ours  LiuNet  1.61  1.24  crop 
Reference  ResNet20  1.91  0.40  crop 
Ours  ResNet20  1.82  1.17  crop 
Reference  ResNet56  1.72  0.54  crop 
Ours  ResNet56  1.63  1.04  crop 
Reference  DenseNet40  1.79  0.46  crop 
Ours  DenseNet40  1.66  0.99  crop 
4.4.4 Svhn
For SVHN, we still validate our method on the same six architectures as in CIFAR10 experiments. The protocol for hyperparameter tuning and final evaluation is also the same. Since SVHN is a digit classification task where the semantics of a sample is generally not kept if we flip it horizontally, we do not use flip for data augmentation for this dataset. Table VI summarizes the results of our SVHN experiments. Many of our considered competitive methods do not perform SVHN experiments, hence we do not have their results in the table. For LeNet, our method achieves lower error (3.12%) than DisturbLabel (3.27%). Compared with reference models, our method is able to provide consistent performance improvement for all six network architectures.
Method  Architecture  Margin  Error (%) 
Reference  ResNet18  0.70  30.23 
Ours  ResNet18  1.33  29.94 
Reference  ResNet50  0.82  23.85 
Ours  ResNet50  1.74  23.54 
Reference  SENet50  1.19  22.37 
Ours  SENet50  1.92  22.19 
4.4.5 ImageNet
ImageNet is a largescale image classification benchmark dataset containing millions of high resolution images in 1000 classes. We test our method on it using three DNN architectures: ResNet18/50 [15] and SENet50 [16]. For efficiency, we collect welltrained models from the community ^{5}^{5}5https://github.com/Cadene/pretrainedmodels.pytorch, and finetune them with our regularizer. Results are summarized in Table VII. We see our method provides consistent accuracy gain for all considered architectures, validating the effectiveness of our regularizer on largescale datasets with modern DNN architectures.
4.5 Computational Cost
Since our method invokes iterative updates to approximate the classification margin and it utilizes highorder gradients during optimization, higher computational cost may be inevitable. In practice, our method usually requires 614 more wall clock time per epoch and 24 GPU memory than the natural crossentropy training, depending on the network architecture. Notice that much less epochs are required when finetuning a pretrained model, thus we advocate a twostep training pipeline as introduced in Section 4.4.5 for largescale problems.
5 Conclusion
In this paper, we study the generalization ability of DNNs and aim at improving it, by investigating the classification margin in the input data space, and deriving a novel and principled regularizer to enlarge it. We exploit the DeepFool adversarial perturbation as a proxy for the margin, and incorporate the normbased perturbation into the regularizer. The proposed regularization can be jointly optimized with the original classification objective, just like training a recursive network. By developing proper aggregation functions and shrinkage functions, we improve the classification margin in a direct way. Extensive experiments on MNIST, CIFAR10/100, SVHN and ImageNet with modern DNN architectures demonstrate the effectiveness of our method.
Acknowledgments
This work is funded by the NSFC (Grant No. 61876095), and the Beijing Academy of Artificial Intelligence (BAAI).
References
 [1] (2015) TensorFlow: largescale machine learning on heterogeneous systems. Note: Software available from tensorflow.org External Links: Link Cited by: §3.1.1.
 [2] (2015) Contractive rectifier networks for nonlinear maximum margin classification. In ICCV, pp. 2515–2523. Cited by: §1, §2, §2, §4.3, TABLE III, TABLE IV, TABLE V.
 [3] (2015) Weight uncertainty in neural network. In ICML, pp. 1613–1622. Cited by: §4.3, TABLE III.
 [4] (2017) Towards evaluating the robustness of neural networks. In IEEE Symposium on Security and Privacy (SP), Cited by: §1, §3.1.1, footnote 1.
 [5] (2017) Parseval networks: improving robustness to adversarial examples. In ICML, Cited by: §3.1.1.
 [6] (1995) Supportvector networks. Machine learning 20 (3), pp. 273–297. Cited by: §1, §1, §1, §2.
 [7] (2009) Imagenet: a largescale hierarchical image database. In CVPR, pp. 248–255. Cited by: §3.1, §4.1.
 [8] (2018) Arcface: additive angular margin loss for deep face recognition. arXiv preprint arXiv:1801.07698. Cited by: §2.
 [9] (2018) Maxmargin adversarial (mma) training: direct input space margin maximization through adversarial training. arXiv preprint arXiv:1812.02637. Cited by: §2.
 [10] (2018) Boosting adversarial attacks with momentum. In CVPR, Cited by: §1.
 [11] (2018) Large margin deep networks for classification. arXiv preprint arXiv:1803.05598. Cited by: §2, §3.3.4, §4.3.1, §4.3.1.
 [12] (2015) Explaining and harnessing adversarial examples. In ICLR, Cited by: §1.
 [13] (2002) Gene selection for cancer classification using support vector machines. Machine learning 46 (13), pp. 389–422. Cited by: §2.
 [14] (2015) Delving deep into rectifiers: surpassing humanlevel performance on imagenet classification. In ICCV, Cited by: §4.2.
 [15] (2016) Deep residual learning for image recognition. In CVPR, Cited by: §4.1.2, §4.1.3, §4.4.2, §4.4.5.
 [16] (2018) Squeezeandexcitation networks. In CVPR, Cited by: §4.1.3, §4.4.5.
 [17] (2014) Discriminative deep metric learning for face verification in the wild. In CVPR, pp. 1875–1882. Cited by: §1, §2.
 [18] (2017) Densely connected convolutional networks. In CVPR, pp. 4700–4708. Cited by: §4.1.2, §4.1.3, §4.4.2.
 [19] (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §4.1.
 [20] (1998) Gradientbased learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §3.3.3, §4.1.1, §4.1.
 [21] (2014) Network in network. In ICLR, Cited by: §4.1.2.
 [22] (2017) Sphereface: deep hypersphere embedding for face recognition. In CVPR, Vol. 1, pp. 1. Cited by: §2.

[23]
(2016)
Largemargin softmax loss for convolutional neural networks.
. In ICML, pp. 507–516. Cited by: §1, §2, item a, §4.1.1, §4.1.2, §4.3, §4.4.1, §4.4.2, §4.4.3, TABLE III, TABLE IV, TABLE V. 
[24]
(2017)
Virtual adversarial training: a regularization method for supervised and semisupervised learning
. arXiv preprint arXiv:1704.03976. Cited by: §1.  [25] (2016) DeepFool: a simple and accurate method to fool deep neural networks. In CVPR, Cited by: §1, §1, §3.1.1, §3.1.
 [26] (2011) Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, Vol. 2011, pp. 5. Cited by: §4.1.
 [27] (2015) Deep neural networks are easily fooled: high confidence predictions for unrecognizable images. In CVPR, Cited by: §1.
 [28] (1962) On convergence proofs on perceptrons. In Proceedings of the Symposium on the Mathematical Theory of Automata, Cited by: §1.
 [29] (1997) Training support vector machines: an application to face detection. In CVPR, pp. 130–136. Cited by: §2.
 [30] (2018) Towards the science of security and privacy in machine learning. In IEEE European Symposium on Security and Privacy, Cited by: §1.
 [31] (2017) Automatic differentiation in pytorch. In NIPS Workshop, Cited by: §3.1.1, §4.3.1.
 [32] (2000) Large margin dags for multiclass classification. In NIPS, pp. 547–553. Cited by: §1.
 [33] (2015) Facenet: a unified embedding for face recognition and clustering. In CVPR, pp. 815–823. Cited by: §2.
 [34] (2017) Robust large margin deep neural networks. IEEE Transactions on Signal Processing 65 (16), pp. 4265–4280. Cited by: §1, §2.
 [35] (2018) The implicit bias of gradient descent on separable data. JMLR 19 (Nov), pp. 1–57. Cited by: §2.
 [36] (2016) On the depth of deep neural networks: a theoretical view.. In AAAI, pp. 2066–2072. Cited by: §2.
 [37] (2015) Going deeper with convolutions. In CVPR, pp. 1–9. Cited by: §4.1.3.
 [38] (2014) Intriguing properties of neural networks. In ICLR, Cited by: §1.
 [39] (2013) Deep learning using linear support vector machines. ICML Workshop. Cited by: §1, §2, §4.3, §4.4.1, §4.4.2, TABLE III, TABLE IV.

[40]
(2001)
Support vector machine active learning with applications to text classification
. JMLR 2 (Nov), pp. 45–66. Cited by: §2.  [41] (1996) Weak convergence and empirical processes: with applications to statistics. Springer Science & Business Media. Cited by: footnote 3.

[42]
(1999)
An overview of statistical learning theory
. IEEE transactions on neural networks 10 (5), pp. 988–999. Cited by: §1, §2.  [43] (2013) Regularization of neural networks using dropconnect. In ICML, pp. 1058–1066. Cited by: TABLE III, TABLE IV.
 [44] (2018) CosFace: large margin cosine loss for deep face recognition. In CVPR, pp. 5265–5274. Cited by: §2.
 [45] (2018) Ensemble softmargin softmax loss for image classification. IJCAI. Cited by: §1, §2, item a, §4.1.1, §4.4.1, §4.4.3, TABLE III, TABLE IV, TABLE V.
 [46] (2018) On the margin theory of feedforward neural networks. arXiv preprint arXiv:1810.05369. Cited by: §2.
 [47] (2009) Distance metric learning for large margin nearest neighbor classification. JMLR 10 (Feb), pp. 207–244. Cited by: §2.
 [48] (2018) Group normalization. In ECCV, Cited by: §4.4.1.
 [49] (2016) Disturblabel: regularizing cnn on the loss layer. In CVPR, pp. 4753–4762. Cited by: §4.1.2, §4.3, §4.4.3, TABLE III, TABLE IV, TABLE V, TABLE VI.
 [50] (2017) Aggregated residual transformations for deep neural networks. In CVPR, pp. 5987–5995. Cited by: §4.1.3.
 [51] (2012) Robustness and generalization. Machine learning 86 (3), pp. 391–423. Cited by: §1, §3.2, Definition 3.2, Theorem 3.1, Theorem 3.2.
 [52] (2018) Deep defense: training dnns with improved adversarial robustness. In NeurIPS, Cited by: §1, §3.1.1, §3.1, §3.2, §3.3.2, §3, §4.3.1, §4.4.2, footnote 1.
 [53] (2017) Polynet: a pursuit of structural diversity in very deep networks. In CVPR, pp. 3900–3908. Cited by: §4.1.3.