1 Introduction
The large margin principle has played a key role in the course of machine learning history, producing remarkable theoretical and empirical results for classification
(Vapnik, 1995) and regression problems (Drucker et al., 1997). However, exact large margin algorithms are only suitable for shallow models. In fact, for deep models, computation of the margin itself becomes intractable. This is in contrast to classic setups such as kernel SVMs, where the margin has an analytical form (the norm of the parameters). Desirable benefits of large margin classifiers include better generalization properties and robustness to input perturbations (Cortes & Vapnik, 1995; Bousquet & Elisseeff, 2002).To overcome the limitations of classical margin approaches, we design a novel loss function based on a firstorder approximation of the margin. This loss function is applicable to any network architecture (e.g., arbitrary depth, activation function, use of convolutions, residual networks), and complements existing generalpurpose regularization techniques such as weightdecay, dropout and batch normalization.
We illustrate the basic idea of a large margin classifier within a toy setup in Figure 1
. For demonstration purposes, consider a binary classification task and assume there is a model that can perfectly separate the data. Suppose the models is parameterized by vector
, and the model maps the input vector to a real number, as shown in Figure 1(a); where the yellow region corresponds to positive values of and the blue region to negative values; the red and blue dots represent training points from the two classes. Such is sometimes called a discriminant function. For a fixed , partitions the input space into two sets of regions, depending on whether is positive or negative at those points. We refer to the boundary of these sets as the decision boundary, which can be characterized by when is a continuous function. For a fixed , consider the distance of each training point to the decision boundary. We call the smallest nonnegative such distance the margin. A large margin classifier seeks model parameters that attain the largest margin. Figure 1(b) shows the decision boundaries attained by our new loss (right), and another solution attained by the standard crossentropy loss (left). The yellow squares show regions where the large margin solution better captures the correct data distribution.Margin may be defined based on the values of (i.e. the output space) or on the input space. Despite similar naming, the two are very different. Margin based on output space values is the conventional definition. In fact, output margin can be computed exactly even for deep networks (Sun et al., 2015). In contrast, the margin in the input space is computationally intractable for deep models. Despite that, the input margin is often of more practical interest. For example, a large margin in the input space implies immunity to input perturbations. Specifically, if a classifier attains margin of , i.e. the decision boundary is at least away from all the training points, then any perturbation of the input that is smaller than will not be able to flip the predicted label. More formally, a model with a margin of is robust to perturbations where , when . It has been shown that standard deep learning methods lack such robustness (Szegedy et al., 2013).
In this work, our main contribution is to derive a new loss for obtaining a large margin classifier for deep networks, where the margin can be based on any norm (), and the margin may be defined on any chosen set of layers of a network. We empirically evaluate our loss function on deep networks across different applications, datasets and model architectures. Specifically, we study the performance of these models on tasks of adversarial learning, generalization from limited training data, and learning from data with noisy labels. We show that the proposed loss function consistently outperforms baseline models trained with conventional losses, e.g. for adversarial perturbation, we outperform common baselines by up to on MNIST, on CIFAR10 and on Imagenet.
2 Related Work
Prior work (Liu et al., 2016; Sun et al., 2015; Sokolic et al., 2016; Liang et al., 2017) has explored the benefits of encouraging large margin in the context of deep networks. Sun et al. (2015) state that crossentropy loss does not have marginmaximization properties, and add terms to the crossentropy loss to encourage large margin solutions. However, these terms encourage margins only at the output layer of a deep neural network. Other recent work (Soudry et al., 2017)
, proved that one can attain maxmargin solution by using crossentropy loss with stochastic gradient descent (SGD) optimization. Yet this was only demonstrated for linear architecture, making it less useful for deep, nonlinear networks.
Sokolic et al. (2016) introduced a regularizer based on the Jacobian of the loss function with respect to network layers, and demonstrated that their regularizer can lead to larger margin solutions. This formulation only offers distance metrics and therefore may not be robust to deviation of data based on other metrics (e.g., adversarial perturbations). In contrast, our work formulates a loss function that directly maximizes the margin at any layer, including input, hidden and output layers. Our formulation is general to margin definitions in different distance metrics (e.g. , , and norms). We provide empirical evidence of superior performance in generalization tasks with limited data and noisy labels, as well as robustness to adversarial perturbations. Finally, Hein & Andriushchenko (2017) propose a linearization similar to ours, but use a very different loss function for optimization. Their setup and optimization are specific to the adversarial robustness scenario, whereas we also consider generalization and noisy labels; their resulting loss function is computationally expensive and possibly difficult to scale to large problems such as Imagenet. Matyasko & Chau (2017) also derive a similar linearization and apply it to adversarial robustness with promising results on MNIST and CIFAR10.In real applications, training data is often not as copious as we would like, and collected data might have noisy labels. Generalization has been extensively studied as part of the semisupervised and fewshot learning literature, e.g. (Vinyals et al., 2016; Rasmus et al., 2015). Specific techniques to handle noisy labels for deep networks have also been developed (Sukhbaatar et al., 2014; Reed et al., 2014). Our margin loss provides generalization benefits and robustness to noisy labels and is complementary to these works. Deep networks are susceptible to adversarial attacks (Szegedy et al., 2013) and a number of attacks (Papernot et al., 2017; Sharif et al., 2016; Hosseini et al., 2017), and defenses (Kurakin et al., 2016; Madry et al., 2017; Guo et al., 2017; Athalye & Sutskever, 2017) have been developed. A natural benefit of large margins is robustness to adversarial attacks, as we show empirically in Sec. 4.
3 Large Margin Deep Networks
Consider a classification problem with classes. Suppose we use a function , for that generates a prediction score for classifying the input vector to class . The predicted label is decided by the class with maximal score, i.e. .
Define the decision boundary for each class pair as:
(1) 
Under this definition, the distance of a point to the decision boundary is defined as the smallest displacement of the point that results in a score tie:
(2) 
Here is any norm (). Using this distance, we can develop a large margin loss. We start with a training set consisting of pairs , where the label . We penalize the displacement of each to satisfy the margin constraint for separating class from class (). This implies using the following loss function:
(3) 
where the adjusts the polarity of the distance. The intuition is that, if is already correctly classified, then we only want to ensure it has distance from the decision boundary, and penalize proportional to the distance it falls short (so the penalty is ). However, if it is misclassified, we also want to penalize the point for not being correctly classified. Hence, the penalty includes the distance needs to travel to reach the decision boundary as well as another distance to travel on the correct side of decision boundary to attain margin. Therefore, the penalty becomes . In a multiclass setting, we aggregate individual losses arising from each by some aggregation operator :
(4) 
In this paper we use two aggregation operators, namely the max operator and the sum operator . In order to learn ’s, we assume they are parameterized by a vector and should use the notation ; for brevity we keep using the notation . The goal is to minimize the loss w.r.t. :
(5) 
The above formulation depends on , whose exact computation from (2) is intractable when ’s are nonlinear. Instead, we present an approximation to by linearizing w.r.t. around .
(6) 
This problem now has the following closed form solution (see supplementary for proof):
(7) 
where is the dualnorm of . is the dual norm of when it satisfies (Boyd & Vandenberghe, 2004). For example if distances are measured w.r.t. , , or norm, the norm in (7) will respectively be , , or norm. Using the linear approximation, the loss function becomes:
(8) 
This further simplifies to the following problem:
(9) 
In (Huang et al., 2015), (7) has been derived (independently of us) to facilitate adversarial training with different norms. In contrast, we develop a novel marginbased loss function that uses this distance metric at multiple hidden layers, and show benefits for a wide range of problems. In the supplementary material, we show that (7) coincides with an SVM for the special case of a linear classifier.
3.1 Margin for Hidden Layers
The classic notion of margin is defined based on the distance of input samples from the decision boundary; in shallow models such as SVM, input/output association is the only way to define a margin. In deep networks, however, the output is shaped from input by going through a number of transformations (layers). In fact, the activations at each intermediate layer could be interpreted as some intermediate representation of the data for the following part of the network. Thus, we can define the margin based on any intermediate representation and the ultimate decision boundary. We leverage this structure to enforce that the entire representation maintain a large margin with the decision boundary. The idea then, is to simply replace the input in the margin formulation (9) with the intermediate representation of . More precisely, let denote the output of the ’th layer () and be the margin enforced for its corresponding representation. Then the margin loss (9) can be adapted as below to incorporate intermediate margins (where the in the denominator is used to prevent numerical problems, and is set to a small value such as in practice):
(10) 
4 Experiments
Here we provide empirical results using formulation (10) on a number of tasks and datasets. We consider the following datasets and models: a deep convolutional network for MNIST (LeCun et al., 1998), a deep residual convolutional network for CIFAR10 (Zagoruyko & Komodakis, 2016) and an Imagenet model with the Inception v3 architecture (Szegedy et al., 2016)
. Details of the architectures and hyperparameter settings are provided in the supplementary material. Our code was written in Tensorflow
(Abadi et al., 2016). The tasks we consider are: training with noisy labels, training with limited data, and defense against adversarial perturbations. In all these cases, we expect that the presence of a large margin provides robustness and improves test accuracies. As shown below, this is indeed the case across all datasets and scenarios considered.4.1 Optimization of Parameters
Our loss function (10) differs from the crossentropy loss due to the presence of gradients in the loss itself. We compute these gradients for each class ( is the true label corresponding to sample ). To reduce computational cost, we choose a subset of the total number of classes. We pick these classes by choosing that have the highest value from the forward propagation step. For MNIST and CIFAR10, we used all (other) classes. For Imagenet we used only class (increasing
increased computational cost without helping performance). The backpropagation step for parameter updates requires the computation of secondorder mixed gradients. To further reduce computation cost to a manageable level, we use a firstorder Taylor approximation to the gradient with respect to the weights. This approximation simply corresponds to treating the denominator (
) in (10) as a constant with respect to for backpropagation. The value of is recomputed at every forward propagation step. We compared performance with and without this approximation for MNIST and found minimal difference in accuracy, but significantly higher GPU memory requirement due to the computation of secondorder mixed derivatives without the approximation (a derivative with respect to activations, followed by another with respect to weights). Using these optimizations, we found, for example, that training is around tomore expensive in wallclock time for the margin model compared to crossentropy, measured on the same NVIDIA p100 GPU (but note that there is no additional cost at inference time). Finally, to improve stability when the denominator is small, we found it beneficial to clip the loss at some threshold. We use standard optimizers such as RMSProp
(Tieleman & Hinton, 2012).4.2 Mnist
We train a 4 hiddenlayer model with 2 convolutional layers and 2 fully connected layers, with rectified linear unit (ReLu) activation functions, and a softmax output layer. The first baseline model uses a crossentropy loss function, trained with stochastic gradient descent optimization with momentum and learning rate decay. A natural question is whether having a large margin loss defined at the network output such as the standard hinge loss could be sufficient to give good performance. Therefore, we trained a second baseline model using a hinge loss combined with a small weight of crossentropy.
The large margin model has the same architecture as the baseline, but we use our new loss function in formulation (10). We considered margin models using an , or norm on the distances, respectively. For each norm, we train a model with margin either only on the input layer, or on all hidden layers and the output layer. Thus there are margin models in all. For models with margin at all layers, the hyperparameter is set to the same value for each layer (to reduce the number of hyperparameters). Furthermore, we observe that using a weighted sum of margin and crossentropy facilitates training and speeds up convergence ^{2}^{2}2We emphasize that the performance achieved with this combination cannot be obtained by crossentropy alone, as shown in the performance plots.. We tested all models with both stochastic gradient descent with momentum, and RMSProp (Hinton et al., ) optimizers and chose the one that worked best on the validation set. In case of crossentropy and hinge loss we used momentum, in case of margin models for MNIST, we used RMSProp with no momentum.
For all our models, we perform a hyperparameter search including with and without dropout, with and without weight decay and different values of for the margin model (same value for all layers where margin is applied). We hold out samples of the training set as a validation set, and the remaining samples are used for training. The full evaluation set of samples is used for reporting all accuracies. Under this protocol, the crossentropy and margin models trained on the sample training set achieves a test accuracy of and the hinge loss model achieve .
4.2.1 Noisy Labels
In this experiment, we choose, for each training sample, whether to flip its label with some other label chosen at random. E.g. an instance of “1” may be labeled as digit “6”. The percentage of such flipped labels varies from to in increments of . Once a label is flipped, that label is fixed throughout training. Fig. 2(left) shows the performance of the best performing (all layer margin and crossentropy) of the algorithms, with test accuracy plotted against noise level. It is seen that the margin and models perform better than crossentropy across the entire range of noise levels, while the margin model is slightly worse than crossentropy. In particular, the margin model achieves a evaluation accuracy of at label noise, compared to for crossentropy. The input only margin models were outperformed by the all layer margin models and are not shown in Fig. 2. We find that this holds true across all our tests. The performance of all methods is shown in the supplementary material.
4.2.2 Generalization
In this experiment we consider models trained with significantly lower amounts of training data. This is a problem of practical importance, and we expect that a large margin model should have better generalization abilities. Specifically, we randomly remove some fraction of the training set, going down from of training samples to only , which is samples. In Fig. 2
(right), the performance of crossentropy, hinge and margin (all layers) is shown. The test accuracy is plotted against the fraction of data used for training. We also show the generalization results of a Bayesian active learning approach presented in
(Gal et al., 2017). The alllayer margin models outperform both crossentropy and (Gal et al., 2017) over the entire range of testing, and the amount by which the margin models outperform increases as the dataset size decreases. The alllayer margin model outperforms crossentropy by around in the smallest training set of samples. We use the same randomly drawn training set for all models.4.2.3 Adversarial Perturbation
Beginning with (Goodfellow et al., 2014), a number of papers (Papernot et al., 2016; Kurakin et al., 2016; MoosaviDezfooli et al., 2016) have examined the presence of adversarial examples that can “fool” deep networks. These papers show that there exist small perturbations to images that can cause a deep network to misclassify the resulting perturbed examples. We use the Fast Gradient Sign Method (FGSM) and the iterative version (IFGSM) of perturbation introduced in (Goodfellow et al., 2014; Kurakin et al., 2016) to generate adversarial examples^{3}^{3}3There are also other methods of generating adversarial perturbations, not considered here.. Details of FGSM and IFGSM are given in the supplementary.
For each method, we generate a set of perturbed adversarial examples using one network, and then measure the accuracy of the same (whitebox) or another (blackbox) network on these examples. Fig. 3 (left, middle) shows the performance of the models for IFGSM attacks (which are stronger than FGSM). FGSM performance is given in the supplementary. We plot test accuracy against different values of used to generate the adversarial examples. In both FGSM and IFGSM scenarios, all margin models significantly outperform crossentropy, with the alllayer margin models outperform the inputonly margin, showing the benefit of margin at hidden layers. This is not surprising as the adversarial attacks are specifically defined in input space. Furthermore, since FGSM/IFGSM are defined in the norm, we see that the margin model performs the best among the three norms. In the supplementary, we also show the whitebox performance of the method from (Madry et al., 2017) ^{4}^{4}4We used values provided by the authors., which is an algorithm specifically designed for adversarial defenses against FGSM attacks. One of the margin models outperforms this method, and another is very competitive. For black box, the attacker is a crossentropy model. It is seen that the margin models are robust against blackbox attacks, significantly outperforming crossentropy. For example at , crossentropy is at accuracy, while the best margin model is at .
Kurakin et al. (2016) suggested adversarial training as a defense against adversarial attacks. This approach augments training data with adversarial examples. However, they showed that adding FGSM examples in this manner, often do not confer robustness to IFGSM attacks, and is also computationally costly. Our margin models provide a mechanism for robustness that is independent of the type of attack. Further, our method is complementary and can still be used with adversarial training. To demonstrate this, Fig. 3 (right) shows the improved performance of the model compared to the crossentropy model for blackbox attacks from a crossentropy model, when the models are adversarially trained. While the gap between crossentropy and margin models is reduced in this scenario, we continue to see greater performance from the margin model at higher values of . Importantly, we saw no benefit for the generalization or noisy label tasks from adversarial training  thus showing that this type of data augmentation provides very specific robustness. In the supplementary, we also show the performance against input corrupted with varying levels of Gaussian noise, showing the benefit of margin for this type of perturbation as well.
4.3 Cifar10
Next, we test our models for the same tasks on CIFAR10 dataset (Krizhevsky & Hinton, 2009). We use the ResNet model proposed in Zagoruyko & Komodakis (2016), consisting of an input convolutional layer, blocks of residual convolutional layers where each block containing layers, for a total of convolutional layers. Similar to MNIST, we set aside of the training data for validation, leaving a total of training samples, and validation samples. We train margin models with multiple layers of margin, choosing evenly spaced layers (input layer, output layer and other convolutional layers in the middle) across the network. We perform a hyperparameter search across margin values. We also train with data augmentation (random image transformations such as cropping and contrast/hue changes). Hyperparameter details are provided in the supplementary material. With these settings, we achieve a baseline accuracy of around for the following models: crossentropy, hinge, margin , and ^{5}^{5}5With a better CIFAR network WRN4010 from Zagoruyko & Komodakis (2016), we were able to achieve accuracy on full data.
4.3.1 Noisy Labels
Fig. 4 (left) shows the performance of the models under the same noisy label regime, with fractions of noise ranging from to . The margin and models consistently outperforms crossentropy by to across the range of noise levels.
4.3.2 Generalization
Fig. 4(right) shows the performance of the CIFAR10 models on the generalization task. We consistently see superior performance of the and margin models w.r.t. crossentropy, especially as the amount of data is reduced. For example at and of the total data, the margin model outperforms the crossentropy model by .
4.3.3 Adversarial Perturbations
Fig. 5 shows the performance of crossentropy and margin models for IFGSM attacks, for both whitebox and black box scenarios. The and margin models perform well for both sets of attacks, giving a clear boost over crossentropy. For , the model achieves an improvement over crossentropy of about when defending against a crossentropy attack. Another approach for robustness is in (Cisse et al., 2017), where the Lipschitz constant of network layers is kept small, thereby directly insulating the network from small perturbations. Our models trained with margin significantly outperform their reported results in Table 1 for CIFAR10. For an SNR of (as computed in their paper), we achieve 82% accuracy compared to 69.1% by them (for nonadversarial training), a 18.7 % relative improvement.
4.4 Imagenet
We tested our margin model against crossentropy for a fullscale Imagenet model based on the Inception architecture (Szegedy et al., 2016), with data augmentation. Our margin model and crossentropy achieved a top1 validation precision of respectively, close to the reported in (Szegedy et al., 2016). We test the Imagenet models for whitebox FGSM and IFGSM attacks, as well as for blackbox attacks defending against crossentropy model attacks. Results are shown in Fig. 6. We see that the margin model consistently outperforms crossentropy for black and white box FGSM and IFGSM attacks. For example at , we see that crossentropy achieves a whitebox FGSM accuracy of , whereas margin achieves whitebox accuracy and blackbox accuracy. Note that our FGSM accuracy numbers on the crossentropy model are quite close to that achieved in (Kurakin et al., 2016) (Table 2, top row); also note that we use a wider range of in our experiments.
5 Discussion
We have presented a new loss function inspired by the theory of large margin that is amenable to deep network training. This new loss is flexible and can establish a large margin that can be defined on input, hidden or output layers, and using , , and distance definitions. Models trained with this loss perform well in a number of practical scenarios compared to baselines on standard datasets. The formulation is independent of network architecture and input domain and is complementary to other regularization techniques such as weight decay and dropout. Our method is computationally practical: for Imagenet, our training was about times more expensive than crossentropy (per step). Finally, our empirical results show the benefit of margin at the hidden layers of a network.
Acknowledgments
We are grateful to Alan Mackey, Alexey Kurakin, Julius Adebayo, Nicolas Le Roux, Sergey loffe, Shankar Krishnan, all from Google Research, Aleksander Madry of MIT, and anonymous reviewers for discussions and helpful feedback on the manuscript.
References
 Abadi et al. (2016) Abadi, Martín, Barham, Paul, Chen, Jianmin, Chen, Zhifeng, Davis, Andy, Dean, Jeffrey, Devin, Matthieu, Ghemawat, Sanjay, Irving, Geoffrey, Isard, Michael, et al. Tensorflow: A system for largescale machine learning. In OSDI, volume 16, pp. 265–283, 2016.
 Athalye & Sutskever (2017) Athalye, Anish and Sutskever, Ilya. Synthesizing robust adversarial examples. arXiv preprint arXiv:1707.07397, 2017.
 Bousquet & Elisseeff (2002) Bousquet, Olivier and Elisseeff, André. Stability and generalization. Journal of machine learning research, 2(Mar):499–526, 2002.
 Boyd & Vandenberghe (2004) Boyd, Stephen and Vandenberghe, Lieven. Convex optimization. 2004.
 Cisse et al. (2017) Cisse, Moustapha, Bojanowski, Piotr, Grave, Edouard, Dauphin, Yann, and Usunier, Nicolas. Parseval networks: Improving robustness to adversarial examples. In International Conference on Machine Learning, pp. 854–863, 2017.
 Cortes & Vapnik (1995) Cortes, Corinna and Vapnik, Vladimir. Supportvector networks. Machine learning, 20(3):273–297, 1995.
 Drucker et al. (1997) Drucker, Harris, Burges, Chris J. C., Kaufman, Linda, Smola, Alex, and Vapnik, Vladimir. Support vector regression machines. In NIPS, pp. 155–161. MIT Press, 1997.
 Gal et al. (2017) Gal, Yarin, Islam, Riashat, and Ghahramani, Zoubin. Deep bayesian active learning with image data. arXiv preprint arXiv:1703.02910, 2017.
 Goodfellow et al. (2014) Goodfellow, Ian J, Shlens, Jonathon, and Szegedy, Christian. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
 Guo et al. (2017) Guo, Chuan, Rana, Mayank, Cissé, Moustapha, and van der Maaten, Laurens. Countering adversarial images using input transformations. arXiv preprint arXiv:1711.00117, 2017.
 Hein & Andriushchenko (2017) Hein, Matthias and Andriushchenko, Maksym. Formal guarantees on the robustness of a classifier against adversarial manipulation. In Advances in Neural Information Processing Systems, pp. 2266–2276, 2017.
 (12) Hinton, Geoffrey, Srivastava, Nitish, and Swersky, Kevin. Neural networks for machine learninglecture 6aoverview of minibatch gradient descent.
 Hosseini et al. (2017) Hosseini, Hossein, Xiao, Baicen, and Poovendran, Radha. Google’s cloud vision api is not robust to noise. arXiv preprint arXiv:1704.05051, 2017.
 Huang et al. (2015) Huang, Ruitong, Xu, Bing, Schuurmans, Dale, and Szepesvári, Csaba. Learning with a strong adversary. arXiv preprint arXiv:1511.03034, 2015.
 Krizhevsky & Hinton (2009) Krizhevsky, Alex and Hinton, Geoffrey. Learning multiple layers of features from tiny images. 2009.
 Kurakin et al. (2016) Kurakin, A., Goodfellow, I., and Bengio, S. Adversarial Machine Learning at Scale. ArXiv eprints, November 2016.
 LeCun et al. (1998) LeCun, Yann, Bottou, Léon, Bengio, Yoshua, and Haffner, Patrick. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 Liang et al. (2017) Liang, Xuezhi, Wang, Xiaobo, Lei, Zhen, Liao, Shengcai, and Li, Stan Z. Softmargin softmax for deep classification. In International Conference on Neural Information Processing, pp. 413–421. Springer, 2017.

Liu et al. (2016)
Liu, Weiyang, Wen, Yandong, Yu, Zhiding, and Yang, Meng.
Largemargin softmax loss for convolutional neural networks.
In ICML, pp. 507–516, 2016.  Madry et al. (2017) Madry, Aleksander, Makelov, Aleksandar, Schmidt, Ludwig, Tsipras, Dimitris, and Vladu, Adrian. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.
 Matyasko & Chau (2017) Matyasko, Alexander and Chau, LapPui. Margin maximization for robust classification using deep learning. In Neural Networks (IJCNN), 2017 International Joint Conference on, pp. 300–307. IEEE, 2017.
 MoosaviDezfooli et al. (2016) MoosaviDezfooli, SeyedMohsen, Fawzi, Alhussein, Fawzi, Omar, and Frossard, Pascal. Universal adversarial perturbations. arXiv preprint arXiv:1610.08401, 2016.
 Papernot et al. (2016) Papernot, Nicolas, McDaniel, Patrick, Goodfellow, Ian, Jha, Somesh, Celik, Z Berkay, and Swami, Ananthram. Practical blackbox attacks against deep learning systems using adversarial examples. arXiv preprint arXiv:1602.02697, 2016.
 Papernot et al. (2017) Papernot, Nicolas, McDaniel, Patrick, Goodfellow, Ian, Jha, Somesh, Celik, Z Berkay, and Swami, Ananthram. Practical blackbox attacks against machine learning. In Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security, pp. 506–519. ACM, 2017.
 Rasmus et al. (2015) Rasmus, Antti, Berglund, Mathias, Honkala, Mikko, Valpola, Harri, and Raiko, Tapani. Semisupervised learning with ladder networks. In Advances in Neural Information Processing Systems, pp. 3546–3554, 2015.
 Reed et al. (2014) Reed, Scott, Lee, Honglak, Anguelov, Dragomir, Szegedy, Christian, Erhan, Dumitru, and Rabinovich, Andrew. Training deep neural networks on noisy labels with bootstrapping. arXiv preprint arXiv:1412.6596, 2014.

Sharif et al. (2016)
Sharif, Mahmood, Bhagavatula, Sruti, Bauer, Lujo, and Reiter, Michael K.
Accessorize to a crime: Real and stealthy attacks on stateoftheart face recognition.
In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pp. 1528–1540. ACM, 2016.  Sokolic et al. (2016) Sokolic, Jure, Giryes, Raja, Sapiro, Guillermo, and Rodrigues, Miguel R. D. Robust large margin deep neural networks. CoRR, abs/1605.08254, 2016. URL http://arxiv.org/abs/1605.08254.
 Soudry et al. (2017) Soudry, Daniel, Hoffer, Elad, and Srebro, Nathan. The implicit bias of gradient descent on separable data. arXiv preprint arXiv:1710.10345, 2017.
 Sukhbaatar et al. (2014) Sukhbaatar, Sainbayar, Bruna, Joan, Paluri, Manohar, Bourdev, Lubomir, and Fergus, Rob. Training convolutional networks with noisy labels. arXiv preprint arXiv:1406.2080, 2014.
 Sun et al. (2015) Sun, Shizhao, Chen, Wei, Wang, Liwei, and Liu, TieYan. Large margin deep neural networks: Theory and algorithms. CoRR, abs/1506.05232, 2015. URL http://arxiv.org/abs/1506.05232.
 Szegedy et al. (2013) Szegedy, Christian, Zaremba, Wojciech, Sutskever, Ilya, Bruna, Joan, Erhan, Dumitru, Goodfellow, Ian J., and Fergus, Rob. Intriguing properties of neural networks. CoRR, abs/1312.6199, 2013. URL http://arxiv.org/abs/1312.6199.

Szegedy et al. (2016)
Szegedy, Christian, Vanhoucke, Vincent, Ioffe, Sergey, Shlens, Jon, and Wojna,
Zbigniew.
Rethinking the inception architecture for computer vision.
InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 2818–2826, 2016.  Tieleman & Hinton (2012) Tieleman, Tijmen and Hinton, Geoffrey. Lecture 6.5rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26–31, 2012.

Vapnik (1995)
Vapnik, Vladimir N.
The Nature of Statistical Learning Theory
. SpringerVerlag New York, Inc., New York, NY, USA, 1995. ISBN 0387945598.  Vinyals et al. (2016) Vinyals, Oriol, Blundell, Charles, Lillicrap, Tim, Wierstra, Daan, et al. Matching networks for one shot learning. In Advances in Neural Information Processing Systems, pp. 3630–3638, 2016.
 Zagoruyko & Komodakis (2016) Zagoruyko, Sergey and Komodakis, Nikos. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
Appendix A Derivation of Equation (7) in paper
Consider the following optimization problem,
(11) 
We show that has the form , where is the dualnorm of . For a given norm , the length of any vector w.r.t. to the dual norm is defined as . Since is linear, the maximizer is attained at the boundary of the feasible set, i.e. . Therefore, it follows that:
(12) 
In particular for and :
(13) 
So far we have just used properties of the dual norm. In order to prove our result, we start from the trivial identity (assuming which is guaranteed when ): . Consequently, . Applying the constraint , we obtain . Thus, we can write,
(14) 
We thus continue as below (using (13) in the last step):
(15) 
It is well known from Hölder’s inequality that , where .
Appendix B SVM as a Special Case
In the special case of a linear classifier, our large margin formulation coincides with an SVM. Consider a binary classification task, so that , for . Let where . Recall from Eq. (7) (in the paper) that the distance to the decision boundary in our formulation is defined as:
(16) 
In this (linear) case, there is a scaling redundancy; if the model parameter yields distance , so does for any . For SVMs, we make the problem wellposed by assuming (the subset of training points that attains the equality are called support vectors). Thus, denoting the evaluation of at by , the inequality constraint implies that for any training point . The margin is defined as the smallest such distance . Obviously, maximizing is equivalent to minimizing ; the wellknown result for SVMs.
Appendix C MNIST  Additional Results
Appendix D FGSM/IFGSM Adversarial Example Generation
Given an input image , a loss function , the perturbed FGSM image is generated as . For IFGSM, this process is slightly modified as , where clips the values of each pixel of to the range . This process is repeated for a number of iterations . The value of is usually set to a small number such as to generate imperceptible perturbations which can nevertheless cause the accuracy of a network to be significantly degraded.
Appendix E CIFAR10  Additional Results
Figure 11 shows the performance of the CIFAR10 models against FGSM blackbox and whitebox attacks.
Appendix F MNIST Model Architecture and Hyperparameter Details
We use a hidden layer network. The first two hidden layers are convolutional, with filter sizes of and and units. The hidden layers are of size each. We use a learning rate of . Dropout is set to either or (hyperparameter sweep for each run), weight decay to or , and to or (for margin only). The same value of is used at all layers where margin is applied for ease of hyperparameter tuning. The aggregation operator (see (4) in the paper) is set to . For margin training, we add crossentropy loss with a small weight of . For example, as noise levels increase, we find that dropout hurts performance. The best performing validation accuracy among the parameter sweeps is used for reporting final test accuracy. We run for steps of training with minibatch size of . We use either SGD with momentum or RMSProp. We fix at for all experiments (see Equation 16 in the paper)  this applies to all datasets.
Appendix G CIFAR10 Model Architecture and Hyperparameter Details
We use the depth , k= architecture from (Zagoruyko & Komodakis, 2016). This consists of a first convolutional layer with filters of size and units; followed by sets of residual units ( residual units each). No dropout is used. Learning rate is set to for margin and for crossentropy and hinge. with decay of every or steps (we choose from a hyperparameter sweep). is set to either , or (as for MNIST, the same value of is used at all layers where margin is applied). The aggregation operator for margin is set to . We use either SGD with momentum or RMSProp. For margin training, we add crossentropy loss with a small weight of .
Appendix H Imagenet Model Architecture and Hyperparameter Details
We follow the architecture in (Szegedy et al., 2016). We always use RMSProp for optimization. For margin training, we add crossentropy loss with a small weight of and an additional auxiliary loss (as suggested in the paper) in the middle of the network.
Appendix I Evolution of distance metric
Below we show plots of our distance approximation (see (7)) averaged over each minibatch for steps of training CIFAR10 model with of the data. This is a screengrab from a Tensorflow run (Tensorboard). The orange curve represents training distance, red curve is validation and blur curve is test. We show plots for different layers, for both crossentropy and margin. It is seen that for each layer, (input, layer and output), the margin achieves a higher mean distance to boundary than crossentropy (for input, margin achieves mean distance about for test set, whereas crossentropy achieves about .). Note that for crossentropy we are only measuring this distance.