1 Introduction
State of the art machine learning algorithms can achieve high accuracies in classification tasks but misclassify minor perturbations in the data known as as adversarial attacks
(Goodfellow et al., 2014; Papernot et al., 2016b; Kurakin et al., 2016; Carlini & Wagner, 2017; Brendel et al., 2017). Data corruptions that go beyond adversarial perturbations such as image brightness, contrast, fog, and snow for example also pose a challenge to machine learning methods (Hendrycks & Dietterich, 2019).The 01 loss is known to be robust to outliers (Bartlett et al., 2004) and to label noise in the training data (Manwani & Sastry, 2013; Ghosh et al., 2015). Does robustness of 01 loss also extend to adversarial data? We study this in the setting of white box and substitute model black box attacks. Computationally 01 loss presents a considerable challenge because it is NPhard to solve (BenDavid et al., 2003). Previous attempts (Zhai et al., 2013; Tang et al., 2014; ShalevShwartz et al., 2011; Li & Lin, 2007; Nguyen & Sanner, 2013) lack onpar test accuracy with convex solvers and are slow and impractical for large multiclass image benchmarks, except for the recent stochastic coordinate descent (Xie et al., 2019)
We propose a two layer neural network with sign activation (01) loss, which to the best of our knowledge is the first such network to be proposed. We train it with stochastic coordinate descent and show that it achieves onpar test accuracy to equivalent convex models. We then proceed with white box and substitute model black box attacks on image benchmarks MNIST (LeCun et al., 1998), CIFAR10 (Krizhevsky, 2009), and Mini ImageNet (a ten class subset of the original ImageNet (Russakovsky et al., 2015)) where we make interesting findings.
2 Results
We refer to our linear (no hidden layer) and nonlinear (single hidden layer with 20 nodes) models as SCD01 and MLP01 respectively. See our Supplementary Material for their objectives, optimization algorithms, and runtime and accuracies on image classification benchmarks. As convex counterparts we select the linear support vector machine (with a crossvalidated regularization parameter) denoted as SVM and a two layer 20 hidden node neural network with logistic loss (MLP). For multiclass we use onevsall for all four methods. We use the majority vote of 32 runs for our 01 loss models to improve stability and do the same for SVM and MLP by majority voting on 32 bootstrapped samples. Our implementations and experimental platforms are given in detail in the Supplementary Material. Our SCD01 and MLP01 source codes, supplementary programs, and data are available from
https://github.com/zerooneloss/mlp01.We refer to the accuracy on the test data as clean data test accuracy. An incorrectly classified adversarial example is considered a successful attack whereas a correctly classified adversarial is a failed one. Thus when we refer to accuracy of adversarial examples it is the same as
. The lower the accuracy the more effective the attack.2.1 White box attacks
In this section we study white box attacks just for binary classification on classes 0 and 1 in each of the three datasets. We use single runs of each of the four models to generate adversaries using the model parameters. We use the same white box attack method (Papernot et al., 2016a) for SVM and SCD01 since both are linear classifiers: for a given datapoint and its label the adversary is where and is the distortion.
For MLP we use the fast gradient sign method (FGSM) (Goodfellow et al., 2014). In this method we generate an adversary using the sign of the model gradient where is the model objective and is the model gradient with respect to the data
. We generate white box adversaries for MLP01 with a simple heuristic: for each hidden node
we modify the input as (where is the output of from the hidden node ) and accept the first modification that misclassifies in the final node output. If is already misclassified or if none of the hidden node distortions misclassify it we distort with a randomly selected hidden node. We provide the full algorithm in the Supplementary Material. We use values on MNIST, CIFAR10, and ImageNet that are typical in the literature.In Table 1 we see that the clean accuracies of our 01 loss models are comparable to the convex counterparts (with more shown in the Supplementary Material). As expected adversaries from the source on the same target are effective except for MLP01. More interestingly, while adversaries from SVM and MLP affect each other considerably they are far less pronounced on SCD01 and MLP01. We see this very clearly on CIFAR10 where both SVM and MLP adversaries have almost 0% accuracy when attacking each other indicating high transferability (Papernot et al., 2016a). But SVM and MLP adversaries on SCD01 and MLP01 have a far less effect in this dataset. Adversaries from MLP attain a 63.7% accuracy on MLP01 and 43.5% on SCD01. Another interesting observation is that adversaries barely transfer between SCD01 and MLP01. We see similar behavior on Mini ImageNet and to a lesser degree on MNIST.
SVM  SCD01  MLP  MLP01  
MNIST  
Clean  100  99.9  100  100 
SVM  11.9  8.1  40.4  43.5 
SCD01  97  0  98.5  53.2 
MLP  25.5  16.1  31  42.3 
MLP01  99.9  99.8  99.6  69.5 
CIFAR10  
Clean  82.2  81.1  88.7  84.2 
SVM  0  41.3  0.5  70.1 
SCD01  76  0.8  86  84.5 
MLP  0  43.5  0.4  63.7 
MLP01  81.7  80  88.5  66.9 
Mini ImageNet  
Clean  60.7  67.5  66.1  68.7 
SVM  0  54.9  21.2  53.8 
SCD01  58.6  1  65  60.3 
MLP  0.5  42  21.6  52.3 
MLP01  60.8  65.1  65.8  35.7 
We argue that the difference of loss functions (01 vs convex) may be responsible for different boundaries and nontransferability. We illustrate this in two examples. First we see the effect of outliers on 01 loss and hinge loss linear classifiers. Recall that the hinge loss is
where is the label and is the prediction of given by the classifier . In Figure 1(a) the misclassified outlier forces the hinge loss to give a skewed linear boundary with two misclassifications. This happens because even though the two points are misclassified by the red boundary they are closer to it than the single misclassified one is to the blue one. The 01 loss is unaffected by distances and thus gives the blue boundary with one misclassification. Since the two boundaries have different orientations their adversaries are also likely to be different. In a dataset like MNIST where our accuracies are high we don’t expect many misclassified outliers and thus boundaries are unlikely to be different. As a result we see that many adversaries transfer between SVM and SCD01 on MNIST. But on CIFAR10 and Mini ImageNet, which are more complex and likely to contain misclassified outliers, we expect different boundaries which in turn gives fewer adversaries that transfer between the two.
Next we see the difference of convex and 01 loss in simple two hidden node network. In Figure 1
(b) we see two hyperplanes
and on the left whose logistic outputs give the hidden feature space on the right. The two hyperplanes and represent two hidden nodes in a two layer network. Recall that the logistic activation (where is prediction of given by ) is similar to 01 loss: for large values of it approaches 0 or 1 depending upon the sign of and approaches as approaches 0. Thus if we move the red circle towards the ”corner” in the original feature space (as shown in Figure 1(b)) its outputs from and approach in the hidden space. Consequently it crosses the linear boundary in the hidden space and becomes adversarial. However if the activation is 01 loss the red point remains unmoved in the hidden space. In fact in 01 loss a datapoint’s value in the hidden space changes only if we cross a boundary in the original space.While both examples are not formal proofs they give some intuition of why fewer adversarial examples transfer between 01 loss and convex loss compared to between just convex. In particular we see that for 01 loss a datapoint becomes potentially adversarial if and only if it crosses a boundary in the original feature space whereas this is not true for convex losses.
(a) Just one point is misclassified by the blue boundary but its 
hinge loss shown with dotted lines is much 
higher than the loss of points and that are misclassified by 
the red boundary. Thus the hinge loss favors the red skewed line. 
(b) The logistic activation in the original space gives 
a linear separation in the hidden space. If we move the red circle 
towards the ”corner” of the boundaries its distance to and 
decreases. This in turn makes its activation values approach half 
and it becomes misclassified in the hidden space. If the activation 
is 01 loss the red circle does not get affected in the hidden space. 
Toy example showing different 01 loss and hinge boundaries, and adversarial examples in simple logistic loss network
Interestingly we also see MLP01 adversaries don’t transfer to the other three models. When applied to MLP01 the adversaries lower its accuracy relative to clean data but to lesser degree than other models attacking themselves. Thus our white box attack method for MLP01 may not be the most powerful one leaving this an open problem.
2.2 Substitute model black box attacks
We see that white box adversaries don’t transfer between convex and 01 loss but can we attack a 01 loss model with a convex substitute model (Papernot et al., 2016a)? In this subsection we consider binary and multiclass classification on all three datasets. For all four methods we use 32 votes and onevsall multiclass classification. We use adversarial data augmentation (Papernot et al., 2016a)
to iteratively train a substitute model trained on label outputs from the target model. In each epoch we generate white box adversaries targeting the substitute model with the FGSM method
(Goodfellow et al., 2014) and evaluate them on the target. Note that our black box attack is untargeted, we are mainly interested in misclassifying the data and not the misclassification label. See Supplementary Material for the full substitute model learning algorithms but it is essentially the method of Papernot et. al. (Papernot et al., 2016a).2.2.1 Convex substitute model
(a) MNIST 
(b) CIFAR10 binary (class 0 and 1) 
(b) CIFAR10 
(c) Mini ImageNet 
In Figure 2 we see the accuracy of target models on adversaries generated from a convex substitute model. Specifically we use a dual hidden layer neural network with logistic loss and 200 nodes in each hidden layer as the substitute model. Like in the white box attacks we use values commonly used on these datasets. In MNIST (Figure 2(a)) we see a rapid drop in accuracy in the first few epochs and somewhat flat after epoch 10. We don’t see a considerable difference between the 01 loss and convex sibling models on MNIST although MLP01 has the highest accuracy.
On CIFAR10 and Mini ImageNet we see much more pronounced differences. In CIFAR10 binary classification (Figure 2(b)) we see that even though both MLP and MLP01 start off with clean test accuracies of 88% and 86% respectively, at the end of the 20th epoch MLP01 has 58% accuracy on adversarial examples while MLP has 7% accuracy. We see similar results on Mini ImageNet binary classification in the Supplementary Material. In CIFAR10 multiclass (Figure 2(c)) at the end of the 20th epoch the difference in accuracy between MLP and MLP01 is 24% even though both methods start off with about the same accuracy on clean test data. Similarly on Mini ImageNet MLP01 is 20.7% higher in accuracy than MLP in the 20th epoch. This is particularly interesting since MLP01 started off with a higher accuracy on Mini ImageNet and in general we expect more accurate models to be less robust (Raghunathan et al., 2019; Zhang et al., 2019; Tsipras et al., 2018). However that is not the case here. Even if we give MLP the advantage of 400 hidden nodes in a shared weight network instead of onevsall, its accuracy in the 20th epoch is 13% lower than MLP01.
We have already seen earlier in white box attacks that adversaries transfer between SVM and SCD01 on MNIST but not so much on CIFAR10 and Mini ImageNet. The same phenomena can be used to explain the results we see here. On MNIST the convex substitute model can attack SCD01 and MLP01 as effectively as convex models due to better transferability on MNIST. Due to poor transferability on CIFAR10 and Mini ImageNet we see that the attack is less effective on SCD01 and MLP01. In the next subsection we explore what happens if the substitute model is SCD01.
2.2.2 01 loss substitute model
In Figure 3 we see the results of a black box attack with SCD01 single run as the substitute model attacking single runs of target models. We see that adversaries produced from this model hardly affect any of the target models in any of the epochs. Even when the target is SCD01 and trained with the same initial seed as the substitute the adversaries are ineffective.
Further investigation reveals that the percentage of test data whose labels match between the 01 loss substitute and its target (known as the label match rate) is high but the label match rate on adversarial examples is much lower (shown in Supplementary Material). Thus even though the SCD01 manages to approximate the target boundary its direction is different which gives ineffective adversaries. This is due to the nonuniqueness of 01 loss which makes single run solutions different from each other. Thus as a substitute model in black box attacks 01 loss is ineffective even in attacking itself.
3 Conclusion
There is nothing to indicate that 01 loss models are robust to black box attacks that do not require substitute model training (Brendel et al., 2017; Chen et al., 2019). These are, however, computationally more expensive and require separate computations for each example. A transfer based model can be more effective (and dangerous) once it has approximated the target model boundary.
Can we further decrease transferability by introducing artificial noise so that 01 loss and convex boundaries are even more different, particularly on datasets like MNIST? We explore this in a separate study.
References
 Alpaydin (2004) Alpaydin, E. Machine Learning. MIT Press, 2004.
 Bartlett et al. (2004) Bartlett, P. L., Jordan, M. I., and Mcauliffe, J. D. Large margin classifiers: Convex loss, low noise, and convergence rates. In Thrun, S., Saul, L., and Schölkopf, B. (eds.), Advances in Neural Information Processing Systems 16, pp. 1173–1180. MIT Press, 2004.
 BenDavid et al. (2003) BenDavid, S., Eiron, N., and Long, P. M. On the difficulty of approximately maximizing agreements. Journal of Computer and System Sciences, 66(3):496–514, 2003.

Bottou (2010)
Bottou, L.
Largescale machine learning with stochastic gradient descent.
In Proceedings of COMPSTAT’2010, pp. 177–186. Springer, 2010.  Brendel et al. (2017) Brendel, W., Rauber, J., and Bethge, M. Decisionbased adversarial attacks: Reliable attacks against blackbox machine learning models. arXiv preprint arXiv:1712.04248, 2017.
 Carlini & Wagner (2017) Carlini, N. and Wagner, D. Towards evaluating the robustness of neural networks. In 2017 ieee symposium on security and privacy (sp), pp. 39–57. IEEE, 2017.
 Chen et al. (2019) Chen, J., Jordan, M. I., and Wainwright, M. J. Hopskipjumpattack: A queryefficient decisionbased attack. arXiv preprint arXiv:1904.02144, 3, 2019.
 Ghosh et al. (2015) Ghosh, A., Manwani, N., and Sastry, P. Making risk minimization tolerant to label noise. Neurocomputing, 160:93–107, 2015.
 Goodfellow et al. (2014) Goodfellow, I. J., Shlens, J., and Szegedy, C. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
 Hendrycks & Dietterich (2019) Hendrycks, D. and Dietterich, T. Benchmarking neural network robustness to common corruptions and perturbations. arXiv preprint arXiv:1903.12261, 2019.
 Krizhevsky (2009) Krizhevsky, A. Learning multiple layers of features from tiny images. 2009.
 Kurakin et al. (2016) Kurakin, A., Goodfellow, I., and Bengio, S. Adversarial machine learning at scale. arXiv preprint arXiv:1611.01236, 2016.
 LeCun et al. (1998) LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.

Li & Lin (2007)
Li, L. and Lin, H.T.
Optimizing 0/1 loss for perceptrons by random coordinate descent.
In Neural Networks, 2007. IJCNN 2007. International Joint Conference on, pp. 749–754. IEEE, 2007.  Lyu & Tsang (2019) Lyu, Y. and Tsang, I. W. Curriculum loss: Robust learning and generalization against label corruption. arXiv preprint arXiv:1905.10045, 2019.
 Manwani & Sastry (2013) Manwani, N. and Sastry, P. Noise tolerance under risk minimization. IEEE transactions on cybernetics, 43(3):1146–1151, 2013.
 Nguyen & Sanner (2013) Nguyen, T. and Sanner, S. Algorithms for direct 0–1 loss optimization in binary classification. In Proceedings of The 30th International Conference on Machine Learning, pp. 1085–1093, 2013.
 Papernot et al. (2016a) Papernot, N., McDaniel, P., and Goodfellow, I. Transferability in machine learning: from phenomena to blackbox attacks using adversarial samples. arXiv preprint arXiv:1605.07277, 2016a.

Papernot et al. (2016b)
Papernot, N., McDaniel, P., Jha, S., Fredrikson, M., Celik, Z. B., and Swami,
A.
The limitations of deep learning in adversarial settings.
In 2016 IEEE European Symposium on Security and Privacy (EuroS&P), pp. 372–387. IEEE, 2016b.  Paszke et al. (2019) Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., and Chintala, S. Pytorch: An imperative style, highperformance deep learning library. In Wallach, H., Larochelle, H., Beygelzimer, A., dÁlché Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 32, pp. 8024–8035. Curran Associates, Inc., 2019.
 Pedregosa et al. (2011) Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. Scikitlearn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
 Raghunathan et al. (2019) Raghunathan, A., Xie, S. M., Yang, F., Duchi, J. C., and Liang, P. Adversarial training can hurt generalization. arXiv preprint arXiv:1906.06032, 2019.

Russakovsky et al. (2015)
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z.,
Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., and FeiFei, L.
ImageNet Large Scale Visual Recognition Challenge.
International Journal of Computer Vision (IJCV)
, 115(3):211–252, 2015. doi: 10.1007/s112630150816y. 
Shai et al. (2011)
Shai, S.S., Shamir, O., and Sridharan, K.
Learning linear and kernel predictors with the 01 loss function.
IJCAI ProceedingsInternational Joint Conference on Artificial Intelligence
, 22(3), 2011.  ShalevShwartz et al. (2011) ShalevShwartz, S., Shamir, O., and Sridharan, K. Learning linear and kernel predictors with the 01 loss function, 2011.
 Tang et al. (2014) Tang, Y., Li, X., Xu, Y., Liu, S., and Ouyang, S. A mixed integer programming approach to maximum margin 0–1 loss classification. In 2014 International Radar Conference, pp. 1–6. IEEE, 2014.
 Tsipras et al. (2018) Tsipras, D., Santurkar, S., Engstrom, L., Turner, A., and Madry, A. Robustness may be at odds with accuracy. arXiv preprint arXiv:1805.12152, 2018.
 Xie et al. (2019) Xie, M., Xue, Y., and Roshan, U. Stochastic coordinate descent for 0/1 loss and its sensitivity to adversarial attacks. In Proceedings of 18th IEEE International Conference on Machine Learning and Applications  ICMLA 2019, pp. to appear, 2019.
 Zhai et al. (2013) Zhai, S., Xia, T., Tan, M., and Wang, S. Direct 01 loss minimization and margin maximization with boosting. In Burges, C., Bottou, L., Welling, M., Ghahramani, Z., and Weinberger, K. (eds.), Advances in Neural Information Processing Systems 26, pp. 872–880. Curran Associates, Inc., 2013.
 Zhang et al. (2019) Zhang, H., Yu, Y., Jiao, J., Xing, E. P., Ghaoui, L. E., and Jordan, M. I. Theoretically principled tradeoff between robustness and accuracy. arXiv preprint arXiv:1901.08573, 2019.
4 Supplementary Material
4.1 Background
The problem of determining the hyperplane with minimum number of misclassifications in a binary classification problem is known to be NPhard (BenDavid et al., 2003). In mainstream machine learning literature this is called minimizing the 01 loss (Shai et al., 2011) given in Objective 1,
(1) 
where , is our hyperplane, and
are our training data. Popular linear classifiers such as the linear support vector machine, perceptron, and logistic regression
(Alpaydin, 2004) can be considered as convex approximations to this problem that yield fast gradient descent solutions (Bartlett et al., 2004). However, they are also more sensitive to outliers than the 01 loss (Bartlett et al., 2004; Nguyen & Sanner, 2013; Xie et al., 2019) and more prone to mislabeled data than 01 loss (Manwani & Sastry, 2013; Ghosh et al., 2015; Lyu & Tsang, 2019).4.2 A two layer 01 loss neural network
We extend the 01 loss to a simple two layer neural network with hidden nodes and sign activation that we call the MLP01 loss. This objective for binary classification can be given as
(2) 
where , are the hidden layer parameters, are the final layer node parameters, are our training data, and . While this is a straightforward model to define optimizing it is a different story altogether. Optimizing even a single node is NPhard which makes optimizing this network much harder.
4.3 Stochastic coordinate descent for 01 loss
We solve both problems with stochastic coordinate descent based upon earlier work (Xie et al., 2019)
. We initialize all parameters to random values from the Normal distribution with mean 0 and variance 1. We then randomly select a subset of the training data (known as a batch) and perform the coordinate descent analog of a single step gradient update in stochastic gradient descent
(Bottou, 2010). We first describe this for a linear 01 loss classifier which we obtain if we set the number of hidden nodes to zero. In this case the parameters to optimize are the final weight vector and the threshold .When the gradient is known we step in its negative direction by a factor of the learning rate: where is the objective. In our case since the gradient does not exist we randomly select features (set to 64, 128, and 256 for MNIST, CIFAR10, and ImageNet in our experiments), modify the corresponding entries in by the learning rate (set to 0.17) one at a time, and accept the modification that gives the largest decrease in the objective. Key to our search is a heuristic to determine the optimal threshold each time we modify an entry of . In this heuristic we perform a linear search on a subset of the projection and select that minimizes the objective.
We repeat the above update step on randomly selected batches for a specified number of iterations given by the user. In Figure S1 we show the effect of the batch size (as a percentage of each class to ensure fair sampling) on a linear 01 loss search on CIFAR10 between classes 0 and 1. We see that a batch size of 75% reaches a train accuracy of 80% faster than the other batch sizes. Thus we use this batch size in all our experiments going forward.
We also see that for this batch size the search flattens after 15 iterations (or epochs as given in the figure). We run 1000 iterations to ensure a deep search with an intent to maximize test accuracy. For imbalanced data (that appears in the onevsall design) we find that optimizing a balanced version of our objective for half the iterations followed by the default (imbalanced) version gives a lower objective in the end.
In a two layer network we have to optimize our hidden nodes as well. In each of the 1000 iterations of our search we apply the same coordinate update described above, first to the final output node and then a randomly selected hidden node. In preliminary experiments we find this to be fast and almost as effective as optimizing all hidden nodes and the final node in each iteration.
Our intuition is that by searching on just the sampled data we avoid local minima and across several iterations we can explore a broad portion of the search space. Throughout iterations we keep track of the best parameters that minimize our objective on the full dataset. Below we provide full details of our algorithms.
The problem with our search described above is that it will return different solutions depending upon the initial starting point. To make it more stable we run it 32 times from different random seeds and use the majority vote for prediction.
We extend both our linear and nonlinear models to a simple onevsall approach for multiclass classification. For a dataset with classes we create onevsall classifiers for each of the classes. From the 32 models we can obtain frequency outputs for a test point using simple counting and use them as confidence scores for each class
. From this we output the predicted class as the one with the highest confidence. This is similar in spirit to the typical convex softmax objective used in convex neural networks except that there we can optimize to obtain the exact confidences given by sigmoid probabilities.
4.4 Implementation, experimental platform, and image benchmarks
We implement our 01 loss models in Python and Pytorch (Paszke et al., 2019), and both MLP and SVM (LinearSVC class) in scikitlearn (Pedregosa et al., 2011). We optimize MLP with stochastic gradient descent that has a batch size of 200, momentum of 0.9, and learning rate of 0.01 (.001 for ImageNet data). We ran all experiments on Intel Xeon 6142 2.6GHz CPUs and NVIDIA Titan RTX GPU machines (for parallelizing multiple votes). Our SCD01 and MLP01 source codes, supplementary programs, and data are available from https://github.com/zerooneloss/mlp01.
We experiment on three popular image benchmarks: MNIST (LeCun et al., 1998), CIFAR10 (Krizhevsky, 2009), and ImageNet (Russakovsky et al., 2015). Briefly MNIST is composed of grayscale handwritten digits each of size with 60000 training images and 10000 test and CIFAR10 has color images with 50000 training and 10000 test. ImageNet is a large benchmark with 1000 classes and color images of size . We extract images from10 random classes and split them to give a training set of 6144 images and test set of 6369. We normalize each image in each benchmark by dividing each pixel value by 255.
4.5 Clean accuracy and runtimes
Before going into robustness we first compare the clean test data accuracies and training runtimes of our 01 loss models to their convex counterparts. In Table S1 we see that ensembling SVM and MLP models does not improve the test accuracy over single runs, thus we use a shared weight MLP network with 400 nodes on ImageNet to boost accuracy there. In fact the SVM boundary depends only upon the support vectors and so each ensemble will be the same as long as the support vectors are included. As a reminder we ensemble by taking the majority vote on multiple bootstrapped samples.
The 01 loss models improve considerably in all three datasets by ensembling. This is not too surprising since 01 loss is nonunique and will give different solutions when ran multiple times from different initializations. As a result of ensembling their accuracy is comparable to their convex peers. This makes it easier to compare their robustness since we don’t have to worry about the robustness vs accuracy tradeoff (Raghunathan et al., 2019; Zhang et al., 2019; Tsipras et al., 2018).
Single run  

SVM  SCD01  MLP  MLP01  
MNIST  91.7  83.7  97.6  91.2 
CIFAR10  39.9  30.7  50.2  34.3 
Mini ImageNet  26  25  32  25.5 
32 votes  
SVM  SCD01  MLP  MLP01  
MNIST  91.7  90.8  97.1  96 
CIFAR10  40.2  39.7  47.4  46.4 
MLP400  SCD01  MLP01  
single run  
Mini ImageNet  36  34.7  41 
In Table S2 we show the runtime of a single run of our 01 loss and convex models on class 0 vs all for each of the three datasets. We don’t claim the most optimized implementation but our runtimes are still somewhat comparable to the convex loss models. Interestingly the convex models take much longer on complex and higher dimensional images in ImageNet compared to MNIST. Our 01 loss model runtimes are similar on MNIST and CIFAR10 because their sizes are similar. On Mini ImageNet since it has fewer training samples than MNIST and CIFAR10 the 01 loss runtimes are also lower.
SVM  SCD01  MLP  MLP01  

MNIST  0.8  171  64  875 
CIFAR10  80  150  267  838 
Mini ImageNet  659  83  8564  199 
4.6 Label match rates between SCD01 substitute model and target models in black box attack
(a) Percentage of labels that are the same between 
the substitute model and target model on clean data 
(b) Percentage of labels that are the same between 
the substitute model and target model on adversarial data 
4.7 Black box adversarial attacks on class 0 and 1 on MiniImageNet with convex substitute model
(a) Black box attack on classes 0 and 1 on MiniImageNet 
with convex substitute model and distortion 
4.8 Coordinate descent
This is our core coordinate descent algorithm. We perform just one iterative update instead of convergence. We find this to be more accurate and faster.
4.9 Optimal threshold and 01 loss objective value
This is our fast algorithm to update and the model objective. Once we have the objective for we can calculate it for in constant time.
4.10 Stochastic coordinate descent for linear 01 loss
Our stochastic descent search performs coordinate descent for the model parameters . We keep track of the best parameters across iterations by evaluating the model objective on the full dataset after each iteration.
4.11 Stochastic coordinate descent for two layer 01 loss network
Our stochastic descent search performs coordinate descent on the final node and then a random hidden node in each iteration. We keep track of the best parameters across iterations by evaluating the model objective on the full dataset after each iteration.
4.12 White box adversarial attacks
If the datapoint is already misclassified by our model our attack simply performs the perturbation given by a random hidden node (since the ordering is chosen randomly). Otherwise it picks the distortion of the first random node that makes it misclassified. If no distortion misclassifies the point it distorts the datapoint by the first hidden node in the random ordering.
4.13 Black box adversarial attacks
In the above procedure we use the test data as the input when attacking a model on a benchmark. We set for MNIST and CIFAR10 and for ImageNet since these values produce the most effective attack. We use values on MNIST, CIFAR10, and ImageNet that are typical in the literature. For MNIST corresponds to a change of in each pixel and for CIFAR10 and ImageNet corresponds to a change of in each pixel.
When our substitute model is the dual layer network each with 200 hidden nodes we train it with stochastic gradient descent, batch size of 200, learning rate of 0.01, and momentum of 0.9. When it is SCD01 we run 1000 iterations with batch size (nrows) of 75%.