Certifying Joint Adversarial Robustness for Model Ensembles

by   Mainuddin Ahmad Jonas, et al.
University of Virginia

Deep Neural Networks (DNNs) are often vulnerable to adversarial examples.Several proposed defenses deploy an ensemble of models with the hope that, although the individual models may be vulnerable, an adversary will not be able to find an adversarial example that succeeds against the ensemble. Depending on how the ensemble is used, an attacker may need to find a single adversarial example that succeeds against all, or a majority, of the models in the ensemble. The effectiveness of ensemble defenses against strong adversaries depends on the vulnerability spaces of models in the ensemble being disjoint. We consider the joint vulnerability of an ensemble of models, and propose a novel technique for certifying the joint robustness of ensembles, building upon prior works on single-model robustness certification. We evaluate the robustness of various models ensembles, including models trained using cost-sensitive robustness to be diverse, to improve understanding of the potential effectiveness of ensemble models as a defense against adversarial examples.


On the Robustness of the CVPR 2018 White-Box Adversarial Example Defenses

Neural networks are known to be vulnerable to adversarial examples. In t...

Enhancing Certifiable Robustness via a Deep Model Ensemble

We propose an algorithm to enhance certified robustness of a deep model ...

Evaluating object detector ensembles for improving the robustness of artifact detection in endoscopic video streams

In this contribution we use an ensemble deep-learning method for combini...

Towards Adversarially Robust Deepfake Detection: An Ensemble Approach

Detecting deepfakes is an important problem, but recent work has shown t...

A Comprehensive Evaluation Framework for Deep Model Robustness

Deep neural networks (DNNs) have achieved remarkable performance across ...

Robustness to Adversarial Examples through an Ensemble of Specialists

We are proposing to use an ensemble of diverse specialists, where specia...

A Deep Neural Networks ensemble workflow from hyperparameter search to inference leveraging GPU clusters

Automated Machine Learning with ensembling (or AutoML with ensembling) s...

1 Introduction

Deep Neural Networks (DNNs) have been found to be very successful at many tasks, including mage classification Krizhevsky et al. (2012); He et al. (2016), but have also been found to be quite vulnerable to misclassifications from small adversarial perturbations to inputs Szegedy et al. (2014); Goodfellow et al. (2015). Many defenses have been proposed to protect models from these attacks. Most focus on making a single model robust, but there may be fundamental limits to the robustness that can be achieved by a single model Schmidt et al. (2017); Gilmer et al. (2018); Fawzi et al. (2018); Mahloujifar et al. (2019); Shafahi et al. (2019). Several of the most promising defenses employ multiple models in various ways Feinman et al. (2017); Tramèr et al. (2018); Meng and Chen (2017); Xu et al. (2018); Pang et al. (2019). These ensemble-based defenses work on the general principle that it should be more difficult for an attacker to find adversarial examples that succeed against two or more models at the same time, compared to attacking a single model. However, an attack crafted against one model may be successful against a different model trained to perform the same task. This leads to a notion of joint vulnerability to capture the risk of adversarial examples that compromise a set of models, illustrated in Figure 1. Joint vulnerability makes ensemble-based defenses less effective. Thus, reducing joint vulnerability of models is important to ensure stronger ensemble-based defenses.

Figure 1: Illustration of joint adversarial vulnerability of two binary classification models. The seed input is , and the dotted square box around represents its true decision boundary. The blue and green lines describe the decision boundaries of two models. The models are jointly vulnerable in the regions marked with red stars, where both models consistently output the (same) incorrect class.

Although the above ensemble defenses have shown promise when evaluated against experimental attacks, these attacks often assume adversaries do not adapt to the ensemble defense, and no previous work has certified the joint robustness of an ensemble defense. On the other hand, several recent works have developed methods to certify robustness for single models Wong and Kolter (2018); Tjeng et al. (2019); Gowal et al. (2019). In this work, we introduce methods for providing robustness guarantees for an ensemble of models buidling upon the approaches of Wong and Kolter Wong and Kolter (2018) and Tjeng et al. Tjeng et al. (2019).

Contributions. Our main contribution is a framework to certify robustness for an ensemble of models against adversarial examples. We define three simple ensemble frameworks (Section 3) and provide robustness guarantees for each of them, while evaluating the tradeoffs between them. We propose a novel technique to extend prior work on single model robustness to verify joint robustness of ensembles of two or more models (Section 4). Second, we demonstrate that the cost-sensitive training approach Zhang and Evans (2019) can be used to train diverse robust models that can be used to certify a high fraction of test examples (Section 5). Our results show that, for the MNIST dataset, we can train diverse ensembles of two, five and ten models using different cost-sensitive robust matrices. When these diverse models are combined using our ensemble frameworks, the ensembles can be used to certify a larger number of test seeds compared to using a single overall-robust model. For example, 78.1% of test examples can be certified robust for two-model averaging ensemble and 85.6% for a ten-model ensemble, compares with 72.7% for a single model. We further show that use of ensemble models do not significantly reduce the model’s accuracy on benign inputs, and when rejection is used as an option, can reduce the error rate to essentially zero with a 9.7% rejection rate.

2 Background and Related Work

In this section, we briefly introduce adversarial examples, provide background on robust training and certification, and describe defenses using model ensembles.

2.1 Adversarial Examples

Several definitions of adversarial example have been proposed. For this paper, we use this definition Biggio et al. (2013); Goodfellow et al. (2015): given a model , an input , a distance metric , and a distance measure , an adversarial example for the input is where and .

In recent years, there has been a significant amount of research on adversarial examples against DNN models, including attacks such as FGSM Goodfellow et al. (2015), DeepFool Moosavi-Dezfooli et al. (2016), PGD Madry et al. (2018), Carlini-Wagner Carlini and Wagner (2017), and JSMA Papernot et al. (2016). The FGSM attack works by taking the signs of the gradient of the loss with respect to the input and adding a small perturbation to the direction of loss for all input features. This simple strategy is surprisingly successful. The PGD attack is considered a very strong state-of-the-art attack. It is essentially an iterative version of FGSM, where instead of just taking one step many smaller steps are taken subject to some constraints and with some randomization.

One interesting property of these attacks is that the adversarial examples they find are often transferable Evtimov et al. (2018) — a successful attack against one model is often successful against a second model. Transfer attacks enable black-box attacks where the adversary does not have full access to the target model. More importantly for our purposes, they also demonstrate that an adversarial example found against one model is also effective against other models, so can be effective against ensemble-based defenses Xie et al. (2018); Tramèr et al. (2018). In our work, we consider the threat model where the adversary has white-box access to all of the models in the ensemble and knowledge of the ensemble construction.

2.2 Robust training

While many proposed adversarial examples defenses look promising, adaptive attacks that compromise defenses are nearly always found Tramer et al. (2020). The failures of ad hoc defenses motivate increased focus on robust training and provable defenses. Madry et al. Madry et al. (2018), Wong et al. Wong and Kolter (2018), and Raghunathan et al. Raghunathan et al. (2018)

have proposed robust training methods to defend against adversarial examples. Madry et al. use the PGD attack to find adversarial examples with high loss value around training points, and then iteratively adversarially train their models on those seeds. Wong et al. define an adversarial polytope for a given input, and robustly train the model to guarantee adversarial robustness for the polytope by reducing the problem into a linear programming problem. These works focus on single models; we propose a way to make

ensemble models jointly robust through training a set of models to be both robust and diverse.

2.3 Certified robustness

Several recent works aim to provide guarantees of robustness for models against constrained adversarial examples Wong and Kolter (2018); Tjeng et al. (2019); Raghunathan et al. (2018); Cohen et al. (2019). All of these works provide certification for individual models. A model is certifiably robust for an input , if for all where , is robust. We extend these techniques for ensemble models. In particular, we extend Tjeng et al.’s Tjeng et al. (2019) MIP verification technique and Wong et al.’s Wong and Kolter (2018) convex adversarial polytope method. Both techniques are based on using linear programming to calculate a bound on outputs given the allowable input perturbations, and using those output bounds to provide robustness guarantees. MIPVerify uses mixed integer linear programming solvers, which are computationally very expensive for deep neural networks. To get around this issue, Wong et al. Wong and Kolter (2018)

use a dual network formulation of the original network that over-approximates the adversarial region, and apply widely used techniques such as stochastic gradient descent to the solve the optimization problem efficiently. This can scale to larger networks and provides a sound certificate, but may fail to certify robust examples because of the over-approximation.

2.4 Ensemble models as defense

In classical machine learning, there has been extensive work on ensemble of models and also diversity measures. Kuncheva 

Kuncheva and Whitaker (2003) provides a comparison of those measures and their usefulness in terms of ensemble accuracy. However, both the diversity measures and the evaluation of their usefulness was done in the benign setting. The assumptions that are valid in the benign setting, such as, the independent and identically distributed inputs no longer applies in the adversarial setting.

In the adversarial setting, there have been several proposed ensemble-based defenses Feinman et al. (2017); Tramèr et al. (2018); Pang et al. (2019) that work on the principle of making models diverse from each other. Feinman et al. Feinman et al. (2017) use randomness in the dropout layers to build an ensemble that is robust to adversarial examples. Tramer et al. Tramèr et al. (2018) use ensembles to introduce diversity in the adversarial examples on which to train a model to be robust. Pang et al. Pang et al. (2019)

promote diversity among non-maximal class prediction probabilities to make the ensembles diverse. Sharif et al. 

Sharif et al. (2019) proposed the nML appproach for adversarial defense. They explicitly train the models in the ensemble to be diverse from each other, and show experimentally that it leads to robust ensembles. Similarly, Meng et al. Meng et al. (2020) have shown than an ensemble of weak but diverse models can be used as strong adversarial defense. While all of the above works focus on making models diverse, they evaluate their ensembles using existing attack methods. None of these prior works have attempted to provide any certification of their diverse ensemble models against adversaries.

3 Ensemble Defenses

Our goal is to provide robustness guarantees against adversarial examples for an ensemble of models. The effectiveness of a ensemble defense depends on the models used in the ensemble and how they are combined.

First, we present a general framework for ensemble defenses. Next, we define three different ensemble composition frameworks: unanimity, majority, and averaging. Section 4 describes the techniques we use to certify each type of ensemble framework. In Sections 5.2 and 5.3, we talk about different ways to train the individual models in these ensemble frameworks, and discuss the results. Our methods do not make any assumptions about the models in an ensemble, for example, that they are pre-processing the input in some way and then running the same model. This means our frameworks are general purpose and agnostic of the input domain, but we cannot handle ensemble mechansisms that are nondeterministic (such as sampling Gaussian noise around the input Salman et al. (2020), which can only provide probabilistic guarantees).

General frame of ensemble defense. We use to represent the output of a model ensemble, composed of models, that are composed using one of the composition mechanisms. Furthermore, given an input , true output class , and the output of the ensemble , we use a decision function to decide whether the given input is adversarial, benign, or rejected. Functions and together define an ensemble defense framework. We discuss three such frameworks in this paper.

Unanimity. In the unanimity framework, the output class is only if all of the component models output . If there is any disagreement among the models, the input is rejected: . For the unanimity framework, joint robustness is achieved when the unanimity-robust property defined below is satisfied.

Definition 3.1.

Given an input with true output class and allowable adversarial distance , we call a model ensemble unanimity-robust for input if there exists no adversarial example such that , and .

Majority. In the majority framework, the output class is only if at least models agree on it. If there is no majority output class, the input is rejected. Joint robustness is achieved when the majority-robust property defined below is satisfied:

Definition 3.2.

Given an input with true output class , allowable adversarial distance , a model ensemble is majority-robust for input if there exists no adversarial example such that , and there is no class such that and

Averaging. In the averaging

framework, we take the average of the second last layer output vectors of each of component models to produce the final output. This second last layer vector is typically a softmax or logits layer. We use

to denote the second last layer output vector of model , and define the average of the second last layer vectors as:

Then, the output of the ensemble is:

Joint robustness for an averaging ensemble is satisfied when the averaging-robust property defined below is satisfied.

Definition 3.3.

Given an input with true output class , allowable adversarial distance , we call a model ensemble averaging-robust if there exists no adversarial example such that , and there is no class such that and , where is as defined above.

4 Certifying ensemble defenses

In this section we introduce our techniques to certify a model ensemble is robust for a given input. Our approach extends the single model methods of Wong and Kolter Wong and Kolter (2018) and Tjeng et al. Tjeng et al. (2019) to support certification for model ensembles using the different composition mechanisms.

4.1 Unanimity and majority frameworks

The simplest approach for certifying joint robustness for the unanimity and majority frameworks would be to certify the robustness of each model in the ensemble individually for a given input, and then make a joint certification decision based on those individual certifications. This strategy is simple but prone to false negatives.

For the unanimity framework, we can verify that an ensemble is unanimity-robust for input if at least one of the models is individually robust for . This provides a simple way to use single-model certifiers to verify robustness for an ensemble, but is stricter than what is required to satisfy Definition 3.1 since compromising a unanimity ensemble requires finding a single input that is a successful adversarial example against every component model. Hence, this method may substantially underestimate the actual robustness, especially when the component models have mostly disjoint vulnerability regions. An input that cannot be certified using the this technique, may still be unanimity-robust. Nevertheless, this technique is an easy way to establish a lower bound for joint robustness.

Similarly, for the majority framework, we can use this approach to verify that an ensemble satisfies majority-robustness (Definition 3.2) by checking if at least models are individually robust for input . As with the unanimity case, this underestimates the actual robustness, but provides a valid joint robustness lower bound. As we will see in Section 5.3, the independent evaluation strategy works fairly well for the unanimity framework, but it is almost useless for the majority framework when the number of models in the ensembles gets large.

4.2 Averaging models

As the averaging framework essentially obtains a single model by combining the models in the ensemble, we can simply apply the single model certification techniques to that to achieve robust certification. This gives us robust certification according to Definition 3.3. Furthermore, we can show that this certification technique implies a certification guarantee for the unanimity framework. In fact, the certification guarantee for the unanimity framework achieved this way has lower false negative rate than the independent technique described in the previous subsection. We state this formally in Theorem 4.1, and provide a proof below.

Theorem 4.1.

If for a given input , the averaging ensemble is certified to be robust, then the component models combined with the unanimity framework is also certifiably robust according to Definition 3.1.


Let be the component models of the averaging ensemble, and let be the second last layer output vectors for each of these models. As described in section 3, given input we can define the average of the second last layer outputs as:

The final output class is defined as:

For input , let be the true output class and

be the target output class. Now, if averaging ensemble

is robust, we can write . It follows that . Thus,

This implies that either or . Generalizing for any , we must have . Thus, in , an unanimity ensemble of is unanimity-robust for target class according to Definition 3.1. Thus, if we can show that a model ensemble is averaging-robust for all target classes for an input , then it implies that the unanimity ensemble formed with models is also unanimity-robust for input . ∎

This is again a stricter definition of robustness compared to the unanimity-robustness defined in Definition 3.1

. This means, even though certification of averaging-robustness implies unanimity-robustness, the opposite is not true. That is, unanimity-robustness does not imply averaging-robustness. Therefore, we again get a lower bound of unanimity-robustness. However, the averaging-robustness is a less strict definition of robustness than the implicit independent certification definition described in the previous subsection. Thus, this formulation gives us a better estimate of true unanimity-robustness.

In this project, we extend two different single-model certification techniques to provide robustness certification for ensembles. The two different techniques we use are described below:

Using MIP verification: Tjeng et al. Tjeng et al. (2019) have used mixed integer programming (MIP) techniques to evaluate robustness of models against adversarial examples. We apply their certification technique on our averaging ensemble model to certify the joint robustness of . However, we found this approach to be computationally intensive, and it is hard to scale to larger models. Nevertheless we found some interesting results for two very simple MNIST models which we report in the next section.

Using convex adversarial polytope: In order to scale our verification technique to larger models, we next extended the dual network formulation by Wong and Kolter Wong and Kolter (2018) to be able to handle the final averaging layer of the averaging ensemble model . Because this layer is a linear operation, it can be simulated using a fully connected linear layer in the neural network. And because linear networks are already supported by their framework, our averaging model can thus be verified.

5 Experiments

This section reports on our experiments extending two different certification techniques, MIPVerify (Section 5.2) and convex adversarial polytope (Section 5.2), for use with model ensembles in different frameworks. To conduct the experiments, we produced a set of robust models that are trained to be diverse in particular ways (Section 5.1) and can be combined in various ensembles. Because of the computational challenges in scaling these techniques to large models, most of our results are only for the convex adversarial polytope method and for now we only have experimental results on MNIST. Although this is a simple dataset, and may not be representative of typical tasks, it is sufficient for exploring methods for testing joint vulnerability, and for providing some insights into the effectiveness of different types of ensembles.

5.1 Training Diverse Robust Models

To train the models in the ensemble frameworks, we used the cost-sensitive robustness framework by Zhang et al. Zhang and Evans (2019), which is implemented based on the convex adversarial polytope work. Cost-sensitive robustness provides a principled way to train diverse models.

Cost-sensitive robust training uses a cost-matrix to specify seed-target class pairs that are trained to be robust. If is the cost matrix, is a seed class, and is a target class, then is set when we want to make the trained model robust against adversarial attacks from seed class to target class , and is set when we don’t want to make the model robust for this particular seed-target pair. For the MNIST dataset, is a matrix. We configure this cost matrix in different ways to produce different types of model ensembles. This provides a controlled way to produce models with diverse robustness properties, in contrast to ad-hoc diverse training methods that vary model architectures or randomize aspects of training. We expect both types of diversity will be useful in practice, but leave exploring ad-hoc diversity methods to future work.

We conduct experiments on ensembles of two, five, and ten models, trained using different cost matrices. The different ensembles we used are listed below:

  • Two model ensembles where individual models are:

    1. Even seed digits robust and odd seed digits robust.

    2. Even target digits robust and odd target digits robust.

    3. Adversarially-clustered seed digits robust.

    4. Adversarially-clustered target digits robust.

  • Five model ensembles with individual models that are:

    1. Seed digits modulo-5 robust.

    2. Target digits modulo-5 robust.

    3. Adversarially-clustered seed digits robust.

    4. Adversarially-clustered target digits robust.

  • Ten model ensembles:

    1. Seed digits robust models.

    2. Target digits robust models.

A representative selection of different models we use are described in Table 1. The overall robust model is a single model trained to be robust on all seed-target pairs (this is the same as standard certifiable robustness training using the convex adversarial polytope). The other models were trained using different cost-matrices. These cost-matrices are shown in Table 1. All these models had the same architecture, and they were trained on distance of 0.1. Each model had 3 linear and 2 convolutional layers.

Cost Matrix Overall Certified Cost-Sensitive
Model () Robust Accuracy Robust Accuracy
Overall Robust % %
Even-seeds Robust % %
Odd-targets Robust % %
Seeds (2,3,5,6,8) Robust % %
Targets (0,1,4,7,9) Robust % %
Seed-modulo-5 = 0 Robust % %
Target-modulo-5 = 3 Robust % %
Seeds (3,5) Robust % %
Targets (1,7) Robust % %
Seed-modulo-10 = 3 Robust % %
Target-modulo-10 = 7 Robust % %
Table 1: Models trained using cost-sensitive robustness for use in ensembles. One representative model is shown from each ensemble for the sake of brevity. We show the robust cost-matrix for each model by listing the and values where ( for all others), as well as its overall robust and cost-sensitive accuracy.

The adversarial clustering was done to ensure digits that appear visually most similar to each other are grouped together. This similarity between a pair of digits was measured in terms of how easily either digit of the pair can be adversarially targeted to the other digit. These results are consistent with our intuitions about visual similarity — for example, MNIST digits 2, 3, 5, 8 are visually quite similar, and we also found them to be adversarially similar, hence clustered together.

5.2 Certifying using MIPVerify

We used the MIP verifier on two shallow MNIST networks. One of the networks had two fully-connected layers, and the other had three fully-connected layers. The two-layer network was trained to be robust on even-seeds and the three-layer network on odd-seeds. We used adversarial training using PGD attacks to robustly train the models. Even with adversarial training, however, the models were not really robust. Even at perturbation of , which is very low for MNIST dataset, the models only had robust accuracy of 23% and 28% respectively. The reason for the lack of robustness is because the networks were very shallow and lacked a convolutional layer. We could not make the models more complex because doing so makes the robust certification too performance-intensive. Still, even with these non-robust models, we can see some interesting results for the ensemble of the two models. We discuss them below.

To understand the robustness possible by constructing an ensemble of the two models, we compute the minimal adversarial perturbation for 100 test seeds for the two single networks and the ensemble average network built from them. We used distance because the MIP verifier performs better with this, due its linear nature, compared to or distances. More than 90% of the seeds were verified within 240 seconds. Figure 2 shows the number of seeds that can be proven robust using MIP verification at a given distance for each model independently, the maximum of the two models, and the ensemble average model. The verifier was not always able to find the minimal necessary perturbation for the ensemble network within the time limit of 240 seconds. In those cases, we reported the maximum adversarial distance proven to be safe at the time when time limit exceeded – which respresents an upper bound of minimal adversarial perturbation. We note from the figure that number of examples certified by the ensemble average model is higher than that for either individual model at all minimal distances.

In general, though, we found the MIP Verificaton does not scale well with networks complex enough to be useful in practice. Deeper networks and use of convolutional layers makes the performance of MIP Verify significantly worse. Furthermore, we found that robust networks were harder to verify than non-robust networks with this framework. Because of this, we decided not to use this approach for the remaining experiments which is more practical networks.

Figure 2: Number of test seeds certified to be robust by the single models and the ensemble average model for different perturbation distance constraints.

5.3 Convex adversarial polytope certification

As the MIP verification does not scale well to larger networks, for our remaining experiments we use the convex adversarial polytope formulation by Wong et al. Wong and Kolter (2018). We conduct experiments with ensembles of two, five, and ten models, using the models described in Table 1. Table 2 summarizes the results.

Joint robustness of two-model ensembles. We evaluated two-model ensembles with different choices for the models, using the three composition methods. We ensured that the averaging ensemble could be treated as a single sequential model made of fully-connected linear layers, so that the robust verification formulation was still valid when applied on it. To do this, we had to first convert the convolutional layers of the single models into linear layers, and then the linear layers of the two models were combined to create larger linear layers for the joint model. We can then calculate the robust error rates of the ensemble average model, as well as the unanimity and majority ensembles for the two-model ensemble. The key here is that no changes were needed to be made to the existing verification framework.

Table 2 shows each ensemble’s robust accuracy. For two-model ensembles, the unanimity and the majority frameworks are the same. Thus we can use the same ensemble average technique to certify them. For adversarial clustering into 2-models, we used two clusters – one for digits (2, 3, 5, 6, 8) and the other for digits (0, 1, 4, 7, 9).

Compared to the single overall robust model, where 72.7% of the test examples can be certified robust, with two-model ensembles we can certify up to 78.1% of seeds as robust (using the averaging composition with the adversarially clustered seed robust models).

Models Composition Certified Robust Normal Test Error Rejection
Overall Robust Single 72.7% 5.0% -
Even/Odd-seed Unanimity 74.7% 1.3% 5.0%
Average 75.9% 3.3% -
Clustered seed (2) Unanimity 77.3% 1.5% 6.0%
Average 78.1% 3.0% -
Seed-modulo-5 Unanimity 84.1% 0.3% 8.1%
Average 85.3% 1.7% -
Clustered seed (5) Unanimity 83.8% 0.7% 7.1%
Average 84.3% 1.4% -
Seed-modulo-10 Unanimity 85.4% 0.1% 9.7%
Average 85.6% 1.5% -
Table 2: Robust certification and normal test error and rejection rates for single model and two, five, and ten-model ensembles for adversarial examples with .
(a) Cluster Two-Model Ensemble
(b) Modulo-5 Seeds Ensemble
(c) Ten-model ensemble
Figure 3: Number of test examples certified to be jointly robust using each model ensemble for different values.

We reran all the above experiments for all values from 0.01 to 0.20 to see how the joint robustness changes as the attacks get stronger. Figure 3(a) shows the results from the adversarially clustered seeds two-model ensemble; the results for the other ensembles show similar patterns and are deferred to Appendix A. For values up to 0.1, which is the value used for training the robust models, the ensemble model is able to certify more seeds compared to the single overall robust model. We also note that the models that are trained to be target-robust, rather than seed-robust, perform much worse. With even and odd target-robust models we were able to certify only 35.6% of test examples. We believe the reason for this is that the evaluation criteria of robustness is inherently biased against models that are trained to target-robust. Because, when evaluating, we always start from some test seed, and try to find an adversarial example from that seed – which is not what the target-robust models are explicitly trained to prevent.

Five-model Ensembles. Our joint certification framework can be extended to ensembles of any number of models. We trained the five models to be robust on modulo-5 seed digits. Ensembles of these models had better certified robustness than the best two-model ensembles. For example, with averaging composition 85.3% of test examples can be certified robust (compared to our previous best result of 78.1% with two models). Figure 3(b) shows how the number of certifiable test seeds drops with increasing , but worth noting is the large gap between any individual model’s certifiable robustness and that for the average ensemble. We also trained model by adversarially clustering into 5-models – for digits (4, 9), (3, 5), (2,8), (0, 6) and (1, 7). For the clustered seed robust ensemble, the results were slightly worse (84.3%) than modulo-5 seeds robust model. One difference between the two-model and five-model ensembles is that in the latter, the unanimity and the majority frameworks are different. We found that independent certification does not really work for majority framework. We were able to certify almost no test seeds for the majority framework for five-model ensembles.

Ten-model Ensembles. Finally, we tried ensembles of ten models, each trained to be robust for a selected seed digit (

). The certified robust rate of the 10-model ensemble trained to be seed robust was 85.6%. This is slightly higher than the 5-model ensemble (85.3%), but perhaps not worth the extra performance cost. It is notable, though, that the unanimity model reduces the normal test error for this ensemble to 0.1%. This means that out of 1000 test seeds, 853 were certified to be robust, 48 were correctly classified but could not be certified, 97 were rejected due to disagreement among the models, and 1 was incorrectly classified by all 10 models. Figure 

4 shows the one test example where all models agree on a predicted class but it is not the given label (Figure 4(a), and selected typical rejected examples from the 97 tests where the models disagree (Figure 4(b)).

(a) The one test example that is incorrectly classified by all ten models. (Labeled as 6, predicted as 1.)
(b) Examples of rejected test examples for which models disagree on the predicted class.
Figure 4: Test examples that are misclassified or rejected by the ten-model ensemble.

Summary. Figure 5 compares the robust certification rate for the two, five, and ten-model ensembles. Clustered seed robust models generally tend to perform well, although just random modulo seed robust models perform almost just as well.

One potential issue with any ensemble models is the possibility of false positives. In our case, the use of multiple models in the unanimity and majority frameworks also introduce the possibility of rejecting benign inputs. As the number of models in a unanimity ensemble increases, the rejection rate on normal inputs increases since if any one model disagrees the input is rejected. However, if the false rejection rate is reasonably low, then in many situations that may be an acceptable trade-off for higher adversarial robustness. The results in Table 2 are consistent with this, but show that even the ten-model unanimity ensemble has a rejection rate below 10%. For more challenging classification tasks, strict unanimity composition may not be an option if rejection rates become unacceptable, but could be replaced by relaxed notions (for example, considering a set of related classes as equivalent for agreement purposes, or allowing some small fraction of models to disagree).

Figure 5: Number of test seeds certified to be jointly robust using ten models, five models, two models and single model for different values.

6 Conclusion

We extended robust certification models designed for single models to provide joint robustness guarantees for ensembles of models. Our novel joint-model formulation technique can be used to extend certification frameworks to provide certifiable robustness guarantees that are substantially stronger than what can be obtained using the verification techniques independently. Furthermore, we have shown that cost-sensitive robustness training with diverse cost matrices can produce models that are diverse with respect to joint robustness goals. The results from our experiments suggest that ensembles of models can be useful for increasing the robustness of models against adversarial examples. These is a vast space of possible ways to train models to be diverse, and ways to use multiple models in an ensemble, that may lead to even more robustness. As we noted in our motivation, however, without efforts to certify joint robustness, or to ensure that models in an ensemble are diverse in their vulnerability regions, the apparent effectiveness of an ensemble may be misleading. Although the methods we have used cannot yet scale beyond tiny models, our results provide encouragement that ensembles can be constructed that provide strong robustness against even the most sophisticated adversaries.


Open source code for our implementation and for reproducing our experiments is available at: https://github.com/jonas-maj/ensemble-adversarial-robustness.


We thank members of the Security Research Group, Mohammad Mahmoody, Vicente Ordóñez Román, and Yuan Tian for helpful comments on this work, and thank Xiao Zhang, Eric Wong, and Vincent Tjeng, Kai Xiao, and Russ Tedrake for their open source projects that we made use of in our experiments. This research was sponsored in part by the National Science Foundation #1804603 (Center for Trustworthy Machine Learning, SaTC Frontier: End-to-End Trustworthiness of Machine-Learning Systems), and additional support from Amazon, Google, and Intel.


  • [1] B. Biggio, I. Corona, D. Maiorca, B. Nelson, N. Šrndić, P. Laskov, G. Giacinto, and F. Roli (2013) Evasion attacks against machine learning at test time. In Machine Learning and Knowledge Discovery in Databases, Cited by: §2.1.
  • [2] N. Carlini and D. Wagner (2017) Towards evaluating the robustness of neural networks. In IEEE Symposium on Security and Privacy, Cited by: §2.1.
  • [3] J. Cohen, E. Rosenfeld, and Z. Kolter (2019) Certified adversarial robustness via randomized smoothing. In International Conference on Machine Learning, Cited by: §2.3.
  • [4] I. Evtimov, K. Eykholt, E. Fernandes, T. Kohno, B. Li, A. Prakash, A. Rahmati, and D. Song (2018)

    Robust physical-world attacks on deep learning models


    Conference on Computer Vision and Pattern Recognition

    Cited by: §2.1.
  • [5] A. Fawzi, H. Fawzi, and O. Fawzi (2018) Adversarial vulnerability for any classifier. In Conference on Neural Information Processing Systems, Cited by: §1.
  • [6] R. Feinman, R. R. Curtin, S. Shintre, and A. B. Gardner (2017) Detecting adversarial samples from artifacts. arXiv preprint arXiv:1703.00410. Cited by: §1, §2.4.
  • [7] J. Gilmer, L. Metz, F. Faghri, S. S. Schoenholz, M. Raghu, M. Wattenberg, and I. Goodfellow (2018) Adversarial spheres. arXiv preprint arXiv:1801.02774. Cited by: §1.
  • [8] I. Goodfellow, J. Shlens, and C. Szegedy (2015) Explaining and harnessing adversarial examples. In International Conference on Learning Representations, Cited by: §1, §2.1, §2.1.
  • [9] S. Gowal, K. Dvijotham, R. Stanforth, R. Bunel, C. Qin, J. Uesato, R. Arandjelovic, T. Mann, and P. Kohli (2019) Scalable verified training for provably robust image classification. In International Conference on Computer Vision, Cited by: §1.
  • [10] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1.
  • [11] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, Cited by: §1.
  • [12] L. I. Kuncheva and C. J. Whitaker (2003) Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Machine learning 51 (2), pp. 181–207. Cited by: §2.4.
  • [13] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu (2018) Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, Cited by: §2.1, §2.2.
  • [14] S. Mahloujifar, D. Diochnos, and M. Mahmoody (2019) The curse of concentration in robust learning: evasion and poisoning attacks from concentration of measure. In

    AAAI Conference on Artificial Intelligence

    Cited by: §1.
  • [15] D. Meng and H. Chen (2017) MagNet: a two-pronged defense against adversarial examples. In ACM Conference on Computer and Communications Security, Cited by: §1.
  • [16] Y. Meng, J. Su, J. O’Kane, and P. Jamshidi (2020) Ensembles of many diverse weak defenses can be strong: defending deep neural networks against adversarial attacks. arXiv preprint arXiv:2001.00308. Cited by: §2.4.
  • [17] S. Moosavi-Dezfooli, A. Fawzi, and P. Frossard (2016) DeepFool: a simple and accurate method to fool deep neural networks. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.1.
  • [18] T. Pang, K. Xu, C. Du, N. Chen, and J. Zhu (2019) Improving adversarial robustness via promoting ensemble diversity. arXiv preprint arXiv:1901.08846. Cited by: §1, §2.4.
  • [19] N. Papernot, P. McDaniel, S. Jha, M. Fredrikson, Z. B. Celik, and A. Swami (2016) The limitations of deep learning in adversarial settings. In IEEE European Symposium on Security and Privacy, Cited by: §2.1.
  • [20] A. Raghunathan, J. Steinhardt, and P. Liang (2018) Certified defenses against adversarial examples. In International Conference on Learning Representations, Cited by: §2.2, §2.3.
  • [21] H. Salman, M. Sun, G. Yang, A. Kapoor, and J. Z. Kolter (2020) Black-box smoothing: a provable defense for pretrained classifiers. arXiv:2003.01908. Cited by: §3.
  • [22] L. Schmidt, S. Santurkar, D. Tsipras, K. Talwar, and A. Madry (2017) Adversarially robust generalization requires more data. In Advances in Neural Information Processing Systems, Cited by: §1.
  • [23] A. Shafahi, W. R. Huang, C. Studer, S. Feizi, and T. Goldstein (2019) Are adversarial examples inevitable?. In International Conference on Learning Representations, Cited by: §1.
  • [24] M. Sharif, L. Bauer, and M. K. Reiter (2019) -ML: mitigating adversarial examples via ensembles of topologically manipulated classifiers. arXiv preprint arXiv:1912.09059. Cited by: §2.4.
  • [25] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus (2014) Intriguing properties of neural networks. In International Conference on Learning Representations, Cited by: §1.
  • [26] V. Tjeng, K. Xiao, and R. Tedrake (2019) Evaluating robustness of neural networks with Mixed Integer Programming. In International Conference on Learning Representations, Cited by: §1, §2.3, §4.2, §4.
  • [27] F. Tramer, N. Carlini, W. Brendel, and A. Madry (2020) On adaptive attacks to adversarial example defenses. arXiv:2002.08347. Cited by: §2.2.
  • [28] F. Tramèr, A. Kurakin, N. Papernot, I. Goodfellow, D. Boneh, and P. McDaniel (2018) Ensemble adversarial training: attacks and defenses. In International Conference on Learning Representations, Cited by: §1, §2.1, §2.4.
  • [29] E. Wong and Z. Kolter (2018) Provable defenses against adversarial examples via the convex outer adversarial polytope. In International Conference on Machine Learning, Cited by: §1, §2.2, §2.3, §4.2, §4, §5.3.
  • [30] C. Xie, J. Wang, Z. Zhang, Z. Ren, and A. Yuille (2018) Mitigating adversarial effects through randomization. In International Confernce on Learning Representations, Cited by: §2.1.
  • [31] W. Xu, D. Evans, and Y. Qi (2018) Feature Squeezing: detecting adversarial examples in deep neural networks. In Network and Distributed Systems Security Symposium, Cited by: §1.
  • [32] X. Zhang and D. Evans (2019) Cost-sensitive robustness against adversarial examples. In International Conference on Learning Representations, Cited by: §1, §5.1.

Appendix A Additional Experimental Results

(a) Even/odd seeds robust
(b) Even/odd targets robust
(c) Clustered Targets
Figure 6: Number of test seeds certified to be jointly robust using the individual models and different two-model ensembles average framework for different values.
(a) Modulo-5 Targets Robust
(b) Modulo-5 Seeds Robust
(c) Clustered Targets
Figure 7: Number of test examples certified to be jointly robust using the individual models and the 5-model ensembles average framework with different values.
Figure 8: Number of test examples certified to be jointly robust using the individual models and the 10-model ensemble average framework with targets robust for different values.