1 Introduction
Neural networks have achieved excellent performances on many computer vision tasks, but they are often vulnerable to small, adversarially chosen perturbations that are barely perceptible to humans while having a catastrophic impact on model performance
(szegedy2013intriguing; goodfellow2014explaining). Making classifiers robust to these adversarial perturbations is of great interest, especially when neural networks are applied to safetycritical applications. Several heuristic methods exist for obtaining robust classifiers, however powerful adversarial examples can be found against most of these defenses
(carlini2017adversarial; uesato2018adversarial).Recent studies focus on verifying or enforcing the certified accuracy of deep classifiers, especially for networks with ReLU activations. They provide guarantees of a network’s robustness to any perturbation
with norm bounded by (wong2017provable; wong2018scaling; raghunathan2018certified; dvijotham2018dual; zhang2018efficient; salman2019convex). Formal verification methods can find the exact minimum adversarial distortions needed to fool a classifier (ehlers2017formal; katz2017reluplex; tjeng2017evaluating), but require solving an NPhard problem. To make verification efficient and scalable, convex relaxations are adopted, resulting in a lower bound on the norm of adversarial perturbations (zhang2018efficient; weng2018towards), or an upper bound on the robust error (dvijotham2018dual; gehr2018ai2; singh2018fast). Linear programming (LP) relaxations
(wong2018scaling)are efficient enough to estimate the lower bound of the margin in each iteration for training certifiably robust networks. However, due to the relaxation of the underlying problem, a wide gap remains between the optimal values from the original and relaxed problems
(salman2019convex).In this paper, we focus on improving the certified robustness of neural networks trained with convex relaxation bounds. To achieve this, we first give a more interpretable explanation for the bounds achieved in (weng2018towards; wong2018scaling)
. Namely, the constraints of the relaxed problem are defined by a simple linear network with adversaries injecting bounded perturbations to both the input of the network and the preactivations of intermediate layers. The optimal solution of the relaxed problem can be written as a forward pass of the clean image through the linear network, plus the cumulative adversarial effects of all the perturbations added to the linear transforms, which makes it easier to identify the optimality conditions and serves as a bridge between the relaxed problem and the original nonconvex problem. We further identify conditions for the bound to be tight, and we propose two indicators for the gap between the original nonconvex problem and the relaxed problem. Adding the proposed indicators into the loss function results in classifiers with better certified accuracy.
2 Background and Related Work
Adversarial defenses roughly fall into two categories: heuristic defenses and verifiable defenses. The heuristic defenses either try to identify adversarial examples and remove adversarial perturbations from images, or make the network invariant to small perturbations through training (papernot2018deep; shan2019gotta; samangouei2018defense; hwang2019puvae). In addition, adversarial training uses adversarial examples as opposed to clean examples during training, so that the network can learn how to classify adversarial examples directly (madry2017towards; shafahi2019adversarial; zhang2019you).
In response, a line of works have proposed to verify the robustness of neural nets. Exact methods obtain the perturbation with minimum such that , where is a classifier and is the data point. Nevertheless, the problem itself is NPhard and the methods can hardly scale (cheng2017maximum; lomuscio2017approach; dutta2018output; fischetti2017deep; tjeng2017evaluating; scheibler2015towards; katz2017reluplex; carlini2017provably; ehlers2017formal).
A body of work focuses on relaxing the nonlinearities in the original problem into linear inequality constraints (singh2018fast; gehr2018ai2; zhang2018efficient; mirman2018differentiable), sometimes using the dual of the relaxed problem (wong2017provable; wong2018scaling; dvijotham2018dual). Recently, (salman2019convex) unified the primal and dual views into a common convex relaxation framework, and suggested there is an inherent gap between the actual and the lower bound of robustness given by verifiers based on LP relaxations, which they called a convex relaxation barrier.
Some defense approaches integrate the verification methods into the training of a network to minimize robust loss directly. (hein2017formal) uses a local lipschitz regularization to improve certified robustness. In addition, a bound based on semidefinite programming (SDP) relaxation was developed and minimized as the objective (raghunathan2018certified). (wong2017provable) presents an upper bound on the robust loss caused by normbounded perturbation via LP relaxation, and minimizes this upper bound during training. (wong2018scaling) further extend this method to much more general network structures with skip connections and general nonlinearities, and provide a memoryfriendly training strategy using random projections. Since LP relaxation is adopted, the aforementioned convex relaxation barrier exists for their methods.
While another line of work (IBP) have shown that an intuitively looser interval bound can be used to train much more robust networks than convex relaxation for large perturbations (gowal2018effectiveness; zhang2019towards), it is still important to study convex relaxation bounds since it can provide better certificates against a broader class of adversaries that IBP struggles to certify in some cases, such as adversaries for convolutional networks. We discuss these motivations in more detail in Appendix F.
We seek to enforce the tightness of the convex relaxation certificate during training. We reduce the optimality gap between the original and the relaxed problem by using various tightness indicators as regularizers during training. Compared with previous approaches, we have the following contributions: First, based upon the same relaxation in (weng2018towards), we illustrate a more intuitive view for the bounds on intermediate ReLU activations achieved by (wong2018scaling), which can be viewed as a linear network facing adversaries adding that make bounded perturbations to both the input and the intermediate layers. Second, starting from this view, we identify conditions where the bound from the relaxed problem is tight for the original nonconvex problem. Third, based on the conditions, we propose regularizers that encourage the bound to be tight for the obtained network, which improves the certificate on both MNIST and CIFAR10.
3 Problem Formulation
In general, to train an adversarially robust network, we solve a constrained minimax problem where the adversary tries to maximize the loss given the norm constraint, and the parameters of the network are trained to minimize this maximal loss. Due to nonconvexity and the complexity of neural networks, it is expensive to solve the inner max problem exactly. To obtain certified robustness, like many related works (wong2018scaling; gowal2018effectiveness), we minimize an upper bound of the inner max problem, which is a cross entropy loss on the negation of the lower bounds of margins over each other class, as shown in Eq. 4.4. Without loss of generality, in this section we analyze the original and relaxed problems for minimizing the margin between the ground truth class and some other class under normbounded adversaries, which can be adapted directly to compute the loss in Eq. 4.4.
The original nonconvex constrained optimization problem for finding the normbounded adversary that minimizes the margin can be formulated as
()  
subject to  
for 
where , and
are onehot vectors corresponding to the label
and some other class , is the ReLU activation, and is one functional block of the neural network. This can be a linear layer (), or even a residual block. We use to denote the ReLU network up to the th layer, and to denote the optimal solution to .3.1 Efficient Convex Relaxations
Grouping of ReLU Activations
The nonconvexity of stems from the nonconvex feasible set given by the ReLU activations. Since the network is a continuous function, the preactivations have lower and upper bounds and when the input . If a certain preactivation has , its corresponding ReLU constraint gives rise to a nonconvex feasible set as shown in the left of Figure 1, making Eq. a nonconvex optimization problem. On the other hand, if or , the constraints degenerate into linear constraints and respectively, which do not affect convexity. Based on and , we divide the ReLU activations into three disjoint subsets
(1) 
If , we call the corresponding ReLU activation an
unstable neuron
.Convex relaxation expands the nonconvex feasible sets into convex ones and solves a convex optimization problem . The feasible set of is a subset of the feasible set of , so the optimal value of lower bounds the optimal value of Eq. . Moreover, we want problem to be solved efficiently, better with a closed form solution, so that it can be integrated into the training process.
Computational Challenge for the “optimal” Relaxation
As pointed out by (salman2019convex), the optimal layerwise convex relaxation, i.e., the optimal convex relaxation for the nonlinear constraint of a single layer, can be obtained independently for each neuron. For each in a ReLU network, the optimal layerwise convex relaxation is the closed convex hull of , which is just , corresponding to the triangle region in the middle of Figure 1. Despite being relatively tight, there is no closedform solution to this relaxed problem. LP solvers are typically adopted to solve a linear programming problem for each neuron. Therefore, such a relaxation is hardly scalable to verify larger networks without any additional trick (like (xiao2018training)). (weng2018towards) find it to be 34 to 1523 times slower than FastLin, and it has difficulty verifying MLPs with more than 3 layers on MNIST. In (salman2019convex), it takes 10,000 CPU cores to parallelize the LP solvers for bounding the activations of every neuron in a twohiddenlayer MLP with 100 neurons per layer. Since solving LP problems for all neurons are usually impractical, it is even more difficult to optimize the network to maximize the lower bounds of margin found by solving this relaxation problem, as differentiating through the LP optimization process is even more expensive.
Computationally Efficient Relaxations
In the layerwise convex relaxation, instead of using a boundary nonlinear in , (zhang2018efficient) has shown that for any nonlinearity, when both the lower and upper boundaries are linear in , there exist closedform solutions to the relaxed problem, which avoids using LP solvers and improves efficiency. Specifically, the following relaxation of has closedform solutions:
minimize  
subject to  
where denotes elementwise product, and for simplicity, we have only considered networks with no skip connections, and represent both Full Connected and Convolutional Layers as a linear transform .
Before we can solve 3.1 to get the lower bound of margin, we need to know the range for the preactivations . As in (wong2017provable; weng2018towards; zhang2018efficient), we can solve the same optimization problem for each neuron starting from layer 1 to , by replacing with or for or respectively.^{1}^{1}1For , take an extra negation on the solution.
The most efficient approach in this category is FastLin (weng2018towards), which sets , as shown in the right of Figure 1. A tighter choice is CROWN (zhang2018efficient), which chooses different and such that the convex feasible set is minimized. However, CROWN has much higher complexity than FastLin due to its varying slopes. We give detailed analysis of the closedform solutions of both bounds and their complexities in Appendix D. Recently, CROWNIBP (zhang2019towards) has been proposed to provide a better initialization to IBP, which uses IBP to estimate range for CROWN. In this case, both CROWN and FastLin have the same complexity and CROWN is a better choice.
4 Tighter Bounds via Regularization
Despite being relatively efficient to compute, FastLin and CROWN are not even the tightest layerwise convex relaxation. Using tighter bounds to train the networks could potentially lead to higher certified robustness by preventing such bounds from overregularizing the networks.
Nevertheless, there exist certain parameters and inputs such that the seemingly looser FastLin is tight for , i.e., the optimal value of FastLin is the same as . The immediate trivial case on can think of is where no unstable neuron exists for the samples inside the allowed perturbation interval.
In fact, even when unstable neurons exist, the optimal solution to the relaxed problem can still be a feasible solution to the original nonconvex problem for a significant portion of input samples . We give an illustrative example where FastLin is tight for a significant portion of the samples even when unstable neurons exist, as shown in Figure 2. In this figure, FastLin is tight at every sample of for the network . Please refer to Appendix E for more details of this example.
It is therefore interesting to check the conditions for FastLin or CROWN to be tight for , and enforcing such conditions during training so that the network can be better verified by efficient verifiers like FastLin, CROWN, and even IBP.
4.1 Conditions for Tightness
Here we look into conditions that make the optimal value of the convex problem to be equal to . Let be some feasible solution of , from which the objective value of can be determined as . Let be some feasible solution of computed by passing through the ReLU subnetworks defined in , and denote the resulting feasible objective value as .
Generally, for a given network with the set of weights , as long as the optimal solution of is equal to a feasible solution of , we will have , since any feasible of satisfies , and by the nature of relaxation .
Therefore, for a given network and input , to check the tightness of the convex relaxation, we can check whether its optimal solution is feasible for . This can be achieved by passing through the ReLU network, and either directly check the resultant objective value , or compare with the resultant feasible solution . Further, we can encourage such conditions to happen during the training process to improve the tightness of the bound. Based on such mechanisms, we propose two regularizers to enforce the tightness. Notice such regularizers are different from the RS Loss (xiao2018training) introduced to reduce the number of unstable neurons, since we have shown with Appendix E that can be tight even when unstable neurons exist.
4.2 A Intuitive Indicator of Tightness: Difference in Output Bounds
The observation above motivates us to consider the nonnegative value
(2) 
as an indicator of the difference between and , where is the margin over class computed by passing the optimal perturbation for through the original network. can be computed efficiently from the optimality condition of FastLin or CROWN, as demonstrated in Eq. 8. For example, when , the optimal input perturbation of is , which corresponds to sending through the ReLU network; when , , which corresponds to sending .
The larger is, the more relaxed is, and the higher could be. Therefore, we can regularize the network to minimize during training and maximize the lowerbound of the margin , so that we can obtain a network where is a better estimate of and the robustness is better represented by . Such an indicator avoids comparing the intermediate variables, which gives more flexibility for adjustment. It bears some similarities to knowledge distillation (hinton2015distilling), in that it encourages learning a network whose relaxed lower bound gives similar outputs of the corresponding ReLU network. It is worth noting that minimizing does not necessarily lead to decreasing or increasing . In fact, both and can be increased or decreased at the same time with their difference decreasing.
The tightest indicator should give the minimum gap , where we need to find the optimal perturbation for . However, the minimum gap cannot be found in polynomial time, due to the nonconvex nature of . (weng2018towards) also proved that there is no polynomial time algorithm to find the minimum norm adversarial distortion with approximation ratio unless NP=P, a problem equivalent to finding the minimum margin here.
4.3 A Better Indicator for Regularization: Difference in Optimal Preactivations
Despite being intuitive and is able to achieve improvements, Eq. 2 which enforces similarity between objective values does not work as good as enforcing similarity between the solutions and in practice, an approach we will elaborate below. For both CROWN and FastLin, unless , may deviate a lot from and does not correspond to any ReLU network, even if may seem small. For example, it is possible that for a given , but a ReLU network will always have .
We find an alternative regularizer more effective at improving verifiable accuracy. The regularizer encourages the feasible solution of to exactly match the feasible optimal solution of . Since we are adopting the layerwise convex relaxation, the optimal solutions of the unstable neurons can be considered independently.
Here we derive a sufficient condition for tightness for FastLin, which also serves as a sufficient condition for CROWN. For linear programming, the optimal solution occurs on the boundaries of the feasible set. Since FastLin is a layerwise convex relaxation, the solution to each of its neurons in can be considered independently, and therefore for a specific layer and , the pair of optimal solutions should occur on the boundary in the right of Figure 1. It follows that the only 3 optimal solutions of that are also feasible for are and . Notice they are also in the intersection between the boundary of CROWN and .
In practice, out of efficiency concerns, both FastLin and CROWN identify the boundaries that the optimal solution lies on and computes the optimal value by accumulating the contribution of each layer in a backward pass, without explicitly computing for each layer with a forward pass (see Appendix D for more details). It is therefore beneficial to link the feasible solutions of to the parameters of the boundaries. Specifically, let be the intercept of the line that the optimal solution lies on. We want to find a rule based on to determine whether the bound is tight from the values of . For both FastLin and CROWN, . For FastLin, when , only or are fesible for ; when , only is feasible for . Meanwhile, is deterministic if is given. Therefore, when the bound is tight for FastLin, if , then . Otherwise, if , and or . For CROWN, this condition is also feasible, though it could be either or when , depending on the optimal slope .
Indeed, we achieve optimal tightness () for both FastLin and CROWN if satisfy these conditions at all unstable neurons. Specifically,
Proposition 1.
We provide the proof of this simple proposition in the Appendix.
It remains to be discussed how to best enforce the similarity between the optimal solutions of and FastLin or CROWN. Like before, we choose to enforce the similarity between and the closest optimal solution of FastLin, where is constructed by setting and pass through the ReLU network to obtain . By Proposition 1, the distance can be computed by considering the values of the intercepts as
where the first term corresponds to and the condition , and the second term corresponds to and the condition . To minimize the second term, the original ReLU network only needs to be optimized towards the nearest feasible optimal solution. It is easy to see from Proposition 1 that if , then , where could be both FastLin or CROWN.
Compared with , puts more constraints on the parameters , since it requires all unstable neurons of the ReLU network to match the optimal solutions of FastLin, instead of only matching the objective values and . In this way, it provides stronger guidance towards a network whose optimal solution for and FastLin or CROWN agree. However, again, this is not equivalent to trying to kill all unstable neurons, since FastLin can be tight even when unstable neurons exist.
4.4 Certified Robust Training in Practice
In practice, for classification problems with more than two classes, we will compute the lower bound of the margins with respect to multiple classes. Denote and as the concatenated vector of lower bounds of the relaxed problem and original problem for multiple classes, and as the regularizers for the margins with respect to class .
Together with the regularizers, we optimize the following objective
(3) 
where is the cross entropy loss with label , as adopted by many related works (wong2018scaling; gowal2018effectiveness), and we have implicitly abbreviated the inner maximization problem w.r.t. into the optimal values and solution . More details for computing the intermediate and output bounds can be found in Algorithm 1, where we have used to denote rowwise norm, and for taking the th column.
One major challenge of the convex relaxation approach is the high memory consumption. To compute the bounds
, we need to pass an identity matrix with the same number of diagonal entries as the total dimensions of the input images, which can make the batch size thousands of times larger than usual. To mitigate this, one can adopt the random projection from
(wong2018scaling), which projects identity matrices into lower dimensions as to estimate the norm of. Such projections add noise/variance to
, and the regularizers are affected as well.5 Experiments
Dataset  Model  Base Method  Rob. Err  PGD Err  Std. Err  
MNIST  2x100, Exact  CP  0.1  0  0  14.85%  10.9%  3.65% 
MNIST  2x100, Exact  CP  0.1  2e3  1  13.32%  10.9%  4.73% 
MNIST  Small, Exact  CP  0.1  0  0  4.47%  2.4%  1.19% 
MNIST  Small, Exact  CP  0.1  5e3  5e1  3.65%  2.2%  1.09% 
MNIST    Best of PV  0.1      4.44%  2.87%  1.20% 
MNIST    Best of RS  0.1      4.40%  3.42%  1.05% 
MNIST  Small  CP  0.1  0  0  4.47%  3.3%  1.19% 
MNIST  Small  CP  0.1  0  5e1  4.32%  3.4%  1.51% 
MNIST  Large  DAI  0.1      3.4%  2.4%  1.0% 
MNIST  2x100, Exact  CP  0.3  0  0  61.39%  49.4%  33.16% 
MNIST  2x100, Exact  CP  0.3  5e3  5e3  56.05%  44.3%  26.10% 
MNIST  Small, Exact  CP  0.3  0  0  31.25%  15.0%  7.88% 
MNIST  Small, Exact  CP  0.3  5e3  5e1  29.65%  13.7%  7.28% 
MNIST  Small  CP  0.3  0  0  42.7%  26.0%  15.93% 
MNIST  Small  CP  0.3  2e3  2e1  41.36%  24.0%  14.29% 
MNIST  XLarge  IBP  0.3      8.05%  6.12%  1.66% 
MNIST  XLarge  CROWNIBP  0.3      7.01%  5.88%  1.88% 
MNIST  XLarge  CROWNIBP  0.3  0  5e1  6.64%    1.76% 
CIFAR10  Small  CP  2/255  0  0  53.19%  48.0%  38.19% 
CIFAR10  Small  CP  2/255  5e3  5e1  51.52%  47.0%  37.30% 
CIFAR10  Large  DAI  2/255      61.4%  55.6%  55.0% 
CIFAR10  XLarge  IBP  2/255      49.98%  45.09%  29.84% 
CIFAR10  XLarge  CROWNIBP  2/255      46.03%  40.28%  28.48% 
CIFAR10  Large  CP  2/255  0  0  45.78%  38.5%  29.42% 
CIFAR10  Large  CP  2/255  5e3  5e1  45.19%  38.8%  29.76% 
CIFAR10  Small  CP  8/255  0  0  75.45%  68.3%  62.79% 
CIFAR10  Small  CP  8/255  1e3  1e1  74.70%  67.9%  62.50% 
CIFAR10  Large  CP  8/255  0  0  74.04%  68.8%  59.73% 
CIFAR10  Large  CP  8/255  5e3  5e1  73.74%  68.5%  59.82% 
CIFAR10  XLarge  IBP  8/255      67.96%  65.23%  50.51% 
CIFAR10  XLarge  CROWNIBP  8/255      66.94%  65.42%  54.02% 
CIFAR10  XLarge  CROWNIBP  8/255  0  5e1  66.64%    53.78% 
We evaluate the proposed regularizer on two datasets (MNIST and CIFAR10) with two different each. We consider only adversaries. Our implementation is based on the code released by (wong2018scaling) for Convex Outer Adversarial Polytope (CP), and (zhang2019towards) for CROWNIBP, so when , we obtain the same results as CP or CROWNIBP. We use up to 4 GTX 1080Ti or 2080Ti for all our experiments.
Architectures:
We experiment with a variety of different network structures, including a MLP (2x100) with two 100neuron hidden layers as (salman2019convex), two Conv Nets (Small and Large) that are the same as (wong2018scaling), a family of 10 small conv nets and a family of 8 larger conv nets, all the same as (zhang2019towards), and also the same 5layer convolutional network (XLarge) as in the latest version of CROWNIBP (zhang2019towards).
The Small convnet has two convolutional layers of 16, 32 output channels each and two FC layers with 100 hidden neurons. The Large convnet has four Conv layers with 32, 32, 64 and 64 output channels each, plus three FC layers of 512 neurons. The XLarge convnet has five conv lyaers with 64, 64, 128, 128, 128 output channels each, with two FC layers of 512 neurons.
Hyperparameters:
For experiments on CP, we use Adam (kingma2014adam) with a learning rate of and no weight decay. Like (wong2018scaling)
, we train the models for 80 epochs, where in the first 20 epochs the learning rate is fixed but the
increases from 0.01/0.001 to its maximum value for MNIST/CIFAR10, and in the following epochs, we reduce learning rate by half every 10 epochs. Unless labelled with “Exact” in the model names of Table 1, we use random projection as in (wong2018scaling) for CP experiments to reduce the memory consumption. Due to the noisy estimation of the optimal solutions from these random projections, we also adopt a warmup schedule for the regularizers in all CP experiments to prevent overregularization, where increases form 0 to the preset values in the first 20 epochs.For CROWNIBP, we use the updated expensive training schedule as (zhang2019towards), which uses 200 epochs with batch size 256 for MNIST and 3200 epochs with batch size 1024 for CIFAR10. We also use the aforementioned warm up schedule for .
Method  error  model A  model B  model C  model D  model E  model F  model G  model H  model I  model J 

Copied  std. (%)  
verified (%)  
Baseline  std. (%)  
verified (%)  
With  std. (%)  
verified (%) 
Mean and standard deviation of the family of 10small models on MNIST with
. Here we use a cheaper training schedule with a total of 100 epochs, all in the same setting as the IBP baseline results of (zhang2019towards). Baseline is CROWNIBP with epoch=140 and lr_decay_step=20. Like in CROWNIBP, we run each model 5 times to compute the mean and standard deviation. “Copied” are results from (zhang2019towards).5.1 Improving Convex Outer Adversarial Polytope
Table 1 shows comparisons with various approaches. All of our baseline implementations of CP have already improved upon (wong2018scaling)
. After adding the proposed regularizers, the certified robust accuracy is further improved upon our baseline in all cases. We also provide results against a 100step PGD adversary for our CP models. Since both PGD errors and standard errors are reduced in most cases, the regularizer should have improved not only the certified upper bound, but also improved the actual robust error.
Despite the fact that we start from a stronger baseline, the relative improvement on 2x100 with our regularizer (10.3%/8.7%) are comparable to the improvements (5.9%/10.0%) under the same setting from (salman2019convex), which solves for the lower and upper bounds of all intermediate layers via the tightest layerwise LP relaxation (Figure 1). This indicates that the improvement brought by using our regularizer during training and efficient verifiers (FastLin in this case) for verification is comparable with using the expensive and unstable optimal layerwise convex relaxation.
Our results with Small are better than the best results of (dvijotham2018training; xiao2018training) on MNIST with , though not as good as the best of (mirman2018differentiable), which uses a larger model. When applying the same model on CIFAR10, we achieve better robust error than (mirman2018differentiable).
The relative improvements in certified robust error for and 0.3 are 18%/3.4% for the small exact model on MNIST, compared with 0.03%/3.13% for the random projection counterparts. This is mainly because in the exact models, we have better estimates of . Still, these consistent improvements validate that our proposed regularizers improve the performance.
5.2 Improving CROWNIBP
In its first stage of training, CROWNIBP (zhang2019towards) trains the network with CROWN (zhang2018efficient) to compute the bounds of the final outputs based on the interval bounds of intermediate activations, and in its second stage, CROWNIBP uses only IBP. We apply our regularizer to the first stage of CROWNIBP, using interval bounds to overapproximate and required by our second regularizer on the optimal preactivations, to obtain a better intialization for its second stage, and demonstrate improvements in Table 1.
Methods based on interval bounds, including IBP, CROWNIBP and DAI (mirman2018differentiable), tend to behave not as good as CP when is small. Our regularizers are able to further improve CP on CIFAR10 (), and demonstrate the best result among all approaches compared in this setting, as shown in Table 1. To our knowledge, these are the best results for CIFAR10 () reported on comparable sized models. By using our regularizers on CROWNIBP to provide a better initialization for the later training stage of pure IBP, our method also achieves the best certified accuracy on MNIST () and CIFAR10 ( ).
To verify the significance of the regularizers, Table 2 shows the mean and variance of the results with the family smaller models on MNIST, demonstrating consistent improvements of our model, while Table 4 (in the appendix) gives the best, median and worst case results with the large models on the MNIST dataset and compares with both IBP and CROWNIBP.
6 Conclusions
We propose two regularizers based on the convex relaxation bounds for training robust neural networks that can be better verified by efficient verifiers including FastLin and IBP for certifiable robustness. Extensive experiments validate that the regularizers improve robust accuracy over nonregularized baselines, and outperform stateoftheart approaches. This work is a step towards closing the gap between certified and empirical robustness. Future directions include methods to improve computational efficiency for LP relaxations (and certified methods in general), and better ways to leverage random projections for acceleration.
References
Appendix A Proof of Proposition 1
Proposition 1.
Proof.
We only need to prove is an optimal solution of both FastLin and CROWN. After that, is both a lower bound and feasible solution of , and therefore is the optimal solution of .
Here we define for , for , and . By definition, is an optimal solution of FastLin or CROWN. Also, since , we have . Next, we will prove if the assumption holds, we will have for both FastLin and CROWN.
For , by definition of FastLin and CROWN, and , so , .
For , again, by definition, , and , so , .
For :

If and as assumed in the conditions, since , we know
where is the th row of . No matter what value is, , , the equality still holds.

If , for both FastLin and CROWN,