provably-robust-boosting
Provably Robust Boosted Decision Stumps and Trees against Adversarial Attacks [preprint, June 2019]
view repo
The problem of adversarial samples has been studied extensively for neural networks. However, for boosting, in particular boosted decision trees and decision stumps there are almost no results, even though boosted decision trees, as e.g. XGBoost, are quite popular due to their interpretability and good prediction performance. We show in this paper that for boosted decision stumps the exact min-max optimal robust loss and test error for an l_∞-attack can be computed in O(n T T), where T is the number of decision stumps and n the number of data points, as well as an optimal update of the ensemble in O(n^2 T T). While not exact, we show how to optimize an upper bound on the robust loss for boosted trees. Up to our knowledge, these are the first algorithms directly optimizing provable robustness guarantees in the area of boosting. We make the code of all our experiments publicly available at https://github.com/max-andr/provably-robust-boosting
READ FULL TEXT VIEW PDFProvably Robust Boosted Decision Stumps and Trees against Adversarial Attacks [preprint, June 2019]
Deep neural networks achieve excellent performance on complex prediction tasks in computer vision or natural language processing. However, it has recently been shown that they are easily fooled by imperceptible perturbations
Szegedy et al. (2014); Goodfellow et al. (2015) or tend to output high-confidence predictions on out-of-distribution inputs Nguyen et al. (2015); Nalisnick et al. (2019); Hein et al. (2019) that have nothing to do with the original classes. Moreover Hein et al. (2019) suggest that the latter behavior cannot be prevented unless one changes the neural network architecture. One of the most popular defenses against adversarial examples for neural networks is adversarial training Goodfellow et al. (2015); Madry et al. (2018). It has been formulated within a principled framework of robust optimization Shaham et al. (2018); Madry et al. (2018), however the inner maximization problem is difficult as the underlying problem is non-convex for neural networks. A large variety of suggested sophisticated defenses Kannan et al. (2018); Buckman et al. (2018); Lu et al. (2017) could be broken again via more sophisticated attacks Athalye et al. (2018); Engstrom et al. (2018); Mosbach et al. (2018). Moreover, empirical robustness can also arise from gradient masking or obfuscation Athalye et al. (2018)and thus one can never be sure if more powerful attacks can break a given heuristic defense. A solution of this problem are methods which lead to neural networks with provable robustness guarantees
Hein and Andriushchenko (2017); Wong and Kolter (2018); Raghunathan et al. (2018); Zhang et al. (2018); Xiao et al. (2019); Croce et al. (2019); Gowal et al. (2018); Cohen et al. (2019) or lead to networks which can be certified e.g. via the mixed-integer programming formulation of Tjeng et al. (2019). However, this certification process does not scale at the moment to large networks and networks having provable robustness guarantees are lacking in terms of prediction performance compared to standard ones.
While the adversarial problem has been studied extensively for neural networks, other classifier models have received much less attention e.g. kernel machines
Xu et al. (2009); Papernot et al. (2016); Hein and Andriushchenko (2017); Bertsimas et al. (2018), k-nearest neighbors Wang et al. (2018), and decision trees Papernot et al. (2016); Chen et al. (2019); Bertsimas et al. (2018). Boosting, in particular boosted decision trees, is quite popular, e.g. XGBoost Chen and Guestrin (2016), due to its interpretability and competitive prediction performance. While robust boosting has been considered Warmuth et al. (2007); Lutz et al. (2008); Freund (2009)it refers in their context to a large functional margin or robustness with respect to outliers e.g. via using a robust loss function, but not to the adversarial robustness we are considering in this paper.
In this paper, we show how to exactly compute the robust loss and test error for an ensemble of decision stumps with coordinate-aligned splits and robustness with respect to the -norm. Even better we show that one can solve the update problem of the ensemble of decision stumps globally optimal and thus one can directly minimize the robust min-max loss without any approximation. The difference of the resulting robust boosted decision stumps compared to normal boosted stumps is visualized in Figure 1.
Very recently, Chen et al. (2019) considered the robust min-max loss for an ensemble of decision trees with coordinate-aligned splits. They propose an approximation of the inner maximization problem but without any guarantees. The robustness guarantees are obtained via certification with Kantchelian et al. (2016), which have proposed a mixed-integer-programming formulation for the computation of the minimal adversarial perturbation for tree ensembles. However, their approach does not scale to large problems. While the approach of Chen et al. (2019) leads to tree ensembles with improved empirical and certified robustness, they have no approximation guarantee on the robust loss or robust error during training time. In contrast we show how to derive an upper bound on the robust loss for tree ensembles based on our results for an ensemble of decision stumps and we show how that upper bound can be minimized during training. Our derived upper bound is quite tight and leads directly to provable guarantees on the robustness of decision splits. Moreover, we obtain tight guarantees for the resulting tree ensemble when the tree construction is combined with a pruning scheme that ensures minimization of the upper bound on the robust loss for the whole ensemble.
In this section we fix the notation and the framework of boosting we want to tackle and define briefly the basis of robust optimization for adversarial robustness, underlying adversarial training. In the next section we derive the specific robust training procedure for an ensemble of decision stumps where we optimize the exact robust loss and for a tree ensemble where we optimize an upper bound.
While the main ideas can be generalized to the multi-class setting, for simplicity of the derivations we restrict ourselves to the binary classification case, that is our labels are in and we assume to have real-valued features. Boosting can be described as the task of fitting an ensemble of weak learners given as
The final classification is done via the sign of
. In boosting the ensemble is fitted in a greedy way in the sense that given the already estimated ensemble we determine an update
, by fitting the new weak learner being guided by the performance of the current ensemble . In this paper we focus on the exponential loss , where we use the functional margin formulation where for a point it is defined as . However, all the following algorithms can be generalized to any strictly monotonically decreasing, convex loss function , e.g. logistic loss . The advantage of the exponential loss is that it decouples and the update in the estimation process and allows to see the estimation process for as fitting a weighted exponential loss where the weights to fit are given by ,In this paper we consider as weak learners: a) decision stumps of the form , , where one does a coordinate-aligned split and b) decision trees (binary tree) of the form , where is a mapping from the set of leaves of the tree to with and is a mapping which assigns to every input the leaf of the tree it ends up. While the approach can be generalized to general linear splits of the form, , we concentrate on coordinate-aligned splits, which are easier to interpret for humans. The trees are first grown to maximum depth and post-hoc one recursively prunes all the leaf splits with negative gain.
The problem of adversarial perturbations is known in spam classification and has been rediscovered in Szegedy et al. (2014). It can be formulated as finding the minimal perturbation with respect to some metric such that the classifier decision is wrong^{1}^{1}1Note that there are variants where one requires to change the decision of the classifier instead but we stick to the formulation related to the robust test error.
(1) | |||
where and is a constraint set the data has to fulfill. We denote by the optimal solution of this problem for . Furthermore, let be the set of perturbations with respect to which we want to be robust (attack model). Then the robust test error with respect to is defined for a test set as the fraction The optimization problem (1) is non-convex for neural networks and can only be solved exactly via mixed-integer programming Tjeng et al. (2019) which does not scale, especially during training. Thus lower bounds on the robust test error are obtained via heuristic attacks Madry et al. (2018); Carlini and Wagner (2017) whereas provable robustness aims at providing upper bounds on the robust test error and the optimization of these bounds during training Hein and Andriushchenko (2017); Wong and Kolter (2018); Raghunathan et al. (2018); Zhang et al. (2018); Xiao et al. (2019); Croce et al. (2019); Gowal et al. (2018); Cohen et al. (2019). For an ensemble of trees the optimization problem (1) can also be reformulated as a mixed-integer-program Kantchelian et al. (2016) which does not scale to large datasets.
The goal of improving adversarial robustness can be formulated as a robust optimization problem with respect to the set of allowed perturbations Shaham et al. (2018); Madry et al. (2018).
(2) |
For neural networks the corresponding training process is called adversarial training Goodfellow et al. (2015), where one tries at each update step to approximately solve the inner maximization problem which again is non-convex and thus globally optimal solutions are very difficult to obtain. Our goal in the following are provable robustness guarantees for boosted stumps and trees which are optimized during training, which we show how to do in the following two sections.
We first show how the exact robust loss can be computed for an ensemble of decision stumps. While decision stumps are very simple weak learners, they have been used in original AdaBoost Freund and Schapire (1996) and were successfully employed in object detection Viola and Jones (2001)
Viola and Jones (2004) which could be done in real-time due to the simplicity of the classifier.The ensemble of decision stumps can be written as
(3) |
where is the coordinate for which makes a split. First, observe that a point with label is correctly classified when . In order to determine whether the point is adversarially robust wrt -perturbations, one have to solve the following optimization problem:
(4) |
If , then the point is non-robust. If , then the point is robust, i.e. it is not possible to change the class. Thus the exact minimization of (4) over the test set yields the exact robust test error. For many state-of-the-art classifiers, this problem is NP-hard. For particular MIP formulations for tree ensembles see Kantchelian et al. (2016) or for neural networks see Tjeng et al. (2019). Closed-form solutions are known for linear classifiers Goodfellow et al. (2015).
We can solve this certification problem for the robust test error exactly and efficiently by noting that the objective is separable wrt the input dimensions, and then solving up to one-dimensional optimization problems as also our attack model is separable. We denote by , i.e. the set of stump indices that split coordinate . Then
(5) | |||
The one-dimensional optimization problem can be solved by simply checking all piece-wise constant regions of the classifier for . The overall time complexity of the exact certification is since we need to sort all thresholds (up to of them) in ascending order to efficiently calculate the partial sums depending on thresholds.
Moreover, using this result, we can obtain provably minimal adversarial examples. By noting that the function is piece-wise constant with constant regions, it suffices to solve this minimization problem for every (where is as small as precision allows) sorted in ascending order and stop when
is large enough to change the original class. In order to get the final perturbation vector
, we have to save the indices that minimize for every splitting coordinate which are used in the ensemble. We provide visualizations of adversarial examples in the experimental section.Finally, as we assume that is monotonically decreasing it holds:
and thus the above algorithm can directly be used to compute also the robust loss.
For updating the ensemble with a new stump splitting coordinate , we first have to solve the inner maximization problem over in (2) before^{2}^{2}2The order is very important as a min-max problem is not the same as a max-min problem. we optimize the parameters of :
In order to solve the remaining optimization problem for we have to make a case distinction based on the values of . However, first we define the minimal values of the ensemble part on and as
These problems can be solved analogously to . Then we get the case distinction:
(6) | |||
Note that as a function of is concave. The following Lemma shows that the full loss is jointly convex in .
Let be concave and convex and monotonically decreasing. Then defined as is convex for any .
Proof. Let and be in . Then for ,
Thus the loss term for each data point is jointly convex in and consequently the sum of the losses is convex as well.
This means that for the overall robust optimization problem over the parameters (for fixed ) we have to minimize the piecewise defined convex function with up to case distinctions on :
where the vector is obtained by sorting the values augmenting the first and last elements and with and respectively in ascending order . There is no closed-form minimizer wrt even when is fixed. Thus we apply coordinate descent to minimize the loss where the minimum wrt is found via bisection, and wrt via a closed-form minimizer when is fixed. Concretely, if we denote to be equal to if the first condition of (6) is true, and otherwise, and also
then the total exponential loss is . We further denote and
(7) | ||||
Then the coordinate descent update for can be derived by setting to zero:
We note that this does not create a significant overhead, since we perform only operations on scalars , , , . The overall complexity for a particular coordinate and fixed threshold is in the number of training examples times the effort for coordinate descent which is logarithmic in the desired precision (cost for bisection).
Finally, we have to minimize over all possible thresholds. We choose the potential thresholds , were can be as small as precision allows and is just introduced so that the thresholds lie outside of . We optimize the robust loss for all thresholds and determine the minimum. For each contiguous set of minimizers we determine the nearest neighbors in and check the thresholds half-way to them (note that they have at most the same robust loss but never a better one) and then take the threshold in the middle of all the ones having equal loss. As there are in the worst case thresholds, the overall complexity of one update step is . And finally, at each update step one typically selects a small random subset of the coordinates and takes the one which yields the smallest overall robust loss of the ensemble.
Similarly to decision stumps we first provide an upper bound on the robust test error which is then used further on in the update step of tree ensemble by minimizing an upper bound on the robust loss.
Our goal is to solve the optimization problem (4). While the exact minimization is NP-hard for trees Kantchelian et al. (2016), we will similarly to Wong et al. (2018); Raghunathan et al. (2018) for neural networks derive a tractable lower bound on for an ensemble of trees.
(8) |
If , then the point is provably robust. However, if , the point may be either robust or non-robust. In this way, we get an upper bound on the number of non-robust points, which yields an upper bound on the robust test error. We note that for a decision tree, can be found exactly by checking all leafs which are reachable for points in . The complexity is where is the depth of the tree, but this remains tractable for shallow trees in the ensemble, e.g. up to 8 as used in Chen et al. (2019) for efficient training.
The goal is to bound the inner maximization problem of Equation (2) based on the certificate that we derived. Note that we aim to bound the loss of the whole ensemble , and thus we do not use approximations of the loss, and we also do not use the approximate split suggested in Chen and Guestrin (2016). We use , that is the attack model is . Let be a fixed ensemble of trees and a new tree with which we update the ensemble.
(9) |
The inner maximization problem can be upper bounded for every tree separately given that is monotonically decreasing wrt , and using our certificate for the ensemble of trees:
We can efficiently calculate as described in the previous subsection. But note that depends on the tree . The exact tree fitting is known to be NP-complete Laurent and Rivest (1976), although it is still possible to scale it to some moderate-sized problems with recent advances in MIP-solvers and hardware as shown in Bertsimas and Dunn (2017). We want to keep the overall procedure scalable to large datasets, so we will stick to the standard greedy recursive algorithm for fitting the tree. On every step of this process, we fit for some coordinate and for some splitting threshold , a single decision stump . Therefore, for a particular decision stump with threshold and coordinate we have to solve the following problem:
(10) |
where are all the points which can reach this leaf for some with .
Finally, we have to make a case distinction depending on the values of and :
(11) |
where we denote the case distinction for brevity as . Note that the right side of (11) is concave as a function of . Thus the overall robust optimization amounts to finding the minimum of the following objective, which is again by Lemma 3.1 jointly convex in :
(12) | |||
Since we have only two intervals and , we first find the minimum loss on each interval separately via coordinate descent. By using the notation from (7), where now , the minimizers of and are given by setting and to zero:
We iterate these updates of and until convergence. After finding the minimum of the objective on a particular interval, we then combine the results from both intervals by taking the smallest loss out of them.
Then we also consider the threshold selection as described in Section 3.2. Finally, as in other tree building methods such as Breiman et al. (1984); Chen and Guestrin (2016), we perform pruning after a tree is constructed based on the training robust loss (12) to ensure that it decreases at every iteration of tree boosting. This cannot be guaranteed with robust splits alone since the tree construction process is greedy, and splits at one branch of the tree may influence another. We note that in the extreme case, pruning may leave only a robust stump at the root, for which we are guaranteed to decrease the robust loss. Thus every new tree is guaranteed to reduce the training robust loss, and in practice pruning that leads to just a single decision stump happens extremely rarely. We note that the total worst case complexity is in the number of training examples compared to for XGBoost, which is a relatively low price given that the overall optimization problem is significantly more complicated than the original XGBoost formulation.
General setup: We are primarily interested in two quantities: test error (TE) and robust test error (RTE) wrt -perturbations. For boosted stumps, we can compute RTE exactly as described in Section 3.1, but we also report the upper bound (URTE) to illustrate that it is actually tight for almost all models. For boosted trees, we cannot compute RTE efficiently, thus we report the upper bound (URTE) together with a lower bound (LRTE) given by solving the left-hand side of (8) by sampling (250 trials). We observe that such a simple lower bound is actually tight enough for low-dimensional datasets such as breast-cancer, diabetes or cod-rna, and also reasonably tight for MNIST, FMNIST and GTS. Thus this is a reasonable quantity to assess the tightness of URTE obtained via (8). For evaluation we use 7 datasets: breast-cancer Dua and Graff (2017), diabetes Smith et al. (1988), cod-rna Uzilov et al. (2006), MNIST 1-5 (digit 1 vs digit 5) LeCun (1998), MNIST 2-6 (digit 2 vs digit 6, following Kantchelian et al. (2016); Chen et al. (2019)), FMNIST shoes (sandals vs sneakers) Xiao et al. (2017), GTS 100-rw (speed 100 vs roadworks sign), and GTS 30-70 (speed 30 vs speed 70) Stallkamp et al. (2012).
We consider three boosted stumps models: plain model, robust stumps where each stump is bounded independently as described in Section 4.2, and exact robust stumps as in Section 3.2. We consider two boosted trees models: plain model, and as described in Section 4.2. We perform model selection based on the validation set of 20% randomly selected points from the original training set, and we train on the rest of the training set. We do up to 500 iterations of stumps and up to 50 iterations of trees, and then select the final model based on the best validation test error for plain models, and based on the best validation robust test error for robustly trained models. For boosted stumps, we optimize over up to 10 coordinates for every split. For boosted trees, we grow trees up to depth 4, for every split we optimize over up to 100 coordinates, we split a node if it contains at least 10 examples. All models are trained with the exponential loss or with its robust version. More details about the experimental details are available at our repository http://github.com/max-andr/provably-robust-boosting.
Plain stumps | Robust stumps | Exact robust stumps | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
Dataset | TE | RTE | URTE | TE | RTE | URTE | TE | RTE | URTE | |
breast-cancer | 0.3 | 1.5 | 99.3 | 99.3 | 5.1 | 11.7 | 11.7 | 5.1 | 11.7 | 11.7 |
diabetes | 0.05 | 28.6 | 77.9 | 77.9 | 27.9 | 32.5 | 32.5 | 29.2 | 32.5 | 32.5 |
cod-rna | 0.025 | 4.8 | 37.4 | 38.5 | 12.0 | 22.4 | 22.4 | 12.0 | 22.4 | 22.4 |
MNIST 1-5 | 0.3 | 0.4 | 98.6 | 98.7 | 0.4 | 3.2 | 3.2 | 0.5 | 3.7 | 4.3 |
MNIST 2-6 | 0.3 | 1.8 | 100 | 100 | 2.6 | 8.6 | 8.6 | 2.6 | 8.9 | 8.9 |
FMNIST shoes | 0.1 | 2.4 | 100 | 100 | 5.9 | 13.1 | 13.1 | 5.3 | 13.1 | 13.8 |
GTS 100-rw | 8/255 | 1.5 | 30.0 | 30.0 | 3.5 | 14.3 | 14.3 | 3.7 | 12.4 | 12.4 |
GTS 30-70 | 8/255 | 14.4 | 76.7 | 76.7 | 18.1 | 32.8 | 32.8 | 17.0 | 32.7 | 32.7 |
Plain trees | Robust trees | ||||||
---|---|---|---|---|---|---|---|
Dataset | TE | LRTE | URTE | TE | LRTE | URTE | |
breast-cancer | 0.3 | 2.2 | 90.5 | 99.3 | 2.9 | 10.2 | 10.2 |
diabetes | 0.05 | 27.3 | 52.6 | 53.2 | 28.6 | 33.1 | 33.1 |
cod-rna | 0.025 | 4.2 | 39.1 | 64.1 | 8.3 | 22.8 | 23.2 |
MNIST 1-5 | 0.3 | 0.3 | 57.5 | 87.5 | 0.3 | 1.5 | 2.0 |
MNIST 2-6 | 0.3 | 1.2 | 98.1 | 100 | 0.7 | 3.5 | 5.0 |
FMNIST shoes | 0.1 | 2.8 | 72.2 | 100.0 | 4.7 | 9.4 | 10.5 |
GTS 100-rw | 8/255 | 2.5 | 12.3 | 21.1 | 4.7 | 9.2 | 10.1 |
GTS 30-70 | 8/255 | 14.2 | 38.8 | 63.9 | 14.9 | 25.6 | 27.2 |
The results for boosted stumps are given in Table 1. We observe that plain models are not robust within the considered perturbations. However, both variants of robust boosted stumps that we propose significantly improve RTE on all datasets, which shows the effectiveness of our method. The most extreme improvements compared to plain models are obtained on breast-cancer dataset from RTE to and on MNIST 2-6 from to RTE. We also notice that robust models perform slightly worse in terms of test error, which goes in line with the empirical observation made for adversarial training for neural networks Madry et al. (2018); Tsipras et al. (2019). Although RTE is the exact quantity and is sufficient to judge about robustness of the considered models, we still report URTE to show that it is very close to RTE. Remarkably, when URTE is integrated into training, i.e. for robust stumps, it is equal to RTE for all models. This suggests that bounding the sum over weak learners element-wise, as done in (8), might be tight enough to lead to robust models also for tree ensembles, which we discuss next.
For boosted stumps or trees, unlike for neural networks, we can directly inspect the model and the classification rules it learned. In particular, in Figure 2, we plot the distibution of the splitting thresholds for the three boosted stumps models on MNIST 2-6 reported in Table 1. We can observe that both robust models always select splits in the range between 0.3 and 0.7, which is reasonable given that more than 80% pixels of MNIST are either 0 or 1, and the considered -perturbations are within . At the same time, the plain model splits arbitrarily close to 0 or 1, which suggests that its decisions might be easily flipped if the adversary is allowed to change them within . We also note that robust stumps and exact robust stumps lead to almost identical histograms of splitting thresholds, which again suggests that bounding every stump independently has a similar effect to solving the corresponding minimization problem exactly.
The results for boosted trees are given in Table 2. Note that now we cannot compute efficiently RTE, so we rely on LRTE and URTE to judge about robustness. Similarly to boosted stumps, we observe that robust training for boosted trees is also efficient in improving robustness of the models. We make this conclusion since for every model URTE of robust trees is lower than the LRTE of the corresponding plain models, often with a large margin. For example, on MNIST 2-6, LRTE of the plain model is , while URTE of the robust model is . We observe that URTE is very close to LRTE or even the same in some cases which allows to assess exact RTE. We also note that while robust trees are often slightly worse than their plain counterparts in terms of test error, they outperform the robust stumps models. Moreover, URTE of robust trees is in many cases better than exact RTE of both versions robust stumps. This suggests that there is a benefit of using more expressive weak learners in boosting such as trees to get more robust and accurate models.
In Section 3.1, we described how we can efficiently obtain provably minimal (exact) adversarial examples for boosted stumps. We show them for MNIST 1-5 and MNIST 2-6 datasets in Figure 3. We show the size of -perturbation needed to flip the class in the title of each image. First, we can observe that -perturbations are sparse which is due to the fact that we modify only the pixels that influence particular decision stumps that contribute to minimization of (5). The main observation is that the perturbations for plain models are imperceptible, while for robust models they are much larger in terms of the -norm. In particular, they have usually slightly larger than which makes sense since the that we used during training was equal to . Moreover, for robust models, the perturbations are situated at the locations where one can expect pixels of the opposite classes.
We introduce efficient robust optimization methods for boosted decision stumps and boosted decision trees wrt -perturbations. In particular, we show how to solve the underlying min-max problem for boosted stumps exactly. Our experimental results confirm efficiency of the proposed methods. For boosted trees, we provably improve robustness over plain models. As future work it will be interesting to identify other non-trivial classifiers for which one can perform robust optimization exactly and efficiently. Alternatively, deriving tight upper bounds on the robust loss is another promising direction as shown in this work.
Provable robustness of relu networks via maximization of linear regions.
AISTATS, 2019.Evaluating and understanding the robustness of adversarial logit pairing.
NeurIPS 2018 Workshop on Security in Machine Learning, 2018.The mnist database of handwritten digits.
http://yann.lecun.com/exdb/mnist/, 1998.Towards deep learning models resistant to adversarial attacks.
ICLR, 2018.Robustness may be at odds with accuracy.
ICLR, 2019.Robustness and regularization of support vector machines.
JMLR, 2009.Efficient neural network robustness certification with general activation functions.
NeurIPS, 2018.