Robust Classification using Robust Feature Augmentation

05/26/2019 ∙ by Kevin Eykholt, et al. ∙ University of Michigan 0

Existing deep neural networks, say for image classification, have been shown to be vulnerable to adversarial images that can cause a DNN misclassification, without any perceptible change to an image. In this work, we propose shock absorbing robust features such as binarization, e.g., rounding, and group extraction, e.g., color or shape, to augment the classification pipeline, resulting in more robust classifiers. Experimentally, we show that augmenting ML models with these techniques leads to improved overall robustness on adversarial inputs as well as significant improvements in training time. On the MNIST dataset, we achieved 14x speedup in training time to obtain 90 adversarial accuracy com-pared to the state-of-the-art adversarial training method of Madry et al., as well as retained higher adversarial accuracy over a broader range of attacks. We also find robustness improvements on traffic sign classification using robust feature augmentation. Finally, we give theoretical insights for why one can expect robust feature augmentation to reduce adversarial input space



There are no comments yet.


page 3

page 12

page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural networks (DNNs) are used for various tasks, including image classification with applications to character recognition, traffic sign classification, and autonomous driving. However, the pervasive use of DNNs has also raised concerns as to their robustness, and thus trustworthiness. Namely, existing DNNs have been shown to be vulnerable to adversarial inputs [20]. These are inputs that, to a human, appear similar to each other, but are assigned different labels by the DNN.

Currently, there is an interest in designing networks that are robust to adversarial examples. Shafai et al. argue that adversarial robustness is limited based on the dimensionality of the input space [18]. Schmidt et al. suggest that accurate, but not robust models are a result of an insufficient number of training samples [17]. Under a theoretical model in which it is possible to learn an accurate classifier from a single sample, they demonstrate that learning a robust classifier requires at least samples. This problem manifests itself during training as the classifier learns to rely on predictive, but non-robust features. For example, Malhotra et al. added pixel noise to training inputs based on the true label of the input and found that the classifier learned to value the position of the noise pixel over any other feature when classifying the data [13]. Other works make similar findings, showing that traditional training of classifiers, results in a classifier learning highly predictive, but non-robust features and the classifier is thus exploitable [8, 5, 21, 2].

The main contribution of this paper is to propose a new approach, robust feature augmentation

, as a component of standard machine learning techniques. In this approach, we augment a classification pipeline with

robust features that are design to absorb most adversarial perturbations, thus improving the overall robustness of the classifier. Under a theoretical model, we provide results and characterizations that help explain as to why this approach improves robustness. Our work is also interesting in the light of recent works on certifiable robustness, for e.g., Cohen et al. [4] mention that “it is typically impossible to tell whether a prediction by an empirically robust classifier is truly robust to adversarial perturbations” however, with robust feature augmentation in the classification pipeline itself, one can expect robustness to bounded adversarial perturbations, by construction.

Adversarial training, popularized by Madry et al., is the current standard approach for designing robust machine learning models, in which L-bounded adversarial examples are generated during training. However, adversarial training is costly. As an alternative approach, we suggest augmenting a classification pipeline with robust features. Compared to Madry et al. [12], robust feature augmentation without adversarial training achieved 80% adversarial accuracy 33x faster on MNIST. Additionally, combining robust feature augmentation with adversarial training achieved a 14x training time speedup for achieving 90% adversarial accuracy. Since robust feature augmentation works well with any DNN, we can get higher accuracy compared to recent attempts at certifiable robustness [16] on the MNIST dataset.

A concurrently developed approach is to create a dataset that only contains robust features [8]. Previously, this approach was shown to improve the robustness of a trained model, but required precise manipulations of the dataset [5]. Using adversarial training on CIFAR, Ilyas et al. created an adversarially robust model, from which they identified robust and non-robust features. They removed the non-robust features from the dataset and showed that standard training on the robust dataset improved adversarial performance by about 45% while decreasing test accuracy by 10%. However, this approach improvement still failed to outperform the model adversarially trained on the original dataset. In this paper, we choose to improve robustness by identifying robust features that can be added directly to the classification pipeline, thus preserving standard training techniques. Our intuition is that since adversarial examples are a human-defined phenomenon, robust features can also be similarly defined.

Summary of contributions and outline of the paper:
  • We define the notion of a robust feature, computed from an input. Informally, a robust feature is a feature that does not change as input is adversarially perturbed. Typically, we intend these features to be meaningful attributes such as color and shape of an object being classified. But, it can also be a coarser categorization of the input that is expected to be stable under permitted adversarial perturbations (Section 3).

  • We show that a function computed on a set of robust features is also robust. In other words, we can use robust features an an input for robust classification decisions (Section 3).

  • We show theoretical connections between (1) adversarially-trained classifiers that attempt to discover a non-linear separation boundary to maximize the separation between natural inputs and (2) using robust functions to map natural inputs to "pure" natural inputs and then using a linear classifier to separate the points (Section 3).

  • On MNIST, we use a binarization function as a robust feature and show that it improves the robustness of a standard classifier from 0% to 74.64% without any adversarial training. For an adversarially trained classifier, binarization reduces the training time by 14x for comparable adversarial accuracy and retains better accuracy as attack radius increases, e.g., 87.13% adversarial accuracy vs 34.88% for as compared to  [12] (Section 4).

  • On a traffic sign dataset, we design a robust color extractor to augment a standard traffic sign classifier. Our augmented classifier prevents more than 90% of adversarial attacks between signs of different colors (Section 4).

Before delving into the details of the main contributions of our work, we give an overview of useful notation and definitions in Section 2.

2 Preliminaries

In this section, we establish some notation and definitions that will be useful in the exposition of the remaining paper. We often refer to the set as for the ease of notation. We assume there is an underlying data distribution which the input set belongs to and each has a corresponding ground truth label . In particular, one can think of as the set of inputs that a human (or an oracle) is able to classify. We follow the supervised machine learning setup where the basic goal is given a training dataset and corresponding labels , learn a function , that is a good approximation for the unknown function . We will assume that is such that in the given data. More specifically, the goal is to seek to minimize the loss over a random sample over the input space , which is often approximated by minimizing an empirical loss over a random training sample.

It has been shown that although highly accurate approximations of can be learned, these approximations are not robust for with respect to perturbations of a majority of inputs. Let be a distance function that measures the distance between inputs in and let us denote an -neighborhood of , , as the set of points in at a distance at most away from , i.e., for some given . We call a function robust if it does not change its output over small neighborhoods around a subset of desired inputs .

Definition 1.

A function is said to be robust over a subset with respect to if for all : for all .

We refer to as “pure” inputs. Since the set of all possible inputs, , encountered in practice is assumed classifiable by a human (or an oracle), we can assume111Note that by this definition, the classifiable set of inputs is not convex since convex combination of two points in different neighborhoods may not lie in the neighborhood of any pure data point. that for some . As an example, a constant function is robust over all inputs, by definition, however it may not be accurate. For some large enough , the ground truth may itself not be robust, although it is accurate. Combining accuracy and robustness, we can define an adversarial input as follows:

Definition 2.

Given a ground truth function and a learned classification function , suppose for some and , then is an adversarial example for the classification function .

Suppose a function is robust222A related notion is that of certifiable robustness that deals with the user being able to certify robustness of a given classifier, in the sense of property testing [16]. on a subset with respect to (i.e., for all , ). By definition, any input in cannot be an adversarial example for the function with respect to and any arbitrary ground truth function .

Ideally, we would like a robust classifier that is also accurate on this input space, i.e., a classifier that minimizes the loss on pure inputs as well as the percentage of pure inputs that have adversarial inputs. Increases in robustness may result in a loss of accuracy, and the goal is be to find a feasible trade-off. For the rest of the paper, we will assume that is chosen small enough such that perturbing inputs in within an -neighborhoods does not change the ground truth classification, and we would like to compute classifiers that are robust over .

3 Robust Feature Augmentation

We propose two general techniques of developing robust classifiers: binarization (Section 3.1

) and group feature extraction (Section


(a) Original Image
(b) Adversarial image
(c) Original Image
(d) Adversarial image
Figure 1: The MNIST image in (a) is correctly classified as a “4", however image in (b) is misclassified as an “8", despite only minor visual distortions in the image. Similarly, the image in (c) is correctly classified as a Stop sign, but the image in (d) is misclassified as German Keep Left sign.

First, we propose that if the first stage of a deep learning pipeline is robust to a class of perturbations, then the overall pipeline will also be robust against those perturbations. An example of binarization is a simple rounding filter that, when applied to an image, can remove perturbations on most pixels. Such a function is useful in images where there is a notion of a static background and only the presence of a single type of pixel defines the object. Previous works have demonstrated that, for MNIST, binarization is remarkably effective in improving adversarial robustness with respect to small pixel perturbations

[3, 6]. We will present theoretical reasons in Section 3.1 on why the use of a binarizer improves robustness even without requiring adversarial training for the special case of a linear classifier, as well as present experimental results on MNIST in Section 4 that show improved adversarial accuracy with this simple, yet powerful idea.

Our second proposal is a generalization of binarization: to use one or more simpler image features (e.g., color and shape for objects) that are expected to be robust to adversarial perturbations. Consider the domain of traffic sign images in the US: a standard Stop traffic sign is known to be predominantly red and with octagonal shape. Traditional adversarial attacks on images change neither feature as there is a constraint to maintain the visual appearance of the original input (e.g., Figure 1). Thus, it is apparent that standard classifiers do not learn to prioritize these features, shape and color, for labeling the sign. Rather, other predictive, non-robust features, are learned, which are then exploited by the adversary so as to maintain the visual appearance of the Stop sign, while causing the predicted label to change. Our goal, then, is to make classifiers more robust by explicitly factoring in any known discriminating features that are robust to perturbations on a large subset of the input space.

3.1 Binarization

Figure 2: (a) Max-margin linear classifier, trained over pure data points , results in large adversarial input space. Binarizing test data to the nearest-neighbor in before classification removes these adversarial inputs completely. (b) When is not known, binarization to the nearest lattice point reduces adversarial input space.

In order to remove spurious noise learned by a DNN, we propose binarization or a snapping of input data to desired intervals. Experimentally, we found that about 82% of the pixels in MNIST images are concentrated near 0 and 8% are concentrated near 1. The remaining pixels are somewhat evenly distributed between 0.1 and 0.9. We observed that adversarial attacks often changed background pixels, and if the changes were removed, the classifier would correctly label the example. Previous work suggests that a binarization function, which rounds all the pixels to {0,1} based on a threshold, can improve robustness of the resulting classifier [3, 6]. Although its name suggests rounding values in [0, 1] to {0, 1}, we define binarization more generally:

Definition 3.

Consider a set . Any function that maps each data point in the input space to elements in is called a binarizer, and is referred to as the binarization of .

Typically, is chosen to be much smaller in cardinality compared to . Suppose the binarizer is defined with respect to a distance such that is the nearest neighbor of in , i.e. . If , we get a binarizer to map any data point to the nearest neighbor from the pure set of points . If and we get the vanilla form of binarization where every pixel is rounded to 0 or 1. One could also define a binarizer with respect to a threshold333Here,

is simply an indicator vector for whether

is true or not., for e.g. , which rounds each element to based on whether the coordinate-wise value is less than threshold or not.

We show in Section 4 that the proposed binarization has a minimal effect on the standard accuracy of the classifier, but greatly improves the adversarial robustness. Furthermore, binarization can be combined with adversarial training. The combination achieved both an order of magnitude faster training time and higher adversarial accuracy as compared to Madry et al. [12], with similar test accuracy.

Why does binarization help?

To see why binarization works in practice, consider the example of a support vector machine that computes a max-margin linear classifier. In Figure

2(a), suppose the set of “pure" data points are the green and red dots, and their -neighborhoods in the norm are the colored squares enclosing them. In this example, the pure data points are linearly separable, although the -neighborhoods are not. We depict the max-margin linear classifier with a solid line that separates the green points from the red points. Clearly, this example has a large adversarial instance space (pink, bright green regions in Figure 2(a)) which belongs to an neighborhood of some pure data point, however, these would be misclassified by the linear classifier. On the other hand, suppose a data point was first binarized to nearest-neighbor in , this would completely remove adversarial instances and we could obtain a perfect classifier even with the underlying classification technique being a support-vector machine. This point is important enough to be stated again:

Augmenting the classification pipeline with a nearest-neighbor mapping increases the power of linear classification to allow non-linear separability (blue decision boundary in Figure


Note that the resultant model from augmenting binarization and linear classification (see Figure 3) is not only powerful in removing adversarial samples but also does so in an interpretable way. We formalize this example in the theorem below.

Theorem 1.

Consider a max-margin classifier that is trained on that is linearly separable. Consider a distance function and a parameter . For any two data points , suppose that the whenever the ground truth labels . Consider a nearest-neighbor binarizer , and the max-margin linear classifier (trained over ), then the augmented classifier is robust over with respect to and exact444By exact, we mean no errors in classification. over the neighborhood.

Figure 3: The MNIST model with a binarization function and classifier .

The theorem holds because is uniquely (and correctly) mapped to the original (unperturbed) data point using the nearest-neighbor binarizer, and these are perfectly classified using since is linearly separable (therefore did not introduce errors on data points in ). Since the -neighborhoods of oppositely classified points do not overlap, we are able to perfectly classify the perturbed points using a linear classifier composed with nearest neighbor matching.

In the case when is not known, we use binarization to map training/testing data points to the nearest points in a lattice, e.g., the set of all 0/1 vectors . This binarizer naturally acts as a regularizer for the output function since outputs in the neighborhoods of lattice points cannot change with small perturbations. Classification boundary over 0/1 vectors is much simpler than over the original (non-binarized) adversarial data. In our experiments, we augment a DNN with a lattice binarizer, which already gives compelling experimental results without adversarial training. We depict the reduction in adversarial input space in Figure 2(b).

3.2 Group Feature Extraction

A generalization of binarization is to extend the notion to a collection of features (such as color, shape, size) that are found to be robust to perturbations. We think of a data point as a member in a group defined by the value of such a feature (e.g.,STOP, Do Not Enter are members of the “red” color group). Given a predominantly red US traffic sign, it will require a large perturbation to change the majority of the sign to another sign color. However, unlike binarization to a lattice or , features like color or shape lie in a much smaller dimension, and lose the finer classification information. We propose two architectures for classification that can incorporate robust group features: (i) intersection of multiple group features, and (ii) augmentation with original classifier.

In the first architecture (Figure 3(a)), we propose to use multiple robust group feature extractors each of which feeds the feature into to get a subset of possible labels. For example, suppose extracts the dominant sign color (e.g., red) then can map the color to a set of possible road signs with the color (e.g., map "red" color to {Stop, Do Not Enter}). This may not be enough information to get to the finer classification, however, adding another group extractor (e.g., for shape) would allow one to classify more precisely (e.g., identify Stop or Do Not Enter). The classifier output is simply the intersection of the possible labels given the extracted robust group features. We show that if all are robust, the resultant classifier formed by intersection is also robust.

(a) Multiple group feature extractors
(b) Augmented classifier design
Figure 4: Basic architecture of a robust classification network using group classifiers.
Why do group features help improve robustness?

Recall that in Section 2, we defined as robust over a subset with respect to if for all : for all . For a given classification task that attempts to classify to labels in , a group feature extractor can be viewed as a function that is robust with respect to and maps to features in (typically, ). When referring to the architecture, we also refer to as a group feature extractor. The intuition here is that, if a group feature is known, then designing a feature extractor , which is robust and accurate, is an easier task than learning a robust and accurate function . Further, let map to possible labels given a group feature in . We next show that the robustness guarantees naturally follow under function composition of and :

Theorem 2.

tm:2 Consider a group feature extractor that is robust on some subset of inputs with respect to , and a potential-label mapping . Then the composition is also robust on with respect to .

Theorem 2 holds since the internal group feature extractor acts as a shock absorber and the function is oblivious to the noise. Indeed for any for , (due to robustness of ), and therefore, , i.e., is robust on with respect to . Robustness guarantees also hold in the case of intersection of multiple robust features:

Theorem 3.

tm:3 Consider a set of robust feature extractors that are robust on some subset of inputs with respect to , and a sequence of potential-labels mappings for . Then the classifier that results by intersecting these: is robust on with respect to .

Theorem 3 holds trivially if . Now consider and as defined. Then, for any , we have using Theorem 2. Therefore, for all and .

One limitation of the above architecture is that we may not know sufficient robust features to make an unambiguous classification. To address this, we propose the augmented architecture (Figure 3(b)). Specifically, we deploy two networks in parallel, a group feature extractor of Figure 3(a) operating in parallel with a standard classifier . The output of group-based network will be classification possibilities and we require outputs of and to be consistent with each other. This prevents targeted attacks on that change to a label , e.g., changing a Stop label (red) to a traffic light ahead (yellow) label, thus reducing adversarial attack space.

This idea itself is quite powerful since it helps the DNN flag outputs where there might be an inconsistency: Consider an augmented classifier . When and is exact (i.e., no errors), then we know that was definitely an example that was misclassified. This can be very useful in practice, where a machine can flag a difficult instance of data, and let an oracle (or a human) take over in these cases until can be made more accurate. We formalize this in Appendix C. We show next that in some datasets, like US traffic signals, these ideas can help develop more robust classifiers.

4 Experimental Results

In this section, we present two sets of experiments demonstrating how our robust feature theory can be applied to improve classifier performance in an adversarial environment.

1. Binarization Augmentation on MNIST:

We start with a simple classification task, digit classification on the MNIST dataset [14], and show that a binarization function both improves adversarial robustness and reduces training time compared to adversarial training to achieve a similar level of adversarial robustness. We use the pre-trained natural and adversarially trained MNIST classifiers used by Madry et al. [12]. For the attack, we use the PGD momentum attack code created by Zheng et al. [23]. Our experiments compare four models, two of which use proposed binary augmentation:

  1. Natural Model (Natural): Madry et al.’s pre-trained natural classifier.

  2. Madry et al.’s Adv. Trained Model (MAT): Madry et al.’s pre-trained robust classifier.

  3. Binarized Natural Model (BIN): A natural classifier with a binarization function as the first processing step, trained on the natural training data (no adversarial training).

  4. Binarized Adv. Trained Model (BAT): A classifier with a binarization function at the input, with the overall classifier trained on adversarially perturbed training data.

Model Test Acc. Adv. Acc.
Natural 99.17% 0%
BIN* 98.93% 74.64%
MAT 98.04% 89.72%
BAT* 99.29% 91%
Table 1: The accuracy of each model evaluated against the MNIST test set and L perturbations within .

All models use the same model architecture (same as used in [12]). BIN and BAT include a binarization function, encoded as a step function centered at a threshold , at the input of the network. Any pixel which is below ( by default) is set to 0; else it is set to 1. For BAT and MAT, we generated adversarial examples in for any given , we run 100 iterations of the PGD attack with a step size of 0.0075 and 20 random restarts. As in the original experiments done by Madry et al. [12], an adversarial attack on a particular input sample is considered successful if at least one of the 20 generated adversarial perturbations is successful in changing the predicted label. For BAT, since the step function is non-differentiable, we use the Backward Pass Differential Approximation (BPDA) technique to generate good adversarial examples, as suggested by Athalye et al. [1].

We first evaluated the test and adversarial accuracy of all 4 models for (i.e., using PGD to find adversarial examples for an input within , see Table 1). We observe that binarization greatly improves the adversarial accuracy of Natural from 0% to 74.64% despite no adversarial examples being used during training. We see that BAT, the binarized implementation of MAT, improved adversarial accuracy from 89.72% to 91.14%. Test accuracy was over 98% for all models.

We next measured the adversarial accuracy of all four models for different values of between 0 and 0.5. We emphasize that MAT and BAT are still trained for ; only the attacker’s capabilities are changed. In Figure 5, we see that binarization is likely reducing attack space for large (e.g., BIN outperforms MAT when with adversarial accuracy of 64.11% versus 34.88%, respectively). Also, adversarial training used with binarization further improves the robustness of the classifier (e.g., BAT has an adversarial accuracy of 87.13% for , more than double that of MAT).

Figure 5: The adversarial performance during testing (left) and training (right). Not shown in the figure: MAT and BAT take approximately 10x more time per training iteration than BIN.

The above findings can be particularly important in settings where adversarial training is infeasible, say for learning on edge computing devices with smaller computational budget. BIN itself, with no adversarial training, results in a significant initial adversarial robustness. In MAT and BAT, each iteration of training is much more expensive since a PGD attack is executed to create a set of adversarial training examples. To further analyze the training efficiency, we evaluated the adversarial accuracy every 300 training iterations for both binarized models and MAT555All training was done on a 12GB Titan X Pascal GPU. Adversarial examples with were generated using 100 iterations of the PGD attack with a step size of 0.0075 and no random restarts66640 iterations with a step size of 0.01 is about twice as fast, but the adversarial accuracy of the model suffers. These results are shown in Figure 5. We observe that although MAT starts achieving a higher adversarial accuracy than BIN after about 30,000 iterations, each training iteration for MAT took 236 ms versus 22 ms for BIN. As a result, BIN achieved 80% adversarial accuracy after about 2.9 minutes of training versus 96 minutes of training for MAT. In BAT, where binarization is used during adversarial training, we see large reductions in training time required for comparable adversarial accuracy. BAT achieved 80% adversarial accuracy in about 3.6 minutes and 90% accuracy in about 19 minutes. MAT only achieved 90% after 273 minutes of training (14x slower than BAT).

2. Group Feature Extraction:

We now move to a more complex task, traffic sign classification, and demonstrate how a using a robust function to extract a robust feature, the dominant color of a sign, can help reduce the adversarial attack space, e.g., preventing attacks that would change a classification across colors (e.g., red Stop to a blue minimum speed 30 sign in Germany).

Based on the architecture shown in Figure 3(b), we augment a traffic sign classifier with a robust feature extraction pipeline, responsible for determining the dominant color of the sign and mapping the color to a set of possible traffic signs. Simply described, the color extractor first determines the sign’s position in the image. Once located, it assigns each pixel a label based on the closest color center in the hue color space, either “red", “blue", or “yellow", then outputs the color based on a weighted majority vote. A more detailed description of the extractor can be found in the appendix.

Adversarial Target # Adv. Images Correction Rate
Blue Signs (GTSRB) 13633 93.53%
Yellow Signs (LISA) 2389 95.33%
Total # of Stop signs 3021
Table 2: # Adv. image is the number of adversarial images () in which the predicted label matched the adversarial target. The correction rate is the percentage of adversarial examples for which the color extractor outputs red.

We train a traffic sign classifier on a dataset composed of traffic sign images from both the LISA [15, 11] and GTSRB sign dataset [7, 19] (normal training). We then perform 20 iterations of a targeted L-bounded PGD attack with and step size of 2. The goal is to perturb a Stop sign into a target sign class that is either blue or yellow. The performance is evaluated on 9 target sign class (8 blue sign classes, 1 yellow sign class) and reported in Table 2.

Overall, the color extractor prevents over 93% of above adversarial attacks that change Stop to a blue or yellow sign (Table 2). Of course, an attacker could attempt to adversarially attack the color extractor’s robustness assumption. Using the same set of Stop sign images, we explored the edges of the -neighborhood () for each image and checked if the color extractor’s output changed at any point. From this, we found that the extractor is robust on approximately 75% of the Stop sign images. The ones that are not robust were poorer quality images, e.g., very dark images, a potentially preventable problem by requiring better lighting and using better cameras. We include robustness measurements for attacks on color for different values of in Appendix B.


The existence of adversarial examples is attributed to a network’s reliance on predictive, but easily exploitable, features it learned during training. In this work, we introduced two methods of robust feature augmentation to mitigate this problem: binarizers and robust group features. Both map the input space to a smaller, more robust, subspace (like a lattice or group labels) and we formally describe these two methods to improve DNN robustness. Experimentally, we demonstrated how these methods can improve the adversarial robustness of a digit classifier and a traffic sign classifier. Furthermore, when adversarial training is used in conjunction with these methods, we were able to train a more adversarially robust model for MNIST 14x faster than without these methods.

We recognize that human identification of robust features may not be applicable to all machine learning tasks, especially if non-interpretable, robust features exist. As such, it is important to develop techniques to identify such features, though doing so is beyond the scope of this work. However, concurrent work done by Ilyas et al. has already shown some progress in this area, through the use of adversarial training to remove non-robust features from training data [8]. We expect to see further research, in which robust feature augmentation can be the method for adversarial robustness.


We thank Yashu Liu of Didi Labs for his helpful feedback on the earlier draft of the paper. This project is supported by Didi Chuxing. This material is based in part upon work supported by the National Science Foundation under Grant No. 1646392.


Appendix A Additional Background

Adversarial Training

Madry et al. [12] proposed the use of adversarial training in which they solve

In their formulation,

is the set of allowed perturbations. The loss function

quantifies the loss relative to the perturbed input and the original label . The inner maximization problem seeks to find a perturbation that maximizes the loss for a given input . The outer minimization problem aims to find the model parameters such that the expected adversarial loss in the inner maximization problem is minimized.

Projected Gradient Descent (PGD)

Adversarial training of a model on input requires generating an adversarial example and then training the model on the adversarial example. Madry et al. use the PGD attack to generate adversarial examples [12]. The PGD attack is an iterative method in which at each step the input is modified based on the negative gradient of the loss function:

is the set of allowed perturbations as defined previously. is a clip function, which ensures the perturbed input is within the allowable range.

Appendix B Traffic Experiment Details

b.1 Traffic Sign Dataset

Traffic signs are fairly standard across counties (e.g., see for classes of traffic signs and examples). LISA [11, 15] and German Traffic Sign Recognition Benchmark (GTSRB [7, 19]) are two popular traffic sign datasets that have been extensively used in previous studies.

We created a traffic sign dataset using images from both the LISA traffic sign dataset [11, 15] and the German Traffic Sign Recognition Benchmark (GTSRB [7, 19]). The LISA dataset contains images of 47 different U.S. traffic signs. However, there are large class imbalances (e.g.,Stop has 1821 images and Speed Limit 55 has 2 images). To address this problem, we first combine the LISA training and GTSRB training dataset, which has images for 43 German traffic signs classes. The image labelled as Stop in both datasets are combined as they have the same visual appearance. Similarly, the images labelled as Do Not Enter and StreetClosedOneWay are combined.

The combined dataset still has low representation for some of the individual U.S. traffic signs. To address that, we created two superclasses composed of white rectangular U.S. traffic signs and yellow U.S. traffic signs. The first superclass contains U.S. Speed Limit signs and Right Lane Must Turn. The second superclass cotnains U.S. Warning signs and School, which are yellow. The 45 class labels in the augmented dataset are provided in Table 3.

With respect to the color extractor, we focused on signs of one of three dominant colors: red, yellow, and blue. U.S. red signs are generally regulatory in nature (e.g., Stop, DoNotEnter) and some examples are shown in in Figure 5(a). U.S. yellow signs (see Figure 5(b)) are used for cautioning a user (e.g., IntersectionAhead, CurveRight, CurveLeft, School Zone). Blue signs (see Figure 5(c)) are common in Germany and can be restrictive or mandatory (e.g., KeepLeft, MandatoryLeftTurn, TrafficCircle , MandatoryAhead). Table 4 identifies the sign labels in the modified dataset and that are either red and blue. For the purpose of classification, yellow signs are grouped together in a single class due to low representation with respect to the original sign labels(e.g.,Intersection: 13 images, CurveLeft: 24 images, TurnRight: 24 images).

(a) Red sign examples
(b) Yellow sign examples
(c) Blue sign examples
Figure 6: Examples images of signs for the three color classes we evaluated.
Class Label Class Label Class Label
speedLimit20 streetClosedBothWays wildlifeWarning
speedLimit30 noTrucks allRestrictionsEnd
speedLimit50 generalWarning mandatoryRightTurn
speedLimit60 sharpLeftTurnAhead mandatoryLeftTurn
speedLimit70 sharpRightTurnAhead mandatoryAhead
speedLimit80 sequenceSharpTurnsAhead mandatoryAheadOrRight
endSpeedLimit80 bumpsInRoad mandatoryAheadOrLeft
speedLimit100 slipperyRoad keepRight
speedLimit120 tighterRoadOnRight keepLeft
noPassing construction trafficCircle
noPassingTrucks trafficLight endNoPassing
intersectionWarning pedestrianCrossing endNoPassingTrucks
rightOfWay schoolCrossing Yellow Signs
yield bicycles doNotEnter
stop icyRoads White Rectangles
Total # of Signs 44121
Table 3: Class labels of the LISA-GTSRB traffic sign dataset used in the experiments.
Red Blue
Stop mandatoryRightTurn
Do Not Enter mandatoryLeftTurn
Table 4: Red and blue sign groupings. Yellow is not included as they have been grouped into a single label with respect to classification.

b.2 Model Description

We use a publicly available implementation of a multi-scale DNN architecture [22]. The architecture description is given in Table 5. Before training, we triple the size of any class with less than 200 images through oversampling and random perturbations of each image. We use K-fold cross-validation with 10 splits and for each split, we train the model 50 times over the entire training split. Our trained model has 97.51% test accuracy based on the GTSRB test dataset containing 12630 images. Our model was not adversarially trained, but the augmented pipeline does allow for an adversarially trained classifier, which may further improve robustness.

Layer Type Number of Channels Filter Size Stride Activation
conv 3 1x1 1 ReLU
conv 32 5x5 1 ReLU
conv 32 5x5 1 ReLU
maxpool 32 2x2 2 -
conv 64 5x5 1 ReLU
conv 64 5x5 1 ReLU
maxpool 64 2x2 2 -
conv 128 5x5 1 ReLU
conv 128 5x5 1 ReLU
maxpool 128 2x2 2 -
FC 1024 - - ReLU
FC 1024 - - ReLU
FC 43 - - Softmax
Table 5: Traffic sign model architecture. The model expects images as input with values in the range [-0.5, 0.5].

b.3 Color Extraction Algorithm

We designed a basic color extractor for traffic sign classification. The extractor involves 2 steps:

  1. Sign Localization - Determine the sign’s location in the image

  2. Color Classification - Determine the dominant color of the sign

The full pipeline is shown in Figure 7.

Figure 7: The color extractor pipeline. We show the step-by-step process for a Stop image.

b.3.1 Sign Localization

Before we can evaluate the dominant color of the sign, we must first identify which pixels in the image correspond to the surface of the sign. Due to the presence of noisy image, like those shown in Figure 9, edge detection and contour extraction algorithms perform poorly. Instead, given a three channel color image, (r,g,b), we normalize each individual channel by the image intensity and compute a chromaticity map (C) and 4 color maps (R, G, B, Y) [9, 10].

Afterwards, all of the maps are converted to a binary image based on the mean of the non-zero values in each map. We use the binary image of C as a mask on each of the binarized color maps to isolate the chromatic colors in each map. Finally, each channel is scored based on the number of non-zero pixels in the image. If less than 10% of the pixels in each of the four color channels are white, the inverted binary chromaticity map is output. Otherwise, the binarized color channel with the highest score is output. We make one optimization based on the fact that in most of the images, the traffic sign is centered in the image. As such, we restrict thresholding and scoring to a small box around the center of the image. In our experiments, we used a 10 by 10 box.

b.3.2 Color Classification

The output of the sign localization step is to mask the original color image, and remove background pixels during color extraction. The image is converted to a hue-based representation (e.g., HSV or HSL). Three predefined color centers (red, yellow, and blue) are used to label each non-zero pixel in the masked image based on the closest color center. Afterwards, a weighed weighted majority vote is computed (i.e., weight of a pixel’s vote increases the closer it is to the center) and the color with the most votes chosen.

For these proof-of-concept experiments, we choose to only detect red, yellow, and blue as these are the three most common colors in the dataset. We did not handle colors such as brown or green as there were no signs in the dataset with these colors. Traffic signs that are white do exist, but white is not characterized by hue, but is instead based on the values of the other channels. As such, the color extractor is not robust for predominantly white signs, thus our analysis did not focus on such signs. This does not hurt the test accuracy of the augmented model, though, as we can include “white" sign labels in the group-labels for all three colors. When we augment the classifier with the color extractor, the test accuracy on the GTSRB test dataset is still 97.51%. Extending the color extractor to extract other colors, or even multiple colors, for finer-grain color-based classification, is future work.

b.4 Robustness of the Color Extractor

In Section 4, we presented the results on the robustness of the color extractor on STOP images for an -bounded attack with . The evaluation involved shifting one or more color channels by in both the positive and negative directions. In Figure 8, we show the measure of the color extractor on STOP images for varying values of . We observe that the robustness of the color extractor is extremely high for small values of , and then steadily decreases. Upon closer examination, we find that many of the points the color extractor is non-robust on for small values of are points that are very close to a different color boundary, often due to noisy images. We provide a few examples in Figure 9. In some cases, like in Figures 8(a) and 8(b), the sign has a blueish tint, often due to poor lighting. In other cases, like Figure 8(c), the blurriness hinders correct sign localization (see Section B.3.1). The differences between robustness for blue and yellow for higher values is due to a smaller hue distance between red and yellow as compared to between red and blue. For smaller values of , the difference is due to dataset artifacts – more Stop signs with very poor lighting in the dataset were closer to having a bluish hue than a yellowish hue (see Figure 9 for a few examples).

Figure 8: The robustness of the color classifier for STOP when changing to blue or yellow signs as L bound increases.
(a) Close to Blue
(b) Close to Blue
(c) Close to Yellow
Figure 9: Some examples of inputs the color classifier is not robust on. Often, this occurs due to either the image being too dark (which tends to shift colors to blue) or the image being too blurry (which causes errors during sign localization).

Appendix C Additional Theorem

Theorem 4.

Consider a classifier and suppose we have access to a group feature extractor as well as a labels mapping . Consider the augmented classifier . If is robust over with respect to , then for all , is non-empty if and only if .

Above theorem holds because robustness of implies robustness of from Theorem 2. Thus, the label of for both and must be in , ruling out targeted attacks that change label of to a label not in .

As an example scenario of the above theorem, suppose is an image of a stop sign. is determined to be red. Then, is the set of sign labels that can be red, e.g., a set including the Stop sign and Do Not Enter sign. Let’s assume that normal case that classifies the sign correctly. Then, will also give a correct classification. Furthermore, for an arbitrary input , since due to robustness of , label of is restricted to be either or in the set of red signs, . implies an inconsistency between the two outputs of and on input , suggesting a problem, which may require human inspection or another intervention to resolve. A non-empty result implies that the two inputs are of the same color, though not necessarily the same label.