1 Introduction
Models produced by machine learning algorithms, especially deep neural networks, are being deployed in domains where trustworthiness is a big concern, creating the need for higher accuracy and assurance [russell2015letter, seshiaarxiv16]
. However, learning highaccuracy models using deep learning is limited by the need for large amounts of data, and, even further, by the need of laborintensive labeling.
Data augmentation overcomes the lack of data by inflating training sets with labelpreserving transformations, i.e., transformations which do not alter the label. Traditional data augmentation schemes [dataAugmentation, simard2003best, cirecsan2011high, ciregan2012multi, krizhevsky2012imagenet] involve geometric transformations which alter the geometry of the image (e.g., rotation, scaling, cropping or flipping); and photometric transformations which vary color channels. The efficacy of these techniques have been demonstrated recently (see, e.g., [xu2016improved, wong2016understanding]). Traditional augmentation schemes, like the aforementioned methods, add data to the training set hoping to improve the model accuracy without taking into account what kind of features the model has already learned. More recently, a sophisticated data augmentation technique has been proposed [liang2017recurrent, marchesi2017megapixel] which uses Generative Adversarial Networks [goodfellow2014generative], a particular kind of neural network able to generate synthetic data, to inflate training sets. There are also augmentation techniques, such as hard negative mining [shrivastava2016training], that inflate the training set with targeted negative examples with the aim of reducing false positives.In this work, we propose a new augmentation scheme, counterexampleguided data augmentation. The main idea is to augment the training set only with new misclassified examples rather than modified images coming from the original training set. The proposed augmentation scheme consists of the following steps: 1) Generate synthetic images that are misclassified by the model, i.e., the counterexamples; 2) Add the counterexamples to the training set; 3) Train the model on the augmented dataset. These steps can be repeated until the desired accuracy is reached. Note that our augmentation scheme depends on the ability to generate misclassified images. For this reason, we developed an image generator that cooperates with a sampler
to produce images that are given as input to the model. The images are generated in a manner such that the ground truth labels can be automatically added. The incorrectly classified images constitute the augmentation set that is added to the training set. In addition to the pictures, the image generator provides information on the misclassified images, such as the disposition of the elements, brightness, contrast, etc. This information can be used to find features that frequently recur in counterexamples. We collect information about the counterexamples in a data structure we term as the “
error table”. Error tables are extremely useful to provide explanations about counterexamples and find recurring patterns that can lead an image to be misclassified. The error table analysis can also be used to generate images which are likely to be counterexamples, and thus, efficiently build augmentation sets.In summary, the main contributions of this work are:

A counterexampleguided data augmentation approach where only misclassified examples are iteratively added to training sets;

A synthetic image generator that renders realistic counterexamples;

Error tables that store information about counterexamples and whose analysis provides explanations and facilitates the generation of counterexample images.
We conducted experiments on Convolutional Neural Networks (CNNs) for object detection by analyzing different counterexample data augmentation sampling schemes and compared the proposed methods with classic data augmentation. Our experiments show the benefits of using a counterexampledriven approach against a traditional one. The improvement comes from the fact that a counterexample augmentation set contains information that the model had not been able to learn from the training set, a fact that was not considered by classic augmentation schemes. In our experiments, we use synthetic data sets generated by our image generator. This ensures that all treated data comes from the same distribution.
Overview
Fig. 1 summarizes the proposed counterexampleguided augmentation scheme. The procedure takes as input a modification space, , the space of possible configurations of our image generator. The space
is constructed based on domain knowledge to be a space of “semantic modifications;” i.e., each modification must have a meaning in the application domain in which the machine learning model is being used. This allows us to perform more meaningful data augmentation than simply through adversarial data generation performed by perturbing an input vector (e.g., adversarially selecting and modifying a small number of pixel values in an image).
In each loop, the sampler selects a modification, , from . The sample is determined by a sampling method that can be biased by a precomputed error table, a data structure that stores information about images that are misclassified by the model. The sampled modification is rendered into a picture by the image generator. The image is given as input to the model that returns the prediction . We then check whether is a counterexample, i.e., the prediction is wrong. If so, we add to our augmentation set and we store ’s information (such as , ) in the error table that will be used by the sampler at the next iteration. The loop is repeated until the augmentation set is large enough (or has been sufficiently covered).
This scheme returns an augmentation set, that will be used to retrain the treated model, along with an error table, whose analysis identifies common features among counterexamples and aids the sampler to select candidate counterexamples.
The paper structure mostly follows the scheme of Fig. 1: Sec. 2 introduces some notation; Sec. 3 describes the image generator used to render synthetic images; Sec. 4 introduces some sampling techniques that can be used to efficiently sample the modification space; Sec. 5 introduces error tables and details how they can be used to provide explanations about counterexamples; Sec. 6 concludes the paper by evaluating the proposed techniques and comparing across different tunings of our counterexampleguided augmentation scheme and the proposed methods against classic augmentation. The implementation of the proposed framework and the reported experiments are available at https://github.com/dreossi/analyzeNN.
2 Preliminaries
This section provides the notation used throughout this paper.
Let be a vector, be its th element with index starting at , be the range of elements of from to ; and be a set. is a set of training examples, is the th example from a dataset and is the associated label. is a model (or function) with domain and range . is the prediction of the model for input . In the object detection context, encodes bounding boxes, scores, and categories predicted by for the image . is the model trained on . Let and be bounding boxes encoded by . The Intersection over Union (IoU) is defined as , where is the area of , with . We consider to be a detection for if . True positives is the number of correct detections; false positives is the number of predicted boxes that do not match any ground truth box; false negatives is the number of ground truth boxes that are not detected.
Precision and recall are defined as and . In this work, we consider an input to be misclassified if or is less than . Let be a test set with examples. The average precision and recall of are defined as
. We use average precision and recall to measure the accuracy of a model, succinctly represented as
.3 Image Generator
At the core of our counterexample augmentation scheme is an image generator (similar to the one defined in [dreossinfm17, dreossirmlw17]) that renders realistic synthetic images of road scenarios. Since counterexamples are generated by the synthetic data generator, we have full knowledge of the ground truth labels for the generated data. In our case, for instance, when the image generator places a car in a specific position, we know exactly its location and size, hence the ground truth bounding box is accordingly determined. In this section, we describe the details of our image generator.
3.1 Modification Space
The image generator implements a generation function that maps every modification to a feature . Intuitively, a modification describes the configuration of an image. For instance, a threedimensional modification space can characterize a car (lateral) and (away) displacement on the road and the image brightness. A generator can be used to abstract and compactly represent a subset of a highdimensional image space.
We implemented an image generator based on a 14D modification space whose dimensions determine a road background; number of cars (one, two or three) and their and positions on the road; brightness, sharpness, contrast, and color of the picture. Fig. 2 depicts some images rendered by our image generator.
We can define a metric over the modification space to measure the diversity of different pictures. Intuitively, the distance between two configurations is large if the concretized images are visually diverse and, conversely, it is small if the concretized images are similar.
The following is an example of metric distance that can be defined over our 14D modification space. Let be modifications. The distance is defined as:
(1) 
where is if the condition is true, otherwise, and is the norm. The distance counts the differences between background and car models and adds the Euclidean distance of the points corresponding to and positions, brightness, sharpness, contrast, and color of the images.
Fig. 2 depicts three images with their modifications and . For brevity, captions report only the dimensions that differ among the images, that are background, car models and positions. The distances between the modifications are , , . Note how similar images, like Fig. 2 (a) and (b) (same backgrounds and car models, slightly different car positions), have smaller distance () than diverse images, like Fig. (a) and (c); or (b) and (c) (different backgrounds, car models, and vehicle positions), whose distances are and .
Later on, we use this metric to generate sets whose elements ensure a certain amount of diversity. (see Sec. 6.1)
3.2 Picture Concretization
Once a modification is fixed, our picture generator renders the corresponding image. The concretization is done by superimposing basic images (such as road background and vehicles) and adjusting image parameters (such as brightness, color, or contrast) accordingly to the values specified by the modification. Our image generator comes with a database of backgrounds and car models used as basic images. Our database consists of 35 road scenarios (e.g., desert, forest, or freeway scenes) and 36 car models (e.g., economy, family, or sports vehicles, from both front and rear views). The database can be easily extended or replaced by the user.
3.3 Annotation Tool
In order to render realistic images, the picture generator must place cars on the road and scale them accordingly. To facilitate the conversion of a modification point describing and position into a proper superimposition of the car image on a road, we equipped the image generator with an annotation tool that can be used to specify the sampling area on a road and the scaling factor of a vehicle. For a particular road, the user draws a trapezoid designating the area where the image generator is allowed to place a car. The user also specifies the scale of the car image on the trapezoid bases, i.e., at the closest and furthest points from the observer (see Fig. 3
). When sampling a point at an intermediate position, i.e., inside the trapezoid, the tool interpolates the provided car scales and determines the scaling at the given point. Moreover, the image generator superimposes different vehicles respecting the perspective of the image. The image generator also performs several checks to ensure that the rendered cars are visible.
4 Sampling Methods
The goal of the sampler is to provide a good coverage of the modification space and identify samples whose concretizations lead to counterexamples.
We now briefly describe some sampling methods (similar to those defined in [dreossinfm17, dreossirmlw17]) that we integrated into our framework:

Uniform Random Sampling
: Uniform random sampling ensures an equal probability of sampling any possible point from
, which guarantees a good mix of generated images for both training and testing. Although a simple and effective technique for both training as well as testing, it may not provide a good coverage of the modification space; 
LowDiscrepancy Sampling: A lowdiscrepancy (or quasirandom) sequence is a sequence of ntuples that fills a nD space more uniformly than uncorrelated random points. Lowdiscrepancy sequences are useful to cover boxes by reducing gaps and clustering of points which ensures uniform coverage of the sample space.

CrossEntropy Sampling: The crossentropy
method was developed as a general Monte Carlo approach to combinatorial optimization and importance sampling. It is a iterative sampling technique, where we sample from a a given probability distribution, and update the distribution by minimizing the crossentropy.
Some examples of lowdiscrepancy sequences are the Van der Corput, Halton [halton1960efficiency], or Sobol [sobol1976uniformly] sequences. In our experiments, we use the Halton [niederreiter1988low] sequence. There are two main advantages in having optimal coverage: first, we increase the chances of quickly discovering counterexamples, and second, the set of counterexamples will have high diversity; implying the concretized images will look different and thus the model will learn diverse new features.
5 Error Tables
Every iteration of our augmentation scheme produces a counterexample that contains information pointing to a limitation of the learned model. It would be desirable to extract patterns that relate counterexamples, and use this information to efficiently generate new counterexamples. For this reason, we define error tables that are data structures whose columns are formed by important features across the generated images. The error table analysis is useful for:

Providing explanations about counterexamples, and

Generating feedback to sample new counterexamples.
In the first case, by finding common patterns across counterexamples, we provide feedback to the user like “The model does not detect white cars driving away from us in forest roads”; in the second case, we can bias the sampler towards modifications that are more likely to lead to counterexamples.
5.1 Error Table Features
We first provide the details of the kinds of features supported by our error tables. We categorize features along two dimensions:

Explicit vs. implicit features: Explicit features are sampled from the modification space (e.g., position, brightness, contrast, etc.) whereas implicit features are userprovided aspects of the generated image (e.g., car model, background scene, etc.).

Ordered vs. unordered features: some features have a domain with a welldefined total ordering (e.g., sharpness) whereas others do not have a notion of ordering (e.g., car model, identifier of background scene, etc.).
The set of implicit and explicit features are mutually exclusive. In general, implicit features are more descriptive and characterize the generated images. These are useful for providing feedback to explain the vulnerabilities of the classifier. While implicit features are unordered, explicit features can be ordered or unordered. Rows of error tables are the realizations of the features for misclassification.
Car model  Background ID  Environment  Brightness  

Toyota  Tunnel  0.9  0.2  0.9  
BMW  Forest  1.1  0.4  0.7  
Toyota  Forest  1.2  0.4  0.8 
Tab. 1 is an illustrative error table. The table includes car model and environment scene (implicit unordered features), brightness, car coordinates (explicit ordered features), and background ID (explicit unordered feature). The first row of Tab. 1 actually refers to Fig. 3. The actual error tables generated by our framework are larger than Tab. 1. They include, for instance, our D modification space (see Sec. 3.1) and features like number of vehicles, vehicle orientations, dominant color in the background, etc.
Given an error table populated with counterexamples, we would like to analyze it to provide feedback and utilize this feedback to sample new images.
5.2 Feature Analysis
A naive analysis technique is to treat all the features equally, and search for the most commonly occurring element in each column of the error table. However, this assumes no correlation between the features, which is often not the case. Instead, we develop separate analysis techniques for ordered and unordered features. In the following we discuss how we can best capture correlations between the two sets:

Ordered features
: Since these features are ordered, a meaningful analysis technique would be to find the direction in the feature space where most of the falsifying samples occur. This is very similar to model order reduction using Principal Component Analysis (PCA
[wold1987principal]). Specifically, we are interested in the first principal component, which is the singular vector corresponding to the largest singular value in the Singular Value Decomposition (SVD
[wold1987principal]) of the matrix consisting of all the samples of the ordered features. We can use the singular vector to find how sensitive the model is with respect to the ordered features. If the value corresponding to a feature is small in the vector, it implies that the model is not robust to changes in that feature, i.e., changes in that feature would affect the misclassification. Or alternatively, features corresponding to larger values in the singular vector, act as “don’t cares”, i.e., by fixing all other features, the model misclassifies the image regardless the value of this feature; 
Unordered features: Since these features are unordered, their value holds little importance. The most meaningful information we can gather from them is the subsets of features which occurs together most often. To correctly capture this, we must explore all possible subsets, which is a combinatorial problem. This proves to be problematic when the space of unordered features is large. One way to overcome this is by limiting the size of the maximum subset to explore.
We conducted an experiment on a set of counterexamples. The ordered features included and positions of each car; along with the brightness, contrast, sharpness, and color of the overall image. The explicit features include the ordered features along with the discrete set of all possible cars and backgrounds. The implicit features include details like color of the cars, color of the background, orientation of the cars, etc. The PCA on the explicit ordered features revealed high values corresponding to the position of the first car (0.74), brightness (0.45) and contrast (0.44). We can conclude that the model is not robust to changes in these ordered features. For the unordered features, the combination of forest road with one white car with its rear towards the camera and the other cars facing the camera, appeared times. This provides an explanation of recurrent elements in counterexamples, specifically “The model does not detect white cars driving away from us in forest roads”.
5.3 Sampling Using Feedback
We can utilize the feedback provided by the error table analysis to guide the sampling for subsequent training. Note that we can only sample from the explicit features:

Feedback from Ordered Features: The ordered features, which is a subset of the explicit features, already tell us which features need to vary more during the sampling process. For example, in the example of Sec. 5.2, our sampler must prioritize sampling different positions for the first car, then brightness, and finally contrast among the other ordered features;

Feedback from Unordered Features: Let be the subset of most occurring unordered features returned by the analysis, where and are the mutually exclusive sets of explicit and implicit features, respectively. The information of can be directly incorporated into the sampler. The information provided by require some reasoning since implicit features are not directly sampled. However, they are associated with particular elements of the image (e.g., background or vehicle). We can use the image generator library and error table to recognize which elements in the library the components of correspond to, and set the feature values accordingly. For instance, in the example of Sec. 5.2, the analysis of the unordered explicit features revealed that the combination of bridge road with a Tesla, Mercedes, and Mazda was most often misclassified. We used this information to generate more images with this particular combination by varying brightness and contrast.
Sec 6.3 shows how this technique leads to a larger fraction of counterexamples that can be used for retraining.
6 Experimental Evaluation
In this section, we show how the proposed techniques can be used to augment training sets and improve the accuracy of the considered models. We will experiment with different sampling methods, compare counterexample guided augmentation against classic augmentation, iterate over several augmentation cycles, and finally show how error tables are useful tools for analyzing models. The implementation of the proposed framework and the reported experiments are available at https://github.com/dreossi/analyzeNN.
In all the experiments we analyzed squeezeDet [squeezedet], a CNN realtime object detector for autonomous driving. All models were trained for epochs.
The original training and test sets and contain and pictures, respectively, randomly generated by our image generator. The initial accuracy is relatively high (see Tab. 3). However, we will be able to generate sets of counterexamples as large as on which the accuracy of drops down. The highlighted entries in the tables show the best performances. Reported values are the averages across five different experiments.
6.1 Augmentation Methods Comparison
As the first experiment, we run the counterexample augmentation scheme using different sampling techniques (see Sec. 4). Specifically, we consider uniform random sampling, lowdiscrepancy Halton sequence, crossentropy sampling, and uniform random sampling with a diversity constraint on the sampled points. For the latter, we adopt the distance defined in Sec. 3.1 and we require that the modifications of the counterexamples must be at least distant by from each other.
For every sampling method, we generate counterexamples, half of which are injected into the original training set and half are used as test sets. Let denote uniform random, Halton, crossentropy, and diversity (i.e., random with distance constraint) sampling methods. Let be a sampling technique. is the augmentation of , and is a test set, both generated using . For completeness, we also defined the test set containing an equal mix of counterexamples generated by all the sampling methods.
Tab. 2 reports the accuracy of the models trained with various augmentation sets evaluated on test sets of counterexamples generated with different sampling techniques. The first row reports the accuracy of the model trained on the original training set . Note that, despite the high accuracy of the model on the original test set (), we were able to generate several test sets from the same distribution of and on which the model poorly performs.
The first augmentation that we consider is the standard one, i.e., we alter the images of using imgaug^{1}^{1}1imgaug: https://github.com/aleju/imgaug, a Python library for images augmentation. We augmented of the images in by randomly cropping on each side, flipping horizontally with probability , and applying Gaussian blur with . Standard augmentation improves the accuracies on every test set. The average precision and recall improvements on the various test sets are and , respectively (see Row 1 Tab. 2).
Next, we augment the original training set with our counterexampleguided schemes (uniform random, lowdiscrepancy Halton, crossentropy, and random with distance constraint) and test the retrained models on the various test sets. The average precision and recall improvements for uniform random are and , for lowdiscrepancy Halton and , for crossentropy and , and for random with distance constraint and . First, notice the improvement in the accuracy of the original model using counterexampleguided augmentation methods is larger compared to the classic augmentation method. Second, among the proposed techniques, crossentropy has the highest improvement in precision but lowdiscrepancy tends to perform better than the other methods in average for both precision and recall. This is due to the fact that lowdiscrepancy sequences ensure more diversity on the samples than the other techniques, resulting in different pictures from which the model can learn new features or reinforce the weak ones.
The generation of a counterexample for the original model takes in average for uniform random sampling s, for Halton s, and for uniform random sampling with constraints s. This shows the tradeoff between time and gain in model accuracy. The maximization of the diversity of the augmentation set (and the subsequent accuracy increase) requires more iterations.
6.2 Augmentation Loop
For this experiment, we consider only the uniform random sampling method and we incrementally augment the training set over several augmentation loops. The goal of this experiments is to understand whether the model overfits the counterexamples and see if it is possible to reach a saturation point, i.e., a model for which we are not able to generate counterexamples. We are also interested in investigating the relationship between the quantity of injected counterexamples and the accuracy of the model.
Consider the th augmentation cycle. For every augmentation round, we generate the set of counterexamples by considering the model with highest average precision and recall. Given , our analysis tool generates a set of counterexamples. We split in halves and . We use to augment the original training set and as a test set. Specifically, the augmented training set is obtained by adding misclassified images of to . are the ratios of misclassified images to original training examples. For instance, , where is the cardinality of . We consider the ratios . We evaluate every model against every test set.
Tab. 3 shows the accuracies for three augmentation cycles. For each model, the table shows the average precision and recall with respect to the original test set and the tests sets of misclassified images. The generation of the first loop took around hours, the second hours, the third hours. We stopped the fourth cycle after more than hours. This shows how it is increasingly hard to generate counterexamples for models trained over several augmentations. This growing computational hardness of counterexample generation with the number of cycles is an informal, empirical measure of increasing assurance in the machine learning model.
Notice that for every cycle, our augmentation improves the accuracy of the model with respect to the test set. Even more interesting is the fact that the model accuracy on the original test set does not decrease, but actually improves over time (at least for the chosen augmentation ratios).
6.3 Error TableGuided Sampling
In this last experimental evaluation, we use error tables to analyze the counterexamples generated for with uniform random sampling. We analyzed both the ordered and unordered features (see Sec. 5.2). The PCA analysis of ordered features revealed the following relevant values: sharpness , contrast , brightness , and position . This tells us that the model is more sensitive to image alterations rather than to the disposition of its elements. The occurrence counting of unordered features revealed that the top three most occurring car models in misclassifications are white Porsche, yellow Corvette, and light green Fiat. It is interesting to note that all these models have uncommon designs if compared to popular cars. The top three most recurring background scenes are a narrow bridge in a forest, an indoor parking lot, and a downtown city environment. All these scenarios are characterized by a high density of details that lead to false positives. Using the gathered information, we narrowed the sampler space to the subsets of the modification space identified by the error table analysis. The counterexample generator was able to produce misclassification with k iterations, against the of pure uniform random sampling, of Halton, and of uniform random with distance constraint.
Finally, we retrained on the training set that includes images generated using the error table analysis. The obtained accuracies are , , , and . Note how error tableguided sampling reaches levels of accuracy comparable to other counterexampleguided augmentation schemes (see Tab. 2) but with a third of augmenting images.
7 Conclusion
In this paper, we present a technique for augmenting machine learning (ML) data sets with counterexamples. The counterexamples we generate are syntheticallygenerated data items that are misclassified by the ML model. Since these items are synthesized algorithmically, their ground truth labels are also automatically generated. We show how error tables can be used to effectively guide the augmentation process. Results on training deep neural networks illustrate that our augmentation technique performs better than standard augmentation methods for image classification. Moreover, as we iterate the augmentation loop, it gets computationally harder to find counterexamples. We also show that error tables can be effective at achieving improved accuracy with a smaller data augmentation.
We note that our proposed methodology can also be extended to the use of counterexamples from “systemlevel” analysis and verification, where one analyzes the correctness of an overall system (e.g., an autonomous driving function) in the context of a surrounding environment [dreossinfm17]. Performing data augmentation with such “semantic counterexamples” is an interesting direction for future work [dreossicav18].
Our approach can be viewed as an instance of counterexampleguided inductive synthesis (CEGIS), a common paradigm in program synthesis [solarasplos06, alurfmcad13]. In our case, the program being synthesized is the ML model. CEGIS itself is a special case of oracleguided inductive synthesis (OGIS) [jhaacta17]. For future work, it would be interesting to explore the use of oracles other than counterexamplegenerating oracles to augment the data set, and to compare our counterexampleguided data augmentation technique to other oracleguided data augmentation methods.
Finally, in this work we decided to rely exclusively on simulated, synthesized data to make sure that the training, testing, and counterexample sets come from the same data source. It would be interesting to extend our augmentation method to realworld data; e.g., images of road scenes collected during driving. For this, one would need to use techniques such as domain adaptation or transfer learning
[tobin2017domain] that can adapt the syntheticallygenerated data to the real world.Acknowledgments
This work was supported in part by NSF grants 1545126, 1646208, and 1739816, the DARPA BRASS program under agreement number FA875016C0043, the DARPA Assured Autonomy program, the iCyPhy center, and Berkeley Deep Drive. We acknowledge the support of NVIDIA Corporation via the donation of the Titan Xp GPU used for this research. Hisahiro (Isaac) Ito’s suggestions on crossentropy sampling are gratefully acknowledged.
Comments
There are no comments yet.