Image categorization is a well-studied problem in computer vision, where a model is trained to classify an image into single or multiple predefined categories. It has a plethora of applications in safety-critical domains like self-driving cars, health care, etc. Even in our day-to-day life we often use image classifiers: for example, Google Photo search, Facebook image tagging, etc. With the advent of Deep Neural Networks (DNN), such image classification tasks have seen major breakthroughs over the past few years—sometimes even matching human-level accuracy under some conditions .
In spite of such spectacular success, we often encounter high-impact classification errors made by these models, as shown in Table I. For example, in 2015, Google faced huge backlash due to a notorious error in its photo-tagging app, which tagged pictures of dark-skinned people as "gorillas" . After manual investigation of some related public reports, we find two main causes behind such mistakes: (i) Confusion: The model cannot differentiate one class from another. For example, it has been recently reported that Google Photos confuse skier and mountain , and (ii) Bias: The model shows disparate outcomes between two related groups. For example, Zhao et al.  in their paper “Men also like shopping” find classification bias towards women on activities like shopping, cooking, washing, etc.
|Error Type||Name||Report Date||Outcome|
|Gorilla Tag||Jul 1, 2015||Black people were tagged as gorilla by Google photo app.|
|Confusion||Elephant is detected||Aug 9, 2018||Image Transplantation (i.e. replacing sub-region of an image by|
|in a room||another image containing trained object) leads to mis-classification.|
|Google Photo||Dec 10, 2018||Google Photo confuses skier and mountain.|
|Nikon Camera||Jan 22, 2010||Camera shows bias toward Caucasian faces when detecting people’s blinks.|
|Men Like Shopping||July 29, 2017||Multilabel object classification models show bias towards women on|
|Bias||activities like shopping, cooking, washing etc.|
Open source face recognition services provided by IBM, Microsoft, and Face++
|have higher error rates on darker-skin females for gender classification.|
These errors are specific to a class of images rather than any particular input image—at a high level, the intuition is that some class properties are getting violated in such cases. For example, in the case of bias reported by Zhao et al. , a DNN model should not have different error rates while classifying the gender of a person in the shopping category. Thus, unlike individual image properties, this is a class property defined over all the shopping images with men and women. Any violation of such property affects the whole class, e.g., man is more likely to be predicted as woman when he is shopping, although many individual images in this category can still be predicted correctly.
Due to the lack of such formally specified properties, while designing a DNN, developers usually follow some mental models of informal specifications; an error occurs when the application “produced harmful and unexpected results” w.r.t. that informal specification, as observed by Google Brain researchers . Without such specification, traditional software testing techniques that test w.r.t. some oracle will remain inadequate [10, 11, 12].
One simple workaround to detect class-level violations could be to simply analyze the class separations because after all the DNN models are supposed to learn this separation. To identify possible sources of confusion/bias-related errors we should check whether the class-separation is enough to distinguish clearly between two classes. In fact, traditional ML testing/validation step indirectly evaluates such class-separation by measuring the model’s accuracy. However, doing this in a black-box manner is inadequate, especially for a pre-trained model, because (i) the training and testing distributions can widely differ; hence the labeled data used to validate the model might not measure the class separation accurately in a real-world test environment and miss the corner-cases, and (ii) the input space (e.g., pixel space) is not well-specified to identify the class properties or their separation.
In this work, we propose a novel white box technique to capture the separation between two classes. For a set of test input images, we compute the probability of activation of a neuron per predicted class. Thus, for each class, we create a vector of neuron activations where each vector element corresponds to a neuron activation probability. If the distance between two vectors is too close (compared to other class-vector pairs) that means the DNN under test cannot effectively distinguish the two classes because often a similar set of neurons are activated by the corresponding class members.
To this end, we propose a novel white-box test strategy, DeepInspect. We evaluate DeepInspect for both single- and multi-label classification models in 10 different settings. Our experiments demonstrate that DeepInspect, unlike existing white-box techniques, can efficiently detect both Bias and Confusion errors in popular neural image classifiers. For all the models we have tested, DeepInspect reports a large number of classification errors with high precision. We further check whether DeepInspect can detect such classification errors in state-of-the-art models designed to be robust against norm-bounded adversarial attacks; DeepInspect finds hundreds of errors proving the need for orthogonal testing strategies to detect such class-level mispredictions. Further, unlike common DNN testing techniques [13, 14], we do not need to generate additional transformed images to find these errors.
We summarize our contributions as follows:
We design a novel white-box testing framework to automatically detect confusion and bias errors in DNN based visual recognition models for image classification.
We implement the proposed techniques in DeepInspect and exhaustively evaluate DeepInspect and detect many errors in state-of-the art DNN models.
We have made the errors reported by DeepInspect public at https://deeplearninginspect.github.io/DeepInspect. We plan to release the code and processed data for public use.
Ii-a DNN Background
Deep Neural Networks (DNN) are popular machine learning models loosely inspired by the neural networks of human brains. A DNN model learns the logic to perform a task from a set of training examples. For example, an image recognition model learns to recognizecows through training with lots of sample images of cows.
A typical feed forward DNN consists of a set of connected computational units, often referred as neurons, which are arranged sequentially in a series of layers, The neurons in different layers are connected with each other through edges.
Each edge has a corresponding weight. Each neuron applies , a nonlinear activation function
nonlinear activation function(e.g.,ReLU , Sigmoid 
), on its inputs and sends the output to the subsequent neurons. For image classification, convolutional neural networks (CNNs) are typically used, which consist of layers with local spatial connectivity and sets of neurons with shared parameters across space. Since our methods are general, we will just refer to DNNs more broadly.
To build a DNN application, developers typically start with a set of pre-annotated experimental data and divide it into three sets: (i) training: to fit the model in a supervised setting (e.g., using stochastic gradient descent with gradients computed using back-propagation), (ii) validation: to tune the model hyper-parameters, and (iii) evaluation: to evaluate the accuracy of the trained model w.r.t. to a pre-annotated test dataset. Typically, the training, validation, and testing data are drawn from the same dataset. The semantics () of the underlying task learned by a DNN model highly depend on the training dataset and are encoded as the weights of the edges in the network. If the training data are changed, the semantics learned by the model also change and essentially a different program is generated . Thus, a final deliverable DNN application is a combination of the training data and the underlying DNN structure.
For image classification, a DNN can be trained in following two settings: (i) Single-label Classification. In a traditional single-label classification problem, each data is associated with a single label l from a set of disjoint labels where . If , the classification problem is called a binary classification problem. If , it is called a multi-class classification problem . In popular image classification tasks, MNIST, CIFAR-10/CIFAR-100
and ImageNet are all single-label classification tasks where each image can be categorized into only one class. (ii) Multi-label Classification. In a multi-label classification problem, each data is associated with a set of labels Y where . COCO, NUS and imSitu are multi-label classification tasks. For example, an image from COCO dataset can be labeled as car, person, traffic light and bicycle. A multi-label classification model is supposed to predict car, person, traffic light and bicycle from this image.
Given any single- or multi-label classification task, the DNN classifier tries to learn the decision boundary between the classes—all members of a class, say , should be categorized identically irrespective of their individual features, and members of another class, say , should not be categorized to . The DNN represents the input image in an embedded space and then use a linear classifier (e.g., softmax) to classify these representations. A class separationestimates how well the DNN has learned to separate a class from another. If the embedded distance between two classes are too small compared to other classes or lower than some pre-defined threshold, we assume that the DNN could not separate them from each other.
Ii-B Testing DNN Classifiers
Traditionally in ML, how well a DNN can classify and learn class-separations is tested w.r.t. annotated ground truth data [27, 28, 29]. During development time, since the training and testing samples are usually drawn from the same distribution, previous papers showed that such techniques are not adequate to test real-world corner cases [14, 30, 13, 31, 32]. To detect class-level violations, as shown in Table I, testing each class separately w.r.t. the high-level class property (e.g., all the possible cow images, such as a black cow with long horns, a white cow with short horns, etc., should be classified as ranch animal) can be a viable option. In fact, this is somewhat similar to equivalence-partition testing ) in traditional software engineering. However, in traditional SE it is usually enough to test one candidate per equivalent class () and the boundary conditions. This is only true when the members of an are homogeneous. For heterogeneous class members, such as ours (e.g., white cow, black cow, etc.), we should test each exhaustively . But each can potentially have a large number of members  and finding representative candidates requires intelligently sampling a nonlinear, high-dimensional space, which is inherently difficult. Thus, exhaustive black-box testing is unfeasible.
Instead, we need a white-box metric, analogous to the branch/path coverage metric, that can be used to approximately represent each class. The popular approaches of white-box testing in the literature are based on DNN structure that often targets activated neurons and layers [14, 30, 13, 31]. However, these techniques are more suitable for testing the network structure (e.g., how many neurons are activated) rather than class properties. Also, these techniques are more suitable for detecting image-level violations than class-level errors. For example, the method proposed in  can detect adversarial images at point-level but cannot be easily extended to identify violations at a group-level. In this work, we address these issues by introducing novel white-box techniques.
In this section, we provide a detailed technical description of DeepInspect. Our experimental setting reflects a real-world scenario where a customer gets a pre-trained model which is running in a production system. The customer has the white box access of the model to profile, although all the data in the production system is unlabeled. Here, we primarily focus on inspecting how well a pre-trained DNN has learned the class-separation by testing it with a set of unlabeled data. In the absence of ground truth labels, the classes are defined by the predicted labels. We finally output class-pairs that are too close as compared to the rest of the class pairs and report potential confusion and bias errors accordingly. Figure 1 shows the DeepInspect workflow.
Before we start describing DeepInspect methodology, we would like to introduce some definitions that we will follow in the rest of the paper.
Neural-Path (). For an input image , we define neural-path as a set of neurons that are activated by .
Neural-Path per Class (). For an class , this metric represents the total number of unique neural-paths activated by all the inputs representing .
For example, consider a class cow containing two images: a brown cow and a black cow (see Figure 2). Let’s assume that they activate two neural-paths: and . Thus, the neural paths for class cow would be . is further represented by a vector , where the superscripts represent the number of times each neuron is activated by . In fact, each in a dataset can be expressed with such neuron activation frequency vector, which captures how the model interacts with that .
Neuron Activation Probability:
Leveraging how frequently a neuron is activated by all the members from a class , this metric estimates the probability of a neuron to be activated by . Thus, we define:
We then construct a dimensional neuron activation probability matrix, (), ( is the number of neurons and is the number of classes) with its ij-th entry being .
This matrix captures how a model interacts with a set of input data. The column vectors () represent the interaction of a class with the model. Note that, in our setting, s are predicted labels.
Since is designed to represent each class, it should be able to distinguish between different s. Next, we use to find two different classes of errors often found in DNN systems: confusion and bias (see Table I).
Iii-B Finding Confusion Errors
In an object classification task, when the model cannot distinguish one object class from another, confusion occurs. For example, as shown in Table I, a Google photo app model confuses a skier with the mountain. Thus, finding confusion errors means checking how well the model can distinguish between objects of different classes. An error happens when the model under test classifies an object with a wrong class, or for multi-label classification task, predicts a class even though no object from that class is present in the test image.
We argue that the model makes these errors because during the training process the model has not learned to distinguish well between the two classes, say and . Therefore, the neurons activated by these objects are similar and the column vectors corresponding to these classes: and will be very close to each other. Thus, we compute the confusion score between two classes as the distance between their two probability vectors:
If the value is less than some pre-defined threshold for two pairs of classes, the model will potentially make mistakes in distinguishing one from another, which results in confusion errors. This is called napvd (Neuron Activation Probabiliy Vector Distance).
Iii-C Finding Bias
In an object classification task, bias occurs if the model under test shows disparate outcomes between two related groups. For example,
we find that ResNet-34 pretrained by imSitu dataset, often mis-classifies a man with a baby as woman. We observe that in the embedded matrix , is much smaller than . Therefore, during testing, whenever the model finds an image with a baby, it is biased towards associating the baby image with a woman. Based on this observation, we propose an inter-class distance based metric to calculate the bias learned by the model. We define the bias between two classes and over a third class as follows:
If a model treats objects of classes and similarly under the presence of a third object class , and should have similar distance w.r.t. in the embedded space ; thus, the numerator of the above equation will be small. Intuitively, the model’s output can be more influenced by the nearer object classes, i.e. if and are closer to . Thus, we normalize the disparity between the two distances to increase the influence of closer classes.
This bias score is used to measure how differently the given model treats two classes in the presence of a third object class. An average bias (abbreviated as avg_bias) between two objects and for all class objects is defined as:
The above score captures the overall bias of the model between two classes. If the bias score is larger than some pre-defined threshold, we report potential bias errors.
Using the above equations we develop a novel testing framework, DeepInspect, to systematically inspect a DNN implementing image classification tasks and look for potential confusion and bias related errors. We implemented DeepInspect in the Pytorch deep learning framework and Python 2.7. All of our experiments were run on Ubuntu 18.04.2 with two TITAN Xp GPUs.
Iv Experimental Design
|Task||Name||#classes||CNN Models||#Neurons||#Layers||Reported Result|
|COCO ||80||ResNet-50||26,560||53 Conv||0.73 mean average precision|
|Multi-label||COCO with gender||81||ResNet-50||26,560||53 Conv||0.71 mean average precision|
|classification||NUS||1000||ResNet-18||5,800||21 Conv||0.26 mean average precision|
|imSitu||205,095||ResNet-34||8,448||36 Conv||0.37 mean accuracy|
|Robust CIFAR-10||10||Small (S) CNN||158||8||0.69 accuracy|
|Single-label||(R CIFAR-10)||Large (L) CNN||1,226||14||0.73 accuracy|
|classification||ResNet (R)||1,410||34||0.70 accuracy|
|ImageNet||1000||ResNet-50||26,560||53 Conv||0.75 accuracy|
|Robust Tiny ImageNet||200||ResNet||1,410||34||0.27 accuracy|
Iv-a Study Subjects
We apply DeepInspect for both multi-label and single-label DNN-based classifications. Under different settings, DeepInspect automatically inspects 10 DNN models for 8 datasets. Table II summarizes our study subjects. We used pre-trained models as shown in the table for all the settings except for COCO with gender. For COCO with gender model, we used the gender labels from  and trained the model in the same way as . There are 11,538 entities and 1788 roles in total in the imSitu dataset. When inspecting imSitu model, we only considered the top 100 most frequent entities or roles in the test dataset.
Among the 10 DNN models, four of them are pre-trained robust models that are trained using adversarial images along with regular images. These models are pre-trained by provably robust training approach proposed in . Three robust models with different network structures are trained using the CIFAR-10 dataset . The last robust model is a pre-trained model in the Tiny ImageNet dataset .
Iv-B Identifying Ground Truth (GT) Errors
To collect the ground truth we refer to the test images truly misclassified by a given model. We then aggregate these misclassified image points by their real and predicted class-labels and estimate pair-wise confusion/bias.
Iv-B1 GT of Confusion Errors
Confusion occurs when a DNN often makes mistakes in disambiguating members of two different classes. In particular, if a DNN is confused between two classes, the classification error rate is higher between the two classes than the rest of the class-pairs. Based on this, we define two types of confusion errors for single-label classification and multi-label classification separately:
Type1 confusions: In single-label classification, Type1 confusion occurs when an object of true label (e.g.,violin) is misclassified to another class (e.g.,cello). For all the objects of class and , it can be quantified as: —DNN’s probability to misclassify class as and vice-versa, and takes the average value between the two. For example, given two classes cello and violin, estimates the mean probability of violin misclassified to cello and vise-versa. Note that, this is a bi-directional score, i.e. misclassification of as is the same as misclassification of as .
Type2 confusions: For multi-label classification, Type2 confusion occurs when an input image contains an object of class (e.g.,keyboard) and no object of class (e.g.,mouse), but the model predicts both classes (See Figure 8. For a pair of classes, this can be quantified as: to compute the probability to detect two objects in the presence of only one. For example, given two classes keyboard and mouse, estimates the mean probability of mouse being predicted while predicting keyboard and vise-versa. Similar to Type1 confusion, this is also a bi-directional score.
We measure and by using a DNN’s true classification error measured on a set of test images. They create the DNN’s true confusion characteristics between all possible class-pairs. We then draw the distributions of and , as shown in Figure 2(a). The class-pairs having confusion scores greater than standard deviation from the mean-value are then marked as pairs truly confused by the model and form our ground truth of confusion errors. For example, in COCO dataset, there are 80 classes and thus 3160 class pairs(80*79/2). 178 class-pairs of them are ground-truth confusion errors.
Note that, different from how a bug/error is defined in traditional software engineering, our suspicious confusion pairs have an inherent probabilistic nature. For example, even if and represent a confusion pair, it does not mean that all the images containing or will be misclassified by the model. Rather, it means that compared with other pairs, images containing or tend to have higher chance to be misclassified by the model.
Iv-B2 GT of Bias
A DNN model is biased if it treats two classes differently. For example, consider three classes: man, woman, and surfboard. An unbiased model should not have different error rates while classifying man or woman in the presence of surfboard. To measure such bias formally, we define confusion disparity (cd) to measure differences of error between classes and and that between and : , where the error measure can be either or as defined earlier. cd essentially estimates the disparity of the model’s error between classes , (e.g., man, woman) w.r.t. a third class (e.g., surfboard).
We also define an aggregated measure average confusion disparity (abbreviated as avg_cd) between two classes and by summing up the bias between them over all third classes and take average:
Depending on the error types we used to estimate avg_cd, we refer to _avg_cd and _avg_cd. We measure avg_cd using true classification error rate reported by DNN for the test images. Similar to confusion errors, we draw the distribution of avg_cd for all possible class pairs and then consider the pairs as truly biased if their avg_cd score is higher than 1 standard deviation from the mean value. Such truly biased pairs form our ground truth of bias errors.
Iv-C Evaluating DeepInspect
We evaluate DeepInspect using test set.
Iv-C1 DeepInspect’s Error Reporting
DeepInspect reports confusion errors based on NAPVD (See Equation 2) scores—lower NAPVD indicates errors. We draw the distributions of NAPVDs for all possible class pairs, as shown in Figure 2(b). Class pairs having NAPVD scores lower than standard deviation from the mean score is marked as potential confusion errors.
As discussed in Section III-C, DeepInspect reports bias errors based on avg_bias score (See Equation 4), where higher avg_bias means class pairs are more prone to bias errors. Similar to above, from the distribution of avg_bias scores, DeepInspect predicts pairs with avg_bias greater than standard deviation from the mean score to be erroneous. Note that, while calculating error disparity between classes , w.r.t. (See Equation 3), if both and are far from in the embedded space , disparity of their distances () should not reflect true bias. Thus, while calculating we further filter out the triplets where , where is some pre-defined threshold. In our experiment, we remove all the class-pairs having larger than standard deviation (i.e. ) below the mean value of all s across all the class-pairs.
Iv-C2 Evaluation Metric
We evaluate DeepInspect in two ways: Precision & Recall:
We use precision and recall to measure DeepInspect’s accuracy. For each error type t, suppose that E is the number of errors detected by DeepInspect and A is the number of true errors in the ground truth set. Then the precision and recall of DeepInspect areand respectively.
Area Under Cost Effective Curve (AUCEC): Similar to how static analysis warnings are ranked based on their priority levels , we also rank the erroneous class-pairs identified by DeepInspect based on the decreasing order of error proneness, i.e. most error-prone pairs will be at the top. To evaluate such ranking we use a cost-effectiveness measure , AUCEC (Area Under the Cost-Effectiveness Curve), which has become standard to evaluate rank-based bug-prediction systems [45, 43, 46].
Cost-Effectiveness evaluates when we inspect/test top n% class-pairs in the ranked list (i.e. inspection cost), how many true errors are found (i.e. effectiveness). Both cost and effectiveness are normalized to 100%. Figure 7 shows cost on the x-axis, and effectiveness on the y-axis indicating the portion of the ground truth errors found. AUCEC is the area under this curve. We compare DeepInspect’s performance against a random model that picks random class-pairs for inspection . We also show the performance of an optimal model that ranks the class-pairs perfectly—if % of all the class-pairs are truly erroneous, the optimal model would rank them at the top such that with lower inspection budget most of the errors will be detected. The optimal curve gives the upper bound of the ranking scheme.
We begin our investigation by checking whether de-facto neuron coverage based metrics can capture class separation.
Here we first investigate whether popular white-box metrics can distinguish between different classes. Then we investigate whether DeepInspect can capture these differences. We evaluate this RQ w.r.t. the training data since the DNN behaviors are not tainted with inaccuracies associated with the test images.
RQ1a. Can Neuron Coverage distinguish between different classes? Neuron Coverage (), proposed by Pei et al. , computes the ratio of unique neurons activated by an input set and the total number of neurons in a DNN. Here we compute per class-label, i.e. for a given class-label, we measure the number of neurons activated by the images tagged with that label w.r.t. the total neurons. The activation threshold we use is 0.5, which is the same as used by Pei et al. . We perform this experiment on COCO and CIFAR-100 to study multi- and single-label classifications. Figure 4 shows results for COCO. We observe similar results for CIFAR-100 .
Each boxplot in the figure shows the distribution of neuron coverage per class-label across all the relevant images. These boxplots visually show that different labels have very similar distribution. We further compare these distributions using Kruskal Test , which is a non-parametric way of comparing more than two groups. Note that we choose a non-parametric measure as, i.e. some differences exist across these distributions. However, a pairwise Cohend’s effect size for each class-label pair, as shown in the following table, shows more than 56% and 78% class-pairs for CIFAR-100 and COCO have small to negligible effect size. This means neuron coverage cannot reliably distinguish a majority of the class-labels.
|Effect Size of neuron coverage across different classes|
RQ1b. Can DeepGauge  distinguish between different classes? Ma et al. argue that each neuron has a primary region of operation; they identify this region by using a boundary condition on its output during the training time; outputs outside this region () are marked as corner cases. Leveraging this notion, they introduce multi-granular neuron and layer-level coverage criteria. For neuron coverage they propose: (i) k-multisection coverage to evaluate how thoroughly the primary region of a neuron is covered, (ii) boundary coverage to compute how many corner cases are covered, and (iii) strong neuron activation coverage to measure how many corner case region is covered in () region. For layer-level coverage, they define (iv) top-k neuron coverage to identify the most active k-neurons for each layer, and (v) top-k neuron pattern for each test-case to find a sequence of neurons from the top-k most active neurons across each layer.
We investigate whether each of these metrics can distinguish between different classes by measuring the above metrics for individual input classes following the original paper methodology. We first profiled every neuron upper- and lower-bound for each class using the training images containing that class-label. Next, we computed per class neuron coverages using test images containing that class; for k-multisection coverage we chose . For layer level coverages, we directly used the input images containing each class, where we select .
Figure 5 shows the results, i.e. histogram of the above five coverage criteria for COCO dataset. For all the five coverage criteria, there are many class-labels that share similar coverage. For example, in COCO , there are labels with k-multisection neuron coverage with values between and (see Figure 5). Similarly, there are labels with 0 neuron boundary coverage. Therefore, none of the five coverage criteria is an effective way to distinguish between different classes. Similar conclusion is drawn for CIFAR-100 dataset.
RQ1c. Can DeepInspect distinguish between different classes? Our white box metric, Neuron Activation Probability Matrix (), by construction is designed per class. Hence it will be unfair to directly measure its capability to distinguish between different classes. Thus, we pose this question in a slightly different way, as described below. For multi-label classification, each image contains multiple class-labels. For example, an image can have labels for both mouse and keyboard. Such coincidence of labels may create confusion—if two labels always appear together in the ground truth set, no classifier can distinguish between them. To check how many times two labels coincide, we define a coincidence score between two labels and as: .
The above formula computes the minimum probability of labels and occurring together in an image given that one of them is present. Note that this is a bi-directional score, i.e. we treat the two labels similarly. The operation ensures we detect the average coincidence of two directions. A low value of coincidence_score indicates two class-labels are easy to separate and vise-versa.
Now, to check DeepInspect’s capability to capture class separation, we simply check the correlation between coincidence_score and confusion score (napvd) from Equation 2 for all possible class-label pairs. Since only multi-label objects can have label coincidences, we perform this experiment for a pre-trained ResNet-50 model on the COCO multi-label classification task. A Spearman correlation coefficient between the confusion and coincidence scores reaches a value as high as 0.96, showing strong statistical significance. The result indicates that DeepInspect can disambiguate most of the classes that have a low confusion score.
Interestingly, we found some pairs where coincidence score is high, but DeepInspect was able to isolate them. For example, (cup,chair), (toilet,sink) etc. Manually investigating such cases reveals that although these pairs often appear together in the input images, there are also enough instances when they appear by themselves. Thus DeepInspect disambiguates between these classes and puts them far in the embedded space . These results indicate DeepInspect can also learn some hidden patterns from the context and thus, can go beyond inspecting the training data coincidence for evaluating model bias/confusion, which is the de-facto technique among machine learning researchers .
We now investigate DeepInspect’s capability in detecting confusion and bias errors in DNN models.
In this RQ, we report DeepInspect’s ability to detect Type1/Type2 confusions w.r.t. to ground truth confusion errors, as described in Section IV-B1.
|napvd < mean-1std||Top 1%|
|R CIFAR-10 S||DeepInspect||4||6||0.4||0.8||-||-||-||-|
|R CIFAR-10 L||DeepInspect||3||4||0.43||0.6||-||-||-||-|
|R CIFAR-10 R||DeepInspect||5||3||0.625||1||-||-||-||-|
We first explore the correlation between napvd and ground truth Type1/Type2 confusion score. Strong correlation has been found for all the 10 experimental settings. Figure 6 gives examples on COCO and CIFAR-10. These results indicate that napvd can be used to detect confusion errors—lower napvd means more confusion.
By default, DeepInspect reports all the class-pairs with napvd scores 1 standard deviation less than mean napvd score as error-prone (See Figure 2(b)). In this setting, as the result shown on Table III, DeepInspect reports errors at high recall under most settings. Specifically, on NUS, CIFAR-100 and robust CIFAR-10 ResNet, DeepInspect can report errors as high as 86.9%, 71.8%, and 100%, respectively. In total, DeepInspect has identified thousands of confusion errors.
If higher precision is wanted, a user can choose to inspect only a small set of confused pairs based on napvd. As also shown in Table III, when the top1% confusion errors are reported, a much higher precision have been achieved for all the datasets. In particular, DeepInspect identifies 31 and 39 confusion errors for COCO model and CIFAR-100 model with 100% and 79.6% precisions respectively. The trade-off between precision and recall can be found on the cost-effective curves shown on Figure 7, which shows overall performance of DeepInspect at different inspection cutoffs. Overall, w.r.t. a random baseline model, DeepInspect is gaining AUCEC performance from to . In fact, for the Robust CIFAR-10 models, DeepInspect’s performance is close to optimal.
Figure 8 and Figure 9 give some specific confusion errors found by DeepInspect in the COCO and the ImageNet settings. In particular, as shown in Figure 7(a), when there is only keyboard but no mouse in the image, the COCO model reports both. Similarly, Figure 8(a) shows a confusion errors (cello, violin). There are several cellos in this image, but the model predicts it to be a violin.
Across all the three robust CIFAR-10 models, DeepInspect identifies (cat, dog), (bird, deer) and (automobile, truck) as erroneous pairs where one class is very likely to be mistakenly classified as the other class of the pair. This indicates that these confusion errors are to be tied to the training data, so all the models trained in this dataset including robust models may have these errors. These results further show that the confusion errors are orthogonal to the norm-based adversarial perturbations and we need a different technique to address them.
We evaluate this RQ by estimating a model’s bias (avg_bias) using Equation 4 w.r.t. the ground truth (avg_cd) computed following Section IV-B2. We first explore the correlation between pairwise avg_cd and our proposed pairwise avg_bias; Figure 10 shows the results for the COCO dataset and the CIFAR-100 dataset. Similar trends also happen in other datasets we study. The results show that a strong correlation exists between avg_cd and avg_bias. In other words, our proposed avg_bias is a good proxy for detecting bias errors.
|napvd > mean+1std||Top 1%|
|R CIFAR-10 S||DeepInspect||7||4||0.636||0.778||-||-||-||-|
|R CIFAR-10 L||DeepInspect||6||7||0.462||0.667||-||-||-||-|
|R CIFAR-10 R||DeepInspect||6||3||0.667||0.667||-||-||-||-|
As in RQ2, we also do a precision-recall analysis of finding the bias errors across all the datasets at two specific cutoffs. We analyze the precision and recall of DeepInspect when reporting bias errors at cutoff Top1(avg_bias) and mean(avg_bias)+standard deviation(avg_bias), respectively. The results are shown in Table IV. At cutoff Top1(avg_bias), DeepInspect has high precision. In particular, DeepInspect can detect ground truth suspicious pairs with precision as high as 75% and 84% for COCO and imSitu respectively. At cutoff mean(avg_bias)+standard deviation(avg_bias), DeepInspect has high recall but relatively low precision. In particular, DeepInspect can detect ground truth suspicious pairs with recall as high as 75.9% and 71.8% recall for COCO and imSitu respectively. DeepInspect can report 657(=249+408) true bias errors in total for the two models. DeepInspect outperforms the random baseline by a large margin at both cutoffs. As in the case of detecting confusion errors, a trade-off between precision and recall exists here. This can be customized based on a user’s need. The cost-effective analysis, as shown on Figure 11, shows the entire spectrum.
As shown in the figure, DeepInspect outperforms the random baseline by a large margin. The AUCEC gains of DeepInspect over the random baseline are from to across the 10 settings. The performance of DeepInspect is close to the optimal curve under some settings. Specifically, the AUCEC gains of the optimal over DeepInspect are only 7.11% and 7.95% under the COCO and ImSitu settings, respectively.
Inspired by , which shows bias exists between man and woman in COCO with gender in the task of image captioning, we analyze the most biased third class for and being man and woman in COCO and imSitu. As shown in Figure 12, we found that sports like skis, snowboard, and surfboard is more closely associated with man and thus misleads the model to predict women in the images to be men. Figure 13 shows results on imSitu, we found that the model tends to associate the class “inside” with woman while associate the class “outside” with man.
We generalize the idea by choosing classes and to be any class-pair. We found that similar bias also exists in the single-label classification settings. For example, in ImageNet, one of the highest bias is between Eskimo_dog and rapeseed w.r.t. Siberian_husky. The model tends to confuse between the two dogs but not between Eskimo_dog and rapeseed. This makes sense since Eskimo_dog and Siberian_husk are all dogs and are more easily to be misclassified by the model.
Note that, one of the fairness violations of a DNN system is the drastic difference on accuracy across groups divided according to some sensitive feature(s). In black-box testing, the tester can get a number indicating the degree of fairness has been violated by feeding into the model a validation set. In contrast, DeepInspect provides a new angle to the fairness violations. The neuron distance difference between two classes and w.r.t. a third class sheds light on why the model tends to be more likely to confuse between one of them and than the other. We leave a more comprehensive examination on interpreting bias/fairness violations for future work.
Vi Threats to Validity
Although DeepInspect can find confusion and bias errors, its performance relies on the the accuracy of the model it tests. Since DeepInspect groups images according to the predicted labels, in the case when the model has very low testing accuracy, the embedding of objects in will not be accurate and thus leads to inferior testing performance.
Another limitation is that DeepInspect needs to decide thresholds for both confusion errors and bias errors. We minimize this threat by choosing thresholds that are 1 standard deviation far from the corresponding mean values.
The task of classifying any possible object accurately is notoriously difficult. Here we simplify the problem to test the DNN model for the classes that it has seen before during the training process. For example, if the model is trained with cow’s images, we will test the model with variations of cows correctly. However, we will not test the model for dinosaur if it has never seen it during the training process.
Vii Related Work
Software Testing & Verification of DNN. Prior research proposed different white-box testing criteria based on neuron coverage [14, 30, 13] and neuron-pair coverage . Sun et al.  further presented a concolic testing approach for DNNs. They showed that their concolic testing approach can effectively increase coverage and find adversarial examples. There are also efforts to verify DNNs [50, 51, 35, 52] against adversarial attacks. However, most of the verification efforts are limited to small DNNs and limited system-wide properties (e.g., range of pixel values). In contrast, we propose testing strategies that test class separations. Our initial results indicate that such metrics have potential to identify potential bias and weakness of end-to-end DNN applications.
Adversarial Deep Learning. DNNs are known to be vulnerable to well-crafted inputs called adversarial examples, which are imperceptible to a human but can easily make DNNs fail [53, 54, 55, 56, 57, 58, 59, 60]. Much work has been done to defend the adversarial attacks [61, 62, 63, 64, 65, 66, 67, 68, 69]. Our methods has potential to identify adversarial inputs. Moreover, adversarial examples are usually out of distribution data and not realistic, while we do not require to generate any new transformed images to find errors. Further, we can identify a general weakness or errors rather than focusing on crafted attacks that often require strong attacker model.
Interpreting DNN. There are many research on model interpretability and visualization [70, 71, 72, 73, 74, 75]. In particular, Dong et al.  observed that instead of learning the semantic features of the whole objects, neurons tend to react on different parts of the objects in a recurrent manner. Our probabilistic way to look at neuron activation per class aims to capture a wholistic behavior of an entire classes instead of individual object so that diverse features of class members can be captured. Closest to ours is by Papernot et al.  who used nearest training points to explain the adversarial attacks. In comparison, we analyze the DNN’s dependencies on an entire training/testing data and represent it in a matrix. By inspecting this matrix, we can explain the bias and weaknesses of the DNN.
Evaluating model’s Bias/Fairness. Evaluating bias and fairness of a system is important both from a theoretical and a practical perspective [77, 78, 79, 80]. At a high level, the related studies first define a fairness criteria and then try to optimize the original objective while satisfying the fairness creteria [81, 82, 83, 84, 85, 86]. These properties are defined either at individual [81, 87, 88] or group levels [89, 82, 90]. In this work, we showed the potential of DeepInspect in detecting group-level fairness.
Galhotra et al.  first applied the notion of software testing for evaluating software fairness. In particular, they mutate the sensitive features of the inputs and check whether the output changes. One major problem with their proposed method Themis is that it assumes the model it tests takes into sensitive attribute(s) during training and inference time. Such assumption is not realistic since most existing fairness-aware models drop input sensitive feature(s). Besides, Themis will not work on image classification, where the sensitive attribute (e.g. gender, race) is a visual concept that cannot be flipped easily. In our work, we use a white-box approach to measure the bias learned by the model during training. Our testing method does not assume the model we are testing taking into any sensitive feature(s). we propose a new fairness notion for the setting of multi-object classification: average confusion disparity and a proxy average bias to measure it for any deep learning models even when only unlabeled testing data is provided. In addition, our method tries to provide an explanation behind such discrimination. A complementary approach by Papernot et al.  shows such explainability behind model bias in single-label classification setting.
In this paper, we propose a white box DNN testing framework, which can automatically detect confusion and bias errors in DNN-based image classification models. We implemented DeepInspect and applied it on 8 different popular image classification datasets and 10 pretrained DNN models including 4 pre-trained robust models. We show that DeepInspect is able to detect errors for both single- and multi-label classification models with high precision. We also put all the errors that DeepInspect detected on the public website.
In this work, we mainly focus on detecting confusion/bias errors. A natural follow-up question is how to fix these errors. Unlike fixing bugs in traditional software, fixing errors in DNNs is an open problem and often requires retraining the models. Preliminary results on COCO dataset have shown that by augmenting the training datasets with images from confusing class pairs could reduce the confusion errors. We leave a more comprehensive examination of how to fix the confusion and bias errors found by DeepInspect to the future work.
-  P. Kamavisdar, S. Saluja, and S. Agrawal, “A survey on image classification approaches and techniques,” International Journal of Advanced Research in Computer and Communication Engineering, vol. 2, no. 1, pp. 1005–1009, 2013.
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
-  L. Grush, “Google engineer apologizes after photos app tags two black people as gorillas,” 2015. [Online]. Available: https://www.theverge.com/2015/7/1/8880363/google-apologizes-photos-app-tags-two-black-people-gorillas
-  MalletsDarker, “I took a few shots at lake louise today and google offered me this panorama,” 2018. [Online]. Available: https://www.reddit.com/r/funny/comments/7r9ptc/i_took_a_few_shots_at_lake_louise_today_and/dsvv1nw/
J. Zhao, T. Wang, M. Yatskar, V. Ordonez, and K.-W. Chang, “Men also like
shopping: Reducing gender bias amplification using corpus-level
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2017, pp. 2941–2951. [Online]. Available: https://www.aclweb.org/anthology/D17-1319
-  A. Rosenfeld, R. S. Zemel, and J. K. Tsotsos, “The elephant in the room,” CoRR, vol. abs/1808.03305, 2018. [Online]. Available: http://arxiv.org/abs/1808.03305
-  A. Rose, “Are face-detection cameras racist?” 2010. [Online]. Available: http://content.time.com/time/business/article/0,8599,1954643,00.html
-  J. Buolamwini and T. Gebru, “Gender shades: Intersectional accuracy disparities in commercial gender classification,” in FAT, 2018.
-  D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. S. Openai, D. Mané, and G. Brain, “Concrete Problems in AI Safety.” [Online]. Available: https://arxiv.org/pdf/1606.06565.pdfhttp://arxiv.org/abs/1606.06565.pdf
-  B. J. Taylor and M. A. Darrah, “Rule extraction as a formal method for the verification and validation of neural networks,” in Neural Networks, 2005. IJCNN’05. Proceedings. 2005 IEEE International Joint Conference on, vol. 5. IEEE, 2005, pp. 2915–2920.
-  S. A. Seshia, D. Sadigh, and S. S. Sastry, “Towards verified artificial intelligence,” arXiv preprint arXiv:1606.08514, 2016.
-  T. Dreossi, S. Jha, and S. A. Seshia, “Semantic adversarial deep learning,” in International Conference on Computer-Aided Verification (CAV), 2018.
-  Y. Tian, K. Pei, S. Jana, and B. Ray, “Deeptest: Automated testing of deep-neural-network-driven autonomous cars,” in International Conference of Software Engineering (ICSE), 2018 IEEE conference on. IEEE, 2018.
-  K. Pei, Y. Cao, J. Yang, and S. Jana, “Deepxplore: Automated whitebox testing of deep learning systems,” pp. 1–18, 2017. [Online]. Available: http://doi.acm.org/10.1145/3132747.3132785
-  T. M. Mitchell, Machine Learning, 1st ed. New York, NY, USA: McGraw-Hill, Inc., 1997.
-  Y. LeCun, L. Bottou, Y. Bengio, P. Haffner et al., “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
-  D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back-propagating errors,” Cognitive modeling, vol. 5, no. 3, p. 1, 1988.
-  D. Sculley, G. Holt, D. Golovin, E. Davydov, T. Phillips, D. Ebner, V. Chaudhary, and M. Young, “Machine learning: The high interest credit card of technical debt,” 2014.
-  G. Tsoumakas and I. Katakis, “Multi-label classification: An overview,” International Journal of Data Warehousing and Mining (IJDWM), vol. 3, no. 3, pp. 1–13, 2007.
-  A. Krizhevsky, “Learning multiple layers of features from tiny images,” University of Toronto, 05 2012.
-  O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,” International Journal of Computer Vision (IJCV), vol. 115, no. 3, pp. 211–252, 2015.
-  T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European conference on computer vision. Springer, 2014, pp. 740–755.
-  T.-S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y.-T. Zheng, “Nus-wide: A real-world web image database from national university of singapore,” in Proc. of ACM Conf. on Image and Video Retrieval (CIVR’09), Santorini, Greece., July 8-10, 2009.
-  M. Yatskar, L. Zettlemoyer, and A. Farhadi, “Situation recognition: Visual semantic role labeling for image understanding,” in Conference on Computer Vision and Pattern Recognition, 2016.
-  Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A review and new perspectives,” IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 8, pp. 1798–1828, 2013.
-  I. H. Witten, E. Frank, M. A. Hall, and C. J. Pal, Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, 2016.
-  “Inside waymo’s secret world for training self-driving cars,” https://www.theatlantic.com/technology/archive/2017/08/inside-waymos-secret-testing-and-simulation-facilities/537648/, 2017.
-  “Google auto waymo disengagement report for autonomous driving,” https://www.dmv.ca.gov/portal/wcm/connect/946b3502-c959-4e3b-b119-91319c27788f/GoogleAutoWaymo_disengage_report_2016.pdf?MOD=AJPERES, 2016.
-  L. Ma, F. Juefei-Xu, F. Zhang, J. Sun, M. Xue, B. Li, C. Chen, T. Su, L. Li, Y. Liu, J. Zhao, and Y. Wang, “Deepgauge: Multi-granularity testing criteria for deep learning systems,” pp. 120–131, 2018. [Online]. Available: http://doi.acm.org/10.1145/3238147.3238202
-  Y. Sun, X. Huang, and D. Kroening, “Testing deep neural networks,” arXiv preprint arXiv:1803.04792, 2018.
-  I. Goodfellow and N. Papernot, “The challenge of verification and testing of machine learning,” http://www.cleverhans.io/security/privacy/ml/2017/06/14/verification.html, 2017.
-  T. J. Ostrand and M. J. Balcer, “The category-partition method for specifying and generating fuctional tests,” Communications of the ACM, vol. 31, no. 6, pp. 676–686, 1988.
-  D. Hamlet and R. Taylor, “Partition testing does not inspire confidence (program testing),” IEEE Transactions on Software Engineering, vol. 16, no. 12, pp. 1402–1411, 1990.
-  X. Huang, M. Kwiatkowska, S. Wang, and M. Wu, “Safety verification of deep neural networks,” in International Conference on Computer Aided Verification. Springer, 2017, pp. 3–29.
-  J. Kim, R. Feldt, and S. Yoo, “Guiding deep learning system testing using surprise adequacy,” in Proceedings of the 41th International Conference on Software Engineering, ser. ICSE 2019, 2019.
-  V. O. Tianlu Wang, Kota Yamaguchi, “Deep residual learning for image recognition,” in Feedback-prop: Convolutional Neural Network Inference under Partial Evidence(CVPR), 2018.
-  “Base pretrained models and datasets in pytorch,” 2017. [Online]. Available: https://github.com/aaron-xichen/pytorch-playground
-  E. Wong, F. Schmidt, J. H. Metzen, and J. Z. Kolter, “Scaling provable adversarial defenses,” in Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, Eds. Curran Associates, Inc., 2018, pp. 8410–8419. [Online]. Available: http://papers.nips.cc/paper/8060-scaling-provable-adversarial-defenses.pdf
-  K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in International Conference on Learning Representations (ICLR), 2015.
-  “Tiny imagenet visual recognition challenge,” 2017. [Online]. Available: https://tiny-imagenet.herokuapp.com/
-  S. Wang, Y. Chen, A. Abdou, and S. Jana, “Mixtrain: Scalable training of formally robust neural networks,” CoRR, vol. abs/1811.02625, 2018. [Online]. Available: http://arxiv.org/abs/1811.02625
-  F. Rahman and P. Devanbu, “How, and why, process metrics are better,” in 2013 35th International Conference on Software Engineering (ICSE). IEEE, 2013, pp. 432–441.
-  E. Arisholm, L. C. Briand, and E. B. Johannessen, “A systematic and comprehensive investigation of methods to build and evaluate fault prediction models.” JSS, vol. 83, no. 1, pp. 2–17, 2010.
-  F. Rahman, D. Posnett, A. Hindle, E. Barr, and P. Devanbu, “Bugcache for inspections: hit or miss?” in Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering. ACM, 2011, pp. 322–331.
-  B. Ray, V. Hellendoorn, S. Godhane, Z. Tu, A. Bacchelli, and P. Devanbu, “On the" naturalness" of buggy code,” in 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE). IEEE, 2016, pp. 428–439.
-  I. H. Witten and E. Frank, Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, 2005.
-  W. H. Kruskal and W. A. Wallis, “Use of ranks in one-criterion variance analysis,” Journal of the American Statistical Association, vol. 47, no. 260, pp. 583–621, 1952. [Online]. Available: https://www.tandfonline.com/doi/abs/10.1080/01621459.1952.10483441
-  Y. Sun, M. Wu, W. Ruan, X. Huang, M. Kwiatkowska, and D. Kroening, “Concolic testing for deep neural networks,” in Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, ser. ASE 2018. New York, NY, USA: ACM, 2018, pp. 109–119. [Online]. Available: http://doi.acm.org/10.1145/3238147.3238172
-  K. Pei, Y. Cao, J. Yang, and S. Jana, “Towards practical verification of machine learning: The case of computer vision systems,” arXiv preprint arXiv:1712.01785, 2017.
-  G. Katz, C. Barrett, D. L. Dill, K. Julian, and M. J. Kochenderfer, Reluplex: An Efficient SMT Solver for Verifying Deep Neural Networks. Cham: Springer International Publishing, 2017, pp. 97–117.
-  S. Wang, K. Pei, J. Whitehouse, J. Yang, and S. Jana, “Formal security analysis of neural networks using symbolic intervals,” 2018.
-  X. Yuan, P. He, Q. Zhu, and X. Li, “Adversarial examples: Attacks and defenses for deep learning,” IEEE Transactions on Neural Networks and Learning Systems, pp. 1–20, 2019.
-  A. Raghunathan, J. Steinhardt, and P. Liang, “Certified defenses against adversarial examples,” 6th International Conference on Learning Representations (ICLR), 2018.
-  I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” in International Conference on Learning Representations (ICLR), 2015.
-  A. Nguyen, J. Yosinski, and J. Clune, “Deep neural networks are easily fooled: High confidence predictions for unrecognizable images,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 427–436.
-  N. Papernot, P. McDaniel, S. Jha, M. Fredrikson, Z. B. Celik, and A. Swami, “The limitations of deep learning in adversarial settings,” in 2016 IEEE European Symposium on Security and Privacy (EuroS&P). IEEE, 2016, pp. 372–387.
-  C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus, “Intriguing properties of neural networks,” in International Conference on Learning Representations (ICLR), 2014.
-  S. Huang, N. Papernot, I. Goodfellow, Y. Duan, and P. Abbeel, “Adversarial attacks on neural network policies,” arXiv preprint arXiv:1702.02284, 2017.
-  A. Kurakin, I. Goodfellow, and S. Bengio, “Adversarial examples in the physical world,” in Workshop track, 2017.
-  O. Bastani, Y. Ioannou, L. Lampropoulos, D. Vytiniotis, A. Nori, and A. Criminisi, “Measuring neural net robustness with constraints,” in Advances in Neural Information Processing Systems, 2016, pp. 2613–2621.
-  N. Carlini and D. Wagner, “Towards evaluating the robustness of neural networks,” in Security and Privacy (SP), 2017 IEEE Symposium on. IEEE, 2017, pp. 39–57.
-  S. Gu and L. Rigazio, “Towards deep neural network architectures robust to adversarial examples,” in International Conference on Learning Representations (ICLR), 2015.
-  J. H. Metzen, T. Genewein, V. Fischer, and B. Bischoff, “On detecting adversarial perturbations,” in International Conference on Learning Representations (ICLR), 2017.
-  N. Papernot, P. McDaniel, X. Wu, S. Jha, and A. Swami, “Distillation as a defense to adversarial perturbations against deep neural networks,” in Security and Privacy (SP), 2016 IEEE Symposium on. IEEE, 2016, pp. 582–597.
-  U. Shaham, Y. Yamada, and S. Negahban, “Understanding adversarial training: Increasing local stability of neural nets through robust optimization,” arXiv preprint arXiv:1511.05432, 2015.
-  W. Xu, D. Evans, and Y. Qi, “Feature squeezing: Detecting adversarial examples in deep neural networks,” arXiv preprint arXiv:1704.01155, 2017.
-  S. Zheng, Y. Song, T. Leung, and I. Goodfellow, “Improving the robustness of deep neural networks via stability training,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4480–4488.
-  W. He, J. Wei, X. Chen, N. Carlini, and D. Song, “Adversarial example defenses: Ensembles of weak defenses are not strong,” in Proceedings of the 11th USENIX Conference on Offensive Technologies, ser. WOOT’17. Berkeley, CA, USA: USENIX Association, 2017, pp. 15–15. [Online]. Available: http://dl.acm.org/citation.cfm?id=3154768.3154783
-  Z. C. Lipton, “The mythos of model interpretability,” Proceedings of the 33rd International Conference on Machine Learning Workshop, 2016.
-  Q.-s. Zhang and S.-C. Zhu, “Visual interpretability for deep learning: a survey,” Frontiers of Information Technology & Electronic Engineering, vol. 19, no. 1, pp. 27–39, 2018.
-  R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: Visual explanations from deep networks via gradient-based localization,” in 2017 IEEE International Conference on Computer Vision (ICCV), Oct 2017, pp. 618–626.
-  G. Montavon, W. Samek, and K.-R. Müller, “Methods for interpreting and understanding deep neural networks,” Digital Signal Processing, 2017.
-  D. Bau, B. Zhou, A. Khosla, A. Oliva, and A. Torralba, “Network dissection: Quantifying interpretability of deep visual representations,” in Computer Vision and Pattern Recognition, 2017.
-  Y. Dong, H. Su, J. Zhu, and F. Bao, “Towards interpretable deep neural networks by leveraging adversarial examples,” arXiv preprint arXiv:1708.05493, 2017.
-  N. Papernot and P. McDaniel, “Deep k-nearest neighbors: Towards confident, interpretable and robust deep learning,” arXiv preprint arXiv:1803.04765, 2018.
-  B. T. Luong, S. Ruggieri, and F. Turini, “k-nn as an implementation of situation testing for discrimination discovery and prevention,” in Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2011, pp. 502–510.
-  R. Zemel, Y. Wu, K. Swersky, T. Pitassi, and C. Dwork, “Learning fair representations,” in Proceedings of the 30th International Conference on Machine Learning, 2013, pp. 325–333.
-  M. B. Zafar, I. Valera, M. Gomez Rodriguez, and K. P. Gummadi, “Fairness constraints: Mechanisms for fair classification,” vol. 54, 2017.
-  Y. Brun and A. Meliou, “Software fairness,” in Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ser. ESEC/FSE 2018. New York, NY, USA: ACM, 2018, pp. 754–759. [Online]. Available: http://doi.acm.org/10.1145/3236024.3264838
-  C. Dwork, M. Hardt, T. Pitassi, O. Reingold, and R. S. Zemel, “Fairness through awareness,” In Proceedings of the Innovations in Theoretical Computer Science Conference, vol. abs/1104.3913, pp. 214–226, 2012.
M. Hardt, E. Price, and N. Srebro, “Equality of opportunity in supervised learning,” inProceedings of the 30th International Conference on Neural Information Processing Systems, ser. NIPS’16, USA, 2016, pp. 3323–3331.
-  S. Barocas, M. Hardt, and A. Narayanan, Fairness and Machine Learning. fairmlbook.org, 2018, http://www.fairmlbook.org.
-  A. K. Menon and R. C. Williamson, “The cost of fairness in binary classification,” in Conference on Fairness, Accountability and Transparency, FAT 2018, 23-24 February 2018, New York, NY, USA, 2018, pp. 107–118. [Online]. Available: http://proceedings.mlr.press/v81/menon18a.html
-  M. Donini, L. Oneto, S. Ben-David, J. Shawe-Taylor, and M. Pontil, “Empirical risk minimization under fairness constraints,” in Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada., 2018, pp. 2796–2806. [Online]. Available: http://papers.nips.cc/paper/7544-empirical-risk-minimization-under-fairness-constraints
-  A. L. Lamy, Z. Zhong, A. K. Menon, and N. Verma, “Noise-tolerant fair classification,” CoRR, vol. abs/1901.10837, 2019. [Online]. Available: http://arxiv.org/abs/1901.10837
-  M. J. Kusner, J. Loftus, C. Russell, and R. Silva, “Counterfactual fairness,” in Advances in Neural Information Processing Systems 30, 2017, pp. 4066–4076.
-  M. P. Kim, O. Reingold, and G. N. Rothblum, “Fairness through computationally-bounded awareness,” 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), 2018.
-  T. Calders, F. Kamiran, and M. Pechenizkiy, “Building classifiers with independency constraints,” in 2009 IEEE International Conference on Data Mining Workshops, Dec 2009, pp. 13–18.
-  M. B. Zafar, I. Valera, M. Gomez Rodriguez, and K. P. Gummadi, “Fairness beyond disparate treatment & disparate impact: Learning classification without disparate mistreatment,” in Proceedings of the 26th International Conference on World Wide Web, 2017, pp. 1171–1180.
-  S. Galhotra, Y. Brun, and A. Meliou, “Fairness testing: testing software for discrimination,” in Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering. ACM, 2017, pp. 498–510.