Testing Deep Learning Models for Image Analysis Using Object-Relevant Metamorphic Relations

Deep learning models are widely used for image analysis. While they offer high performance in terms of accuracy, people are concerned about if these models inappropriately make inferences using irrelevant features that are not encoded from the target object in a given image. To address the concern, we propose a metamorphic testing approach that assesses if a given inference is made based on irrelevant features. Specifically, we propose two novel metamorphic relations to detect such inappropriate inferences. We applied our approach to 10 image classification models and 10 object detection models, with three large datasets, i.e., ImageNet, COCO, and Pascal VOC. Over 5.3 top-5 correct predictions made by the image classification models are subject to inappropriate inferences using irrelevant features. The corresponding rate for the object detection models is over 8.5 designed a new image generation strategy that can effectively attack existing models. Comparing with a baseline approach, our strategy can double the success rate of attacks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 7

page 8

04/11/2019

An Analysis of Pre-Training on Object Detection

We provide a detailed analysis of convolutional neural networks which ar...
01/12/2020

Membership Inference Attacks Against Object Detection Models

Machine learning models can leak information about the dataset they trai...
02/11/2019

Bag of Freebies for Training Object Detection Neural Networks

Comparing with enormous research achievements targeting better image cla...
04/07/2016

A Classification Leveraged Object Detector

Currently, the state-of-the-art image classification algorithms outperfo...
07/27/2021

Towards Black-box Attacks on Deep Learning Apps

Deep learning is a powerful weapon to boost application performance in m...
03/28/2022

Neurosymbolic hybrid approach to driver collision warning

There are two main algorithmic approaches to autonomous driving systems:...
08/14/2020

RODEO: Replay for Online Object Detection

Humans can incrementally learn to do new visual detection tasks, which i...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Deep learning models have been widely deployed for image analysis applications, such as image classification [15, 38, 18], object detection [26, 35, 34] and human keypoint detection [41, 7, 23, 33]

. While these image analysis models outperform classical machine learning algorithms, recent studies 

[36, 39, 30] have raised concerns on such models’ reliability.

Various testing techniques [32, 40, 45, 44, 10, 9, 29] have been proposed to help assess the reliability of deep learning models for image analysis. For instance, Pei et al.  [32] proposed an optimization strategy to generate test inputs for image classification and digit recognition applications. However, a major limitation of these techniques is that they do not consider whether the inferences made by a model are based on the features encoded from the target objects instead of those encoded from these objects’ background. We refer to the former as object-relevant features and the latter as object-irrelevant features. For example, those features encoded from the rectangular region occupied by the keyboard object in the image as shown in Fig.0(a) is considered as object-relevant for a keyboard detection model. Other features encoded from the rest of this image are object-irrelevant. Such relevant and irrelevant features vary with target objects. For example, a mouse detection model would consider those features encoded from the “mouse” in  Fig.0(a) as object-relevant.

(a)
(b)
(c)
Fig. 1: (a): An Image from ImageNet. (b): Object (Mouse) Preserving Mutation. (c): Object (Mouse) Removing Mutation

Deep learning models do not necessarily make inferences based on object-relevant features. For instance, a recent study showed that a model would classify an image with bright background as “wolf” regardless of the objects in the image 

[36]. Even though the model could output accurate results with respect to the test inputs (i.e., test images), such inferences are unreliable. More seriously, unreliable inferences based on object-irrelevant features are vulnerable to malicious attacks. For example, Gu et al.  [14]

showed that attackers could inject a backdoor trigger, such as a yellow square in an image’s background, to a deep neural network (DNN) model. A model that makes inferences based on object-irrelevant features (e.g., yellow square at the background), will then classify an image containing this trigger to a specific label, regardless of the target object in the image. Deploying such a model in mission-critical applications could cause catastrophic consequences. Therefore, it is important to develop effective techniques to assess deep learning inference results from the perspective of object relevancy.

However, there are two major challenges that prevent us from easily validating inference results generated by deep learning models with respect to object relevancy. First, obtaining oracles for testing deep learning models is hard [44]. We resort to metamorphic testing [5] to tackle this challenge. Metamorphic testing has been popularly leveraged to test deep learning models for image analysis [45, 44, 10, 9]. Second, it is difficult to measure whether a model makes inference based on object-relevant features. Efforts have been made to explain whether an inference made by a model is trustable [36]. However, model explanation is still an outstanding challenge. Recent research focuses mostly on image classification models. Further, existing studies cannot quantitatively measure to what extent an inference is made with respect to object relevancy. Besides, they are designed for the purpose of interpretation instead of testing, and thus cannot be easily adapted to validate inferences made by deep learning models (e.g., cannot generate test inputs). To address this challenge, we propose two novel metamorphic relations (MRs) to quantitatively assess a model’s inferences from the perspective of object relevancy as follows:

  • MR-1: An image after altering the regions unoccupied by the target object should lead to a similar inference result.

  • MR-2: An image after removing the target object should lead to a dissimilar inference result.

Formulation of these two metamorphic relations is given in Section III. Based on these two relations, we propose a metamorphic testing technique to assess whether an inference made by a deep learning model for image analysis is based on object-relevant features. Essentially, we design image mutation operations concerning the two relations to generate test inputs. To test the object relevancy of an image inference, we apply these operations on the given image to construct mutated images and check whether the subsequent inferences on such mutated images satisfy the metamorphic relations. Based on the metamorphic testing results, we devise an “object-relevancy score” as a metric to measure the extent to which an inference made by a deep learning model is based on object-relevant features.

We evaluated our technique using 10 common image classification and 10 object detection models on 3 popular large datasets: ImageNet [8], VOC [11] and COCO [25]. We found that over 5.3% of the correct classification results made by the image classification models are not based on object-relevant features. The corresponding rate for the object detection models is over 8.5%. For specific models, the rate can be as high as 29.1%. We additionally defined and demonstrated a simple yet effective strategy to attack deep learning models by leveraging the object relevancy scores.

To summarize, this paper makes three major contributions:

  1. We proposed a metamorphic testing technique to assess the reliability of inferences generated by deep learning models for image analysis using object-relevant metamorphic relations.

  2. We proposed a metric “object-relevancy score” to measure the object relevancy of an inference result. We further show that our metric could be used to effectively facilitate an existing attacking method.

  3. We conducted experiments on 20 common deep learning models for image analysis. We found that the inference results with low object-relevancy scores commonly exist in these models.

Ii Preliminaries

Ii-a Metamorphic Testing

Metamorphic testing [5, 6] was proposed to address the test oracle problem. It works in two steps. First, it constructs a new set of test inputs (called follow-up inputs) from a given set of test inputs (called source inputs). Second, it checks whether the program outputs based on the source inputs and follow-up inputs satisfy some desirable properties, known as metamorphic relations (MR).

For example, suppose is a program implementing the function. We know that the equation holds for any numeric value . Leveraging this knowledge, we can apply metamorphic testing to as follows. Given a set of source inputs , we first construct a set of follow-up inputs , where . Then, we check whether the metamorphic relation holds. A violation of it indicates the presence of faults in . These two steps can be applied repetitively by treating the follow-up inputs in one cycle as the source inputs in the next cycle.

Ii-B Image Analysis Based on Deep Learning

Image analysis is a key application of deep learning algorithms to image classification [15, 38, 18], object detection [26, 35, 34], human keypoint detection [41, 7, 23, 33] and so on.

Ii-B1 Image Classification

An image classifier is built to classify a given image into a category. AlexNet [22], VGG [38], DenseNet [19], and sMobileNets [18] are popular models for image classification. MNIST [24], CIFAR-10 [21], and ImageNet [8] are datasets that have been widely used to evaluate these models. The performance of the models is mostly evaluated based on the top-1/5 error rate, which refers to the percentage of test images whose correct labels are not in the top-1/5 inference(s) made by models [22, 15, 38, 19, 18].

Ii-B2 Object Detection

An object detector is built to identify the location of target objects in a given image and label their categories. The object detection result usually contains multiple regions of interest, each of which is marked by a bounding box or a mask, and associated with a confidence value. Both bounding box and object mask show the region of the object, in the form of rectangle or loop, respectively. Each region of interest is annotated by a confidence value, indicating the confidence in the inference. Single Shot MultiBox Detector (SSD) [26], YOLO [34] and Faster R-CNN [35] are popular detectors. PASCAL VOC [11] and COCO [25] are datasets widely used by studies on object detection. The performance of object detection models is evaluated using several metrics. The VOC challenge uses the metrics Precision x Recall curve and Average Precision. The COCO challenge uses mAP (mean Average Precision) [17]. The metrics used in both challenges require the computation of IOU (Intersection Over Union): , where the is the object’s bounding box in the ground truth and is the bounding box of the object detected by a model. Here, can be substituted by object mask .

Iii object-relevant metamorphic relations

With the aim to quantitatively measure to what extend that an inference made by deep learning models is based on object-relevant features, we are motivated to propose two novel metamorphic relations as mentioned in Section I. This section presents the details of these two relations. Specifically, we follow a common metamorphic testing framework to define the metamorphic relations [6]. In subsequent formulation, let denote the inference made by a deep learning model on an image , and denote the distance between two inferences and .

MR-1: An image after altering the regions unoccupied by the target object should lead to a similar inference result.

Relation Formulation: Let be a follow-up image constructed from a source image for a model by preserving the target object but mutating the other parts. We consider such a mutation object-preserving. An example of object-preserving mutation for a keyboard detection model is given by Fig.0(a) (source image) and Fig.0(c) (follow-up image). MR-1 mandates that and should satisfy the relation: . Here, denotes a threshold for the distance between two inference results made by a model under metamorphic testing using object-preserving mutations.

Explanation: If an inference made by a specific model is based on object-relevant features, after object-preserving mutations, the new inference results should be similar since the object-relevant features are preserved and should still be leveraged by the model.

MR-2: An image after removing the target object should lead to a dissimilar inference result.

Relation Formulation: Let be a follow-up image constructed from a source image for a model by removing the target object but preserving its background. We consider such a mutation object-removing. An example of object-removing mutation for a keyboard detection model is given by Fig.0(a) (source image) and Fig.0(b) (follow-up image). MR-2 mandates that and should satisfy the relation: . Here, denotes a threshold for the distance between two inference results made by a model under metamorphic testing using object-removing mutations.

Explanation: If an inference made by a specific model is based on object-relevant features, after object-removing mutations, the new inference results should be affected since the object-relevant features disappear and cannot be leveraged by the model.

Iv Overview

Fig. 2: Overview of Our Approach

In this section, we present the overview of our approach, which consists of the following three steps:

Object-Relevant Feature Identification: We treat the images in a given model’s validation/test set as source images. We apply image analysis techniques to each source image and identify its object-relevant features. Specifically, we leverage the ground truth in the validation/test set to divide an image semantically into two parts, an object region and a background region. We consider those segments (i.e., an area of pixels) belonging to the object region as relevant features and the others irrelevant.

Follow-up Tests Construction: Mutation functions are designed to generate follow-up inputs from the source inputs. Specifically, we design a set of object-preserving mutation functions for MR-1 and and a set of object-removing mutation functions for MR-2. The details of these functions are explained in Section V.

Test Result Validation: We define distance functions for image analysis tasks and object detection tasks, respectively. We validate if the distance between the result of a source input and that of its follow-up input fulfills the metamorphic relations as described in Section III. Finally, we define the object-relevancy score as a metric to measure to what extent an inference is based on object-relevant features.

V Approach

We present the details of our approach for two common image analysis tasks in deep learning: image classification and object detection.

V-a Image Classification

V-A1 Object-Relevant Feature Identification

Since the images used for image classification usually contain one object, we mark the pixels where the object resides as the object region and the remaining pixels as the background region. For an image whose ground truth indicates multiple objects, we examine whether any one labels in the ground truth are ranked top-5 by the model. If so, we regard the object whose label has the highest rank as the object region. All other objects together with the rest of the image are regarded as the background region. If not, we regard the union of all objects as the object region and the rest as the background region. We examine ‘top-5’ since existing evaluations mostly consider the top-5 results as discussed in Section II-B.

V-A2 Follow-up Tests Construction

We generate follow-up test input images by semantically mutating a source test input image using the two aforementioned image mutations: object preserving mutation and object removing mutation. For each image mutation type, we design multiple mutation functions (e.g., MoveObjToImg), as shown in Table I. Each of them could use different ingredients (e.g., background image 1). A mutation function together with an ingredient defines a mutation operation (e.g., MoveObjToImg using background image 1). In total, 38 mutation operations (25 for object preserving and 13 for object removing) are designed.

[t]

Mutation Function
Type
Mutation Function
Name
Description
The Number of
Mutation Operations
MvObjToImg
First, directly move the object to a new background image.
Second, blur the object boundary by median filter.
12
Object
Preserving
BldObjToImg
Use OpenCV::seamlessClone to blend the object with a new background image.
12
PsvObj
First, change the value of pixels in the background region to gray.
Second, blur the object boundary by median filter.
1
RmvObjByRGB
Remove the object by inpainting pixels of the object using a specific color.
Then, blur the object boundary by median filter.
9
Object
Removing
RmvObjByTool
Remove the object by inpainting the pixels of the object by the existing tools.
We use two tools from OpenCV: INPAINT_NS and INPAINT_TELEA.
2
RmvObjByMM
Remove the object by inpainting the object with the mean/median value of
all pixels in the margin between mask and the bounding box.
This operation is only applicable if both mask and the bounding box exist.
2
  • :https://en.wikipedia.org/wiki/Median_filter; mean and median; 2 tools from OpenCV; 9 common RGB colors.

  • top 12 different images by search ”background” and online;

TABLE I: Image Mutation Functions & Operations

V-A3 Metamorphic Relation Validation

Before formulating the object-relevancy score for an inference, let us introduce our distance function.

Distance Function: Given a source input image

, an image classification model will generate a probability vector,

= , where denotes the probability that this image belongs to label . Suppose the image belongs to the label (i.e., ground truth). Its probability is the -th largest element in the vector , i.e. rank . After feeding the follow-up image into the model, suppose the new result generated by the model is = . Similarly, each element is associated with a specific label . We assume that in , the ground truth label has probability and its rank is . We then compute the distance between and , according to the type of construction function as follows:

Object Preserving: If is constructed by an object preserving function, we follow the convention in Section III and denote it as . We measure the differences using changes of the prediction probability and the rank of the label as follows:

The first factor captures on the change in the probability value while the second captures the change in the rank. If this inference is made by a model based on object-relevancy features, there should be no changes in the probability and the rank of , and hence should be 0.

Object Removing: We measure how much the prediction probability and the rank of label are lowered. In the ideal case, since the object has been removed, the new probability should be reduced to 0 and ’s rank should be lowered.

Object-relevancy Score: We devise a new metric, called Object-relevancy Score to measure to what extent an inference by model on input is based on object-relevant features, by integrating the distances between and for each follow-up input .

We define the Preserving Object-relevancy Score using the weighted-average of all distances between the source input and each of its follow-up input generated by an object preserving mutation operation.

Assume is an image constructed by the -th mutation operation of -th preserving mutation function, its weight is defined as . is the total number of mutation operations in -th mutation function and is the total number of mutation functions.

Similarly, we define Removing Object-relevancy Score as follows:

Again, if is an image constructed by the -th mutation operation of -th removing mutation function, its weight is defined as . is the total number of mutation operations in -th mutation function and is the total number of mutation functions.

Finally, we define the Object-relevancy Score as follows:

V-B Object Detection

Object detection task has following differences with image classification problem and thus our the approach need to be changed accordingly.

Model Output: In image classification, model predicts a label for whole image. In object detection, given a image, model generates a result that contains multiple records. Each record is a tuple representing a object detected, which contains the detected object’s bounding box (or/and mask), label and the corresponding confidence probability.

Therefore, in our approach, when comparing the output between source input and follow-up input, the smallest element for comparison is each record, instead of each result. We will measure to what extend a record is based on object-relevant features, instead of a result.

Further, we need to effectively map the records from the new results to original results, in order to select the corresponding record to compare with the record to be measured. Later, we would introduce how we solve this mapping problem by a new concept Associated Object.

Dataset: Compared with image classification, the dataset for object detection has more objects per images. For example, in the COCO test set, each image contains 7.3 objects on average, shown in Table II. In particular, it is common that there exists multiple objects with the same label in a single image.

Such difference brings a new challenge in feature identification. If we treat all objects as the object region and mutate them together, noises might be introduced when comparing the output of source input and follow-up input. For example, assume we want to measure a record ‘dog’ in image to what extend it is based on object-relevant features and there are multiple dogs in this image. If we simply mutate all ‘dog’s together in this images, this record could be affected. It is hard for us to understand which dogs’ features cause such affection. For example, it could be the dog having overlap with this record, or it could be another dog which is in the corner of image and far from the region labeled by this record. We leverage Associated Object to solve this challenge, and its detail is followed.

Associated Object Given a detection record , we first locate the object in the ground truth that is most matched to this record, and denote it as the Associated Object.

We describe the method to find the Associated Object for a given detection record as follows. Suppose that the input image contains objects in the ground truth. For -th object object, we mark it as , with label , bounding box . Suppose the detected object in is , the label is , the bounding box is .

Then we compute the IOU score between the object with each object if they have same label, i.e.,:

We then select the object that has the highest IOU with as the Associated Object, which is denoted as , and the corresponding IOU is denoted as .

In feature identification, we use the region belonging to Associated Object as the object region. After feeding the mutation into model, we denote the record which has the highest IOU with Associated Object as and we select to compare with the original record .

V-B1 Object-Relevant Feature Identification

After locating the Associated Object, we treat the pixels where it locates as the object region and the other area as the background region.

V-B2 Follow-up Tests Construction

We apply the same mutation function as stated in Table I.

V-B3 Metamorphic Relation Validation

First, we define the distance function to compare the record in output of a source input with record in output of follow-up input. Second, we define the object-relevancy score of a single detection record. Based on multiple record, we compute the model object-relevance score.

Distance Function: For record , we denote its object in as with label , bounding box . Similarly, we compute the as follows:

We then compute the distance between and based on different mutation function types as follows:

Object Preserving: If the mutation function is preserving, We denote the record after mutation as . The IOU with is denoted as . We measure the degree of records, including the detection IOUs and labels, that remain the same after mutation.

Object Removing: We measure the degree of results, including the detection IOUs and labels, that alter after mutation.

Object-relevancy Score: Similar to image classification task, We devise a new metric, called Object-relevancy Score to measure to what extend an inference is based on object-relevant features. Noted that here the inference refers to one detection record.

Firstly, we define the Preserving Object-relevancy Score via weighted averaging all distance of and s constructed by object preserving mutation operation.

is the same weight function as in Section V-A.

Similarly, we define Removing Object-relevancy Score as follows:

Finally, we define the Object-relevancy Score as follows:

Vi Experiment Design

We conducted experiments with the aim to evaluate the effectiveness and usefulness of our proposed approach. Specifically, we propose the following two research questions:

First, we investigate the performance of the state-of-the-art deep learning models with respect to their object-relevancy scores via answering the following research question:

Research Question 1: Do the state-of-the-art models make inferences based on object-relevancy features?

We answer this question via investigating the following two sub-questions:

1). Are correct inferences made by existing models problematic if they have low object-relevancy scores?

To answer this question, we first selected those correct inferences with high probabilities but with low object-relevancy scores. We then examined whether such inferences are problematic via manual checking with the help of LIME [36], a visualization tool that can explain an inference result made by deep learning models. If the answer to this research question is “yes”, which means that even though the inferences made by existing models are correct with respect to their “labels”, they might still be problematic. We are then curious towards the distributions of such inferences with high probabilities but low object-relevant scores. If they are frequently observed in a certain number of image instances, we should then pay more attention to such images when evaluating new models. Therefore, we are motivated to propose the second sub-question:

2) How are the distributions of those inferences with high probabilities but low object-relevant scores among different models and different image instances?

To answer this question, we evaluated 20 models for tasks of image classification and object detection under three large-scale datasets. For each model, we calculated the proportion of correct inferences that have high probabilities but low object-relevancy scores.

Second, we investigate the usefulness of our proposed approach. Specifically, we investigate whether the object-relevancy score can be leveraged to attack existing state-of-the-art models. This is motivated by a previous study [37], which reveals that existing object detection models can be easily attacked via image transplanting. In their study, moving the object in an image to another with different background could prevent the detector from successfully recognizing it, and thus to attack existing models. Our proposed object-relevancy score could guide us to perform the operation of image transplanting when generating attacking images since it reveals whether an inference is made majorly based on objects or backgrounds. Therefore, we propose the following research question:

Research Question 2: Can the object-relevancy score be used to guide the attack of existing the state-of-the-art models?

To answer this question, we designed a new strategy that can effectively generate new images guided by the object-relevancy score. We then fed these images to the state-of-the-art techniques with the aim to attack them.

Vii Evaluation I: Effectiveness

To evaluate the effectiveness of our approach, we systematically evaluated the performance of existing the state-of-the-art deep learning models in terms of their object-relevant scores.

Vii-a Experimental Setup

In this experiment, we evaluated 20 models for image classification and object detection on three different datasets quantitatively with respect to their object-relevancy. Specifically, we selected 10 models for image classification, which are ResNet [15] (including ResNet-50/101/152), MobileNets [18], VGG [38] (including VGG-16 and VGG-19), DenseNet [19] (including DenseNet-121/161), Squeezenet [20] and ResNeXt [42]. We chose ImageNet to evaluate the performance of these image classification models. For object detection, the selected models are SSD [26], YOLOv3 [34] and Faster R-CNN [35]

. For SSD and YOLOv3, variants using different feature extraction networks (e.g., MobileNets, VGG-16) were also considered in our evaluation. We evaluated these object detection models under the COCO and VOC datasets. The information of the selected datasets is listed in Table 

II. All pre-trained models are obtainable from GluonCV [16, 46], which is an open-source model zoo providing implementations of common deep learning models. All of their implementations have reproduced the results as presented in the original publications.

Image Classification Object Detection
Name ImageNet COCO VOC
Version 2012 val 2017 val 2007 test
# images 50000 5000 4952
# categories 1000 80 20
# objects - 36781 14976
TABLE II: Datasets Information
Model Hashtag
Top-1
Accuracy
Top-5
Accuracy
DenseNet-121 f27dbf2d 0.750 0.923
DenseNet-161 b6c8a957 0.777 0.938
MobileNets efbb2ca3 0.733 0.913
ResNet-50 117a384e 0.792 0.946
ResNet-101 1b2b825f 0.805 0.951
ResNet-152 cddbc86f 0.806 0.953
ResNeXt 8654ca5d 0.807 0.952
SqueezeNet 264ba497 0.561 0.791
VGG-16 e660d456 0.732 0.913
VGG-19 ad2f660d 0.741 0.914
TABLE III: Image Classification Models and Their Accuracy
Dataset Model Hashtag mAP
COCO Faster RCNN 5b4690fb 0.370
SSD(ResNet-50) c4835162 0.306
YOLOv3(Darknet-53) 09767802 0.370
YOLOv3(MobileNets) 66dbbae6 0.280
VOC Faster RCNN 447328d8 0.783
SSD(MobileNets) 37c18076 0.754
SSD(ResNet-50) 9c8b225a 0.801
SSD(VGG-16) daf8181b 0.792
YOLOv3(Darknet-53) f5ece5ce 0.815
YOLOv3(MobileNets) 3b47835a 0.758
TABLE IV: Object Detection Models and Their mAP

Vii-B Results and Findings

Vii-B1 Are correct inferences made by existing models problematic if they have low object-relevancy scores?

To investigate whether correct inferences made by existing models are problematic when they have low object-relevancy scores, we selected images which are correctly classified by the image classification model ResNet-152 with high probabilities () but low object-relevancy scores () for investigation. Specifically, for each range for integer from 0 to 4, we selected 20 images whose object-relevancy scores are in this range. In total, 100 images were selected.

We then manually investigated to what extent these inference results are problematic. Specifically, we presented the original images and the corresponding explanations generated by LIME to 5 senior undergraduate students. They were then asked to what extent do they think this inference results are problematic in the range of [0%,100%]. The value of refers to ‘Not Problematic’ and

refers to ‘Very Problematic’. The experiment was conducted for each student individually and they were not aware of the corresponding object-relevancy scores. For each question, they were also asked towards of their confidence on the estimation. We filtered out the answers with low confidence (

) to control the data quality.

Finally we collected 80 results and 49 out of them were labeled as problematic () by at least one student, as shown in Fig.3. Besides, the lower the object-relevancy score, the more number of students labeled the inferences as problematic. For instance, for of the inferences whose object-relevancy scores are in the range of , over 3 students labelled them as problematic. Such a ratio is for those inferences whose object-relevancy scores are in the range of , and is only for the range of .

We also selected three examples to demonstrate that correct inferences with low object-relevancy scores are likely to be problematic as shown in Fig.4. The left column in Fig.4 shows the original test images, and the other images show correct inferences that predict the images as “wolf” with high probabilities () made by different models. The object-relevancy scores are high ( 0.5) for the inferences displayed in the middle column while they are low ( 0.5) for the inferences displayed in the right column. As we can see from the interpretation made by LIME (i.e., the green areas are generated by LIME), those correct inferences in the right column are more likely to be problematic since they are majorly made based on object-irrelevant features (i.e., background areas). Such problematic inferences can also be successfully reflected by their low object-relevant scores.

Fig. 3: Percentage of Problematic Images
(a)
(b) 0.553
(c) 0.399
(d)
(e) 0.671
(f) 0.451
(g)
(h) 0.703
(i) 0.472
Fig. 4: Left Column: Test Images. Middle Column: Explanation of Inferences based on Object-Relevant Features. Right Column: Explaination of Inferences based on Object-Irrelevant Features. The Object-Relevancy Scores are Below the Images. The Caption Denotes the Object-Relevancy Score for the Inference.

Vii-B2 How are the distributions of those inferences with high probabilities but low object-relevant scores among different models and different image instances?

To investigate the distributions of those inferences that have high probabilities but low object-relevant scores among different models and different image instances, we investigated all the correct (Top-5) inference results with high probabilities () but low () object-relevancy scores for each image classification model. The statistical information of the selected images is displayed in Table V. For each image classification model, around 6% of their correctly inference results have low object-relevancy scores.

Number Percentage
Total
DenseNet-121 3066 6.6% 46106
DenseNet-161 2995 6.4% 46911
MobileNets 2726 6.0% 45647
ResNet-50 2986 6.3% 47302
ResNet-101 3171 6.7% 47551
ResNet-152 3046 6.4% 47664
ResNeXt 3064 6.4% 47534
SqueezeNet 2130 5.3% 40017
VGG-16 2872 6.3% 45659
VGG-19 2895 6.3% 45877
  • : The Number of Correctly Classified Images by This Model

TABLE V: The Number and Percentage of Correctly Classified Images with High Classification Probability and Low Inference Object-Relevancy Score in All Correctly Classified Images

We further collected the union of all images with high classification probability but low inference object-relevancy score from 10 models. In total, we obtained 6317 images and the histogram of these images according to the number of occurrences in the 10 models are shown in Fig. 5. It shows that 771 images can be correctly classified with high probabilities () but the object-relevancy scores evaluated by all the 10 models are low (). From the perspective of object-relevancy score, these images should be paid more attentions to since all models evaluated by us do not make inference based on object-relevant features. We then investigated the distributions of these images, which belongs to 231 distinct labels. Fig.6 shows the top 17 labels with the highest frequencies in terms of the number of images. From the perspective of object-relevancy score, these labels should be paid more attention to in future model evaluations.

Fig. 5: Distribution of Correctly Classified Images with High Probability but Low Object-Relevant Score
Fig. 6: Distribution of Correctly Classified Image with High Probability but Low Object-Relevant Score by all the 10 Models w.r.t Labels (only the frequency larger than 10 are shown).

Similar investigation was conducted for the task of object detection. We selected all the detection records with high IOU () but low () object-relevancy scores. The statistical information of the selected images is displayed in Table VI. For most object detection models, around 10% to 20% of the correct results have high IOUs but low object-relevancy scores.

Dataset Model Number Percentage Total
COCO Faster RCNN 2648 27.0% 9819
SSD(ResNet-50) 3296 21.2% 15557
YOLOv3(Darknet-53) 2248 15.1% 14919
YOLOv3(MobileNets) 4450 29.1% 15281
VOC Faster RCNN 260 28.3% 918
SSD(MobileNets) 2156 21.5% 10046
SSD(VGG-16) 1162 14.4% 8063
SSD(ResNet-50) 1645 16.7% 9821
YOLOv3(MobileNets) 1865 17.4% 10708
YOLOv3(Darknet-53) 906 8.5% 10610
  • : The Number of Detected () Objects by This Model

TABLE VI: The Number and Percentage of Detected Objects with High IOU and Low Object-Relevancy Score in All Detected Objects with High IOU

Viii Evaluation II: Usefulness

To demonstrate the usefulness of our proposed approach, we designed an approach via leveraging the object-relevancy score to facilitate an existing model attacking approach [37]. Previous study [37] showed that the state-of-the-art object detection models failed to detect the objects in object-transplanted images, which are generated by replacing an image’s sub-regions with another sub-region that contains a object from another image. The transplanted objects come from the original dataset and they can be correctly detected in their original image. We extended such attacking method for image classification models. Specifically, given a model that is trained to classify images of object , there are two attacking scenarios with respect to the two defined metamorphic relations.

Scenario 1: Synthesize an image having object with label that force the model incorrectly classify it as other labels.

Scenario 2: Synthesize an image that does not contain any objects of but force the model incorrectly classify it as .

In both scenarios, images are synthesized through transplanting as defined in [37]. More specifically, to realize the first attacking scenario, we first select an image from all images with label that is correctly classified by the target model, and another image with another label (). We then replace the object in with the object extracted from with appropriate adjustment in scale. We finally feed the synthesized image to the target image classifier. If the top-1 prediction result is not equal to label , we regard it as a successful attack.

Fig.7 shows an example of such an attacking scenario. In this example, we extracted the object in Fig.6(a) with label ‘eggnog’ and transplanted it to the Fig.6(b). The ‘cup’ in Fig.6(b) was replaced by ‘eggnog’. The synthesized image is shown in Fig.6(c) and it is classified as ‘can opener’ by the model. This is a successful attack, since Fig.6(c) has object ‘eggnog’ but it is classified as ‘can opener’ incorrectly instead.

(a)
(b)
(c)
Fig. 7: (a): Image with Label ‘eggnog’. (b): Image with Label ‘cup’. (c): Successful Attack, which is Predicted as Label ‘can opener’ with a High Probability of 0.9987.

The selection of image and could be either performed randomly or guided by the object-relevancy score. Intuitively, we should select with a lower preserving object-relevancy score since lower preserving object-relevancy score indicates changing the background of would significantly affect the classification results. In other word, the object in is ignored by the target model so even a new image has this object, the target model is less likely to recognize it and would not classify it as . Similarly, the should be selected with a lower removing object-relevancy score since it indicates that removing the object from this image would not significantly affect the classification results. In other word, the background in affects significantly on the target model’s inference. Therefore, such background could lead the model to label the image as , regardless of the real object in the input image.

To guide the synthesis of an attacking image, we sort the images with label according to their preserving object-relevancy scores, and then select starting from the image with the lowest score. Similarly, we sort the images with label according to their removing object-relevancy scores, and then select starting from the image with the lowest score.

We compared the results using random selection and the guided selection. We used the dataset ImageNet and the model ResNet-152 (Model Hashtag: cddbc86f) from GluonCV model zoo [16, 46], which achieves the top-5 accuracy on the ImageNet dataset among the whole model zoo (ver0.3.0) in this experimentas. We randomly generated 50 pairs of labels and . For each pair, we selected two images and (i.e., randomly or guided by the object-relevancy score) 100 times and synthesized 100 images for attacking. We recorded the number of successful attacks. The results are displayed in Fig. 8. In total, our guided selection generated 3310 successful attacks (success attack rate ) while using random selection generated only 1800 successful attacks (success attack rate ). For 45 out of the 50 pairs, our strategy is more effective in terms of the number of successful attacks, the improvement of which ranges from 1.02x to 16.17x. In 40 out of the 45 pairs, our strategy is more efficient since it could synthesize the first successful attacking image more quickly.

Fig. 8: Number of Successful Attacks using Guided Selection and Random Selection for the Attack Scenario 1
Fig. 9: Number of Successful Attacks using Guided Selection and Random Selection for the Attack Scenario 2

We also designed a similar approach for the second attack. First, we selected an image from all images with label , and then selected another image with label (). After that, we substituted the object in with the object from after appropriate adjustments. We then fed the synthesized image to an image classifier, and obtained the inference results. If the top-1 prediction is equal to label , we regard it as a successful attack. Similar to the design of the first attack, in the guided selection, for , we prefer those images with lower removing object-relevancy score with respect to the object removing mutation. As for , we prefer those images with lower preserving object-relevancy scores.

We conducted experiments following the methodology of the first attack. The results are displayed in Fig. 9. It shows that our guided selection generated 1288 successful attacks while the random selection only generated 629 successful attacks. In particular, our guided selection outperformed the random one for 33 out of the 50 pairs. The two selection strategies achieved the same performance for 3 pairs. As for the rest 14 pairs, random selection generated 113 successful attacks while our guided selection generated 40 successful attacks. Although our strategies do not outperform the random selection for certain cases, our guided selection is much more effective in general.

Ix Related Work

Ix-a Metamorphic Testing in Deep Learning System

Several studies have applied metamorphic testing to validate machine learning systems [10, 44, 9], including deep learning ones [10, 45]. Dwarakanath et al. [10] leveraged two sets of metamorphic relations to identify faults in machine learning implementations. For example, one metamorphic relation for deep learning system is the “permutation of input channels (i.e. RGB channels) for the training and test data” would not affect inference results. To validate whether a specific implementation of DNN satisfies this relation, they re-order the RGB channel of images in both training set and test set. They examine the impact on the accuracy or precision of the DNN model after it is trained using the permuted dataset. Their relations treat the pixels in an image as independent units and they do not consider objects and background in the image.

Xie et al. [44] performed metamorphic testing on two machine learning algorithms: k-Nearest Neighbors and Naïve Bayes Classifier. Their work targets at testing attribute-based machine learning models instead of deep learning systems. Ding et al. [9] proposed metamorphic relations for deep learning at three different level of validation: system level, data set level and data item level. For example, a metamorphic relation on system level asserts that DNN should perform better than SVM classifier for image classification. Both studies require retraining of the machine learning systems under test, they are inapplicable to pre-trained models.

Other studies [45, 40, 47] leveraged metamorphic testing in validating autonomous driving systems. DeepTest [40] designed a systematic testing approach to detecting the inconsistent behaviors of autonomous driving systems using metamorphic relation. Their relations focus on general image transformation, including scale, shear, rotation and so on. Further, DeepRoad [45] leverage GAN (Generative Adversarial Networks) to improve the quality of transformed image. Given a autonomous driving system, DeepRoad mutates the original images to simulate weather conditions such as adding fog to an image. An inconsistency is identified if a deep learning system makes inconsistent decision on an image and its mutated one (e.g., the difference of the steering degrees exceeds a certain threshold). To the best of our knowledge, we are the first to design metamorphic relations to assess whether an inference is based on object-relevant object or not.

Ix-B Testing Deep Learning Systems

Besides metamorphic testing, studies have also been made to adapt other classical testing techniques for deep learning systems. DeepXplore [32]

proposed neuron coverage to quantify the adequacy of a testing dataset. DeepGauge 

[27] proposed a collection of testing criteria. DeepFuzz [31] and DeepHunt [43] leveraged fuzz testing to facilateing the debugging processing in DNN. DeepMutation [28] applied mutation testing to measure the quality of test data in deep learning.

Our study falls into the research direction of testing deep learning systems. The major contribution of our study is to test deep learning systems from a new perspective, i.e., the object relevancy of inferences. This new perspective has not attracted enough attention from communities.

X Threats to Validity

The validity of our study is subject to the following two threats. First, we collected and tested 20 models for image classification and object detection. These models may not include all models used by deep learning applications. To mitigate the threat, all models collected are representative, designed over popular model architectures in image analysis. We ensured that all models in our implementation achieved an accuracy not worse than the one reported in their original research publications. All the models collected by us are the representative and popular model architectures in their areas. Second, our manual check is subject to human mistakes. To address the threat, all results are cross validated by 5 senior students independently. They were not aware of the object-relevancy scores of dataset.

Xi Conclusion

In this work, we proposed to leverage metamorphic testing to test whether the inference made by pre-trained deep learning models are based on object-relevant features. We proposed two novel metamorphic relations, from the perspective of object relevancy. We devised a metric, i.e., object-relevancy score to measure to what extend an inference is based on object-relevant features. We applied our approach to 20 popular deep learning models, with 3 large-scale datasets. We found that the inferences based on object-irrelevant features commonly exist in the output of these models. We further leveraged the object-relevancy score to facilitate an existing attacking method.

References

  • [1] (2017)

    2017 IEEE conference on computer vision and pattern recognition, CVPR 2017, honolulu, hi, usa, july 21-26, 2017

    .
    IEEE Computer Society. External Links: Link, ISBN 978-1-5386-0457-1 Cited by: 21, 19.
  • [2] (2017) 2017 IEEE symposium on security and privacy, SP 2017, san jose, ca, usa, may 22-26, 2017. IEEE Computer Society. External Links: Link, ISBN 978-1-5090-5533-3 Cited by: 21.
  • [3] (2018) 2018 IEEE conference on computer vision and pattern recognition, CVPR 2018, salt lake city, ut, usa, june 18-22, 2018. IEEE Computer Society. External Links: Link Cited by: 21, 7.
  • [4] (2019) 26th annual network and distributed system security symposium, NDSS 2019, san diego, california, usa, february 24-27, 2019. The Internet Society. External Links: Link, ISBN 1-891562-55-X Cited by: 21.
  • [5] T. Y. Chen and S. M. Yiu (1998-01) Metamorphic testing: a new approach for generating next test cases. Technical report Department of Computer Science, Hong Kong University of Science and Technology, Hong Kong. Cited by: §I, §II-A.
  • [6] T. Y. Chen, F. Kuo, H. Liu, P. Poon, D. Towey, T. H. Tse, and Z. Q. Zhou (2018-01) Metamorphic testing: a review of challenges and opportunities. 51 (1), pp. 4:1–4:27. External Links: ISSN 0360-0300, Link, Document Cited by: §II-A, §III.
  • [7] Y. Chen, Z. Wang, Y. Peng, Z. Zhang, G. Yu, and J. Sun (2018) Cascaded pyramid network for multi-person pose estimation. See 3, pp. 7103–7112. External Links: Link, Document Cited by: §I, §II-B.
  • [8] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei (2009) ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, Cited by: §I, §II-B1.
  • [9] J. Ding, X. Kang, and X. Hu (2017-05) Validating a deep learning framework by metamorphic testing. In 2017 IEEE/ACM 2nd International Workshop on Metamorphic Testing (MET), Vol. , pp. 28–34. External Links: Document, ISSN Cited by: §I, §I, §IX-A, §IX-A.
  • [10] A. Dwarakanath, M. Ahuja, S. Sikand, R. M. Rao, R. P. J. C. Bose, N. Dubash, and S. Podder (2018) Identifying implementation bugs in machine learning based image classifiers using metamorphic testing. In Proceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2018, New York, NY, USA, pp. 118–128. External Links: ISBN 978-1-4503-5699-2, Link, Document Cited by: §I, §I, §IX-A.
  • [11] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman (2010-06) The pascal visual object classes (voc) challenge. 88 (2), pp. 303–338. Cited by: §I, §II-B2.
  • [12] V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss (Eds.) (2018) Computer vision - ECCV 2018 - 15th european conference, munich, germany, september 8-14, 2018, proceedings, part VI. Lecture Notes in Computer Science, Vol. 11210, Springer. External Links: Link, Document, ISBN 978-3-030-01230-4 Cited by: 21, 41.
  • [13] S. Ghosh, R. Natella, B. Cukic, R. Poston, and N. Laranjeiro (Eds.) (2018) 29th IEEE international symposium on software reliability engineering, ISSRE 2018, memphis, tn, usa, october 15-18, 2018. IEEE Computer Society. External Links: Link, ISBN 978-1-5386-8321-7 Cited by: 21, 28, 29.
  • [14] T. Gu, K. Liu, B. Dolan-Gavitt, and S. Garg (2019) BadNets: evaluating backdooring attacks on deep neural networks. 7 (), pp. 47230–47244. External Links: Document, ISSN 2169-3536 Cited by: §I.
  • [15] K. He, X. Zhang, S. Ren, and J. Sun (2016-06) Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 770–778. External Links: Document, ISSN 1063-6919 Cited by: §I, §II-B1, §II-B, §VII-A.
  • [16] T. He, Z. Zhang, H. Zhang, Z. Zhang, J. Xie, and M. Li (2018)

    Bag of tricks for image classification with convolutional neural networks

    .
    External Links: Link Cited by: §VII-A, §VIII.
  • [17] J. H. Hosang, R. Benenson, P. Dollár, and B. Schiele (2015) What makes for effective detection proposals?. CoRR abs/1502.05082. External Links: Link, 1502.05082 Cited by: §II-B2.
  • [18] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam (2017) MobileNets: efficient convolutional neural networks for mobile vision applications. CoRR abs/1704.04861. External Links: Link, 1704.04861 Cited by: §I, §II-B1, §II-B, §VII-A.
  • [19] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. See 1, pp. 2261–2269. External Links: Link, Document Cited by: §II-B1, §VII-A.
  • [20] F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J. Dally, and K. Keutzer (2016) SqueezeNet: alexnet-level accuracy with 50x fewer parameters and <1mb model size. abs/1602.07360. External Links: Link, 1602.07360 Cited by: §VII-A.
  • [21] A. Krizhevsky, V. Nair, and G. Hinton () The cifar-10 dataset. International Journal of Computer VisionCoRRIEEE Transactions on Software EngineeringIEEE AccessJournal of Systems and SoftwareCoRRACM Comput. Surv.CoRRCoRRCoRRCommun. ACMarXiv preprint arXiv:1807.10875CoRRarXiv preprint arXiv:1812.01187arXiv preprint arXiv:1902.04103. External Links: Link Cited by: §II-B1.
  • [22] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.), pp. 1097–1105. External Links: Link Cited by: §II-B1.
  • [23] H. Law, Y. Teng, O. Russakovsky, and J. Deng (2019) CornerNet-lite: efficient keypoint based object detection. abs/1904.08900. External Links: Link, 1904.08900 Cited by: §I, §II-B.
  • [24] Y. LeCun and C. Cortes (2010) MNIST handwritten digit database. Note: http://yann.lecun.com/exdb/mnist/ External Links: Link Cited by: §II-B1.
  • [25] T. Lin, M. Maire, S. J. Belongie, L. D. Bourdev, R. B. Girshick, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft COCO: common objects in context. abs/1405.0312. External Links: Link, 1405.0312 Cited by: §I, §II-B2.
  • [26] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. C. Berg (2016) SSD: single shot multibox detector. In Computer Vision – ECCV 2016, B. Leibe, J. Matas, N. Sebe, and M. Welling (Eds.), Cham, pp. 21–37. External Links: ISBN 978-3-319-46448-0 Cited by: §I, §II-B2, §II-B, §VII-A.
  • [27] L. Ma, F. Juefei-Xu, F. Zhang, J. Sun, M. Xue, B. Li, C. Chen, T. Su, L. Li, Y. Liu, J. Zhao, and Y. Wang (2018) DeepGauge: multi-granularity testing criteria for deep learning systems. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, ASE 2018, New York, NY, USA, pp. 120–131. External Links: ISBN 978-1-4503-5937-5, Link, Document Cited by: §IX-B.
  • [28] L. Ma, F. Zhang, J. Sun, M. Xue, B. Li, F. Juefei-Xu, C. Xie, L. Li, Y. Liu, J. Zhao, and Y. Wang (2018) DeepMutation: mutation testing of deep learning systems. See 29th IEEE international symposium on software reliability engineering, ISSRE 2018, memphis, tn, usa, october 15-18, 2018, Ghosh et al., pp. 100–111. External Links: Link, Document Cited by: §IX-B.
  • [29] L. Ma, F. Zhang, J. Sun, M. Xue, B. Li, F. Juefei-Xu, C. Xie, L. Li, Y. Liu, J. Zhao, and Y. Wang (2018) DeepMutation: mutation testing of deep learning systems. See 29th IEEE international symposium on software reliability engineering, ISSRE 2018, memphis, tn, usa, october 15-18, 2018, Ghosh et al., pp. 100–111. External Links: Link, Document Cited by: §I.
  • [30] S. Moosavi-Dezfooli, A. Fawzi, and P. Frossard (2015) DeepFool: a simple and accurate method to fool deep neural networks. CoRR abs/1511.04599. External Links: Link, 1511.04599 Cited by: §I.
  • [31] A. Odena and I. Goodfellow (2018) Tensorfuzz: debugging neural networks with coverage-guided fuzzing. Cited by: §IX-B.
  • [32] K. Pei, Y. Cao, J. Yang, and S. Jana (2017) DeepXplore: automated whitebox testing of deep learning systems. In Proceedings of the 26th Symposium on Operating Systems Principles, SOSP ’17, New York, NY, USA, pp. 1–18. External Links: ISBN 978-1-4503-5085-3, Link, Document Cited by: §I, §IX-B.
  • [33] C. Peng, T. Xiao, Z. Li, Y. Jiang, X. Zhang, K. Jia, G. Yu, and J. Sun (2018) Megdet: a large mini-batch object detector. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6181–6189. Cited by: §I, §II-B.
  • [34] J. Redmon and A. Farhadi (2018) YOLOv3: an incremental improvement. CoRR abs/1804.02767. External Links: Link, 1804.02767 Cited by: §I, §II-B2, §II-B, §VII-A.
  • [35] S. Ren, K. He, R. Girshick, and J. Sun (2017-06) Faster r-cnn: towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (6), pp. 1137–1149. External Links: Document, ISSN 0162-8828 Cited by: §I, §II-B2, §II-B, §VII-A.
  • [36] M. T. Ribeiro, S. Singh, and C. Guestrin (2016) ”Why should I trust you?”: explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13-17, 2016, pp. 1135–1144. Cited by: §I, §I, §I, §VI.
  • [37] A. Rosenfeld, R. S. Zemel, and J. K. Tsotsos (2018) The elephant in the room. abs/1808.03305. External Links: Link, 1808.03305 Cited by: §VI, §VIII, §VIII.
  • [38] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556. External Links: Link, 1409.1556 Cited by: §I, §II-B1, §II-B, §VII-A.
  • [39] P. Stock and M. Cissé (2017) ConvNets and imagenet beyond accuracy: explanations, bias detection, adversarial examples and model criticism. CoRR abs/1711.11443. External Links: Link, 1711.11443 Cited by: §I.
  • [40] Y. Tian, K. Pei, S. Jana, and B. Ray (2018) DeepTest: automated testing of deep-neural-network-driven autonomous cars. In Proceedings of the 40th International Conference on Software Engineering, ICSE ’18, New York, NY, USA, pp. 303–314. External Links: ISBN 978-1-4503-5638-1, Link, Document Cited by: §I, §IX-A.
  • [41] B. Xiao, H. Wu, and Y. Wei (2018) Simple baselines for human pose estimation and tracking. See Computer vision - ECCV 2018 - 15th european conference, munich, germany, september 8-14, 2018, proceedings, part VI, Ferrari et al., pp. 472–487. External Links: Link, Document Cited by: §I, §II-B.
  • [42] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He (2017-07) Aggregated residual transformations for deep neural networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 5987–5995. External Links: Document, ISSN 1063-6919 Cited by: §VII-A.
  • [43] X. Xie, L. Ma, F. Juefei-Xu, H. Chen, M. Xue, B. Li, Y. Liu, J. Zhao, J. Yin, and S. See (2018) Coverage-guided fuzzing for deep neural networks. abs/1809.01266. External Links: Link, 1809.01266 Cited by: §IX-B.
  • [44] X. Xie, J. W.K. Ho, C. Murphy, G. Kaiser, B. Xu, and T. Y. Chen (2011) Testing and validating machine learning classifiers by metamorphic testing. 84 (4), pp. 544 – 558. Note: The Ninth International Conference on Quality Software External Links: ISSN 0164-1212, Document, Link Cited by: §I, §I, §IX-A, §IX-A.
  • [45] M. Zhang, Y. Zhang, L. Zhang, C. Liu, and S. Khurshid (2018) DeepRoad: gan-based metamorphic testing and input validation framework for autonomous driving systems. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, ASE 2018, New York, NY, USA, pp. 132–142. External Links: ISBN 978-1-4503-5937-5, Link, Document Cited by: §I, §I, §IX-A, §IX-A.
  • [46] Z. Zhang, T. He, H. Zhang, Z. Zhang, J. Xie, and M. Li (2019) Bag of freebies for training object detection neural networks. External Links: Link Cited by: §VII-A, §VIII.
  • [47] Z. Q. Zhou and L. Sun (2019-02) Metamorphic testing of driverless cars. 62 (3), pp. 61–67. External Links: ISSN 0001-0782, Link, Document Cited by: §IX-A.