Fast Test Input Generation for Finding Deviated Behaviors in Compressed Deep Neural Network

by   Yongqiang Tian, et al.

Model compression can significantly reduce sizes of deep neural network (DNN) models so that large, sophisticated models after compression can be deployed on resource-limited mobile and IoT devices. However, model compression often introduces deviated behaviors into a compressed model: the original and compressed models output different prediction results for the same input. Hence, it is critical to warn developers and help them comprehensively evaluate possible consequences of such behaviors before deployment. To this end, we propose TriggerFinder, a novel, effective and efficient testing approach to automatically identifying inputs to trigger deviated behaviors in compressed models. Given an input i as a seed, TriggerFinder iteratively applies a series of mutation operations to change i until the resulting input triggers a deviated behavior. However, compressed models usually hide their architecture and gradient information; without such internal information as guidance, it becomes difficult to effectively and efficiently trigger deviated behaviors. To tackle this challenge, we propose a novel fitness function to determine the mutated input that is closer to the inputs that can trigger the deviated predictions. Furthermore, TriggerFinder models this search problem as a Markov Chain process and leverages the Metropolis-Hasting algorithm to guide the selection of mutation operators. We evaluated TriggerFinder on 18 compressed models with two datasets. The experiment results demonstrate that TriggerFinder can successfully find triggering inputs for all seed inputs while the baseline fails in certain cases. As for efficiency, TriggerFinder is 5.2x-115.8x as fast as the baselines. Furthermore, the queries required by TriggerFinder to find one triggering input is only 51.8x-535.6x as small as the baseline.



There are no comments yet.


page 1

page 6


Compression-Resistant Backdoor Attack against Deep Neural Networks

In recent years, many backdoor attacks based on training data poisoning ...

Fast Conditional Network Compression Using Bayesian HyperNetworks

We introduce a conditional compression problem and propose a fast framew...

DKM: Differentiable K-Means Clustering Layer for Neural Network Compression

Deep neural network (DNN) model compression for efficient on-device infe...

Stealthy Backdoors as Compression Artifacts

In a backdoor attack on a machine learning model, an adversary produces ...

DeepGalaxy: Testing Neural Network Verifiers via Two-Dimensional Input Space Exploration

Deep neural networks (DNNs) are widely developed and applied in many are...

DeepGini: Prioritizing Massive Tests to Reduce Labeling Cost

Deep neural network (DNN) based systems have been deployed to assist var...

Snipuzz: Black-box Fuzzing of IoT Firmware via Message Snippet Inference

The proliferation of Internet of Things (IoT) devices has made people's ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Compressed models are increasingly deployed for deep learning tasks on mobile and embedded devices. Compared to original models, compressed ones achieve similar accuracy on the original test data but require significantly less time and computational resources,

e.g., disk, memory and energy, for inference Choudhary et al. (2020); Wang et al. (2019). However, model compression is a lossy process. A compressed model can make predictions deviated from its original model for the same input Xie et al. (2019b, a). For example, given the two images in Figure 1, the LeNet-4 Lecun et al. (1998) model correctly predicts both as 4 while its compressed model predicts the left one as 9 and the right one as 6. We say a deviated behavior occurs if a compressed model makes a prediction different from its original model. The input that triggers such a deviated behavior is referred to as a triggering input; otherwise a non-triggering input. Our objective is to effectively find triggering inputs for a given pair of compressed model and its original model, so that users and developers of the compressed model can understand the consequence before the deployment of compressed models Akhtar and Mian (2018); Liu et al. (2018).

Figure 1: Images Triggering Deviated Behaviors between LeNet-4 and its Quantized Model.

However, it is non-trivial to identify triggering inputs. Specifically, to accelerate the inference efficiency and reduce the storage consumption, compressed models usually hide their architectures and the intermediate results. Without such information as guidance, it is difficult for input generation approaches to effectively and efficiently find the triggering inputs. For example, the state-of-the-art approach, DiffChaser Xie et al. (2019b)

, is not necessarily able to find the triggering inputs for compressed models. The reason is that the fitness function of DiffChaser is not able to capture the difference between the two models’ predictions. Further, as a genetic algorithm, DiffChaser needs to crossover a considerably large portion of inputs and feed them into DNN models in each iteration. As a result, DiffChaser is computationally expensive, requiring thousands of queries from the two models to find a triggering input.

In this paper, we propose TriggerFinder, an effective and efficient technique to automatically trigger deviated behaviors for compressed DNN models. Given a non-triggering input seed, TriggerFinder mutates it successively until a triggering input is found. The mutation is guided by a specially designed fitness function, which measures (1) the difference between the prediction outputs of the original and compressed model, and (2) whether the input triggers a previously unobserved states of two models. The fitness function of TriggerFinder does not require the model’s intermediate results, and thus TriggerFinder can be applied to any compressed model. Unlike DiffChaser, TriggerFinder only selects one mutation operator and generates one mutated input at each iteration, resulting in much less queries than DiffChaser. To achieve this, TriggerFinder models the selection of mutation operators as a Markov Chain process and adopts the Metropolis-Hastings (MH) algorithm Kass et al. (1998) to guide the selection. Specifically, TriggerFinder prefers a mutation operator that is likely to enlarge the fitness function value of subsequent mutated inputs.

To evaluate TriggerFinder, we conduct the experiments using 18 pairs of models (i.e.

 the original model and its compressed one) on 2 datasets, MNIST 

LeCun and Cortes (2010)

and CIFAR-10 

Krizhevsky, Nair, and Hinton (2009). The compressed models are prepared by diverse representative techniques: weight pruning Li et al. (2017); Han et al. (2015), quantization Zhou et al. (2017); Rastegari et al. (2016) and knowledge distillation Polino, Pascanu, and Alistarh (2018); Mishra and Marr (2018). The model architectures include both small- and large-scale ones, from LeNet to VGG-16. We use DiffChaser, the state-of-the-art blackbox approach as a baseline for comparison.

We evaluate the effectiveness and efficiency of TriggerFinder. For effectiveness, we feed a fixed number of seed inputs to TriggerFinder and measure the ratio of seed inputs given which TriggerFinder can successfully generate triggering inputs. For efficiency, we measure the time and queries that TriggerFinder needs to find one triggering input given a seed input. We repeat each experiment five times using five sets of seed inputs.

TriggerFinder achieves 100% success rate while the baseline DiffChaser cannot generate triggering inputs for some seed inputs. On average, TriggerFinder can generate a triggering input within in 0.28s while the baseline needs more than 4.96s. Further, the number of queries needed by TriggerFinder is much smaller. TriggerFinder is also able to find one triggering input with an average of 24.97 queries while DiffChaser needs 4,059.53.

In summary, this paper makes the following contributions.

  1. We propose a novel method by leveraging a novel fitness function and the MH algorithm to find the triggering input for original DNN models and the compressed ones.

  2. We implement TriggerFinder as a tool and collect a benchmark to facilitate related future researches.

  3. We conduct experiments on TriggerFinder and the state-of-the-art technique. Evaluation results show that TriggerFinder significantly outperforms the state-of-the-art technique in terms of both effectiveness and efficiency.



Let denote a DNN model designed for single-label classification, and denote a corresponding compressed model. Given an arbitrary input , model

outputs a probability vector

, where is the total number of all possible classification labels. We refer to the highest probability in as top-1 probability and denote it as . We refer to the label whose probability is in as top-1 label and denote it as . Similarly, the probability vector of the compressed model, the top-1 probability and its label are denoted as , and , respectively.

We assume that compressed model is a blackbox and only the information , and are available Guo et al. (2019); Cheng et al. (2019); Bhagoji et al. (2018); Shi, Wang, and Han (2019)

. The reason is that in practice, the intermediate results of compressed models, such as activation values and gradients, are not available due to the lack of such API support in deep learning frameworks. Modern deep learning frameworks, such as TensorFlow Lite and ONNX Inference, usually only provide APIs for end-to-end inference of the compressed model, instead of the acquisition of intermediate results. Moreover, if the compressed model under test requires special devices such as mobile phone, accessing the intermediate results requires the support from system vendors, which is not always feasible. Therefore, it is impractical to adopt test generation approaches designed for DNN models, such as DeepHunter 

Xie et al. (2019a), DeepGauge Ma et al. (2018), DeepXplore Pei et al. (2017) and so on  Kim, Feldt, and Yoo (2019); Tian et al. (2018). Further, the blackbox assumption of compressed models increases the generalizability of TriggerFinder.

State of the Art

DiffChaser Xie et al. (2019b) is a blackbox genetic-based approach to find the triggering inputs for a compressed model. In the beginning, it creates a pool of inputs by mutating a given non-triggering input. In each iteration, DiffChaser crossovers two branches of the selected inputs and then selectively feeds them back to the pool until any triggering input is found. To determine whether each mutated input will be fed back to the pool or discarded, DiffChaser proposes k-Uncertainty fitness function. k-Uncertainty measures the difference between the highest probability and k-highest probability of either or . Please note that k-Uncertainty does not capture the difference between two models, resulting in its ineffectiveness in certain cases, as shown later in the Section Evaluation.

Difference from Adversarial Samples

Please note that adversarial samples are different from triggering inputs. The adversarial attack approach targets a single model using a malicious input, which is crafted by applying human-imperceptible perturbation on a benign input Carlini and Wagner (2017); Goodfellow, Shlens, and Szegedy (2015); Odena et al. (2019); Pei et al. (2017); Zhang, Chowdhury, and Christakis (2020). In contrast, a triggering input is the one that can cause an inconsistent prediction between an original model and its compressed model. Note that adversarial samples of the original model are often not triggering inputs for compressed models. In our preliminary exploration, we have leveraged FGSM Goodfellow, Shlens, and Szegedy (2015) and CW Carlini and Wagner (2017) to generate adversarial samples for three compressed models using MNIST. On average, only 18.6 out of 10,000 adversarial samples are triggering inputs.

Model Compression

Various model compression algorithms such as weight pruning Li et al. (2017); Han et al. (2015), quantization Zhou et al. (2017); Rastegari et al. (2016) and knowledge distillation Buciluundefined, Caruana, and Niculescu-Mizil (2006); Polino, Pascanu, and Alistarh (2018); Mishra and Marr (2018) have been proposed to compress deep learning models. Weight pruning sets a portion of model parameters selected by predefined criteria to zero. The intuition behind weight pruning is that some weights are redundant or have an ignorable contribution to the whole inference process. Quantization compresses a model by reducing the number of bits for number representation. For example, a common way in quantization is to use 8-bit integer representation for numeric parameters, which are originally represented by 32-bit floating-point numbers. Knowledge distillation aims to train a compact model based on its large original model. In the distillation, the knowledge is transferred from the original model into the compact model.


This section formulates the targeted problem, and then details how we tackle this problem in TriggerFinder.

Problem Formulation

Given a non-triggering input as seed input , TriggerFinder strives to find a new input such that the top-1 label predicted by the original model is different from the top-1 label from the compressed model , i.e., . Similar to the mutated-based test generations Xie et al. (2019b); Odena et al. (2019), TriggerFinder attempts to find by applying a series of input mutation operators on the seed input . Conceptually, , where is a perturbation made by the applied input mutation operators.

Overview of TriggerFinder

Algorithm 1 shows the overview of TriggerFinder. TriggerFinder takes four inputs: a seed input , the original and the compressed model and , and a list pool of predefined input mutation operators; it returns a triggering input if found.

TriggerFinder finds via multiple iterations. Throughout all iterations, TriggerFinder maintains two variables: is the input mutation operator to apply, which is initially randomly picked from pool on line 1 and updated each iteration on line 10; is the input with the maximum fitness value among all generated inputs, which is initialized with on line 2.

In each iteration, TriggerFinder applies an input mutation operator on the input which has the highest fitness value to generate a new, mutated input, i.e., on line 4. If triggers a deviated behavior between and on line 5, then is returned as the triggering input . Otherwise, TriggerFinder compares the fitness values of and on line 7, and use the one that has the higher value for the next iteration (line 8) . The mutation operators are implemented separately from the main logic of TriggerFinder, and it is easy to integrate more mutation operators. In our implementation, we used the same operators as DiffChaser.

Two factors can significantly affect the performance of the Algorithm 1: fitness function and the strategy to select mutation operators. In the following section, we give a detailed illustration on each of them, including their intuitions and mechanisms.

Input: : a seed input : the original model : the compressed model pool: a list of predefined input mutation operators
Output: an triggering input

1:   an operator randomly selected from pool
3:  while true do
5:     if   then
6:        return // is a triggering input
7:     if  then
8:        // if it has higher fitness value
9:        .update()  // update its ranking value
10:     .select()  // select the next operator
Algorithm 1 Overview of TriggerFinder

Fitness Function

Following the existing test generation approaches in software testing Chen et al. (2016); Odena et al. (2019); Xie et al. (2019b), in TriggerFinder, if the mutated input is a non-triggering input, the fitness function is used to determine whether should be used in the subsequent iterations of mutations(Algorithm 1, Line 78). By selecting the proper mutated input in each iteration, we aim to move increasingly close to the triggering input from the initial seed input .

Intuitions of TriggerFinder

We design the fitness function from two perspectives. First, if can cause a larger distance between outputs of and than , is more favored than . The intuition is that if can, then future inputs generated by mutating are more likely to further enlarge the difference. Eventually, one input generated in the future will increase the distance substantially such that the labels predicted by and become different, and this input is a triggering input that TriggerFinder has been searching for.

Second, when and cause the same distance between outputs of and , is preferred over if triggers a previously unobserved model state in or . Conceptually, a model state refers to the internal status of original or compressed models during inference, including but not limited to a model’s activation status. If an input triggers a model state that is different from the previously observed ones, it is likely that it triggers a new logic flow in or . By selecting such input for next iterations, we are encouraging TriggerFinder to explore more new logic flows of two models, resulting in new model behaviors, even deviated ones.

Definition of Fitness Function

Now we present the formal definition of our fitness function for a non-triggering input as a combination of two intuitions.

For the first intuition, given an input , we denote the distance between two DNN models’ outputs as . Since is a non-triggering input, the top-1 labels of and are the same and we simply use the top-1 probability to measure the distance, i.e., .

For our second intuition, since we assume that the compressed model is a blackbox and its internal status is not available, we use the probability vector to approximate the model state. When executing Algorithm 1, we track the probabilities vectors produced by and on all generated inputs. In the calculation of fitness value of at each iteration, we check whether the pair of probability vectors output by the two DNN models is observed previously or not. This checking is denoted as ,

We adopt the Nearest Neighborhood algorithm Muja and Lowe (2014) to determine , i.e., whether is close to any previously observed states.

The fitness function for a non-triggering input is defined as:

Specifically, according to , for two non-triggering inputs, we choose the one with a higher component. If their components are very close (i.e., the difference is less than the tolerance ), they will be chosen based on . In our implementation, we set .

The Strategy to Select Mutation Operators

Existing work on the test generation for conventional software has shown that the selection strategy of mutation operators can have a significant impact on the performance of mutation-based test input generation techniques adopted by TriggerFinder Le, Sun, and Su (2015); Chen et al. (2016). Following prior work, in each iteration, TriggerFinder favors a mutation operator with a high probability to make the next mutated input have a higher fitness values than . Unfortunately, it is non-trivial to obtain such probability of mutation operators before the mutation process starts.

To tackle the challenge of selecting effective mutation operators, TriggerFinder models the problem as a Markov Chain Meyn and Tweedie (2009) and uses Monte Carlo Kass et al. (1998) to guide the selection. During the test generation, TriggerFinder selects one mutation operator from a pool of operators and applies it to the input. This process can be modeled as a stochastic process , where is the selected operator at -th iteration. Since the selection of from all possible states only depends on  Le, Sun, and Su (2015); Wang et al. (2020); Chen et al. (2016),

this process is a typical Markov Chain. Given this modeling, TriggerFinder further uses Markov Chain Monte Carlo (MCMC) Kass et al. (1998) to guide the selection of mutation operators in order to mimic the selection from the actual probability.

Specifically, TriggerFinder adopts Metropolis-Hasting algorithm Kass et al. (1998), a popular MCMC method to guide the selection the mutation operator pool. Throughout all iterations, for operator , TriggerFinder associates it with a ranking value:

where is the number of times that operator is selected and is the number of times that the fitness value of input is increased after applying . These numbers are dynamically updated in the generation as shown in Algorithm 1, line 9.

The detailed algorithm for the operator selection given the operator at last iteration in TriggerFinder is shown in Algorithm 2. Based on each operator’s ranking value , TriggerFinder first sorts the mutation operators from the largest to smallest (line 1) and denotes the index of as (line 2). Then TriggerFinder selects one mutation operator from the pool (line  4) and calculates the acceptance probability for given (line 6):

where is the multiplicative inverse for the number of mutation operators in the pool. Following the Metropolis-Hasting algorithm, TriggerFinder randomly accepts or rejects this mutation operator based on its acceptance probability (line 7). The above process will repeat until one operator is accepted.

Input: : the mutation operator used in last iteration pool: a list of predefined input mutation operators Output: the mutation operator for this iteration

1:  // sort the operators in pool into a list in // descending order of the operators’ ranking values
3:  repeat
7:  until random.rand(0, 1)
8:  return
Algorithm 2 Mutation Operator Selection
MNIST LeNet-1 97.88 Quan-8-bit 97.88
LeNet-5 98.81 Quan-8-bit 98.81
CIFAR-10 ResNet-20 91.20 Quan-8-bit 91.20
MNIST CNN 99.11 Pruning 99.23
Quan 99.13
LeNet-4 99.21 Pruning 99.13
Quan 99.21
LeNet-5 99.13 Pruning 98.99
Quan 99.15
CIFAR-10 PlainNet-20 87.33 Know. Distil. 75.89
Pruning 85.98
Quan 87.12
ResNet-20 89.42 Know. Distil. 74.60
Pruning 89.88
Quan 88.89
VGG-16 87.48 Know. Distil. 87.59
Pruning 88.44
Quan 87.06
Table 1: The Top-1 Accuracy of the Original Models and Compressed Models used in Evaluation. The upper half is the models from DiffChaser and the lower half is the models prepared by this study. “Quan”: Quantization; “Know. Distil.”: Knowledge Distillation.

Experiment Design

Datasets and Seed Inputs

We use the two datasets: MNIST LeCun and Cortes (2010) and CIFAR-10 Krizhevsky, Nair, and Hinton (2009) to evaluate the performance of TriggerFinder. We choose them as they are widely used for image classification tasks, and there are many models trained on them so that we can collect a sufficient number of compressed models for evaluation. The test set of each dataset consists of 10,000 images that are equally distributed in 10 classes. In our experiments, we randomly select 50 non-triggering inputs from each class of the test set. Thus, for each dataset, we use 500 seed inputs for evaluation. To mitigate the impact of randomness, we repeat the experiments five times using five unique random seeds.

Compressed Models

The compressed models used in our evaluation come from two sources. First, we use three pairs of the original model and the according quantized model used by DiffChaser. More specifically, they are LeNet-1 and LeNet-5 for MNIST, and ResNet-20 for CIFAR-10. They are compressed by the authors of DiffChaser using TensorFlow Lite Abadi et al. (2015) with 8-bit quantization. The upper half of Table 1 shows their top-1 accuracy.

Second, to comprehensively evaluate the performance of TriggerFinder on other kinds of compressed models, we also prepare 15 pairs of models. Specifically, six of them are for MNIST and the remaining nine of them are for CIFAR-10. These compressed models are prepared by three kinds of techniques, namely, quantization, pruning, and knowledge distillation, using Distiller, an open-source model compression toolkit built by the Intel AI Lab 

Zmora et al. (2019). The lower half of Table 1 shows their top-1 accuracy.

Evaluation Metrics

For effectiveness, we measure the success rate to find a triggering input for selected seed inputs. In terms of efficiency, we measure the average time and model queries it takes to find a triggering input for each seed input. All of them are commonly used by previous studies Xie et al. (2019b); Guo et al. (2019); Pei et al. (2017); Odena et al. (2019). Their detailed meanings are explained as follows.

Success Rate.

It measures the ratio of the seed inputs based on which a triggering input is successfully found over the total number of seed inputs. The higher the success rate, the more effective the underlying methodology. Specifically, , where is an indicator: it is equal to 1 if a triggering input based on seed input is found. Otherwise, is 0. is the total number of seed inputs, i.e., 500 in our experiments.


It measures the average time to find a triggering input for each seed input. The shorter the time, the more efficient the input generation.


It measures the average number of model queries issued by TriggerFinder in order to find a triggering input for each seed input. A model query means that one input is fed into both the original DNN model and compressed one. Since the computation of the DNN models is expensive, it is preferred to issue as few queries as possible. The fewer the average queries, the more efficient the test generation.

Experiments Setting

Baseline and its Parameters.

We use the DiffChaser Xie et al. (2019b) as the baseline, since it is the state-of-the-art blackbox approach to our best knowledge. Specifically, we use the source code provided by the corresponding authors. For the timeout to find triggering inputs for each seed input, we use 240s for both DiffChaser and TriggerFinder, relatively longer than the timeout used in DiffChaser, i.e., 180s. We use a longer timeout to mitigate the potential threat that the success rate of either DiffChaser or TriggerFinder might not saturate within a short period time, thus causing biased comparisons. For the population size of DiffChaser, we set it to the value as stated in their paper, i.e., 1,000. For a fair comparison, we do not include the white-box approach Xie et al. (2019a) in our evaluation.


The experiments are conducted on a CentOS8 server with 2 CPU E5-2683V4 2.1GHz and 8 GPU 2080Ti.

Results and Discussions

Figure 2: Triggering Inputs Found by TriggerFinder
Dataset Model Compression TriggerFinder DiffChaser
Success Rate Time(s) Query Success Rate Time(s) Query
MNIST LeNet-1 Quantization-8-bit 100% 0.513 83.97 99.24% 4.618 5,781.54
LeNet-5 Quantization-8-bit 100% 0.706 117.02 99.72% 5.513 6,065.78
CIFAR-10 ResNet-20 Quantization-8-bit 100% 0.509 30.43 99.68% 21.889 2,346.71
MNIST LeNet-4 Prune 100% 0.056 18.34 99.20% 3.930 6,155.65
Quantization 100% 0.187 27.83 99.44% 5.025 6,643.60
LeNet-5 Prune 100% 0.071 22.03 98.76% 3.935 6,318.75
Quantization 100% 0.225 28.08 98.56% 4.494 6,675.32
CNN Prune 100% 0.068 22.51 99.36% 3.831 6,017.31
Quantization 100% 0.173 25.34 99.44% 4.488 6,442.48
CIFAR-10 PlainNet-20 Prune 100% 0.051 4.31 99.92% 2.234 1,939.15
Quantization 100% 0.470 9.13 99.48% 3.140 2,011.66
Knowledge Distillation 100% 0.029 3.97 99.68% 2.202 1,951.44
ResNet-20 Prune 100% 0.063 4.70 99.84% 2.801 2,155.45
Quantization 100% 0.685 10.16 99.68% 4.444 2,213.08
Knowledge Distillation 100% 0.032 3.91 99.96% 2.615 2,095.24
VGG-16 Prune 100% 0.041 5.84 99.72% 3.751 2,543.52
Quantization 100% 1.183 26.16 99.72% 6.184 2,958.87
Knowledge Distillation 100% 0.036 5.78 99.68% 4.217 2,756.02
Table 2: Evaluation Results using TriggerFinder and DiffChaser.
(a) TriggerFinder, LeNet-5, Quantization-8-bit
(b) TriggerFinder, ResNet-20, Knowledge Distillation
(c) DiffChaser, LeNet5 Quantization-8-bit
(d) DiffChaser, ResNet-20, Knowledge Distillation
Figure 3: Histogram of the Number of Queries Required by TriggerFinder and DiffChaser to Find the Triggering Input for the Given Seed Input. The value is averaged over five repeated experiments.


Success Rate.

The two Success Rate columns in Table 2 show the success rate of TriggerFinder and DiffChaser, respectively. TriggerFinder achieves 100% success rate for all pairs of models. In contrast, DiffChaser fails to find the triggering input for certain seed inputs of all the pairs. The ratios of such failures range from 0.04% to 1.44%, with average 0.52%. This result demonstrates that TriggerFinder outperforms DiffChaser in terms of effectiveness. It is because DiffChaser leverages the k-Uncertainty fitness function to guide the generation. However, this function does not properly measure the differences between two models. In contrast, our fitness function not only measures the differences between the prediction outputs of the original and compressed models, but also measures whether the input triggers previously unobserved states of two models.

To further investigate the effectiveness of TriggerFinder, we use all the non-triggering inputs in the whole test set as seed inputs and measure the success rate of TriggerFinder on the 18 pairs of models. We found that TriggerFinder can consistently achieve a 100% success rate for all 18 pairs. Note that due to the poor efficiency of DiffChaser as shown in the next section, we are not able to conduct the same experiments using DiffChaser.

Figure 2

shows two examples of the triggering inputs found by TriggerFinder in MNIST and CIFAR-10 datasets, respectively. The original models correctly classify the two inputs as “5” and “cat”, respectively. However, the inputs are misclassified as “6” and “deer” by the associated compressed models, respectively.



The two Time columns in Table 2 show the average time spent by TriggerFinder and DiffChaser to find triggering inputs for each seed input if successful. The time needed by TriggerFinder to find one triggering input ranges from 0.029s to 1.183s, with the average value 0.283s. DiffChaser takes much longer time than TriggerFinder. Specifically, DiffChaser takes 2.202s21.889s to find one triggering input. On average, TriggerFinder is 43.8x (5.2x115.8x) as fast as DiffChaser in terms of time.


The two Query columns in Table 2 show the average query issued by TriggerFinder and DiffChaser for all seed inputs if a triggering input can be found. Generally, TriggerFinder only needs less than 30 queries to find a triggering input, with only two exceptions. On average, TriggerFinder requires only 24.97 queries (3.9117.0). DiffChaser needs thousands of queries for each trigger input (average 4059.53), much more than TriggerFinder. For example, the smallest number of queries needed by DiffChaser is 1,939.15 for PlainNet-20 and the compressed model utilizing the Prune compression technique. In the same pair of models, TriggerFinder only needs 4.31 queries on average. Overall, TriggerFinder is 289.8x (51.8x535.6x) as few as DiffChaser in terms of queries required.

We further visualize the queries of TriggerFinder and DiffChaser in Figure 3 on two pairs of models: LeNet-5 Quantization-8-bit and ResNet-20 Knowledge Distillation. They are selected since the ratio of queries needed by DiffChaser over the one needed by TriggerFinder is the smallest (51.8x) and largest (535.6x) in all the 18 pairs of models. Figure 3 shows the histogram of the number of queries needed by TriggerFinder and DiffChaser, respectively, as well as the mean and median. It can be observed that TriggerFinder significantly outperforms DiffChaser in terms of queries. The reason is that DiffChaser adopts a genetic algorithm to generate many inputs via crossover and feed them into DNN models in each iteration. As a result, it requires thousands of queries from the two models to find a triggering input. In contrast, TriggerFinder only needs to generate one mutated input and query once in each iteration.

Ablation Study

We further investigate the effects of our fitness function and mutation operator selection strategy. Specifically, we create the following two variants of TriggerFinder and compare their performance with TriggerFinder.

  1. : the fitness function in TriggerFinder is replaced by a simpler fitness function: . In other words, the fitness function will not trace the model states triggered by inputs.

  2. : the selection strategy for mutation operators in TriggerFinder is changed to uniform random selection.

We measured the success rate, time, and queries needed by each variant using the same seed inputs as the previous experiments. For , its success rates in the 18 pairs range from 40.8% to 100%, with the average value 86.6% only. It implies that it is important to encourage the mutated inputs to explore more model states as formulated by our fitness function. achieves the 100% success rate. In terms of efficiency, the average time spent by for each seed input to find triggering inputs is 0.950x1.131x as the same as TriggerFinder, with an average value of 1.037x. The queries required by is 0.946x1.134x as the same as TriggerFinder. More specifically, there are four out of 18 pairs where is marginally more efficient than TriggerFinder and the reason is that with our fitness function, it takes pretty limited iterations (less than 10) for these four pairs to find the triggering input. In such cases, our special selection strategy has not got the enough samples to capture the knowledge of each mutation operator. Nevertheless, in most (the other 14) pairs, TriggerFinder is 1.06x as fast as . The ablation study shows that our fitness function and selection strategy both contribute towards the effectiveness and efficiency of TriggerFinder.


In this paper, we proposed TriggerFinder, a novel, effective input generation method to trigger deviated behaviors between an original DNN model and its compressed model. Specifically, TriggerFinder leverages the MH algorithm in the selection of a mutation operator at each iteration to successively mutate a given seed input. TriggerFinder incorporates a novel fitness function to determine whether to use a mutated input in the next iteration. The results show that TriggerFinder outperforms prior work in terms of both effectiveness and efficiency. TriggerFinder can achieve 100% success rate using significantly less amount of time and queries than DiffChaser, the state-of-the-art technique.


  • Abadi et al. (2015) Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G. S.; Davis, A.; Dean, J.; Devin, M.; Ghemawat, S.; Goodfellow, I.; Harp, A.; Irving, G.; Isard, M.; Jia, Y.; Jozefowicz, R.; Kaiser, L.; Kudlur, M.; Levenberg, J.; Mané, D.; Monga, R.; Moore, S.; Murray, D.; Olah, C.; Schuster, M.; Shlens, J.; Steiner, B.; Sutskever, I.; Talwar, K.; Tucker, P.; Vanhoucke, V.; Vasudevan, V.; Viégas, F.; Vinyals, O.; Warden, P.; Wattenberg, M.; Wicke, M.; Yu, Y.; and Zheng, X. 2015.

    TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems.

    Software available from
  • Akhtar and Mian (2018) Akhtar, N.; and Mian, A. 2018.

    Threat of Adversarial Attacks on Deep Learning in Computer Vision: A Survey.

    IEEE Access, 6: 14410–14430.
  • Bhagoji et al. (2018) Bhagoji, A. N.; He, W.; Li, B.; and Song, D. 2018. Practical Black-Box Attacks on Deep Neural Networks Using Efficient Query Mechanisms. In Ferrari, V.; Hebert, M.; Sminchisescu, C.; and Weiss, Y., eds., Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part XII, volume 11216 of Lecture Notes in Computer Science, 158–174. Springer.
  • Buciluundefined, Caruana, and Niculescu-Mizil (2006) Buciluundefined, C.; Caruana, R.; and Niculescu-Mizil, A. 2006. Model Compression. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’06, 535–541. New York, NY, USA: Association for Computing Machinery. ISBN 1595933395.
  • Carlini and Wagner (2017) Carlini, N.; and Wagner, D. A. 2017. Towards Evaluating the Robustness of Neural Networks. In 2017 IEEE Symposium on Security and Privacy, SP 2017, San Jose, CA, USA, May 22-26, 2017, 39–57. IEEE Computer Society.
  • Chen et al. (2016) Chen, Y.; Su, T.; Sun, C.; Su, Z.; and Zhao, J. 2016. Coverage-Directed Differential Testing of JVM Implementations. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’16, 85–99. New York, NY, USA: Association for Computing Machinery. ISBN 9781450342612.
  • Cheng et al. (2019) Cheng, M.; Le, T.; Chen, P.; Zhang, H.; Yi, J.; and Hsieh, C. 2019. Query-Efficient Hard-label Black-box Attack: An Optimization-based Approach. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019.
  • Choudhary et al. (2020) Choudhary, T.; Mishra, V.; Goswami, A.; and Sarangapani, J. 2020. A comprehensive survey on model compression and acceleration. Artificial Intelligence Review.
  • Goodfellow, Shlens, and Szegedy (2015) Goodfellow, I. J.; Shlens, J.; and Szegedy, C. 2015. Explaining and Harnessing Adversarial Examples. In Bengio, Y.; and LeCun, Y., eds., 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.
  • Guo et al. (2019) Guo, C.; Gardner, J. R.; You, Y.; Wilson, A. G.; and Weinberger, K. Q. 2019. Simple Black-box Adversarial Attacks. In Chaudhuri, K.; and Salakhutdinov, R., eds., Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, 2484–2493. PMLR.
  • Han et al. (2015) Han, S.; Pool, J.; Tran, J.; and Dally, W. J. 2015. Learning Both Weights and Connections for Efficient Neural Networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, NIPS’15, 1135–1143. Cambridge, MA, USA: MIT Press.
  • Kass et al. (1998) Kass, R.; Carlin, B.; Gelman, A.; and Neal, R. 1998. Markov chain monte carlo in practice: A roundtable discussion. American Statistician, 52(2): 93–100.
  • Kim, Feldt, and Yoo (2019) Kim, J.; Feldt, R.; and Yoo, S. 2019. Guiding Deep Learning System Testing Using Surprise Adequacy. In Proceedings of the 41st International Conference on Software Engineering, ICSE ’19, 1039–1049. IEEE Press.
  • Krizhevsky, Nair, and Hinton (2009) Krizhevsky, A.; Nair, V.; and Hinton, G. 2009. The CIFAR-10 dataset.
  • Le, Sun, and Su (2015) Le, V.; Sun, C.; and Su, Z. 2015. Finding Deep Compiler Bugs via Guided Stochastic Program Mutation. In Proceedings of the 2015 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA 2015, 386–399. New York, NY, USA: Association for Computing Machinery. ISBN 9781450336895.
  • Lecun et al. (1998) Lecun, Y.; Bottou, L.; Bengio, Y.; and Haffner, P. 1998. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, 2278–2324.
  • LeCun and Cortes (2010) LeCun, Y.; and Cortes, C. 2010. MNIST handwritten digit database.
  • Li et al. (2017) Li, H.; Kadav, A.; Durdanovic, I.; Samet, H.; and Graf, H. P. 2017. Pruning Filters for Efficient ConvNets. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings.
  • Liu et al. (2018) Liu, Q.; Liu, T.; Liu, Z.; Wang, Y.; Jin, Y.; and Wen, W. 2018. Security Analysis and Enhancement of Model Compressed Deep Learning Systems under Adversarial Attacks. In Proceedings of the 23rd Asia and South Pacific Design Automation Conference, ASPDAC ’18, 721–726. IEEE Press.
  • Ma et al. (2018) Ma, L.; Juefei-Xu, F.; Zhang, F.; Sun, J.; Xue, M.; Li, B.; Chen, C.; Su, T.; Li, L.; Liu, Y.; Zhao, J.; and Wang, Y. 2018. DeepGauge: Multi-granularity Testing Criteria for Deep Learning Systems. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, ASE 2018, 120–131. New York, NY, USA: ACM. ISBN 978-1-4503-5937-5.
  • Meyn and Tweedie (2009) Meyn, S.; and Tweedie, R. L. 2009. Markov Chains and Stochastic Stability. USA: Cambridge University Press, 2nd edition. ISBN 0521731828.
  • Mishra and Marr (2018) Mishra, A. K.; and Marr, D. 2018. Apprentice: Using Knowledge Distillation Techniques To Improve Low-Precision Network Accuracy. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings.
  • Muja and Lowe (2014) Muja, M.; and Lowe, D. G. 2014.

    Scalable Nearest Neighbor Algorithms for High Dimensional Data.

    IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(11): 2227–2240.
  • Odena et al. (2019) Odena, A.; Olsson, C.; Andersen, D.; and Goodfellow, I. J. 2019. TensorFuzz: Debugging Neural Networks with Coverage-Guided Fuzzing. In Chaudhuri, K.; and Salakhutdinov, R., eds., Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, 4901–4911. PMLR.
  • Pei et al. (2017) Pei, K.; Cao, Y.; Yang, J.; and Jana, S. 2017. DeepXplore: Automated Whitebox Testing of Deep Learning Systems. In Proceedings of the 26th Symposium on Operating Systems Principles, SOSP ’17, 1–18. New York, NY, USA: ACM. ISBN 978-1-4503-5085-3.
  • Polino, Pascanu, and Alistarh (2018) Polino, A.; Pascanu, R.; and Alistarh, D. 2018. Model compression via distillation and quantization. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings.
  • Rastegari et al. (2016) Rastegari, M.; Ordonez, V.; Redmon, J.; and Farhadi, A. 2016. XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks. In Leibe, B.; Matas, J.; Sebe, N.; and Welling, M., eds., Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part IV, volume 9908 of Lecture Notes in Computer Science, 525–542. Springer.
  • Shi, Wang, and Han (2019) Shi, Y.; Wang, S.; and Han, Y. 2019. Curls & Whey: Boosting Black-Box Adversarial Attacks. In

    IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019

    , 6519–6527. Computer Vision Foundation / IEEE.
  • Tian et al. (2018) Tian, Y.; Pei, K.; Jana, S.; and Ray, B. 2018. DeepTest: Automated Testing of Deep-neural-network-driven Autonomous Cars. In Proceedings of the 40th International Conference on Software Engineering, ICSE ’18, 303–314. New York, NY, USA: ACM. ISBN 978-1-4503-5638-1.
  • Wang et al. (2019) Wang, K.; Liu, Z.; Lin, Y.; Lin, J.; and Han, S. 2019. HAQ: Hardware-Aware Automated Quantization With Mixed Precision. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, 8612–8620. Computer Vision Foundation / IEEE.
  • Wang et al. (2020) Wang, Z.; Yan, M.; Chen, J.; Liu, S.; and Zhang, D. 2020. Deep Learning Library Testing via Effective Model Generation. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2020, 788–799. New York, NY, USA: Association for Computing Machinery. ISBN 9781450370431.
  • Xie et al. (2019a) Xie, X.; Ma, L.; Juefei-Xu, F.; Xue, M.; Chen, H.; Liu, Y.; Zhao, J.; Li, B.; Yin, J.; and See, S. 2019a. DeepHunter: A Coverage-Guided Fuzz Testing Framework for Deep Neural Networks. In Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2019, 146–157. New York, NY, USA: Association for Computing Machinery. ISBN 9781450362245.
  • Xie et al. (2019b) Xie, X.; Ma, L.; Wang, H.; Li, Y.; Liu, Y.; and Li, X. 2019b. DiffChaser: Detecting Disagreements for Deep Neural Networks. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, 5772–5778. International Joint Conferences on Artificial Intelligence Organization.
  • Zhang, Chowdhury, and Christakis (2020) Zhang, F.; Chowdhury, S. P.; and Christakis, M. 2020. DeepSearch: A Simple and Effective Blackbox Attack for Deep Neural Networks. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2020, 800–812. New York, NY, USA: Association for Computing Machinery. ISBN 9781450370431.
  • Zhou et al. (2017) Zhou, A.; Yao, A.; Guo, Y.; Xu, L.; and Chen, Y. 2017. Incremental Network Quantization: Towards Lossless CNNs with Low-precision Weights. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings.
  • Zmora et al. (2019) Zmora, N.; Jacob, G.; Zlotnik, L.; Elharar, B.; and Novik, G. 2019. Neural Network Distiller: A Python Package For DNN Compression Research.