Hierarchical Distribution-Aware Testing of Deep Learning

by   Wei Huang, et al.
University of Liverpool

With its growing use in safety/security-critical applications, Deep Learning (DL) has raised increasing concerns regarding its dependability. In particular, DL has a notorious problem of lacking robustness. Despite recent efforts made in detecting Adversarial Examples (AEs) via state-of-the-art attacking and testing methods, they are normally input distribution agnostic and/or disregard the perception quality of AEs. Consequently, the detected AEs are irrelevant inputs in the application context or unnatural/unrealistic that can be easily noticed by humans. This may lead to a limited effect on improving the DL model's dependability, as the testing budget is likely to be wasted on detecting AEs that are encountered very rarely in its real-life operations. In this paper, we propose a new robustness testing approach for detecting AEs that considers both the input distribution and the perceptual quality of inputs. The two considerations are encoded by a novel hierarchical mechanism. First, at the feature level, the input data distribution is extracted and approximated by data compression techniques and probability density estimators. Such quantified feature level distribution, together with indicators that are highly correlated with local robustness, are considered in selecting test seeds. Given a test seed, we then develop a two-step genetic algorithm for local test case generation at the pixel level, in which two fitness functions work alternatively to control the quality of detected AEs. Finally, extensive experiments confirm that our holistic approach considering hierarchical distributions at feature and pixel levels is superior to state-of-the-arts that either disregard any input distribution or only consider a single (non-hierarchical) distribution, in terms of not only the quality of detected AEs but also improving the overall robustness of the DL model under testing.



page 13

page 14

page 16

page 20


Detecting Operational Adversarial Examples for Reliable Deep Learning

The utilisation of Deep Learning (DL) raises new challenges regarding it...

RobOT: Robustness-Oriented Testing for Deep Learning Systems

Recently, there has been a significant growth of interest in applying so...

Distribution Awareness for AI System Testing

As Deep Learning (DL) is continuously adopted in many safety critical ap...

Guiding Deep Learning System Testing using Surprise Adequacy

Deep Learning (DL) systems are rapidly being adopted in safety and secur...

DeepGauge: Multi-Granularity Testing Criteria for Deep Learning Systems

Deep learning defines a new data-driven programming paradigm that constr...

Assessing the Reliability of Deep Learning Classifiers Through Robustness Evaluation and Operational Profiles

The utilisation of Deep Learning (DL) is advancing into increasingly mor...

Copy, Right? A Testing Framework for Copyright Protection of Deep Learning Models

Deep learning (DL) models, especially those large-scale and high-perform...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

DL is being explored to provide transformational capabilities to many industrial sectors including automotive, healthcare and finance. The reality that DL is not as dependable as required now becomes a major impediment. For instance, key industrial foresight reviews identified that the biggest obstacle to gaining benefits of DL is its dependability (lane_new_2016). There is an urgent need to develop methods to enable the dependable use of DL, for which great efforts have been made in recent years in the field of DL VnV (huang_survey_2020; zhang_machine_2020).

DL robustness is arguably the property in the limelight. Informally, robustness requires that the decision of the DL model is invariant against small perturbations on inputs. That is, all inputs in a small input region (e.g., a norm ball defined in some -norm distance) should share the same prediction label by the DL model. Inside that region, if an input is predicted differently to the given label, then this input is normally called an AE. Most VnV methods designed for DL robustness are essentially about detecting AEs, e.g., adversarial attack based methods (DBLP:journals/corr/GoodfellowSS14; DBLP:conf/iclr/MadryMSTV18) and coverage-guided testing (du2019deepstellar; xie_npc_2022; ma2018deepgauge; DBLP:conf/sosp/PeiCYJ17; huang2021coverage; DBLP:conf/icse/SunHKSHA19).

As recently noticed by the software engineering community, emerging studies on systematically evaluating AEs detected by aforementioned state-of-the-arts have two major drawbacks: (i) they do not take the input data distribution into consideration, therefore it is hard to judge whether the identified AEs are meaningful to the DL application (berend_cats_2020; dola_distribution_aware_2021); (ii) most detected AEs are of poor perception quality that are too unnatural/unrealistic (harel_canada_is_2020) to be seen in real-life operations. That said, not all AEs are equal nor can be eliminated given limited resources. A wise strategy is to detect those AEs that are both being “distribution-aware” and with natural/realistic pixel-level perturbations, which motivates this work.

Prior to this work, a few decent attempts at distribution-aware testing for DL have been made. Broadly speaking, the field has developed two types of approaches: OOD detector based (dola_distribution_aware_2021; DBLP:conf/icse/Berend21) and feature-only based (toledodistribution; byun_manifold_based_2020)

. The former can only detect anomalies/outliers, rather than being “fully-aware” of the distribution. While the latter is indeed generating new test cases according to the learnt distribution (in a latent space), it ignores the pixel-level information due to the compression nature of generative models used

(zhong2020generative). To this end, our approach is advancing in this direction with the following novelties and contributions:

a) We provide a “divide and conquer” solution—HDA testing—by decomposing the input distribution into two levels (named as global and local) capturing how the feature-wise and pixel-wise information are distributed, respectively. At the global level, isolated problems of estimating the feature distribution and selecting best test seeds can be solved by dedicated techniques. At the local level where features are fixed, the clear objective is to precisely generate test cases considering perceptual quality. Our extensive experiments show that such hierarchical consideration is more effective to detect high-quality AEs than state-of-the-art that either disregards any data distribution or only considers a single (non-hierarchical) distribution. Consequently, we also show the DL model under testing exhibits higher robustness after “fixing” the high-quality AEs detected.

b) At the global level, we propose novel methods to select test seeds based on the approximated feature distribution of the training data and predictive robustness indicators, so that the norm balls of the selected seeds are both from the high-density area of the distribution and relatively unrobust (thus more cost-effective to detect AEs in later stages). Notably, state-of-the-art DL testing methods normally select test seeds randomly from the training dataset without any principled rules. Thus, from a software engineering perspective, our test seed selection is more practically useful in the given application context.

c) Given a carefully selected test seed, we propose a novel two-step GA to generate test cases locally (i.e. within a norm ball) to control the perceptual quality of detected AEs. At this local level, the perceptual quality distribution of data-points inside a norm ball requires pixel-level information that cannot be sufficiently obtained from the training data alone. Thus, we innovatively use common perceptual metrics that quantify image quality as an approximation of such local distribution. Our experiments confirm that the proposed GA is not only effective after being integrated into HDA (as a holistic testing framework), but also outperforms other pixel level AE detectors in terms of perception quality when applied separately.

d) We investigate black-box (to the DL model under testing) methods for the main tasks at both levels. Thus, to the best of our knowledge, our HDA approach provides an end-to-end, black-box solution, which is the first of its kind and more versatile in software engineering practice.

e) A publicly accessible tool of our HDA testing framework with all source code, datasets, DL models and experimental results.

2. Preliminaries and Related Work

In this section, we first introduce preliminaries and related work on DL robustness, together with formal definitions of concepts adopted in our HDA approach. Then existing works on distribution-aware testing are discussed. Since our HDA testing also considers the naturalness of detected AEs, some common perception quality metrics are introduced. In summary, we present Fig. 1 to show the stark contrast of our proposed HDA testing (the green route) to other related works (the red and amber routes).

Figure 1. Comparison between our proposed Hierarchical Distribution-Aware (HDA) testing and related works.

2.1. DL Robustness and Adversarial Examples

We denote the prediction output of DL model as the vector

with size equal to the total number of labels. The predicated label where is the attribute of vector .

DL robustness requires that the decision of the DL model is invariant against small perturbations on input . That is, all inputs in an input region have the same prediction label, where is usually a small norm ball (defined with some -norm distance111 and . norm is more commonly used.) around an input . If an input inside is predicted differently to by the DL model, then is called an Adversarial Example (AE).

DL robustness VnV can be based on formal methods (huang_safety_2017; ruan_reachability_2018) or statistical approaches (DBLP:conf/iclr/WengZCYSGHD18; webb_statistical_2019)

, and normally aims at detecting AEs. In general, we may classify two types of methods (the two branches in the red route of Fig. 

1) depends on how the test cases are generated: (i) Adversarial attack based methods are normally optimised for the DL prediction loss to find AEs, which include white-box attack methods like FGSM (DBLP:journals/corr/GoodfellowSS14) and PGD (DBLP:conf/iclr/MadryMSTV18), as well as black-box attacks (alzantot2019genattack; DBLP:conf/cec/WuLZXZ21)

using GA with gradient-free optimisation. (ii) Coverage-guided testing are optimised for certain coverage metrics on the DL model’s internal structure, which is inspired by the coverage testing for traditional software. Several popular test metrics, like neuron coverage

(DBLP:conf/sosp/PeiCYJ17; ma2018deepgauge), modified condition/decision coverage (DBLP:conf/icse/SunHKSHA19) for CNNs and temporal coverage (huang2021coverage; du2019deepstellar) for RNNs are proposed. While it is argued that coverage metrics are not strongly correlated with DL robustness (yan_correlations_2020; harel_canada_is_2020), they are seen as providing insights into the internal behaviours of DL models and hence may guide test selection to find more diverse AEs (huang2021coverage).

Without loss of generality, we reuse the formal definition of DL robustness in (webb_statistical_2019; weng_proven_2019) in this work:

Definition 0 (Local Robustness).

The local robustness of the DL model , w.r.t. a local region and a target label , is:


where is the local distribution of region which is precisely the “input model” used by both (webb_statistical_2019; weng_proven_2019). is an indicator function, and when , otherwise.

To detect as many AEs as possible, normally the first question is—which local region shall we search for those AEs? I.e. how to select test seeds? To be cost-effective, we want to explore unrobust regions, rather than regions where AEs are relatively rare. This requires the local robustness of a region to be known a priori, which may imply a paradox (cf. Remark 3 later). In this regard, we can only predict the local robustness of some regions before doing the actual testing in those regions. We define:

Definition 0 (Local Robustness Indicator).

Auxiliary information that strongly correlated with (thus can be leveraged in its prediction) is named as a local robustness indicator.

We later seek for such indicators (and empirically show their correlation with the local robustness), which forms one of the two key factors considered in selecting test seeds in our method.

Given a test seed, we search for its AEs in a local region that with different labels. This involves the question on what size of should be, for which we later utilise the property of:

Remark 1 (-separation).

For real-world image datasets, any data-points with different ground truth labels are at least distance apart in the input (pixel) space, with being estimated case by case depends on the dataset.

The -separation property was first observed by (yang_closer_2020): intuitively it says, there is a minimum distance between two real-world objects of different labels.

Finally, not all AEs are equal in terms of the “strength of being adversarial” (stronger AEs may lead to greater robustness improvement in, e.g., adversarial training (DBLP:conf/icml/WangM0YZG19)), for which we define:

Definition 0 (Prediction Loss).

Given a test seed with label , the prediction loss of an input to the test seed is defined as:


where returns the probability of label after input being processed by the DL model .

Note, implies and thus is an AE of .

Next, to measure the DL models’ overall robustness across the whole input domain, we introduce a notion of global robustness. Being different to some existing definitions where local robustness are treated equally (wang2021robot; wang2021statistically), ours is essentially a “weighted sum” of the local robustness of local regions where each weight is the probability of the associated region on the input data distribution. Defining global robustness in such a “distribution-aware” manner aligns with our motivation—as revealed later by empirically estimated global robustness, our HDA appears to be more effective in supporting the growth of the overall robustness after “fixing” those distribution-aware AEs.

Definition 0 (Global Robustness).

The global robustness of the DL model is defined as:


where is the global distribution of region (i.e., a pooled probability of all inputs in the region ) and is the local robustness of region to the label .

The estimation of , unfortunately, is very expensive that requires to compute the local robustness of a large number of regions. Thus, from a practical standpoint, we adopt an empirical definition of the global robustness in our later experiments, which has been commonly used for DL robustness evaluation in the adversarial training (wang_robot_2021; DBLP:conf/iclr/MadryMSTV18; DBLP:conf/icml/WangM0YZG19; zhang2019theoretically).

Definition 0 (Empirical Global Robustness).

Given a DL model and a validation dataset , we define the empirical global robustness as where T denotes a given type of AE detection method and is the weighted accuracy on AEs obtained by conducting T on .

To be “distribution-aware”, the synthesis of should conform to the global distribution while locally AEs are searched by according to the local distribution. Consequently, the set of AEs may represent the input distribution.

2.2. Distribution-Aware Testing for DL

There are increasing amount of DL testing works developed towards being distribution-aware (as summarised in the amber route of Fig. 1). Deep generative models, such as VAE and GAN, are applied to approximate the training data distribution, since the inputs (like images) to DNN are usually in a high dimensional space. Previous works heavily rely on OOD detection (dola_distribution_aware_2021; DBLP:conf/icse/Berend21) or synthesising new test cases directly from latent spaces (DBLP:conf/icse/ByunR20a; byun_manifold_based_2020; toledodistribution; DBLP:conf/sigsoft/RiccioT20). The former not really considers the whole distribution, rather flags outliers, thus a more pertinent name of it should be out-of-distribution-aware (OODA) testing. While for both types of methods, another problem arises that the distribution encoded by generative models only contain the feature-wise information and filter out the pixel-wise perturbations (zhong2020generative). Consequently, directly searching and generating test cases from the latent space of generative models may only perturb features, thus called Feature-Only Distribution-Aware (FODA) in this paper (while also named as semantic AEs in some literature (DBLP:conf/cvpr/HosseiniP18; DBLP:conf/iclr/ZhaoDS18)). Our approach, the green route in Fig. 1, differs from aforementioned works by considering both the global (feature level) distribution in latent spaces and the local (pixel level) perceptual quality distribution in the input space.

2.3. Perception Quality of Images

Locally, data-points (around the selected seed) sharing the same feature information may exhibit differently in terms of naturalness. To capture such distribution, some perceptual quality metric can be utilised to compare the perceptual difference between the original images and perturbed images. Some common metrics for perceptual quality include:

  • MSE between the original image and perturbed image.

  • Peak Signal-to-Noise Ratio (PSNR)

    (rafael1992gonzalez) defined as , where is the maximum possible pixel value of the image.

  • Structural Similarity Index Measure (SSIM) (DBLP:journals/tip/WangBSS04) that considers image degradation as perceived change in structural information.

  • FID (DBLP:conf/nips/HeuselRUNH17) that compares the distribution between a set of original images and a set of perturbed images by squared Wasserstein metric.

Notably, all these metrics are current standards for assessing the quality of images, as widely used in the experiments of aforementioned related works.

3. The Proposed Method

Figure 2. An example of Hierarchical Distribution Aware Testing

We first present an overview of our HDA testing, cf. the green route in Fig. 1, and then dive into details of how we implement each stage by referring to an illustrative example in Fig. 2.

3.1. Overview of HDA Testing

The core of HDA testing is the hierarchical structure of two distributions. We formally define the following two levels of distributions:

Definition 0 (Global Distribution).

The global distribution captures how feature level information is distributed in some (low-dimensional) latent space after data compression.

Definition 0 (Local Distribution).

Given a data-point sampled from the latent space, we consider its norm ball in the input pixel space. The local distribution is a conditional distribution capturing the perceptual quality of all data-points within the norm ball.

Latent space is the representation of compressed data, in which data points with similar features are closer to each other. DNNs, e.g. encoder of VAEs, map any data points in the high dimensional input space to the low dimensional latent space. It infers that input space can be divided into distinct regions, and each region corresponds to a data point in latent space. By fitting a global distribution in the latent space, we actually model the distribution of distinct regions over the input space. The local distribution is defined as a conditional distribution within each region, sharing the same features. Thus, we propose the following remark.

Remark 2 (Decompose one distribution into two levels).

Given the definitions of global and local distributions, denoted as and respectively, we may decompose a single distribution over the entire input domain as:


where variable represents a set of features while represents a region in the input space that “maps” to the point in the latent space.

Intuitively, compared to modelling a single distribution, our hierarchical structure of distributions is superior in that the global distribution guides for which regions of the input space to test, while the local distribution can be leveraged to precisely control the perceptual quality of test cases. The green route in Fig. 1 shows our HDA testing process, which appears as three stages:

Stage 1: Explicitly Approximate the Global Distribution

We first extract feature-level information from the given dataset by using data compression techniques—the encoder of VAEs in our case, and then explicitly approximate the global distribution in the latent-feature space, using KDE.

Stage 2: Select Test Seeds Based on the Global Distribution and Local Robustness Indicators

Given the limited testing budget, we want to test in those local input regions that are both more error-prone and representative of the input distribution. Thus, when selecting test seeds, we consider two factors—the local robustness indicators (cf. Definition 2) and the global distribution. For the former, we propose several auxiliary information with empirical studies showing their correlation with the local robustness, while the latter has already been quantified in the first stage via KDE.

Stage 3: Generate Test Cases Around Test Seeds Considering the Local Distribution and Prediction Loss of AEs

When searching for AEs locally around a test seed given by the 2nd stage, we develop a two-step GA in which the objective function is defined as a fusion of the prediction loss (cf. Definition 3) and the local distribution (modelled by common perceptual quality metrics). Such fusion of two fitness functions allows the trade-off between the “strength of being adversarial” and the perceptual quality of the detected AEs. The optimisation is subject to the constraint of only exploring in a norm ball whose central point is the test seed and with a radius smaller than the -separation distance (cf. Remark 1).

While our chosen technical solutions are effective and popular, alternatives may also suffice for the purpose of each stage.

3.2. Approximation of the Global Distribution

Given the training dataset , the task of approximating the input distribution is equivalent to estimating a PDF over the input domain given . Despite this is a common problem with many established solutions, it is hard to accurately approximate the distribution due to the relatively sparse data of , compared to the high dimensionality of the input domain . So the practical solution is to do dimensionality reduction and then estimate the global distribution, which indeed is the first step of all existing methods of distribution-aware DL testing.

Specifically, we choose VAE-Encoder+KDE222

We only use the encoder of VAEs for feature extraction, rather than generate new data from the decoder, which is different to other methods mentioned in Section

2.2. for their simplicity and popularity. Assume contains samples and each

is encoded by VAE-Encoder as a Gaussian distribution

in the latent space, we can estimate the PDF of (denoted as ) based on the encoded . The conforms to the mixture of Gaussian distributions, i.e., . Notably, this mixture of Gaussian distributions nicely aligns with the gist of adaptive KDE (DBLP:conf/snn/LokerseVB95), which uses the following estimator:


That is, when choosing a Gaussian kernel for in Eqn. (5) and adaptively setting the bandwidth parameter

(i.e., the standard deviation of the Gaussian distribution representing the compressed sample

), the VAE-Encoder and KDE are combined “seamlessly”. Finally, our global distribution (a pooled probability of all inputs in the region that corresponds to a point in the latent space) is proportional to the approximated distribution of with the PDF .

Running Example: The left diagram in Fig. 2 depicts the global distribution learnt by KDE, after projected to a two-dimensional space for visualisation. The peaks333Most training data lie in this region or gather around the region. are evaluated with highest probability density over the latent space by KDE.

3.3. Test Seeds Selection

Selecting test seeds is actually about choosing which norm balls (around the test seeds) to test for AEs. To be cost-effective, we want to test those with higher global probabilities and lower local robustness at the same time. For the latter requirement, there is potentially a paradox:

Remark 3 (A Paradox of Selecting Unrobust Norm Balls).

To be informative on which norm balls to test for AEs, we need to estimate the local robustness of candidate norm balls (by invoking robustness estimators to quantify , e.g., (webb_statistical_2019)). However, local robustness evaluation itself is usually about sampling for AEs (then fed into statistical estimators) that consumes the testing resources.

To this end, instead of directly evaluating the local robustness of a norm ball, we can only indirectly predict it (i.e., without testing/searching for AEs) via auxiliary information that we call local robustness indicators (cf. Definition 2). In doing so, we save all the testing budget for the later stage when generating local test cases.

Given a test seed with label , we propose two robustness indicators (both relate to the vulnerability of the test seed to adversarial attacks)—the prediction gradient based score (denoted as ) and the score based on separation distance of the output-layer activation (denoted as ):


These allow prediction of a whole norm ball’s local robustness by the limited information of its central point (the test seed). The gradient of a DNN’s prediction with respect to the input is a white-box metric, that widely used in adversarial attacks, such as FGSM (DBLP:journals/corr/GoodfellowSS14) and PGD (DBLP:conf/iclr/MadryMSTV18) attacks. A greater gradient calculated at a test seed implies that AEs are more likely to be found around it. The activation separation distance is regarded as a black-box metric and refers to the minimum norm between the output activations of the test seed and any other data with different labels. Intuitively, a smaller separation distance implies a greater vulnerability of the seed to adversarial attacks. We later show empirically that indeed these two indicators are highly correlated with the local robustness.

After quantifying the two required factors, we combine them in a way that was inspired by (zhao_assessing_2021). In (zhao_assessing_2021), the DL reliability metric is formalised as a weighted sum of local robustness where the weights are operational probabilities of local regions. To align with that reliability metric, we do the following steps to select test seeds:

(i) For each data-point in the test set, we calculate its global probability (i.e., where is its compressed point in the VAE latent space) and one of the local robustness indicators (either white-box or black-box, depending on the available information).

(ii) Normalise both quantities to a same scale.

(iii) Rank all data-points by the product of their global probability and local robustness indicator.

(iv) Finally we select top- data-points as our test seeds, and depends on the testing budget.

Running Example: In the middle diagram of Fig. 2, we add in the local robustness indicator results of the training data which are represented by a scale of colours—darker means lower predicted local robustness while lighter means higher predicated local robustness. By our method, test seeds selected are both from the highest peak (high probability density area of the global distribution) and relatively darker ones (lower predicated local robustness).

3.4. Local Test Cases Generation

Not all AEs are equal in terms of the “strength of being adversarial”, and stronger AEs are associated with higher prediction loss (cf. Definition 3). Detecting AEs with higher prediction loss may benefit more when considering the future “debuging” step, e.g., by adversarial retraining (DBLP:conf/icml/WangM0YZG19). Thus, at this stage, we want to search for AEs that are both “strongly being adversarial” and “less likely to be noticed by humans”. That is, the local test case generation can be formulated as the following optimisation given a seed :


where is the prediction loss, is the local distribution (note, represents the latent features of test seed ), is the -separation distance, and is a coefficient to balance the two terms. As what follows, we note two points on Eqn. (7): why we need the constraint and how we quantify the local distribution.

The constraint in Eqn. (7) determines the right locality of local robustness—the “neighbours” that should have the same ground truth label as the test seed. We notice the -separation property of real-world image datasets (cf. Remark 1) provides a sound basis to the question. Thus, it is formalised as a constraint that the optimiser can only search in a norm ball with a radius smaller than , to guarantee the detected AEs are indeed “adversarial” to label .

While the feature level information is captured by the global distribution over a latent space, we only consider how the pixel level information is distributed in terms of perceptual quality to humans. Three common quantitative metrics—MSE, PSNR and SSIM introduced in Section 2.3—are investigated. We note, those three metrics by no means are the true local distribution representing perceptual quality, rather quantifiable indicators from different aspects. Thus, in the optimisation problem of Eqn. (7), replacing the local distribution term with them would suffice our purpose. So, we redefine the optimisation problem as:


where represents those perceptual quality metrics correlated with the local distribution of the seed . Certainly, implementing requires some prepossessing, e.g., normalisation and negation, depending on which metric is adopted.

Considering that the second term of the objective function in Eqn. (8) may not be differentiable and/or the DL model’s parameters are not always accessible, we propose a black-box approach to solve the optimisation problem that generates local test cases. It is based on a GA with two fitness functions to effectively and efficiently detect AEs, as shown in Algorithm 1.

1:Test seed , neural network function , local perceptual quality metric , population size , maximum iterations , norm ball radius , weight parameter .
2:A set of test cases
4:for  do
6:end for
7:while  or does not converge do
10:     if  then
12:     else
14:     end if
18:end while
22:return test set
Algorithm 1 Two-Step GA Based Local Test Cases Generation

Algorithm 1 presents the process of generating a set of test cases from a given seed with label (denoted as ). At line 1, we define two fitness functions (the reason behind it will be discussed next). We initialise the population by adding uniform noise in range to the test seed, at line 2-4. At line 5-16, the population is iteratively updated by evaluating the fitness functions, selecting the parents, conducting crossover and mutation. At line 17-20, the best fitted individuals in population are chosen as test cases. The crossover and mutation are regular operations in GA based test cases generation for DL models (alzantot2019genattack), while two fitness functions work alternatively to guide the selection of parents.

The reason why we propose two fitness functions is because, we notice that there is a trade-off between the two objectives and in the optimisation. Prediction loss is related to the adversarial strength, while indicates the local distribution. Intuitively, generating the test cases with high local probability tends to add small amount of perturbations to the seed, while a greater perturbation is more likely to induce high prediction loss. To avoid the competition between the two terms that may finally leads to a failure of detecting AEs, we define two fitness functions to precisely control the preference at different stages:


At early stage, is optimised to quickly direct the generation of AEs, with meaning the detection of an AE. When most individuals in the population are AEs, i.e., , the optimisation moves to the second stage, in which is replaced by to optimise the local distribution indicator as well as the prediction loss. It is possible444Especially when a large is used, i.e., with preference on detecting AEs with high local probability than with high adversarial strength, cf. the aforementioned trade-off. that the prediction loss of most individuals again become negative, then the optimisation will go back to the first stage. With such a mechanism of alternatively using two fitness functions in the optimisation, the proportion of AEs in the population is effectively prevented from decreasing.

Algorithm 1 describes the process for generating local test cases given a single test seed. Suppose test seeds are selected earlier and in total local test cases are affordable, we can allocate, for each test seed , the number of local test cases , according to the (re-normalised) global probabilities, which emphasises more on the role of distribution in our detected set of AEs.

Running Example: The right diagram in Fig. 2 plots the local distribution using MSE as its indicator, and visualises the detected AEs by different testing methods. Unsurprisingly, all AEs detected by our proposed HDA testing are located at the high density regions (and very close to the central test seed), given it considers the perceptual quality metric as one of the optimisation objectives in the two-step GA based test case generation. In contrast, other methods (PGD and coverage-guided) are less effective.

4. Evaluation

We evaluate the proposed HDA method by performing extensive experiments to address the following research questions (RQs):

RQ1 (Effectiveness): How effective are the methods adopted in the three main stages of HDA?

Namely, we conduct experiments to i) examine the accuracy of combining VAE-Encoder+KDE to approximate the global distribution; ii) check the correlation significance of the two proposed local robustness indicators with the local robustness; iii) investigate the effectiveness of our two-step GA for local test cases generation.

RQ2 (AE Quality): How is the quality of AEs detected by HDA?

Comparing to conventional attack-based and coverage-guided methods and more recent distribution-aware testing methods of OODA and FODA, we introduce a comprehensive set of metrics to evaluate the quality of AEs detected by HDA and others.

RQ3 (Sensitivity): How sensitive is HDA to the DL models under testing?

We carry out experiments to assess the capability of HDA applied on DL models (adversarially trained) with different levels of robustness.

RQ4 (Robustness Growth): How useful is HDA to support robustness growth of the DL model under testing?

We examine the global robustness of DL models after “fixing” the AEs detected by various testing methods.

4.1. Experiment Setup

In RQ1 and RQ2, we consider five popular benchmark datasets and five diverse model architectures for evaluation. Details of the datasets and trained DL models under testing are listed in Table 1. The norm ball radius is calculated based on the -separation distance (cf. Remark 1) of each dataset. In RQ3, we add the comparison on DL models, enhanced by PGD-based adversarial training, for sensitivity analysis. Table 1 also records the accuracy of these adversarially trained models. Adversarial training trades the generalisation accuracy for the robustness as expected (thus a noticeable decrement of the training and testing accuracy) (zhang2019theoretically). In RQ4

, we firstly sample 10000 data points from the global distribution as validation set and detect AEs around them by different methods. Then, we fine-tune the normally trained models with training dataset augmented by these AEs. 10 epochs are taken along with ‘slow start, fast decay’ learning rate schedule

(jeddi_simple_2021) to reduce the computational cost while improve the accuracy-drop and robustness. To empirically estimate the global robustness on validation set, we find another set of AEs according to local distribution, different from the fine-tuning data. These validating AEs are miss-classified by normally trained models. Thus, empirical global robustness of normally trained models is set to 0 as the baseline.

For readers’ convenience, all the metrics used in RQ2, RQ3 and RQ4 for comparisons are listed in Table 2. The metrics are introduced to comprehensively evaluate the quality of detected AEs and the DL models from different aspects.

Dataset Image Size DL Model Normal Training Adversarial Training
Train Acc. Test Acc. Train Acc. Test Acc.
MNIST 0.1 LeNet5
Fashion-MNIST 0.08 AlexNet
SVHN 0.03 VGG11
CIFAR-10 0.03 ResNet20
CelebA 0.05 MobileNetV1
Table 1. Details of the datasets and DL models under testing.
Metrics Meanings
AE Prop. Proportion of AEs in the set of test cases generated from selected test seeds
Pred. Loss Adversarial strength of AEs as formally defined by Definition 3
Normalised global probability density of test-seeds/AEs
Local robustness to the correct classification label, as formally defined by Definition 1
Empirical global robustness of DL models over input domain as defined in Definition 5
FID Distribution difference between original images (test seeds) and perturbed images (AEs)
Average perturbation distance between test seeds and AEs
% of Valid AEs Percentage of “in-distirbution” AEs in all detected AEs
Table 2. Evaluation metrics for the quality of detected AEs and DL models

All experiments were run on a machine of Ubuntu 18.04.5 LTS x86_64 with Nvidia A100 GPU and 40G RAM. The source code, DL models, datasets and all experiment results are publicly available at https://github.com/havelhuang/HDA-Testing.

4.2. Evaluation Results and Discussions

4.2.1. Rq1

There are 3 sets of experiments in RQ1 to examine the accuracy of technical solutions in our tool-chain, corresponding to the 3 main stages respectively.

First, to approximate the global distribution, we essentially proceed in two steps—dimensionality reduction and PDF fitting, for which we adopt the VAE-Encoder+KDE solution. Notably, the VAE trained in this step is for data-compression only (not for generating new data). To reflect the effectiveness of both aforementioned steps, we (i) compare VAE-Encoder with the Principal Component Analysis (PCA), and (ii) then measure the FID between the training dataset and a set of random samples drawn from the fitted global distribution by KDE.

PCA is a common approach for dimensionality reduction. We compare the performance of VAE-Encoder and PCA from the following two perspectives. The quality of latent representation can be measured by the clustering and reconstruction accuracy

. To learn the global distribution from latent data, we require that latent representations should group together data-points based on semantic features and can be decoded to reconstruct the original images with less information loss. Therefore, we apply K-means clustering to the latent data and calculate the Completeness Score (CS), Homogeneity Score (HS) and V-measure Score (VS)

(DBLP:conf/emnlp/RosenbergH07) for measuring the ability of clustering. While, the reconstruction loss is calculated based on the MSE. As is shown in Table 3, VAE-Encoder achieves higher CS, HS, VS scores and less reconstruction loss than PCA. In other words, the latent representations encoded by VAE-Encoder is more significant in terms of capturing features than that of PCA.

Dataset PCA VAE-Encoder
Clustering Recon. Loss Clustering Recon. Loss
MNIST 0.505 0.508 0.507 44.09 0.564 0.566 0.565 27.13
F.-MNIST 0.497 0.520 0.508 55.56 0.586 0.601 0.594 23.72
SVHN 0.007 0.007 0.007 65.75 0.013 0.012 0.013 66.21
CIFAR-10 0.084 0.085 0.085 188.22 0.105 0.105 0.105 168.44
CelebA 0.112 0.092 0.101 764.94 0.185 0.150 0.166 590.54
Table 3. Quality of Latent Representation in PCA & VAE-Encoder
Dataset Global Dist. Uni. Dist.
MNIST 0.395 13.745
Fashion-MNIST 0.936 90.235
SVHN 0.875 143.119
CIFAR-10 0.285 12.053
CelebA 0.231 8.907


Figure 3.

Samples drawn from the approximated global distribution by KDE and a uniform distribution over the latent feature space (Figure); and FID to the ground truth based on 1000 samples (Table).

To evaluate the accuracy of using KDE to fit the global distribution, we calculate the FID between a new dataset (with 1000 samples) based on the fitted global distribution by KDE and the training dataset. The FID scores are shown in Table 3. As a baseline, we also present the results of using a uniform distribution over the latent space. As expected, we observe that all FID scores based on approximated distributions are significantly smaller (better). We further decode the newly generated dataset for visualisation in Fig. 3, from which we see the generated images by KDE keep high fidelity while the uniformly sampled images are not human-perceptible.

Answer to RQ1 on HDA stage 1: The combination of VAE-Encoder+KDE may accurately approximate the global distribution.

Move on to stage 2, we study the correlations between a norm ball’s local robustness and its two indicators proposed earlier—the prediction gradient based score and the score based on separation distance of output-layer activation (cf. Eq. 6).

We invoke the tool (webb_statistical_2019) for estimating the local robustness defined in Definition 1. Based on 1000 randomly selected data-points from the test set as the central point of 1000 norm balls, we calculate the local robustness of each norm ball555Radius is usually small by definition (cf. Remark 1), yielding very small . as well as the two proposed indicators. Then, we do the scatter plots (in log-log scale666There are dots collapsed on the vertical line of , due to a limitation of the estimator (webb_statistical_2019)—it terminates with the specified threshold when the estimation is lower than that value. Note, the correlation calculated with such noise is not undermining our conclusion, rather the real correlation would be even higher.), as shown in Fig. 4. Apparently, for all 5 datasets, the indicator based on activation separation distance is negatively correlated (1st row), while the gradient indicator is positively correlated with the estimated local robustness (2nd row). We further quantify the correlation by calculating the Pearson coefficients, as recorded in Table 4. We observe, both indicators are highly correlated with the local robustness, while the gradient based indicator is stronger. This is unsurprising, because the activation separation distance is a black-box metric which is usually weaker than the white-box gradient information.

Figure 4. Scatter plots of the local robustness evaluation vs. its two indicators, based on 1000 random norm balls.
MNIST 0.672 0.379
Fashion-MNIST 0.872 0.716
SVHN 0.848 0.612
CIFAR-10 0.832 0.646
CelebA 0.699 0.468
Table 4. Pearson correlation coefficients (in absolute values) between the local robustness & its two indicators.

Answer to RQ1 on HDA stage 2: The two proposed local robustness indicators are significantly correlated with the local robustness.

Figure 5. The prediction loss (red) and the three quantified local distribution indicators (blue) of the best fitted test case during the iterations of our two-step GA based local test case generation.

For the local test case generation in stage 3, by configuring the parameter in our two-step GA, we may do trade-off between the “strength of being adversrial” (measured by prediction loss ) and the perceptual quality (measured by a specific ), so that the quality of detected AEs can be optimised.

Figure 6. Comparison between regular GA and two-step GA.

In Fig. 5, we visualise the changes of the two fitness values as the iterations of the GA. As shown in the first plot, only the prediction loss is taken as the objective function (i.e., ) during the whole iteration process. The GA can effectively find AEs with maximised adversarial strength, which is observed from that the prediction loss of best fitted test case in the population converges after hundreds of iterations. From the second to the last plot, the other fitness function representing the local distribution information is added to the objective function (i.e., ), they are MSE, PSNR and SSIM. Intuitively, higher local probability density implies smaller MSE and greater PSNR and SSIM.

Thanks to the two-step setting of the fitness functions, the prediction loss of best fitted test case goes over 0 quickly in less than 200 iterations, which means it detects a first AE in the population. The of the best fitted test case is always quite close to the rest in the population, thus we may confidently claim that many AEs are efficiently detected by the population not long after the first AE was detected. Then, the optimisation goes to the second stage, in which the quantified local distribution indicator is pursued. The and finally converge and achieve a balance between them. If we configure the coefficient , the balance point will change correspondingly. A greater (e.g., in the plots) detects more natural AEs (i.e., with higher local probability density), and the price paid is that the detected AEs are with weaker adversarial strength (i.e., with smaller but still positive prediction loss).

Figure 7. AEs detected by our two-step GA (last 3 columns) & other methods

We further investigate the advantages of our 2-step GA over the regular GA (using as the objective function). In Fig. 6, as increases, the proportion of AEs in the population exhibits a sharp drop to 0 when using the regular GA. In contrast, the two-step GA prevents such decreasing of the AE proportion while preserving it at a high-level of 0.6, even when is quite large. Moreover, larger represents the situations when the AEs are more natural—as shown by the blue curves777The blue dashed line stops earlier as there is no AEs in the population when is big., the local distribution indicator (SSIM in this case) is only sufficiently high when is big enough. Thus, compared to the regular GA, we may claim our novel 2-step GA is more robust (in detecting AEs) to the choices of and more suitable in our framework for detecting AEs with high local probabilities.

Fig. 7 displays some selected AEs from the five datasets. Same as the PGD and the coverage-guided testing, if we only use the prediction loss as the objective function in the GA, the perturbations added to the images can be easily told. In stark contrast, AEs generated by our two-step GA (with the 3 perceptual quality metrics in the last 3 columns) are of high quality888All 3 perceptual quality metrics perform good in grey-scale images, while SSIM performs the best in colourful images and tends to add noise to the background. and indistinguishable with human-eyes from the original images (first column).

Answer to RQ1 on HDA stage 3: Two-step GA based local test case generation can effectively detect AEs with high perception quality.

4.2.2. Rq2

We compare our HDA with state-of-the-art AE detection methods in two sets of experiments. In the first set, we focus on comparing with the adversarial attack and coverage-guided testing (i.e., the typical PGD attack and neuron coverage metric for brevity, while the conclusion can be generalised to other attacks and coverage metrics). Then in the second set of experiments, we show the advantages of our HDA over other distribution-aware testing methods.

In fact, both PGD attack and coverage-guided testing do not contribute to test seeds selection. They simply use randomly sampled data from the test set as test seeds, by default. Thus, we only need to compare the randomly selected test seeds with our “global distribution probability999Refer to Section 3.2 for the calculation. The value of probability density is further normalised by training dataset for a better presentation. plus local robustness indicated” test seeds, shown as “‘” in Table 5. Specifically, for each test seed, we calculate two metrics—the local robustness of its norm ball and its corresponding global probability . We invoke the estimator of (webb_statistical_2019) to calculate the former (, to be exact). To reduce the sampling noise, we repeat the test seed selection 100 times and present the averaged results in Table 5.

Dataset Random Test Seeds Test Seeds
MNIST -48.7 -45.6
Fashion-MNIST -21.9 -18.4
SVHN -22.1 -21.2
CIFAR-10 -23.3 -19.8
CelebA -36.3 -32.7
Table 5. Comparison between randomly selected test seeds and our “ indicated” test seeds (averaging over 100 test seeds).

From Table 5, we observe: (i) test seeds selected by our method have much higher global probability density, meaning their norm balls are much more representative of the data distribution; (ii) the norm balls of our test seeds have worse local robustness, meaning it is more cost-effective to detect AEs in them. These are unsurprising, because we have explicitly considered the distribution and local robustness information in the test seed selection.

Finally, the overall evaluation on the generated test cases and the detected AEs by them are shown in Table 6. The results are presented in two dimensions—3 types of testing methods versus 2 ways of selecting test seeds, yielding 6 combinations (although by default, PGD and Coverage-guided methods are using random seeds, while our method is using the “” seeds). For each combination, we study 4 metrics (cf. Table 2 for meanings behind them): (i) the AE proportion; (ii) the average prediction loss; (iii) the FID101010To show how close the perturbed test cases are to the test seeds in the latent space, we use the last convolutional layer of InceptionV3 to extract the latent representations of colour images for FID. InceptionV3 is a well-trained CNN and commonly used to show FID that captures the perturbation levels, e.g., in (DBLP:conf/nips/HeuselRUNH17). While InceptionV3 is used for colour images, VAE is used for grey-scale datasets MNIST and Fashion-MNIST. of the test set quantifying the image quality; and (iv) the computational time (and an additional coverage rate for coverage-guided testing). We note the observations on these 4 metrics in the following paragraphs.

Seed Dataset PGD Attack
Coverage Guided Testing
Hierarchical Distribution-Aware Testing
AE Prop. Pred. Loss FID Time(s) Cov. Rate AE Prop. Pred. Loss FID Time(s) AE Prop. Pred. Loss FID Time(s)
MNIST 0.205 7.01 0.46 0.48 0.859 0.001 1.48 0.76 187.92 0.600 1.48 0.16 51.63
F.-MNIST 0.957 19.62 1.89 0.44 0.936 0.228 2.15 3.65 131.47 0.999 6.08 0.11 63.36
SVHN 0.866 2.81 95.81 11.11 0.976 0.004 0.09 98.49 343.47 0.922 2.37 95.22 156.46
CIFAR-10 1.000 39.74 87.51 11.22 0.988 0.196 3.32 93.32 542.73 1.000 37.79 75.59 221.29
CelebA 0.979 119.95 78.53 12.64 0.992 0.052 12.09 84.42 931.65 1.000 96.03 69.39 233.78
MNIST 0.986 11.95 0.21 0.47 0.873 0.076 1.37 0.82 187.73 1.000 3.59 0.01 53.05
F.-MNIST 1.000 26.24 0.69 0.44 0.950 0.322 2.41 1.33 132.39 1.000 9.73 0.03 62.61
SVHN 0.992 2.91 87.50 11.44 0.979 0.038 0.09 93.05 336.64 1.000 2.12 83.02 156.39
CIFAR-10 1.000 40.08 83.35 12.74 0.989 0.221 3.98 87.05 543.32 1.000 33.45 70.27 221.38
CelebA 0.998 120.51 74.83 11.54 0.988 0.067 8.72 80.49 939.78 1.000 93.48 67.77 233.97
Table 6. Evaluation of the generated test cases and detected AEs by PGD Attack, coverage-guided testing and the proposed HDA testing (all results are averaged over 100 seeds)

Regarding the AE proportion in the set of generated test cases, the default setting of our proposed approach is clearly the best (with score 1) among the 6 combinations. Both our novel test seed selection and two-step GA local test case generation methods contribute to the win. This can be told from the decreased AE proportion when using random seeds in our method, but still the result is relatively higher than most combinations. PGD, as a white-box approach using the gradient information, is also quite efficient in detecting AEs, especially when paired with our new test seed selection method. On the other hand, coverage-guided testing is comparatively less effective in detecting AEs (even with high coverage rate), yet our test seed selection method can improve it.

As per the results of prediction loss, PGD, as a gradient-descent based attacking method, unsurprisingly finds the AEs with the largest prediction loss. With better test seeds considering local robustness indicators selected by our method, the prediction loss of PGD can be even higher. Both coverage-guided and our HDA testing are detecting AEs with relatively lower prediction loss, meaning the AEs are with “weaker adversarial strength”. The reason for the low prediction loss of AEs detected by our approach is that our two-step GA makes the trade-off and sacrifices it for AEs with higher local probabilities (i.e., more natural). This can be seen through the small FID of our test set. PGD, on the other hand, has relatively high FID scores, as well as coverage-guided testing.

On the computational overheads, we observe PGD is the most efficient, given it is by nature a white-box approach using the gradient-descent information. While, our approach is an end-to-end black-box approach (if without using the gradient based indicator when selecting test seeds) requiring less information and being more generic, at the price of being relatively less efficient. That said, the computational time of our approach is still acceptable and better than coverage-guided testing.

Answer to RQ2 on comparing with adversarial attack and coverage-guided testing: HDA shows advantages over adversarial attack and coverage-guided testing on test seeds selection and generation of high perception quality AEs.

Next, we try to answer the difference between our HDA testing method and other distribution-aware testing as summarised earlier (the amber route of Fig. 1). We not only study the common evaluation metrics in earlier RQs, but also the input validation method in (dola_distribution_aware_2021), which flags the validity of AEs according to a user-defined reconstruction probability threshold.

Dataset Tool
% of
Valid AEs
MNIST OODA 0.0055 29 0.81 2.29
FODA 0.0030 100 0.73 0.47
HDA 0.0835 100 0.05 0.01
SVHN OODA 0.0046 21 0.82 128.84
FODA 0.0021 100 0.31 110.73
HDA 0.0804 100 0.03 83.02
Table 7. Evaluation of AEs detected by OODA, FODA and our HDA testing methods (based on 100 test seeds).

As shown in Table 7, HDA can select test seeds from much higher density region on the global distribution and find more valid AEs than OODA. The reason behind this is that OODA aims at detecting outliers—only AEs have lower reconstruction probabilities (from the test seed) than the given threshold will be marked as invalid test cases. While, HDA explicitly explores the high density meanwhile error-prone regions by combining the global distribution and local robustness indicators. In other words, HDA does priority ordering (according to the global distribution and local robustness) and then selects the best, while OODA rules out the worst. As expected, FODA performs similarly bad as OODA in terms of , since both are using randomly selected seeds. While, FODA has high proportion of valid AEs since the test cases are directly sampled from the distribution in latent space.

Regarding the perceptual quality of detected AEs, HDA can always find AEs with small pixel-level perturbations () in consideration of the -separation constraint, and with small FID thanks to the use of perceptual quality metrics (SSIM in this case) as objective functions. While OODA only utilises the reconstruction probability (from VAE) to choose AEs, and FODA directly samples test cases from VAE without any restrictions (thus may suffer from the oracle problem, cf. Remark 11 later). Due to the compression nature of generative models—they are good at extracting feature level information but ignore pixel level information (zhong2020generative), AEs detected by OODA and FODA are all distant to the original test seeds, yielding large and FID scores. Notably, the average distance between test seeds and AEs detected by OODA and FODA are much (728 times) greater than the -separation constraints (cf. Table 1), leading to the potential oracle issues of those AEs, for which we have the following remark:

Remark 4 (Oracle Issues of AEs Detected by OODA and FODA).

AEs detected by OODA and FODA are normally distant to the test seeds with a perturbation distance even greater than the -separation constraint. Consequently, there is the risk that the perturbed image may not share the same ground truth label of the test seed, and thus hard to determine the ground truth label of the “AE”111111In quotes, because the perturbed image could be a “benign example” with a correct predicted label (but different to the test seed)..

To visualise the difference between AEs detected by HDA, FODA and OODA, we present 4 examples in Fig. 8. We may observe the AEs detected by HDA are almost indistinguishable from the original images. Moreover, the AEs by FODA is a set of concrete evidence for Remark 11—it is actually quite hard to tell what is the ground truth label of some perturbed image (e.g., the bottom left one), while others appear to have a different label of the seed (e.g., the bottom right one should be with a label “1” instead of “7”).

Figure 8. Example AEs detected by different distribution-aware testing methods.AEs detected by our HDA are indistinguishable from the original images, while AEs detected by FODA and OODA are of low perceptual quality and subject to the oracle issues noted by Remark 11.

Answer to RQ2 on comparing with other distribution-aware testing: Compared to OODA and FODA, the proposed HDA testing can detect more valid AEs, free of oracle issues, with higher global probabilities and perception quality.

4.2.3. Rq3

In earlier RQs, we have varied the datasets and model architectures to check the effectiveness of HDA. In this RQ3, we concern HDA’s sensitivity to DL models with different levels of robustness. Adversarial training may greatly improve the robustness of DL models and is normally used as the defence to adversarial attack. To this end, we apply HDA on both normally and adversarially trained models (by (DBLP:conf/iclr/MadryMSTV18) to be exact), and then compare with three most representative attacking methods—the most classic FGSM, the most popular PGD, and the most advanced one AutoAttack (DBLP:conf/icml/Croce020a). Experimental results are presented in Table 8.

Model Dataset FGSM PGD AutoAttack HDA
AE Prop. FID AE Prop. FID AE Prop. FID AE Prop. FID
MNIST 0.0099 0.34 1.085 0.0099 0.53 0.639 0.0100 0.92 0.954 0.0835 1.00 0.011
F.-MNIST 0.0109 0.78 3.964 0.0109 1.00 2.611 0.0109 1.00 4.505 0.0635 1.00 0.013
SVHN 0.0041 0.77 114.65 0.0042 0.97 107.04 0.0042 0.99 108.41 0.0804 1.00 79.21
CIFAR10 0.0115 0.93 112.32 0.0114 1.00 101.92 0.0115 1.00 108.15 0.3442 1.00 67.13
CelebA 0.0090 0.81 99.226 0.0090 0.97 89.413 0.0091 1.00 91.591 0.1285 1.00 67.71
MNIST 0.0100 0.09 0.993 0.0100 0.08 0.728 0.0105 0.11 0.634 0.1944 0.62 0.049
F.-MNIST 0.0112 0.29 4.187 0.0112 0.34 3.492 0.0122 0.74 2.888 0.0632 0.81 0.297
SVHN 0.0043 0.45 127.89 0.0040 0.50 121.25 0.0054 0.63 120.97 0.0821 0.71 83.88
CIFAR10 0.0137 0.46 122.74 0.0136 0.50 118.55 0.0139 0.55 93.779 0.3263 0.64 55.47
CelebA 0.0096 0.37 108.56 0.0095 0.39 105.96 0.0097 0.43 106.72 0.2007 0.49 71.06
Table 8. Evaluation of AEs generated by FGSM, PGD, AutoAttack and HDA on normally and adversarially trained DL models (all results are averaged over 100 test seeds).

As expected, after the adversarial training by (DBLP:conf/iclr/MadryMSTV18), the robustness of all five DL models are greatly improved. This can be observed from the metric of AE Prop.: For all four methods, the proportion of AEs detected in the set of test case is sharply decreased for adversarially trained models, while HDA still outperforms others. Since the rationales behind the three adversarial attacks are without considering the input data distribution nor perception quality, it is unsurprising that their two sets of results for normally and adversarially trained models are quite similar, in terms of the metrics and FID. On the other hand, the FID scores of AEs detected by HDA get worse but still better than others. The measured on AEs detected by HDA also changed due to the variations of robustness indicators before and after the adversarial training, and yet much higher than all attacking methods.

Answer to RQ3: HDA is shown to be capable and superior to common adversarial attacks when applied on DL models with different levels of robustness.

4.2.4. Rq4

The ultimate goal of developing HDA testing is to improve the global robustness of DL models. To this end, we refer to a validation set of 10000 test seeds. We fine-tune (jeddi_simple_2021) the DL models with AEs detected for validation set from different methods. Then, we calculate the train accuracy, test accuracy and empirical global robustness before and after the adversarial fine-tuning. Empirical global robustness is measured on a new set of on-distribution AEs for validation set, different from the fine-tuning data. Fine-tuning requires to know the ground truth label of the AEs, which cannot be satisfied due to the potential oracle issues of OODA and FODA (cf. Remark 11). Thus, we omit the comparison with OODA and FODA, while other results are presented in Table 9.

No. of
Test Cases
PGD Attack HDA Testing Coverage Guided Testing
Train Acc. Test Acc. Train Acc. Test Acc. Train Acc. Test Acc.
MNIST 300 98.26% 97.64% 49.27% 99.98% 98.77% 90.94% 100.00% 98.93% 34.83%
3000 99.10% 98.05% 84.09% 99.95% 98.72% 99.28% 100.00% 98.99% 47.91%
30000 99.94% 98.65% 99.88% 100.00% 98.88% 100.00% 100.00% 98.77% 71.10%
F.-MNIST 300 97.41% 91.27% 68.04% 98.92% 91.35% 70.00% 97.62% 90.93% 47.05%
3000 89.48% 87.34% 88.78% 94.49% 89.96% 92.39% 88.06% 84.94% 63.30%
30000 86.67% 85.06% 95.00% 93.54% 89.70% 97.89% 88.60% 85.56% 84.71%
SVHN 300 95.01% 93.63% 48.84% 89.26% 87.78% 62.93% 97.25% 90.95% 16.79%
3000 88.81% 88.94% 75.83% 92.96% 92.66% 83.78% 87.21% 84.06% 37.59%
30000 78.72% 80.14% 80.14% 92.81% 91.91% 94.91% 87.32% 84.18% 66.58%
CIFAR-10 300 92.39% 85.60% 46.88% 93.38% 86.78% 48.96% 95.56% 86.22% 0.03%
3000 88.78% 84.10% 76.46% 92.07% 86.42% 92.92% 93.22% 84.40% 0.48%
30000 88.26% 82.99% 94.47% 91.62% 86.13% 97.58% 93.46% 84.30% 13.43%
Table 9. Evaluation of DL models’ train accuracy, test accuracy, and empirical global robustness (based on 10000 on-distribution AEs) after adversarial fine-tuning.

We first observe that adversarial fine-tuning is effective to improve the DL models’ empirical global robustness, measured by the prediction accuracy on AEs for normally trained models, while compromising the train/test accuracy as expected (in contrast to normal training in Table 1). In most cases, DL models enhanced by HDA testing suffers least from the drop of generalisation. The reason behind this is that HDA testing targets at AEs from high density regions on distributions, usually with small prediction loss, shown in Fig. 5. Thus, eliminating AEs detected by HDA testing requires relatively minor adjustment to DL’s models, the generalisation of which can be easily tampered during the fine-tuning with new samples.

In terms of empirical global robustness, HDA testing detects AEs around test seeds from the high global distribution region, which are more significant to the global robustness improvement. It can be seen that with 3000 test cases generated by utilising 1000 test seeds, the HDA testing can improve empirical global robustness to nearly or over , very closed to the fine-tuning with 30000 test cases from 10000 test seeds. This means the distribution-based test seeds selection is more efficient than random test seeds selection. Moreover, even fine-tuning with 30000 test cases, leveraging all the test seeds in the validation set, HDA is still better than PGD attack and coverage-guided testing, due to the consideration of local distributions (approximated by naturalness). We notice that PGD-based adversarial fine-tuning minimises the maximum prediction loss within the local region, which is also effective to eliminate the natural AEs, but sacrificing more train/test accuracy. DL models fined-tuned with HDA testing achieve the best balance between the generalisation and global robustness.

Answer to RQ4: Compared with adversarial attack and coverage-guide testing, HDA contributes more to the growth of global robustness, while mitigating the drop of train/test accuracy during adversarial fine-tuning.

5. Threats to Validity

5.1. Internal Validity

Threats may arise due to bias in establishing cause-effect relationships, simplifications and assumptions made in our experiments. In what follows, we list the main threats of each research question and discuss how we mitigate them.

5.1.1. Threats from HDA Techniques

In RQ1, both the performance of the VAE-Encoder and KDE are threats. For the former, it is mitigated by using four established quality metrics (in Table 3

) on evaluating dimensionality reduction techniques and compared to the common PCA method. It is known that KDE performs poorly with high-dimensional data and works well when the data dimension is modest

(scott1991feasibility; liu2007sparse). The data dimensions in our experiments are relatively low given the datasets have been compressed by VAE-Encoder, which mitigates the second threat. When studying the local robustness indicators, quantifying both the indicators and the local robustness may subject to errors, for which we reduce them by carefully inspecting the correctness of the script on calculating the indicators and invoking a reliable local robustness estimator (webb_statistical_2019) with fine-tuned hyper-parameters. For using two-step GA to generate local test cases, a threat arises by the calculation of norm ball radius, which has been mitigated by -separation distance presented in the paper (yang_closer_2020). Also, the threat related to estimating the local distribution is mitigated by quantifying its three indicators (MSE, PSNR and SSIM) that are typically used in representing image-quality by human-perception.

5.1.2. Threats from AEs’ Quality Measurement

A threat for RQ1, RQ2 and RQ3 (when examining how effective our method models the global distribution and local distribution respectively) is the use of FID as a metric, quantifying how “similar” two image datasets are. Given FID is currently the standard metric for this purpose, this threat is sufficiently mitigated now and can be further mitigated with new metrics in future. RQ2 includes the method of validating AEs developed in (dola_distribution_aware_2021), which utilises generative models and OOD techniques to flag valid AEs with reconstruction probabilities greater than a threshold. The determination of this threshold is critical, thus poses a thread to RQ2. To mitigate it, we use same settings across all the experiments for fair comparisons.

5.1.3. Threats from Adversarial Training and Fine-Tuning

In RQ3 and RQ4, the first threat rises from the fact that adversarial training and adversarial fine-tuning will sacrifice the DL model’s generalisation for robustness. Since the training process is data-driven and of black-box nature, it is hard to know how the predication of a single data-point will be affected, while it is meaningless to study the robustness of an incorrectly predicted seed. To mitigate this threat when we compare the robustness before and after adversarial training/fine-tuning, we select enough number of seeds and check the prediction of each selected seed (filtering out incorrect ones if necessary) to make sure test seeds are always predicted correctly. For the global robustness computation in RQ4, we refer to a validation dataset, where a threat may arise if the empirical result based on the validation dataset cannot represent the global robustness. To mitigate it, we synthesise the validation set with enough data—10000 inputs sampled from global distribution. We further attack the validation dataset to find an AE per seed according to the local distribution. Thus, DL models’ prediction accuracy on this dataset empirically represents the global robustness as defined. For the training/fine-tuning to be effective, we need a sufficient number of AEs to augment the training dataset. A threat may arise due to a small proportion of AEs in the augmented training dataset (the DL model will be dominated by the original training data during the training/fine-tuning). To mitigate such a threat, we generate a large proportion of AEs in our experiments.

5.2. External Validity

Threats might challenge the generalisability of our findings, e.g. the number of models and datasets considered for experimentation; thus we mitigate these threats as follows. All our experiments are conducted on 5 popular benchmark datasets, covering 5 typical types of DL models, cf. Table 1. Experimental results on the effectiveness of each stage in our framework are all based on averaging a large number of samples, reducing the random noise in the experiments. In two-step GA based test case generation, a wide range of the parameter has been studied showing converging trends. Finally, we enable replication by making all experimental results publicly available/reproducible on our project website to further mitigate the threat.

6. Conclusion & Future Work

In this paper, we propose a HDA testing approach for detecting AEs that considers both the data distribution (thus with higher operational impact assuming the training data statistically representing the future inputs) and perceptual quality (thus looks natural and realistic to humans). The key novelty lies in the hierarchical consideration of two levels of distributions. To the best of our knowledge, it is the first DL testing approach that explicitly and collectively models both (i) the feature-level information when selecting test seeds and (ii) pixel-level information when generating local test cases. To this end, we have developed a tool chain that provides technical solutions for each stage of our HDA testing. Our experiments not only show the effectiveness of each testing stage, but also the overall advantages of HDA testing over state-of-the-arts. From a software engineering’s perspective, HDA is cost-effective (by focusing on practically meaningful AEs), flexible (with end-to-end, black-box technical solutions) and may effectively contribute to the robustness growth of the DL software under testing.

The purpose of detecting AEs is to fix them. Although existing DL retraining/repairing techniques (e.g. (jeddi_simple_2021) used in RQ4 and (DBLP:conf/icml/WangM0YZG19; 9508369)) may satisfy the purpose to some extent, bespoke “debugging” methods with more emphasise on the feature-distribution and perceptual quality can be integrated into our framework in a more efficient way. To this end, our important future work is to close the loop of “detect-fix-assess” as depicted in (zhao_detecting_2021) and then organise all generated evidence as safety cases (zhao_safety_2020). Finally, same as other distribution-aware testing methods, we assume the input data distribution is same as the training data distribution. To relax this assumption, we plan to take distribution-shift into consideration in future versions of HDA.

This work is supported by the U.K. DSTL (through the project of Safety Argument for Learning-enabled Autonomous Underwater Vehicles) and the U.K. EPSRC (through End-to-End Conceptual Guarding of Neural Architectures [EP/T026995/1]). Xingyu Zhao and Alec Banks’ contribution to the work is partially supported through Fellowships at the Assuring Autonomy International Programme. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 956123. This document is an overview of U.K. MOD (part) sponsored research and is released for informational purposes only. The contents of this document should not be interpreted as representing the views of the U.K. MOD, nor should it be assumed that they reflect any current or future U.K. MOD policy. The information contained in this document cannot supersede any statutory or contractual requirements or liabilities and is offered without prejudice or commitment.