Deep Learning (DL) has become an essential component of complex software systems, including autonomous vehicles and medical diagnosis systems. As a consequence, the problem of ensuring the dependability of DL systems is critical.
Unlike traditional software, in which developers explicitly program the system’s behaviour, one peculiarity of DL systems is that they mimic the human ability to learn how to perform a task from training examples (Manning et al., 2008). Therefore, it is essential to understand to what extent they can be trusted in response to the diversity of inputs they will process once deployed in the real world, as they could face scenarios that might be not sufficiently represented in the data from which they have learned (Humbatova et al., 2020).
, the Software Engineering research community is working hard at adequately testing the functionality of DL systems by proposing a steadily growing number of approaches. Since part of the program logic of these systems is determined by the training data, traditional code coverage metrics are not effective in determining whether their logic has been adequately exercised. Therefore, recent testing solutions aim at maximising ad hoc white-box adequacy metrics, such as neuron(Pei et al., 2019; Guo et al., 2018; Tian et al., 2018; Xie et al., 2019) or surprise coverage (Kim et al., 2019), or at exposing misbehaviours (Abdessalem et al., 2016; Zhang et al., 2018; Gambi et al., 2019). A limitation of these approaches is that their output cannot be directly used to explain the behaviour of the DL system under test, e.g. coverage reports do not provide enough information to understand what input features might have caused misbehaviours. Consequently, the usefulness of these approaches for the developers is strongly limited in practice.
Few approaches (Abdessalem et al., 2018; Riccio and Tonella, 2020) use behavioural properties during test generation, but none of them considers the combination of interpretable features of the DL system under test as the target of test generation. This hinders them from exploring the feature space at large and providing a detailed explanation on how the system behaves for qualitatively different inputs.
In this paper, we introduce a novel way to assess the quality of DL systems by automatically generating a large, diverse set of high-performing (i.e., misbehaving or near-misbehaving), but qualitatively different test inputs that provide developers with a human-interpretable picture of the system’s quality. With our approach, developers can understand how different structural and behavioural features of the inputs combine to affect the system’s performance. To this aim we developed DeepHyperion
, an open source automated test input generator for DL systems that leverages the key advantages of Illumination Search, i.e. a family of search algorithms that “illuminate” the input space by returning the highest-performing solution at each point of the search space defined by features of interest to the user(Mouret and Clune, 2015).
DeepHyperion is the first approach to apply Illumination Search to DL system testing and to provide developers with a feature map, where the automatically generated inputs are positioned based on their characteristics and where the misbehaviours they expose can be interpreted (see Figure 1 for an example).
A crucial element of our approach is the choice of the dimensions that define the feature space of interest. In particular, the features should represent meaningful properties of the test scenarios, i.e. discriminative and interpretable properties of the inputs, or behavioural properties manifested by the DL system when exercised by the test inputs. To this aim, we propose a novel systematic methodology that can be used in conjunction with DeepHyperion to define the feature dimensions in a domain of interest, making it possible to generate test cases that illuminate the associated map in such domain. This methodology supports the identification of the features that better characterise the generated inputs and the definition of metrics that quantify the selected features.
We evaluated the proposed technique on both a classification problem (handwritten digit recognition) and a regression problem (steering angle prediction in a self-driving car). Results show that, for both problems, DeepHyperion is effective in generating failure-inducing inputs that are structurally or behaviourally different among them, as they cover different regions of the feature space. We compared DeepHyperion with state-of-the-art test input generators. Our results show that DeepHyperion can explore the feature space at large, whereas existing tools ignore parts of the feature space and expose only misbehaviours that belong to a narrow region of such space.
To foster research and replication, we release the code implementing DeepHyperion, the dataset, and all the scripts to replicate the experimental evaluation (Zohdinasab et al., 2021).
2. The DeepHyperion Technique
DeepHyperion aims to explore extensively the feature space of a DL system to find the most misbehaving solutions (i.e., those that deviate the most from the expected behaviour) with diverse characteristics. DeepHyperion implements the Illumination Search algorithm proposed by J.B. Mouret and J. Clune, named Multi-dimensional Archive of Phenotypic Elites (MAP-Elites) (Mouret and Clune, 2015). Given N dimensions of variation of interest, which define the feature space, DeepHyperion looks for the most misbehaving solution at each point in the space defined by those dimensions, trying to fill the entire feature map. The degree of misbehaviour exhibited by a solution is measured by a properly defined fitness function
. For example, in a grey-scale digit classification problem, two dimensions of interest could be the boldness and discontinuity of the image, whereas the fitness function could be the misclassification probability computed from the softmax layer output of the Deep Neural Network (DNN)(Goodfellow et al., 2016). In such a case, the output of DeepHyperion would be a map where each cell is associated with a specific level of boldness and discontinuity, while the entry of each cell would provide an input image with such boldness and discontinuity, which is either misclassified or close to being misclassified – i.e., corner case inputs with the given features (see Figure 1).
Algorithm 1 outlines the top level steps of the Illumination Search implemented in DeepHyperion. The map to be generated is initially empty (line 2). Then a pool of candidate valid inputs (seeds) is generated (line 4) and evaluated (lines 5-7), which means each seed is associated with its feature values and its fitness function value. The resulting pool of seeds is used to initialise the population to be evolved by the algorithm (line 8). The position of each initial individual in the map is determined and the highest fitness individual is added to the map in the corresponding position (lines 10-12). Then, the main evolutionary loop is executed, with a termination condition determined by the execution budget (lines 14-19). At each loop iteration, an individual randomly selected from the current map is mutated and evaluated (lines 15-17) and if it has a higher fitness value than the individual in the map cell it occupies, it replaces the existing entry in the map (this is done also if the map entry is currently empty). In the next sections, we describe the key design choices behind DeepHyperion’s algorithm and how we applied it to the chosen application domains.
2.1. Model-Based Input Representation
DeepHyperion belongs to the family of the model-based input generation techniques (Utting et al., 2012). It does not directly modify raw input data (e.g., pixels) but it manipulates a model of the input that is later used to derive the actual raw input data. This enables DeepHyperion to generate more realistic inputs, belonging to the input validity domain (Riccio and Tonella, 2020). This implies that DeepHyperion is applicable to a given domain if we have a generative model of the input data processed in such domain.
The development of input models is standard practice in several domains, including safety-critical ones (Larman, 1997). Generative input models are largely domain-specific. In the following, we present two examples of such models for the domains we considered in our experimental evaluation: handwritten digits, processed by a digit classifier, and driving scenarios for self-driving cars.
As regards the handwritten digits, the test inputs are images in the MNIST(LeCun et al., 1998) format. MNIST is a database of
handwritten digits, originally encoded as 28 x 28 images with greyscale levels that range from 0 to 255. We model them as combinations of Bézier curves, adopting the Scalable Vector Graphics (SVG)111https://www.w3.org/Graphics/SVG/ representation. The control parameters that determine the shape of the modelled digit are: the start point, the end point and the control points that define each Bézier segment. This representation ensures that the realism of handwritten shapes is preserved even after minor manipulation of the Bézier curve parameters (Riccio and Tonella, 2020). We use the Potrace algorithm (Selinger, 2003) to transform an MNIST input into its SVG model representation. To transform an SVG model back to a grayscale image, we perform a rasterisation operation by means of the functionalities offered by two popular open source graphic libraries (i.e. LibRsvg222https://wiki.gnome.org/Projects/LibRsvg and Cairo333https://www.cairographics.org).
In the autonomous driving domain, the test input is the scenario in which the car drives. A simulated scenario can be modelled as the composition of the roads, the driving task (i.e., start point, end point and lane to keep), and the environment, which includes the weather and lightness conditions. We consider input scenarios similar to the those generated by the state-of-the-art testing tool DeepJanus (Riccio and Tonella, 2020). These scenarios consist of plain asphalt roads surrounded by green grass on which the car has to drive keeping the right lane. The environment is set to a clear day without fog. The roads are composed of two lanes with fixed width in which there is a yellow center line plus two white lines that separate each lane from the non-drivable area. Our model of a road is a sequence of consecutive points in a bi-dimensional space. To produce a smooth and realistic shape for the road being modelled, we use Catmull-Rom cubic splines (Catmull and Rom, 1974). The control parameters that determine the shape of the splines are the coordinates of the control points of the center line spline. To transform the model into a road to be rendered in the simulator, we calculate the road points by means of the recursive algorithm for the evaluation of Catmull-Rom cubic splines proposed by Barry and Goldman (Barry and Goldman, 1988) and the functionality offered by the Shapely library, for the manipulation and analysis of planar geometric objects444https://github.com/Toblerity/Shapely.
2.2. Fitness Function
The Evaluate function (lines 6 and 17 in Algorithm 1) evaluates an individual ind by determining the values of its features ind, …, ind and of its fitness function ind.fitness, both of which are domain/problem specific.
For what concerns the definition of the relevant input features in a given domain, we propose a novel, systematic methodology, described in detail in Section 3. For what concerns the fitness function, the general idea is that it should quantify how close the DL system is to a misbehaviour. In the following, we illustrate the definition of a sensible fitness function for each of the two domains of handwritten digit recognition and autonomous driving.
For the digit classification problem
, we exploit the activation levels available in the output softmax layer of the DNN that classifies the input image. In fact, the softmax output can be interpreted as a confidence level assigned to each of the possible classes(Goodfellow et al., 2016), where the selected class for the given input is the one with highest confidence. More specifically, we calculate the difference between the confidence level associated to the expected class and the maximum confidence level associated to any other class. In this way, we get a fitness value that is close to zero when the correct class and the second highest class have similar activation levels, while we get a negative number when the input is misclassified. Hence, this fitness function is to be minimised.
For the steering angle prediction problem, we adopt a fitness function similar to that used by DeepJanus (Riccio and Tonella, 2020). The behaviour of the self-driving system is characterised by the distance of the car from the center of the lane during the simulation (Stocco and Tonella, 2020; Jahangirova et al., 2021). The fitness is calculated as , where is the width of the lane and the distance of the car from the lane centre. The position of the car is approximated by its centre of mass. The fitness function returns its maximum value () when the car is at the center of the lane, whereas it returns a negative number when the car is out of bound. Hence, this fitness function is also to be minimised.
2.3. Feature Map
The feature map represents the feature space defined by N dimensions of variation that characterise the input or the behaviour of the DL system under test. Given the values of an individual’s features, ind, function UpdateMap (lines 11 and 18 in Algorithm 1) computes the individual’s coordinates in the map by applying the following mapping function:
The value ind is converted to an integer after multiplying it by a constant that ensures approximately the desired grid size, given the expected range of each feature ind. For instance, if the desired grid size is 100 and ind is known to range between 0 and 1, a proper selection of could be . The resulting integer is used as an index in the map to get the cell to be assigned to the individual ind (in a 2D map, ).
During the search, the size of map grows dynamically as higher/lower index values are discovered. In fact, initially has zero entries along all its dimensions. As soon as a new index is discovered during the search process, is extended to accommodate the newly discovered range of each dimension. For instance, if the first mapped individual has indexes , the initial map will have size 1 in both directions and will contain just one cell, at position . If later another individual is mapped to , the map will be extended to cover the integer range [2:5] along its first dimension and [1:3] along the second dimension. Hence, at this point of the search the map will cover the rectangle [2:5] [1:3], which means it will contain cells.
At the end of the search, the final map can be adjusted to allow the user to define a granularity different from the dynamically discovered one. This is particularly useful if users want to compare maps produced in different runs/configurations or by different algorithms. The final remapping is a linear rescaling function:
where is the desired grid size, while are the minimum/maximum value of the -th feature observed across all maps being rescaled to the new grid.
2.4. Initial Population Generation
The generation of the initial population consists of choosing an initial set of diverse individuals from the feature space, given a set of seeds of size seedsize and the population size popsize. Function GenerateSeeds (line 4 in Algorithm 1) generates seeds that are valid inputs for the system under test. More specifically, for digit classification, seeds are randomly chosen inputs from the MNIST database and converted to SVG. For steering angle prediction, we randomly generate valid roads.
DeepHyperion evaluates the fitness and the feature values of the generated seeds and then it finds the map cells to which they belong (lines 5-7 in Algorithm 1). Then, function InitialisePopulation (line 8 in Algorithm 1) selects as initial population the most diverse inputs among the available seeds, by computing the pairwise Manhattan distance (Krause, 1986) (sum of the absolute differences of the map coordinates) and greedily constructing the set of most diverse seeds, starting from a randomly selected first seed, up to the desired population size.
2.5. Selection and Mutation
Function RandomSelection (line 15 in Algorithm 1) randomly chooses an already filled cell in the map (Mouret and Clune, 2015). The individual that occupies the chosen cell is selected for the next genetic evolution. In this way, DeepHyperion
is not biased towards the solution with highest fitness, like classic evolutionary algorithms, and it can explore the feature space at large, with the ultimate goal of “illuminating” it as completely as possible.
The selected individual is mutated by the Mutate operator (line 16 in Algorithm 1). This operator manipulates the input model’s control parameters by applying a small perturbation to them. The extent of the perturbation is uniformly sampled in a customisable range. After applying the operator, DeepHyperion verifies that the mutant complies with the constraints of the input domain. For the digit classification problem, the mutated control points must remain within the 28 x 28 input grid. For the steering angle prediction problem, DeepHyperion enforces the following constraints to ensure that the mutant is a valid road: (1) the start point and the end point of the road should be different, (2) the road should fall within a square bounding box of fixed size, and (3) a road should not self-intersect. Moreover, DeepHyperion also verifies that, once concretised into an actual input for the DL system, the mutant is different from its parent. If any of these checks fails, the operator is applied repeatedly, until a valid input is obtained.
3. Definition of the Map Dimensions
A crucial element of our approach is the choice of the dimensions of variation of the automatically generated test cases. Such dimensions define both the feature space of interest to the user (Mouret and Clune, 2015) and the search space of DeepHyperion. In the case of DL testing, they should represent meaningful properties of the test scenarios: either discriminative and interpretable structural features of the inputs, or behavioural features observed as the DL system processes the input and produces its output.
We propose an empirical methodology that can be used to define the feature dimensions in a new domain of interest. Our methodology consists of two macro-steps (see Figure 2): (1) open coding: select the features that better characterise the generated inputs, and (2) metric identification: quantify the selected features. The second step is needed to provide DeepHyperion with quantitative feature values to position the generated tests in the feature map.
This methodology relies on the experts’ ability to define meaningful features and metrics to quantify them. Therefore, it can be challenging for DL systems with complex input/output spaces.
3.1. Open Coding
The first step entails an open coding procedure (Seaman, 1999) in which a set of existing inputs is manually analysed by human assessors to select the relevant features in a given domain. Since we are interested in both structural and behavioural features, the information provided to the human assessors is not restricted to the bare inputs (i.e. digit images and roads): it also includes the output of the DL system when processing the given existing inputs (e.g., the class predicted by an image classifier), as well as any relevant behavioural data (e.g., the trajectory of the car driving on the input road).
The assessors independently tag the inputs assigned to them by either reusing an existing feature label or defining a new one. Each feature label is composed of a feature name, paired with the corresponding feature value, chosen from a rating scale, usually with five levels. For instance, a hypothetical speed of a self-driving car label will have values that range between -2 and +2, where -2 means “very low”, while +2 means “very high”. This procedure is supported by a web application that we developed, which ensures that unlabelled inputs are equally distributed among the assessors, enables assessors to label inputs according to the existing features as well as to define new features, and supports conflict resolution when assessors evaluate the same input differently.
In our methodology, it is strongly advised to run a preliminary pilot study on a subset of inputs to gain confidence in the labelling procedure and, more importantly, agree on the meaning of the features and on the interpretation of the corresponding values. The pilot is concluded with a consensus meeting in which the disagreements are solved either by consensus among the assignees or arbitration by the other assessors. In our experience, a disagreement is worth being discussed in the consensus meeting when the assigned values differ by more than 1 position in the rating scale (e.g., a disagreement between “very low” and “low” speed can be just ignored, while one between “low” and “high” is worth being discussed and solved). It might happen that the assessors realise through the discussion that some important features have been overlooked. Therefore, as part of the consensus meeting, assessors are allowed to agree upon additional features to be considered during the labelling procedure.
Only when a common understanding of the features and of their possible values is reached, we suggest that it is possible to switch from the pilot study mode to the final study mode. In the final study, it is usually enough that each remaining unlabelled input is evaluated by a single assessor. In fact, while during the pilot study the number of inputs being labeled is kept small, in the final labelling phase we typically want to label as many inputs as possible.
3.2. Metric Identification
The second step of our methodology aims to define a set of metrics that can accurately quantify the domain-relevant features. The metrics can be either (1) selected from the most used in the literature or (2) designed ad-hoc to accurately quantify the features identified in the Open Coding step.
To select the most accurate metrics for the features that have been identified in the previous step, we compute the Pearson correlation coefficient (23) and the associated -value, between the manually defined feature values, converted from the rating scale to a numeric scale (e.g., in the range [1:5]), and the values returned by the candidate metrics. The metrics with highest, statistically significant (-value 0.05) correlation are chosen to quantify the selected features. In the following, we provide the details about how this methodology was applied to each of our case studies, i.e. digit recognition and autonomous driving.
3.3. Dimensions for Digit Recognition
3.3.1. Open Coding
In this phase, three authors acted as assessors. In the pilot, we randomly selected images from the MNIST database and each assessor was assigned images, such that each input was evaluated by two assessors. The assessors identified the following features, to which they assigned values within a range from -2 to 2:
Boldness, indicates how strong the stroke of the handwriting is. It ranges from very thin () to very thick line ().
Smoothness, indicates the absence of sharp angles in the digit. It ranges from sharp angles () to smooth angles ().
Discontinuity, indicates how continuous the stroke of the handwriting is. It ranges from continuous line () to digits made of multiple disconnected segments ().
Rotation with respect to the vertical axis. It ranges from strongly tilted to the left () to strongly tilted to the right ().
Examples of images of handwritten digits at various levels of Boldness and Discontinuity can be found in Figure 1. The inter-rater agreement during the pilot study, measured as the percentage of inputs that were assigned the same feature value or feature values with a difference of 1, is reported in Table 1 under Agree. We observed that assessors strongly agreed over Boldness and Discontinuity (i.e., no conflicts have been registered). Noticeably, Table 1 does not report any agreement value for Rotation, as the assessors introduced this feature during the consensus meeting, i.e., after the data collection for the pilot study ended. In the final phase, we randomly selected images from MNIST and each of the three assessors labelled images.
3.3.2. Metric Identification
To measure each feature resulting from the labelling procedure, we designed several candidate metrics and applied them to the images labelled by the assessors. Table 1 (top) shows the metric with highest correlation for each MNIST feature, together with the corresponding correlation and -value:
Luminosity (Lum): number of light pixels of the image, i.e., pixels whose value is above .
Average Angle (AvgAng) the average angle of the Bezier curves in the SVG representation of the digit.
Moves (Mov): sum of the Euclidean distances between pairs of consecutive sections of the digit. To obtain the sections of a digit, we convert its bitmap to SVG.
(Or): vertical orientation of the digit, obtained by computing the angular coefficient of the linear regression of the non-black pixels, i.e., pixels with value greater than.
As shown in Table 1, for Boldness, Discontinuity and Rotation we were able to define metrics that significantly correlate with the human assessment, whereas this was not possible for Smoothness, which turned out to be both difficult to evaluate for humans (see low inter-rater agreement) and difficult to quantify precisely. Hence, this feature was not included among those used by DeepHyperion for input generation.
3.4. Dimensions for Autonomous Driving
3.4.1. Open Coding
In this phase, all the authors acted as assessors. In the pilot, we randomly generated virtual roads according to our model representation. Each assessor was assigned images representing roads, so that each road was evaluated by two assessors. To simplify the job of the assessors, the web application supporting the labelling procedure provides some interaction facilities for the inspection of the road, such as: (1) zoom in/out; (2) selection of specific road segments; (3) navigation along the road; (4) toggling the visualisation of the car. The images abstract the roads over a two-dimensional plane but retain their geometrical properties and the proportions to the vehicle. In the images we draw boxes that represent the vehicle and cones that represent its field of view, to give assessors more context.
The assessors identified the following features, to which they assigned values within a range from to :
Smoothness, indicates how smooth the turns of the road are. It ranges from sharp turns () to gentle turns ().
Complexity, indicates how complex the road’s shape is. It ranges from almost straight roads () to roads with many turns ().
Orientation, indicates how many directions (i.e., N, NE, E, SE, S, SW, W, NW) the road covers. It ranges from straight road which is oriented to one direction only () to road that covers the whole spectrum of directions ().
Figure 3.a reports a smooth but complex road, whereas the road in Figure 3.b is complex and has sharper turns. The road in Figure 3.c is smooth but includes a very sharp turn. Figure 3.d shows a road covering almost the whole spectrum of directions.
As reported at the bottom of Table 1, during the pilot, we observed that assessors generally agreed upon all the features. In the final phase, we randomly generated roads and each assessor tagged of them.
3.4.2. Metric Identification
We designed a set of candidate metrics and applied them to the images labelled by the assessors. We eventually selected the following metrics that best correlate with the corresponding features, as reported in Table 1 (bottom):
Minimum radius of curvature (MinRad): minimum value of the radius for the circles passing through triplets of consecutive road waypoints.
Turn Count (TurCnt): number of turns in the road, where a turn is a change of direction between consecutive road segments by more than .
Direction Coverage (DirCov): number of different angular sectors covered by the directions of the road segment. In particular, we consider sectors, each spanning .
In addition to the features that characterise the structure of the test input, we considered further features to capture the behaviour of the car during the simulation. In particular, we used the following metrics that have been proposed as quality metrics for self-driving cars (Jahangirova et al., 2021) to measure the quality of driving:
Standard deviation of the steering angle
(StdSA): standard deviation of the sequence of steering angles collected along the road during self-driving.
Mean lateral position of the car (MLP): mean distance between the center of the car and the center of the driving lane, where the mean is computed across all car positions observed along the road.
4. Experimental Evaluation
4.1. Subject Systems
We evaluate DeepHyperion on two DL systems which address different tasks and belong to different domains. Moreover, they have been widely used in the literature to evaluate testing techniques for DL systems (Riccio et al., 2020; Zhang et al., 2020). Hereafter, we refer to these systems as MNIST and BeamNG, respectively.
The MNIST system performs a classification task, which consists of recognising handwritten digits from the MNIST dataset (LeCun et al., 1998)
. It is a DNN that predicts which digit is represented in a greyscale image. We considered the DNN instance provided by Keras,(Chollet, 2020) because of its popularity. It has % test accuracy, obtained after training it ourselves on the MNIST training set with its default configuration, i.e. epochs, batches of size , and a learning rate equal to .
The BeamNG system is a self-driving car equipped with a Lane Keeping Assist System (LKAS). The DL component solves a regression problem, i.e., it predicts the steering angle of the car given the image of its onboard cameras. We tested the whole DL system which includes the LKAS by using the BeamNG.research driving simulator (BeamNG GmbH, 2018), a freely available research-oriented version of the commercial game BeamNG.drive. The DL component driving the car utilises the DAVE-2 architecture designed by Bojarski et al. at NVIDIA (Bojarski et al., 2016), consisting of three convolutional layers, followed by five fully-connected layers. The DNN was trained with images captured by the camera sensors of the car, paired with the steering angles provided by the simulator’s autopilot, which takes advantage of global knowledge and computes the optimal steering angle geometrically. We trained the model for epochs, with batches of size and a learning rate equal to . We used a training dataset obtained by letting the autopilot drive up to 15 mph on randomly generated roads.
4.2. Research Questions
The goal of our evaluation is to understand whether coupling feature maps and automated test generation is an effective technique for DL testing, which (1) can thoroughly stress the DL system under different conditions, and (2) can provide information useful to characterise problems in DL systems. Therefore, we seek to answer the following research questions:
RQ1 (Failure Diversity): How effective is DeepHyperion in generating test inputs that expose diverse failures?
Generating tests that trigger failures is more useful when these failures are diverse. Whereas, a test generator that repeatedly exposes the same problem is not desirable, as it wastes computational resources.
Metrics: To assess how many different failures are triggered during a run, we measure the number of Mapped Misbehaviours (MM), i.e. how many cells of the feature map M contain at least one failure-inducing input.
To measure how the mapped misbehaviours of M are diverse, we compute the Misbehaviour Sparseness (MS), defined as the average maximum Manhattan distance between cells containing misbehaviours ():
RQ2 (Search Exploration): How extensively does DeepHyperion explore the feature space?
Effective test generation should exercise different behaviours of the systems under test. This can be achieved by exploring the feature space extensively, at large.
Metrics: We measure the map coverage as the number of Filled Cells in the map (). Moreover, to measure how broadly our tool explores the feature space, generating inputs in diverse cells, we use the Coverage Sparseness (CS), defined as the average maximum Manhattan distance between filled cells:
RQ3 (Feature Discrimination): How strongly do different combinations of features characterise the failure-inducing inputs?
The existence of regions of the map where the probability of misbehaviours is very high indicates that the corresponding feature value combinations are very likely to induce a failure, since most of the times when the combination was generated by DeepHyperion, a misbehaviour was observed. This could provide developers with a powerful tool to understand the conditions responsible for misbehaviours (a form of root-cause analysis).
Metrics: To answer this research question, we compute the Misbehaviour Probability (MP) for each cell of a map as the ratio between the number of failure-inducing inputs and the total number of inputs generated by DeepHyperion during the search process for that cell. Since occasionally only a small number of inputs may be generated by DeepHyperion
for a given cell during the search, our estimate of MP might be affected by a large error. Hence, we also compute the confidence interval of MP. In particular, we use Wilson’s confidence interval estimator for binomial random variables: in our case, such binomial variable indicates whether a misbehaviour is induced or not. We consider a combination of feature values with a high probability of failure if its MP value is greater thanand the lower bound of its confidence interval is above .
4.3. Experimental Procedure
|seed pool size||900||40|
|time budget (s)||3600||36000|
|mutation lower bound||0.01||1|
|mutation upper bound||0.6||6|
To answer our research questions, we ran DeepHyperion with different combinations of the features we identified following our methodology (see Section 3). We limited DeepHyperion to use only pairwise combinations of features to ease the visualisation of the maps and the discussion of the results; however, the algorithm is general and works also with maps that have more than two dimensions. For MNIST, we report the results achieved by considering all the three features that significantly correlate with the corresponding metrics (see Table 1). As regards BeamNG, we conducted our experiments on five feature combinations that cover the three combinations types: two structural features, two behavioural features, and a combination of a structural and a behavioural feature.
DLFuzz is representative of approaches that generate test inputs for image classifiers by applying perturbations to the raw input (i.e., pixels), often used to generate adversarial examples and test the robustness of DL components. It has been applied to the MNIST system. However, it cannot be applied to BeamNG since it could only manipulate individual camera inputs, without affecting the road shape.
DeepJanus is a search-based tool that generates inputs at the frontier of behaviours of DL systems, i.e. similar inputs that trigger different system behaviours. It is a model-based approach that can be applied to both BeamNG and MNIST systems. Moreover, it shares with DeepHyperion the same input representation which guarantees a consistent measurement of the features and, thus, a fair comparison of the approaches.
We ran each tool the same number of times on each subject system, i.e. times on MNIST and times on BeamNG, respectively. To ensure a fair comparison, we ran them on the same computing nodes and used the same time budget for each tool, i.e. seconds for MNIST and seconds for BeamNG, respectively. The reason for the different time budgets is that testing MNIST requires only to feed it an image and get the corresponding prediction, which usually is a matter of milliseconds, while testing BeamNG requires to execute driving simulations that take several minutes to complete.
The configurations of DeepHyperion were obtained in a few preliminary runs and are reported in Table 2. With the other tools, we either used the configuration reported as the one achieving the best performance or directly contacted the tools’ corresponding authors when some details about the configuration were missing.
The initial seeds for MNIST were obtained by randomly selecting inputs from the MNIST test set, all belonging to the class “5”. We obtained similar results for other digit classes, but we do not report them for space reasons. For BeamNG, the seeds were defined by 10 control points in which the initial point was always at a fixed position, whereas the others were placed at a random position 25 meters away from the previous one.
At the end of the runs, we used the inputs generated by each tool and the outputs generated by the subjects to compute the feature map of each run. All the maps were generated with the same number of cells for each feature, i.e. up to cells, by using the rescaling function described in Equation 2. The min/max values defining the range for each feature were the ones observed across the runs of all the tools. We used these feature maps to compute the metrics associated with each research question. To assess the statistical significance of the comparisons between DeepHyperion and the considered state-of-the-art tools, we performed the Mann-Whitney U-test and measured the effect size by means of the Vargha-Delaney’s Â statistic (Arcuri and Briand, 2014).
5.1. RQ1: Failure Diversity
Figure 4 shows the number of diverse misbehaviours found by each tool in the MNIST system and their sparseness on the map.
As shown in Figure 4 (top), DeepHyperion found more than diverse misbehaviours for each feature combination. DeepHyperion outperformed the other tools by generating a higher number of diverse failures for all feature combinations (-values ; large effect size). In particular, for the Or-Lum feature combination DeepHyperion produced a number of mapped misbehaviours that is remarkably above all the other tools (exceeding the second best by more than misbehaviours).
Figure 4 (bottom) shows that DeepHyperion produced failure-inducing inputs that are more sparse on the feature map for all feature combinations, as its Misbehaviour Sparseness (MS) metric is always significantly higher than the compared tools with -values (effect size is always large with the exception of Mov-Lum vs DLFuzz, for which the effect size is small). This result is achieved despite DeepJanus explicitly rewards diversity, having a fitness function that promotes the euclidean distance among solutions. DeepHyperion can expose a large number of misbehaviours (Figure 4, top), and, more importantly, it can reveal highly diverse misbehaviours, associated with very distant feature combinations (Figure 4, bottom).
Figure 5 shows Mapped Misbehaviours and Misbehaviour Sparseness of the tools that have been applied to the BeamNG system. Figure 5 (top) shows that DeepHyperion was always able to expose several diverse failures of the BeamNG system (at least on average). In comparison with DeepJanus, it produced significantly more misbehaviours for all the feature combinations (-values , large effect size). In particular, MLP-StdSA produced an impressive number of mapped misbehaviours ( on average) which is remarkably above the competitor (exceeding it by ). The goodness of the combination of behavioural features is confirmed also on the sparseness of the misbehaviours, as shown in Figure 5 (bottom), since it performed better than the other combinations, i.e. on average. With respect to the competitor, DeepHyperion generated significantly sparser inputs for four out of five combinations. Only for MLP-TurnCnt, the sparseness values of the inputs generated by the two tools do not show any statistically significant difference (-value , slightly above the conventional threshold of 0.05; medium effect size in favour of DeepHyperion).
We also compared the total number of misbehaviours exposed by each tool, regardless of their diversity. Table 3 shows that DeepHyperion exposed a total number of misbehaviours significantly larger than the competitors (-values , large effect size).
This confirms that the good results achieved by DeepHyperion are not biased by the size of the feature maps we adopted in the experiments
Summary: DeepHyperion can find diverse failure-inducing inputs for all feature combinations. It can detect up to 10X more than the competitor in BeamNG.
5.2. RQ2: Search Exploration
Figure 6 shows the number and the sparseness of the filled cells in the maps produced by each tool for the MNIST system. Figure 6 (top) shows that DeepHyperion covered all feature maps more extensively than the other tools (with large effect size and -value always ). Similarly to RQ1, the Or-Lum combination shows dramatically better results of DeepHyperion in comparison to the other tools, i.e. additional cells filled by DeepHyperion.
Figure 6 (bottom) shows that DeepHyperion produced more sparse inputs for all three feature combinations. As regards the Mov-Lum combination, DeepHyperion has significantly better Coverage Sparseness than DeepJanus and DLFuzz (-value ), with large effect size vs DeepJanus and small vs DLFuzz.
For the BeamNG system, Figure 7 reports Filled Cells and Coverage Sparseness achieved by the tools applied to the BeamNG system. Figure 7 (top) confirms that DeepHyperion is particularly good in covering the map corresponding to the combination of behavioural features MLP-StdSA ( filled cells on average). In comparison with DeepJanus, it always filled significantly more cells (at least cells more). Figure 7 (bottom) shows that DeepHyperion produced inputs that are always significantly sparser than DeepJanus (-values , large effect size).
Summary: DeepHyperion can always explore the feature space more extensively than the other tools (up to 8X for MNIST).
5.3. RQ3: Feature Discrimination
In Figures 9 and 9, we report feature maps where the cell colour indicates the average Misbehaviour Probability (MP) across the runs for the corresponding feature combination (darker colour indicates higher probability). Blank cells are combinations that have never been explored by DeepHyperion. We highlight with a dark border the cells for which there is a MP ¿ 0.8 and the lower bound of its confidence interval is ¿ 0.65, which means that whenever an input is placed in these cells it is very likely to trigger a misbehaviour.
As regards MNIST, Figure 9 shows that each feature map produced by DeepHyperion has multiple regions where the probability of failure is high. For instance, Figure 9 (center) suggests that left-oriented and thin digits are very likely to cause a classification failure. DeepHyperion can further help the user to interpret its results by showing the most representative inputs for each cell. As an example, in Figure 1 we show the actual inputs generated by DeepHyperion representing the map in Figure 9 (left). We can see that bold are hard to recognise as “5” when part of the figure forms a circle, as they can be considered as “6” or “9”. The bottom of the map shows that thin and discontinuous figures are hard to classify.
As regards BeamNG, Figure 9 shows that each of the maps corresponding to the combinations MLP-StdSA and MLP-MinRad has a clear region where the probability of failure is high. Figure 9 (left) suggests that the car is likely to go astray in roads that cause it to drive closer to the lane margins (lower MLP) and change often the steering angle direction (higher StdSA). Figure 9 (right) suggests that roads with at least a very sharp turn that cause the car to drive close to the lane margins are likely to cause a failure. The absence of high failure probability regions for the combination of the structural features TurnCnt-MinRad (see Figure 9 (center)) may indicate that behavioural features are more useful for characterising the conditions that trigger a misbehaviour.
Summary: For both subjects, DeepHyperion can detect well-characterised regions of the feature space that are likely to expose failures.
5.4. Threats to Validity
Construct Validity: DeepHyperion highly depends on map dimensions corresponding to measurable features. There is the risk that the selected features are not accurately quantified by the adopted metrics. To mitigate this threat, we (1) developed an empirical methodology to define the features of interest and the associated metrics, and (2) used metrics widely adopted in the literature.
External Validity: The choice of subject DL systems is a possible threat to the external validity. To mitigate this threat, we chose two diverse DL systems. One solves a classification problem, while the other is a self-driving car software that solves a regression problem. However, further studies with a wider set of DL systems should be carried out to fully assess the generalisability of our findings.
Conclusion Validity: Random variations might have affected the results, given the highly stochastic nature of DL systems. To mitigate this threat, we ran each experiment multiple times and statistical tests to assess the significance of our results, according to the guidelines for comparing randomised test generation algorithms proposed by Arcuri and Briand (Arcuri and Briand, 2014).
6. Related Work
DL systems’ quality has been mainly assessed by generating new inputs that expose misbehaviours (Pei et al., 2019; Guo et al., 2018; Tian et al., 2018; Ma et al., 2018; Gambi et al., 2019) and by proposing novel adequacy criteria that guide the input generation process (Pei et al., 2019; Xie et al., 2019; Kim et al., 2019). To the best of our knowledge, no technique aims at covering the feature space of DL systems and few works (Abdessalem et al., 2018; Riccio and Tonella, 2020) make use of interpretable properties for test generation.
6.1. Input Generation and Adequacy
DeepXplore (Pei et al., 2019) is a testing technique to detect behaviour inconsistencies among different DNNs. It is guided by neuron coverage, i.e., the percentage of neurons whose activation level is above a certain threshold. DLFuzz (Guo et al., 2018) is also a test input generator guided by neuron coverage. DeepTest (Tian et al., 2018) maximises the neuron coverage of a DNN-based steering angle predictor by applying different image transformations to images captured by the on-board camera of an autonomous car. DeepGauge (Ma et al., 2018) uses a set of coverage criteria that extend neuron coverage by taking the distribution of training data into consideration. DeepCT (Ma et al., 2019) uses a set of combinatorial testing criteria for DNNs based on the interactions between neurons. DeepHunter (Xie et al., 2019) leverages multiple coverage criteria originally proposed by Ma et al. (Ma et al., 2018) as feedback to guide test generation. DeepSmartFuzzer (Demir et al., 2020) uses Monte Carlo Tree Search (MCTS) to exploit the coverage increase patterns. The degree of “surprise” of an input was measured by means of the two metrics proposed by Kim et al. (Kim et al., 2019), associated with a surprise adequacy coverage criterion.
The above test input generation techniques focus on generating adversarial inputs by adding perturbations to the original inputs. Other input generators (Riccio and Tonella, 2020; Gambi et al., 2019; Abdessalem et al., 2018) are instead based on the manipulation of a model of the inputs, which ensure more control on the validity and realism of the generated inputs, going beyond adversarial attacks that expose security vulnerabilities. For instance, DeepJanus (Riccio and Tonella, 2020) manipulates the way-points that define the shape of a road within a self-driving car simulator. AsFault (Gambi et al., 2019) is a model-based approach that applies a search-based algorithm to test the lane-keeping system of self-driving cars. DeepHyperion belongs to this category of test generators.
All test generators mentioned above aim at maximising some internal adequacy metric, such as neuron or surprise coverage, or at exposing misbehaviours. None of them considers the value combinations of interpretable features of the DL system under test as the target of test generation. DeepHyperion is the first tool to provide developers with a map of such features, where the automatically generated inputs, as well as the exposed misbehaviours, are positioned and can be interpreted. Hence, existing test generators might completely ignore parts of a feature map or might expose only misbehaviours that belong to a narrow map region.
6.2. Structural and Behavioural Properties
NSGAII-DT (Abdessalem et al., 2018)
is a model-based approach for testing vision-based control systems. This approach builds on evolutionary multi-objective algorithms and uses decision trees to guide the generation of new test scenarios within the multidimensional space of the model parameters. Decision trees are used to identify the critical regions of the input space, i.e., the combinations of model parameter values that are more likely to cause misbehaviours. While decision trees provide interpretable information to developers asDeepHyperion does with its feature maps, the variables that appear in decision nodes are limited to the control parameters of the input model, which might not be fully representative of all relevant behavioural features of the system under test. Moreover, decision trees are used to focus the search on critical scenarios (collisions or near-collision at high speed with pedestrians), so as to increase the search efficiency, while DeepHyperion aims at covering the feature map at large, so as to ensure that as many regions as possible are tested and that regions with misbehaviours are not left untested.
DeepJanus (Riccio and Tonella, 2020) characterises the quality of a DL system as its frontier of behaviours, i.e., pairs of similar inputs that trigger different (expected vs failing) behaviours of the system. The output of DeepJanus provides users with a a set of system’s frontier inputs, but it does not explicitly characterise them based on structural or behavioural features. Instead, DeepHyperion’s maps allow developers to interpret the inputs that trigger a misbehaviour in terms of their feature values.
The properties we use as feature dimensions are identified by experts during the open coding step of our empirical methodology. In the literature, weak supervision approaches (Zhou, 2017), e.g., the Data Programming paradigm (Ratner et al., 2016), also exploit domain-experts’ knowledge to create and assign output labels to the training set elements. Unlike these approaches, our open coding identifies input features that can be quantified by metrics, without considering their relationship with the network’s output.
7. Conclusions and Future Work
DeepHyperion provides a unique characterisation of a DL system’s quality through an interpretable map which represents the highest-performing (i.e., misbehaving or closest to misbehaving) inputs in the space of the relevant, domain-specific features.
Our empirical study shows that DeepHyperion is more effective than state-of-the-art DL testing tools in generating failure-inducing inputs associated with highly diverse features. In the reverse direction, we showed that DeepHyperion is useful to detect the feature combinations that are most likely to induce a system misbehaviour. In our future work, we plan to generalise our results to a wider sample of DL systems, including industrial ones.
Acknowledgements.This work was partially supported by the H2020 project PRECRIME, funded under the ERC Advanced Grant 2017 Program (ERC Grant Agreement n. 787703). The driving simulator has been provided by BeamNG GmbH.
- Testing advanced driver assistance systems using multi-objective search and neural networks. In Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering, ASE 2016, Singapore, September 3-7, 2016, pp. 63–74. External Links: Cited by: §1.
- Testing vision-based control systems using learnable evolutionary algorithms. In Proceedings of the 40th International Conference on Software Engineering, ICSE ’18, pp. 1016–1026. External Links: Cited by: §1, §6.1, §6.2, §6.
- A hitchhiker’s guide to statistical tests for assessing randomized algorithms in software engineering. Software Testing, Verification and Reliability 24 (3), pp. 219–250. External Links: Cited by: §4.3, §5.4.
- A recursive evaluation algorithm for a class of catmull-rom splines. SIGGRAPH Comput. Graph. 22 (4), pp. 199–204. External Links: Cited by: §2.1.
- BeamNG.research BeamNG GmbH. External Links: Cited by: §4.1.
- End to end learning for self-driving cars. CoRR abs/1604.07316, pp. 1–9. External Links: Cited by: §4.1.
A class of local interpolating splines. In Computer Aided Geometric Design, R. E. Barnhill and R. F. Riesenfeld (Eds.), pp. 317 – 326. External Links: Cited by: §2.1.
- Simple mnist convnet. Note: https://github.com/keras-team/keras-io/blob/master/examples/vision/mnist_convnet.py Cited by: §4.1.
DeepSmartFuzzer: reward guided test generation for deep learning.
Proceedings of the Workshop on Artificial Intelligence Safety 2020 (IJCAI-PRICAI 2020), Yokohama, Japan, January, 2021, CEUR Workshop Proceedings, Vol. 2640, pp. 134–140. External Links: Cited by: §6.1.
- Automatically testing self-driving cars with search-based procedural content generation. In Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2019, Beijing, China, July 15-19, 2019, pp. 318–328. External Links: Cited by: §1, §6.1, §6.
- Deep learning. MIT Press. Note: http://www.deeplearningbook.org Cited by: §2.2, §2.
- DLFuzz: differential fuzzing testing of deep learning systems. In Proceedings of the 2018 ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/SIGSOFT FSE 2018, Lake Buena Vista, FL, USA, November 04-09, 2018, pp. 739–743. External Links: Cited by: §1, §4.3, §6.1, §6.
- Taxonomy of real faults in deep learning systems. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering, ICSE ’20, pp. 1110–1121. External Links: Cited by: §1.
- Quality metrics and oracles for autonomous vehicles testing. In Proceedings of 14th IEEE International Conference on Software Testing, Verification and Validation, ICST ’21, pp. 194–204. Cited by: §2.2, §3.4.2.
- Guiding deep learning system testing using surprise adequacy. In Proceedings of the 41st International Conference on Software Engineering, ICSE 2019, Montreal, QC, Canada, May 25-31, 2019, pp. 1039–1049. External Links: Cited by: §1, §6.1, §6.
- Taxicab geometry: an adventure in non-euclidean geometry. Courier Corporation. Cited by: §2.4.
- Applying UML and patterns: an introduction to object-oriented analysis and design. Prentice Hall. External Links: Cited by: §2.1.
- Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §2.1, §4.1.
- DeepCT: tomographic combinatorial testing for deep learning systems. In 26th IEEE International Conference on Software Analysis, Evolution and Reengineering, SANER 2019, Hangzhou, China, February 24-27, 2019, pp. 614–618. External Links: Cited by: §6.1.
- DeepGauge: multi-granularity testing criteria for deep learning systems. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, ASE 2018, pp. 120–131. External Links: Cited by: §6.1, §6.
- Introduction to information retrieval. Cambridge University Press. External Links: Cited by: §1.
- Illuminating search spaces by mapping elites. External Links: Cited by: §1, §2.5, §2, §3.
-  (2008) Pearson’s correlation coefficient. In Encyclopedia of Public Health, W. Kirch (Ed.), pp. 1090–1091. External Links: Cited by: §3.2.
- DeepXplore: automated whitebox testing of deep learning systems. Commun. ACM 62 (11), pp. 137?145. External Links: Cited by: §1, §6.1, §6.
- Data programming: creating large training sets, quickly. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, pp. 3574–3582. External Links: Cited by: §6.2.
Testing machine learning based systems: a systematic mapping. Empir. Softw. Eng. 25 (6), pp. 5193–5254. External Links: Cited by: §1, §4.1.
- Model-based exploration of the frontier of behaviours for deep learning system testing. In Proceedings of the ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE ’20, pp. 13 pages. External Links: Cited by: §1, §2.1, §2.1, §2.1, §2.2, §4.3, §6.1, §6.2, §6.
- Qualitative methods in empirical studies of software engineering. IEEE Transactions on Software Engineering 25, pp. 557–572. Cited by: §3.1.
- Potrace: a polygon-based tracing algorithm. External Links: Cited by: §2.1.
- Towards anomaly detectors that learn continuously. In 2020 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW), Vol. , pp. 201–208. External Links: Cited by: §2.2.
- DeepTest: automated testing of deep-neural-network-driven autonomous cars. In Proceedings of the 40th International Conference on Software Engineering, ICSE ’18, pp. 303–314. External Links: Cited by: §1, §6.1, §6.
- A taxonomy of model-based testing approaches. Software testing, verification and reliability 22 (5), pp. 297–312. Cited by: §2.1.
- DeepHunter: a coverage-guided fuzz testing framework for deep neural networks. In Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2019, pp. 146–157. External Links: Cited by: §1, §6.1, §6.
- Machine learning testing: survey, landscapes and horizons. IEEE Transactions on Software Engineering Early Access (–), pp. 1–1. External Links: Cited by: §1, §4.1.
- DeepRoad: gan-based metamorphic testing and input validation framework for autonomous driving systems. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, ASE 2018, Montpellier, France, September 3-7, 2018, pp. 132–142. External Links: Cited by: §1.
A brief introduction to weakly supervised learning. National Science Review 5 (1), pp. 44–53. External Links: Cited by: §6.2.
- :replication package. Note: https://github.com/testingautomated-usi/DeepHyperion Cited by: §1.