Towards Dependability Metrics for Neural Networks

06/06/2018 ∙ by Chih-Hong Cheng, et al. ∙ fortiss 0

Neural networks and other data engineered models are instrumental in developing automated driving components such as perception or intention prediction. The safety-critical aspect of such a domain makes dependability of neural networks a central concern for long living systems. Hence, it is of great importance to support the development team in evaluating important dependability attributes of the machine learning artifacts during their development process. So far, there is no systematic framework available in which a neural network can be evaluated against these important attributes. In this paper, we address this challenge by proposing eight metrics that characterize the robustness, interpretability, completeness, and correctness of machine learning artifacts, enabling the development team to efficiently identify dependability issues.



There are no comments yet.


page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Artificial neural networks (NN) are instrumental in realizing a number of important features in safety-relevant applications such as highly-automated driving. In particular, vision-based perception, the prediction of drivers’ intention, and even end-to-end autonomous control are usually based on NN technology. State-of-the-practice safety engineering processes (cmp. ISO 26262) require that safety-relevant components, including NN-enabled ones, demonstrably satisfy their respective safety goals.

Notice that the transfer of traditional testing methods and corresponding test coverage metrics such as MC/DC (cmp. DO 178C) to NN may lead to an exponential (in the number of neurons) number of branches to be investigated  

[1]. Such an exponential blow-up is not practical as typical NN may be comprised of millions of neurons. Moreover, a straightforward adaptation of structural coverage metrics for NN, e.g., the percentage of activated neurons for a given test set [2], does not take into account that the activation of single neurons is usually not strongly connected to the result of the whole network. The challenge therefore is to develop a set of NN-specific and efficiently computable metrics for measuring various aspects of the dependability of NN.

In previous work we have been generating test cases for NN testing based on finite partitions of the input space and by relying on predefined sets of application-specific scenario attributes [3]. Besides correctness and completeness of NN we also identified robustness [4] and interpretability [1] as important NN dependability attributes.

Here we build on our previous work on testing NN, and we propose a set of metrics for measuring the RICC dependability attributes of NN, which are informally described as follows.

  • Robustness of a NN against various effects such as distortion or adversarial perturbation (which is closely related to security).

  • Interpretability in terms of understanding important aspects of what a NN has actually learned.

  • Completeness in terms of ensuring that the data used in training has possibly covered all important scenarios.

  • Correctness in terms of a NN able to perform the perception task without errors.

The main contribution of this paper is a complete and efficiently computable set of NN-specific metrics for measuring RICC dependability attributes. Fig. 1 illustrates how the metrics cover the space of RICC, where at least two metrics relate to each criterion.

Fig. 1: Relations between RICC criteria and the proposed metrics. The group of metrics A., B. and E. cover completeness and correctness.

Ii Quality Metrics

Ii-a Scenario coverage metric

Similar to the class imbalance problem [5]

when training classifiers in machine learning, one needs to account for the presence of all relevant scenarios in training datasets for NN for autonomous driving. A scenario over a list of

of operating conditions (e.g., weather and road condition) is given by a valuation of each condition. E.g., let represent the weather condition, represent the road surfacing, and represent the incoming road orientation. Then and constitute two possible scenarios.

Since for realistic specifications of operating conditions, checking the coverage of all scenarios is infeasible due to combinatorial explosion, our proposed scenario coverage metric is based on the concept of -projection and is tightly connected to the existing work of combinatorial testing, covering arrays and their quantitative extensions [6, 7, 3].

Fig. 2: Computing scenario coverage metric via -projection table


Computing the scenario coverage metric requires that the dataset is semantically labeled according to the specified operating conditions, such that for each data point it can be determined whether it belongs to a certain scenario.


The metric starts by preparing a table recording all possible pairs of operating conditions, followed by iterating each data point to update the table with occupancy. Lastly, compute the ratio between occupied cells and the total number of cells. Eq. 1 summarizes the formula, and an illustration can be found in Fig. 2, where a dataset of two data points achieves .


Provided that for each , the size of is bounded by constant (i.e., the categorization is finite and discrete), then the denominator can at most be , i.e., the number of data points required for full coverage is polynomially bounded.

Relations to Ricc & improving

The metric reflects completeness and correctness attributes of RICC. To improve the metric, one needs to discover new scenarios. For the example in Fig. 2, an image satisfying the scenario can efficiently increase the metric from to .

Ii-B Neuron -activation metric

The previously described input space partitioning may also be observed by the activation of neurons. By considering ReLU activation as an indicator of successfully detecting a feature, for close-to-output layers where high-level features are captured, the combination of neuron activation in the same layer also forms scenarios (which are independent from the specified operating conditions). Again, we encounter combinatorial explosion, e.g., for a layer of

neurons, there is a total of scenarios to be covered. Therefore, similar to the 2-projection in the scenario coverage metric, this metric only monitors whether the input set has enabled all activation patterns for every neuron pair or triple in the same layer.


The user specifies an integer constant  and a specific layer to be analyzed. Assume that the layer has neurons.


The metric starts by preparing a table recording all possible -tuples of on-off activation for neurons in the layer being analyzed (similar to Fig. 2 with each now having only and status), followed by iterating each data point to update the table with occupancy. The denominator is given by the number of cells, which has value .


Note that when , our defined neuron -activation metric subsumes commonly seen neuron coverage acting over a single layer [8, 2], where one analyzes the on-off cases for each individual neuron.

Relations to Ricc & improving

The metric reflects the completeness and correctness attribute of RICC. To improve the metric, one needs to provide inputs that allows enabling different neuron activation patterns.

Ii-C Neuron activation pattern metric

Encountering the combinatorial explosion, while -activation metric captures the completeness, our designed neuron activation pattern metric is used to understand the distribution of activation. For inputs within the same scenario, intuitively the activation pattern should be similar, implying that the number of activated neurons should be similar.


The user provides an input set In, where all images belong to the same scenario, and specifies a layer of the NN (with neurons) to be analyzed. Furthermore, the user chooses the number of groups , for a partition of In into  groups , where for group , , the number of activated neurons in the specified layer is within the range for each input in this group.


Let be the largest set among . Then the metric is evaluated by considering all inputs whose activation pattern, aggregated using the number of neurons being activated, significantly deviates from the majority.


Relations to Ricc & improving

This metric reflects the robustness and completeness attribute of RICC, as well as interpretability. To improve the metric, one requires careful examination over the reason of diversity in the activation pattern under the same scenario.

Ii-D Adversarial confidence loss metric

Vulnerability w.r.t. adversarial inputs [9]

is an important quality attribute of NNs, which are used for image processing and designed to be used in safety-critical systems. As providing a formally provable guarantee against all possible adversarial inputs is hard, our proposed adversarial confidence loss metric is useful in providing engineers an estimate of how robust a NN is.


Computing requires that there exists a list of input transformers where for each , given a parameter specifying the allowed perturbation, one derives a new input by transforming input in. Each is one of the known image perturbation techniques ranging from simple rotation, distortion, to advanced techniques such as FGSM [10] or deepfool [11].


Given a test set In, a predefined perturbation bound , and the list of input transformers, let , where , be the output of the NN being analyzed, with larger value being better111Here the formulation also assumes that there exists a single output for the NN, but the formulation can be easily extended to incorporate multi-output scenarios.. The following equation computes the adversarial perturbation loss metric.


Intuitively, analyzes the change of output value for input in due to a perturbation , and selects one which leads to largest performance drop among all perturbation techniques, i.e., it makes the computed value of most negative. A real example is shown in Fig. (a)a

, where the FGSM attack yields the largest classification performance drop among three perturbation techniques, which changes the probability of car from

to . Thus, the largest negative value of the probability difference for this image is . Lastly, average the computed value over all inputs being analyzed.

(a) A vehicle image and three perturbed images. The largest classification performance drop is achieved by the FGSM technique.

(b) This heatmap for a pedestrian contains nine hot pixels in orange, 30 occluding pixels and five hot and occluding pixels.
Fig. 3: Illustrating and a heatmap for and .

Relations to Ricc & improving

The metric has a clear impact on robustness and correctness. To improve the metric, one needs to introduce perturbed images into the training set, or apply alternative training techniques with provable bounds [12].

Ii-E Scenario based performance degradation metric

Here we omit details, but for commonly seen performance metrics such as validation accuracy or even quantitative statistic measures such as MTBF, one may perform detailed analysis by either considering each scenario, or by discounting the value due to missing input scenarios (the discount factor can be taken from the computed scenario coverage metric).

Ii-F Interpretation precision metric

The interpretation precision metric is intended to judge if a NN for image classification or object detection makes its decision on the correct part of the image. E.g., the metric can reveal that a certain class of objects is mostly identified by its surroundings, maybe because it only exists in similar surroundings in the training and validation data. In this case, engineers should test whether this class of object can also be detected in different contexts.


For computing this metric, we need a validation set that has image segmentation ground truth in addition to the ground truth classes (and bounding boxes), e.g., as in VOC2012 data set [13].


Here we describe how the metric can be computed for a single detected object, where one can extend the computation to a set of images by posing average or min/max operators. A real example demonstrating the exact computation is shown in Fig. 4.

  1. Run the NN on the image to classify an object with probability (and obtain a bounding box in the case of object detection).

  2. Compute an occlusion sensitivity heatmap , where each pixel of the heatmap maps to a position of the occlusion on the image [14]. The value of is given by the probability of the original class for the occluded image. For object detection we take the maximum probability of the correct class over all detected boxes that have a significant Jaccard similarity with the ground truth bounding box.

  3. For given probability threshold that defines the set of hot pixels as and the set of pixels that partly occlude the segmentation ground truth, denoted by , the metric is computed as follows:


An illustrative example of computing can be found in Fig. (b)b, where for the human figure only five out of nine hot pixels intersect the region of the human body. Thus . The set of thirty pixels constituting the human forms .

Relations to Ricc & improving

The interpretation precision metric contributes to the interpretability and correctness of the RICC criteria. It may reveal that a NN uses a lot of context to detect some objects, e.g., regions surrounding the object or background of the image. In this case, adding images where these objects appear in different surroundings can improve the metric.

(a) Result of object detection
(b) Heatmap for red car (bottom left)
(c) for
(d) Heatmap for the right person
(e) for
Fig. 4: Computing for red car and the right person in front of the red car. The metric shows that the red car is mostly identified by the correct areas. On the other hand, for the person there are a lot of hot pixels in incorrect regions.

Ii-G Occlusion sensitivity covering metric

This metric measures the fraction of the object that is sensitive to occlusion. Generally speaking, it is undesirable to have a significant probability drop if only a small part of the object is occluded.

Furthermore, care should be taken about the location of the occlusion sensitive area. If a certain part of an object class is occlusion sensitive in many cases (e.g., the head of a dog) it should be tested if the object can still be detected when this part is occluded (e.g., head of a dog is behind a sign post). is computed in a similar way and based on the same inputs as :

  1. Perform steps 1) and 2) and determine and as for .

  2. Derive .

If the value is high it indicates that many positions of small occlusions can lead to a detection error. A low value indicates that there is a greater chance of still detecting the object when it is partly occluded. An illustrative example of computing can be found in Fig. (b)b, where for the human figure the heatmap only contains five hot pixels intersecting the human body (the head). As there are 30 pixels intersecting the region of the human, we have .

Relations to Ricc & improving

Occlusion sensitivity coverage covers the robustness and interpretability of RICC. If the metric values are too high for certain kinds of objects, an approach to improve it is to augment the training set with more images where these objects are only partly visible.

Ii-H Weighted accuracy/confusion metric

In object classification, not all errors have the same severity, e.g., confusing a pedestrian for a tree is more critical than in the opposite way. Apart from pure accuracy measures, one may employ fine-grained analysis such as specifying penalty terms as weights to capture different classification misses.

As such a technique is standard in performance evaluation of machine learning algorithms, the specialty will be how the weights of confusion are determined. Table I provides a summary over penalties to be applied in traffic scenarios, by reflecting the safety aspect. Misclassifying a pedestrian (or bicycle) to be background image (i.e., no object exists) should be set with highest penalty, as pedestrians are unprotected and it may easily lead to life threatening situations.

A is classified to B B (pedestrian) B (vehicle) B (background)
A (pedestrian) n.a. (correct)
A (vehicle) n.a. (correct)
A (background) n.a. (correct)
TABLE I: Qualitative severity of safety to be reflected as weights

Relations to Ricc & improving

The metric is a fine-grained indicator on correctness. To improve the metric, either one trains the network with more examples, or one modifies the loss function such that it is aligned with the weighted confusion, e.g., it sets higher penalty term when misclassifying a “pedestrian” to “background”.

Iii Outlook

We propose a set of NN-specific and efficiently computable metrics for measuring the RICC dependability attributes of NN. At this point, we have also implemented a NN testing tool for evaluating the usefulness of our proposed set of metrics in on-going industrial NN developments. Our ultimate goal is to obtain a complete and validated set of NN dependability metrics. In this way, corresponding best practices can be identified as the basis of new safety processes for engineering NN-enabled components and systems.