A Programmatic and Semantic Approach to Explaining and DebuggingNeural Network Based Object Detectors

by   Edward Kim, et al.
berkeley college

Even as deep neural networks have become very effective for tasks in vision and perception, it remains difficult to explain and debug their behavior. In this paper, we present a programmatic and semantic approach to explaining, understanding, and debugging the correct and incorrect behaviors of a neural network based perception system. Our approach is semantic in that it employs a high-level representation of the distribution of environment scenarios that the detector is intended to work on. It is programmatic in that the representation is a program in a domain-specific probabilistic programming language using which synthetic data can be generated to train and test the neural network. We present a framework that assesses the performance of the neural network to identify correct and incorrect detections, extracts rules from those results that semantically characterizes the correct and incorrect scenarios, and then specializes the probabilistic program with those rules in order to more precisely characterize the scenarios in which the neural network operates correctly or not, without human intervention to identify important features. We demonstrate our results using the SCENIC probabilistic programming language and a neural network-based object detector. Our experiments show that it is possible to automatically generate compact rules that significantly increase the correct detection rate (or conversely the incorrect detection rate) of the network and can thus help with debugging and understanding its behavior.



page 1

page 4

page 11

page 12

page 15

page 16


Scenic: A Language for Scenario Specification and Data Generation

We propose a new probabilistic programming language for the design and a...

Inference Compilation and Universal Probabilistic Programming

We introduce a method for using deep neural networks to amortize the cos...

Scenic: Language-Based Scene Generation

Synthetic data has proved increasingly useful in both training and testi...

DeepBugs: A Learning Approach to Name-based Bug Detection

Natural language elements in source code, e.g., the names of variables a...

Formal Analysis and Redesign of a Neural Network-Based Aircraft Taxiing System with VerifAI

We demonstrate a unified approach to rigorous design of safety-critical ...

Concept-based Explanations for Out-Of-Distribution Detectors

Out-of-distribution (OOD) detection plays a crucial role in ensuring the...

Generate and Verify: Semantically Meaningful Formal Analysis of Neural Network Perception Systems

Testing remains the primary method to evaluate the accuracy of neural ne...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Models produced by Machine Learning (ML) algorithms, especially

deep neural networks

(DNNs), have proved very effective at tasks in computer vision and perception. Moreover, ML models are being deployed in domains where trustworthiness is a big concern, such as automotive systems 

[NVIDIATegra], health care [alipanahi2015predicting], and cyber-security [dahl2013large]

. Research in adversarial machine learning 

[goodfellow-cacm18], verification [dreossi-nfm17], and testing [DBLP:conf/icse/TianPJR18] has shown that DNN-based vision/perception systems are not always robust and can be fooled, sometimes leading to unsafe situations for the overall system (e.g., autonomous vehicle).

Given this lack of robustness and potential for unsafe behavior, it is crucial that we develop methods to better understand, debug, and characterize scenarios where DNN-based perception components fail and where they perform correctly. The emerging literature on explaining and understanding ML models provides one approach to address this concern. However, while there are several ideas proposed to explain the behavior of ML-based perception (e.g. [DeepDream, ActivationAtlas, NIPS2017_7062, DBLP:journals/corr/SmilkovTKVW17, Qi_2019_CVPR_Workshops]), almost all of these operate on the concrete input feature space of the network. For example, attribution based methods(e.g. [DBLP:journals/corr/SundararajanTY17, DBLP:journals/corr/abs-1711-05611, SelvarajuCDVPB17]) indicate pixels in an input image that are associated with the output of a DNN on that input. These methods, while very useful, do not directly identify the higher-level “semantic” features of the scene that are associated with that decision; they require a human to make that judgment. Additionally, in many cases it is important to generate “population-level explanations” of correct/incorrect behavior on such higher-level features. For example, it would be useful to identify whether a DNN-based object detector used on an autonomous vehicle generally misses cars of a certain model or color, or on a particular region of a road. And, leverage this knowledge to describe a high-level success/failure scenario of an object detector without human intervention.

In this paper, we present a programmatic and semantic approach to explaining and debugging DNN-based perception systems, with a focus on object detection. In this approach, we begin by formalizing the semantic feature space as a distribution over a set of scenes, where a scene is a configuration of objects in three dimensional space and semantic features are features of the scene that capture its semantics (e.g., the position and orientation of a car, its model and color, the time of day, weather, etc.). We then represent the semantic feature space using a program in a domain-specific programming language – hence the term programmatic. Given such a representation, and data corresponding to correct and incorrect behaviors of our DNN-based object detector, we seek to compute specializations of the program corresponding to those correct/incorrect behaviors. The specialized programs serve as interpretable representations of environment scenes that result in those correct/incorrect behaviors, enabling us to debug failure cases and to understand where the DNN succeeds.

We implement our approach using the Scenic probabilistic programming language [fremont-pldi19, scenic-www]. Probabilistic programming has already been demostrated to be applicable to various computer vision tasks (see, e.g., [kulkarni2015picture]). Scenic is a domain-specific language used to model semantic feature spaces, i.e., distributions over scenes. It has a generative back-end that allows one to automatically produce synthetic data when it is connected to a renderer or simulator, such as that of the Grand Theft Auto V (GTA-V) video game. It is thus a particularly good fit for our approach. Using Scenic, we implement the workflow shown in Fig. 1. We begin with a Scenic program that captures a distribution that we would like our DNN-based detector to work on. Generating test data from

, we evaluate the performance of the detector, partitioning the test set into correct and incorrect detections. For each partition, we use a rule extraction algorithm to generate rules over the semantic features that are highly correlated with successes/failures of the detector. Rule extraction is performed using decision trees and anchors 


. We further propose a novel white-box approach that analyzes the neuron activation patterns of the neural network to get insights into its inner workings. Using these activation patterns, we show how to derive semantically understandable rules over the high-level input features to characterize scenarios.

The generated rules are then used to refine yielding programs and that characterize more precisely the correct and incorrect feature spaces, respectively. Using this framework, we evaluate DNN-based object detector for autonomous vehicles, using data generated using Scenic and GTA-V. We demonstrate that our approach is very effective, producing rules and refined programs that significantly increase the correct detection rate (from to ) and incorrect detection rate (from to ) of the network and can thus help with debugging, understanding and retraining the network.

In summary, we make the following contributions:

  • Formulation of a programming language-based semantic framework to characterize success/failure scenarios for a ML-based perception module;

  • An approach based on anchors and decision tree learning methods for deriving rules for refining scenario programs;

  • A novel white-box technique that uses activation patterns of convolutional neural networks (CNN) to enhance feature space refinement;

  • Experimental results demonstrating that our framework is effective for a complex convolutional neural network for autonomous driving, and

  • Development of a data generation platform for the community for research into debugging and explaining DNN-based perception.

2 Background

Feature Range
Weather Neutral, Clear, Extrasunny, Smog, Clouds,
Overcast, Rain, Thunder, Clearing, Xmas,
Foggy,Snowlight, Blizzard, Snow
Time [00:00, 24:00)
Car Model Blista, Bus, Ninef, Asea, Baller, Bison, Buffalo, Bobcatxl, Dominator, Granger, Jackal, Oracle, Patriot, Pranger
Car Color R = [0, 255], G = [0, 255], B =[0, 255]
Car Heading [0, 360) deg
Car Position Anywhere on a road on GTA-V’s map
Table 1: Environment features and their ranges in GTA-V

Scenic [fremont-pldi19, scenic-www] is a probabilistic programming language for scenario specification and scene generation. The language can be used to describe environments for autonomous systems, i.e. autonomous cars or robots. Such environments are scenes, i.e. configurations of objects and agents. Scenic allows assigning distributions to the features of the scenes, as well as hard and soft mathematical constraints over the features in the scene. Generating scenes from a Scenic program requires sampling from the distributions defined in the program. Scenic comes with efficient sampling techniques that take advantage of the structure of the Scenic program, to perform aggressive pruning. The generated scenes are rendered into images with the help of a simulator. In this paper (and similar to [fremont-pldi19]) we use the Grand Theft Auto V (GTA-V) game engine [gtav] to create realistic images with a case study that uses SqueezeDet [squeezeDet], a convolutional neural network for object detection in autonomous cars. Note that the framework we put forth is not specific to SqueezeDet, and can be used with other object detectors as well.

The semantic features that we use in our case study are described in Table 1. These features are determined and limited by the environment parameters that are controllable by users in the simulator. If distributions over these environment features are not specified in a Scenic program, then, by default, they are uniformly randomly selected from ranges shown in Table 1. Note that for a different application domain, we would have a different set of features.

Scenic is designed to be easily understandable, with simple and intuitive syntax. We illustrate it via an example, shown in Figure 2. The formal syntax and semantics can be found in [fremont-pldi19].

As shown in Figure 2, the program describes a rare situation where a car is illegally intruding over a white striped traffic island to either cut in or belatedly avoid entering elevated highway. In line 1, ”param time = (6*60, 18*60)” means that time of the day is uniformly randomly sampled from 6:00 to 18:00. In line 2, an ego car is placed at specific x @ y coordinate on GTA-V’s map. In line 4, a spot on a traffic island that is within a visible region from a camera mounted on ego car is selected. Of all visible region of the traffic island, a spot is uniformly randomly sampled. In line 7 and 8, otherCar is placed on the spot facing -90 to 90 degree off of where ego car is facing, simulating cases when

Figure 2: Example Scenic program

a car may be protruding into a traffic flow. Lastly, scenic allows users to define hard and soft constraints using ”require” statements. In this scenario, all four require statements define hard constraints. In line 10, entire surface of the otherCar must be within the view region of the ego car. So, a scene where only front half of the otherCar is visible is not allowed. In line 11, the otherCar must be positioned in the right half of the ego car’s visible region. In line 12 and 13, the distance of the otherCar from ego car should be 5 to 20 meters.

3 Approach

Figure 3: Analysis Pipeline

The key idea of our approach is to leverage the high-level semantic features formally encoded in a Scenic program to derive rules (sufficient conditions) that explain the behavior of a detection module in terms of those features. Our hypothesis is that since these features describe the important characteristics that should be present in an image and furthermore they are much fewer than the raw, low-level pixels, they should lead to small, compact rules that have a clear meaning for the developer.

A high-level overview of our analysis pipeline is illustrated in Figure 3. We start with a Scenic program that encodes constraints (and distributions) over high-level semantic features that are relevant for a particular application domain, in our case object detection for autonomous driving. Intuitively, the program (henceforth called scenario) encodes the environments that the user wants to focus on in order to test the module. Based on this scenario, Scenic generates a set of

feature vectors

by sampling from the specified distributions. A simulator is then used to generate a set of realistic images (i.e. raw low-level pixel values) based on those features.

The images are fed to the object detector. Each image is assigned a binary label, correct or incorrect, based on the performance of the object detector on the image (see Section 3.1). The labels obtained for the images are mapped back to the feature vectors that led to the generation of the respective images. The result is a labeled data set that maps each high-level feature vector to the the respective label.

We then use off-the-shelf methods to extract rules from this data set. The rule extraction is described in more detail below. The result is a set of rules encoding the conditions on high-level features that lead to likely correct or incorrect detection. The obtained rules can be used to refine the Scenic program, which in turn can be sampled to generate more images that can be used to test, debug or retrain the detection module. This iterative process can continue until one obtains refined rules, and Scenic programs, of desired precision.

In the following we provide more details about our approach.

3.1 Labelling

Obtaining the label (correct/incorrect) for an image is performed using the F1 score metric (harmonic mean of the precision and recall). This metric is commonly used in statistical analysis of binary classification. The F1 score is computed in the following way. For each image, the true positive (TP) is the number of ground truth bounding boxes correctly predicted by the detection module. Correctly predicted here means intersection-over-union (IoU for object detection) is greater than 0.5. The false positive (FP) is the number of predicted bounding boxes that falsely predicted ground truths. This false prediction includes duplicate predictions on one ground truth box. The false negative (FN) is the number of ground truth boxes that is not detected correctly.

We computed the F1 score for each image, and if it is greater than a threshold, we assigned correct label; if not, incorrect. The threshold used in our experiments was 0.8.

3.2 Challenges and Assumptions

We note that by using the semantic features that are interpretable to human, we heavily abstract an image with 1920 x 1200 resolution to a vector of, at most, a few dozen features. For example, Scenario 1 as shown in Figure2 represents an image with a vector of only 16 different features. This is a severely under-determined representation of an image. Hence, it is certainly possible that there may be no noticeable patterns within these features, or even if there are, not precise enough to filter out a set of feature vectors pertaining to one class. However our experiments indicate that it is indeed possible to derive compact rules, with high precision and good coverage, in terms of these abstract features.

We also note that all the scenic programs we experimented with contained only uniform distributions. Also, for each of the scenario programs that we analyzed, we fixed the location and heading angle of the camera, since they are not features that appear in the

Scenic programs that we analyzed. Given the challenge above, we restricted our settings to validate whether there exists any rule to extract, even in this simplified setting.

3.3 Rule Extraction

3.3.1 Methods

We experimented with two methods, decision tree (DT) learning for classification [quinlan1986induction] and anchors [DBLP:conf/aaai/Ribeiro0G18], to extract rules capturing the subspace of the feature space defined in the given scenic program.

Decision tree learning is commonly used to extract rules explaining the global behavior of a complex system while the anchors method is a state-of-the art technique for extracting explanation rules that are locally faithful.

Decision trees encode decisions (and their consequences) in a tree-like structure. They are highly interpretable, provided that the trees are short. One can easily extract rules for explaining different classes, by simply following the paths through the trees and collecting the decisions encoded in the tree nodes. We used the rpart [rpart] package in R software, which implements corresponding algorithm in [cart84]. We used the default parameters.

The anchor method is a state-of-the art technique that aims to explain the behavior of complex ML models with high-precision rules called anchors, representing local, sufficient conditions for predictions. The system can efficiently compute these explanations for any black-box model with high-probability guarantees. We used the code from


with the default parameters. Applying the method to the object detector directly would result in anchors describing conditions on low-level pixel values, which would be difficult to interpret and use. Instead what we want is to extract anchors in terms of high-level features. While one can use the simulator together with the object detector as the black-box model, this would be very inefficient. Instead we built a surrogate model mapping high-level Scenic features to output labels; we used a random forest learning algorithm for this purpose. This surrogate model was then passed to the anchor method to extract the rules.

3.3.2 Blackbox vs Whitebox Analysis

So far we explained how we can obtain rules when treating the detection module as a black box. We also investigated a white-box analysis, to determine whether we can exploit the information about the internal workings of the module to improve the rule inference.

The white-box analysis is one of our novel contributions in this paper. We leverage recent work on property inference for neural networks [DBLP:journals/corr/abs-1904-13215]. The properties are in terms of on/off activation patterns (at different internal layers) that lead to the same predictions. These patterns are computed by applying decision-tree learning over the activations observed during the execution of the network on the training or testing set.

We analyzed the architecture of the SqueezeDet network and we determined that there are three maxpool layers which provide a natural decomposition of the network. Furthermore they have relatively low dimensionality making them a good target for property inference.

We consider activation patterns over maxpool neurons based on whether the neuron output is greater or equal to zero. A decision tree can then be learned over these patterns to fit the prediction labels. For our experiments we selected patterns from the maxpool layer 5, which turned out to be highly correlated to images that lead to correct/incorrect predictions.

Then, we augmented the assigned correct and incorrect labels with corresponding decision pattern in the following way. For example, using a decision pattern for correct labels (i.e. the decision pattern that most correlated to images with correct label), we created two sub-classes for correct class. By feeding in only images with correct label to the perception module, the images inducing the decision pattern is re-labelled as ”correct-decision-pattern,” otherwise, ”correct-unlabelled.” Likewise, the incorrect class is augmented using a decision pattern that is most correlated to images with incorrect label. It is our intuition that the decision pattern captures more focused properties (or rules) among images belonging to a target class. Hence, we hypothesize that this label augmentation would help anchor and decision tree methods to better identify rules.

3.3.3 Rule Selection Criteria

Once we extracted rules with either DT or anchors, we selected the best rule using following criteria. To best achieve our objective, first, we chose the rule with highest precision on a held-out testset of feature vectors. If there are more than one rule with equal high precision, then we chose the rule with the highest coverage (i.e. the number of feature vectors satisfying the rules). Finally, if there is still more than one rule left, then we broke the tie by choosing the most compact rule which has the least number of features. The last two criteria are established to select the most general high-precision rule.

4 Experiments

In this section we report on our experiments with the proposed approach on the object detector. We investigate whether we can generate rules that are effective in explaining the correct/incorrect behavior with increased precision. We evaluate the proposed techniques along the following dimensions: DT vs anchors, black-box (BB) vs white-box (WB).

4.1 Scenarios

For the experiment, we wrote four different scenarios in Scenic language. The corresponding Scenic programs are shown in Figure 5 and 6.

Scenario 1 describes the situation where a car is illegally intruding over a white striped traffic island at the entrance of an elevated highway. This scenario captures scenes where a car abruptly switches into or away from the elevated highway by crossing over the traffic island. The images from Scenario 1 are shown in Figure 7.

Scenario # Rules
(BaselineRule Precision)
Scenario 1 x coordinate -198.1
hour 7.5
weather = all except neutral
Scenario 2 car0 distance from ego 11.3m
() car0 model = {Asea, Bison, Blista, Buffalo, Dominator, Jackal, Ninef, Oracle}
Scenario 3 car0 red color 74.5
() car0 heading 220.3 deg
car0 model = {Asea, Baller, Blista,
Scenario 4 Buffal, Dominator, Jackal, Ninef,
() Oracle}

Table 2: Rules for correct behaviors of the detection module with the highest precision from Table 6
Scenario # Rules
(BaselineRule Precision)
x coordinate -200.76
Scenario1 distance 8.84
() car model = PRANGER
hour 7.5
Scenario 2 weather = all except Neutral
() car0 distance from ego 11.3
weather = neutral
Scenario 3 agent0 heading = 218.08 deg
() hour 8.00
car2 red color 95.00
car0 model = PATRIOT
car1 model = NINEF
Scenario 4 car2 model = BALLER
() 92.25 car0 green color 158
car0 blue color 84.25
178.00 car2 red color 224

Table 3: Rules for incorrect behaviors of detection module with the highest precision from Table 7

Scenario 2 describes two-car scenario where a car occludes the ego car’s view of another car at a T-junction intersection . In the Scenic program, to cause occlusion in scenes, we place an agent0 car within a visible region from ego car. Then, we place agent1 car within a close vicinity, defined by small horizontal (i.e. leftRight) and vertical (i.e. frontback) perturbations in the program, to agent0 car. The metric of these perturbations are in meters. The images from Scenario 2 are shown in Figure 8.

Scenario 3 describes scenes where three cars are merging into the ego car’s lane. The location for Scenario 3 is carefully chosen such that the sun rises in front of the ego car, causing a glare. The Scenic program describes three cars merging in a platoon-like manner where one car is following another car in front with some variations in distance between front and rear cars. The metric for distance perturbation is in meters. The images from Scenario 3 are shown in Figure 9.

Finally, Scenario 4 describes a set of scenes when the nearest car is abruptly switching into ego car’s lane while another car on the opposite traffic direction lane is slightly intruding over the middle yellow line into the ego car’s lane. Failure to detect these two cars out of four cars may potentially be safety-critical. The images from Scenario 4 are shown in Figure 10. The locations of all four cars, in Scenario 4 Scenic program, are hard-coded with respect to ego car’s location. The Scenic program would have become much more interpretable had we described car locations with respect to lanes. The reason we had to code in this undesirable manner is due to the simulator as illustrated in Section 6.

5 Success and Failure Scenario Descriptions

The refined Scenic programs characterizing success/failure scenarios are shown in Figure 11, 12, 13, and 14. The red/green parts of programs represent the rules automatically generated by our technique, which are cut and pasted to original Scenic programs. These success/failure inducing rules are shown in Table 2 and 3. As aforementioned, we generated new images using Scenic programs that characterize failure scenarios. Examples of these images from failure scenarios are shown in Figure 15, 16, 17, and 18.

5.1 Setup

The object detector was trained on 10,000 GTA images with one to four cars in various locations of the map producing different background scenes. The GTA-V simulator provided images, ground truth boxes, and values of the environment features.

For each scenario, we generated 950 images as a train set and another 950 images as a test set.

We denote the labels corresponding to the maxpool layer 5 decision pattern as p5c(correct) and p5_ic(incorrect) and the remaining as correct_unlabelled and incorrect_unlabelled respectively. We augmented the feature vector with some extra features that are not part of the feature values provided by the simulator but could help with extracting meaningful rules. For example, in Scenario 1 the distance from ego to otherCar is not part of the feature values provided by GTA-V. However, it can be computed with Euclidean distance metric using (x,y) location coordinates of ego and otherCar. Also, the difference in heading angle between ego and otherCar is also added as extra feature to represent “badAngle” variable in the program.

From the train set, we extracted rules to predict each label based on the feature vectors.. These rules were evaluated on the test set based on precision, recall, and F1 score metrics.

For DT learning we adjusted the label weight to account for the uneven ratio among labels for both black-box and white-box labels.

For the Anchors method, we applied it on each instance of the training set until we had covered a maximum of 50 instances for every label ( correct,incorrect for Black Box, and p5c, p5_ic, correct_unlabelled, incorrect_unlabelled for White Box). The best anchor rule for every label is selected based on the rule selection criteria mentioned in section 3.

Figure 4: The cumulative ratio of incorrectly detected images generated from refined Scenic programs (using incorrect rules) stabilizes over 500 samples. Each color has four graphs representing four different rule extraction methods
Scenario # 1 2 3 4
Correct DP 0.626 0.651 0.514 0.824
Incorrect DP 0.276 0.175 0.234 0.212

Table 4: Support for correct and incorrect decision patterns
Scenario # 1 2 3 4
BB Decision Tree 0.723 0.342 0.631 0.622
WB Decision Tree 0.727 0.696 0.601 0.778
BB Anchor 0.361 0.457 0.302 0.438
WB Anchor 0.520 0.188 0.149 0.438

Table 5: F1 score of correct rules on testset
Scenario # 1 2 3 4
Original Program 0.653 0.723 0.617 0.896
BB Decision Tree 0.843 0.778 0.787 0.950
WB Decision Tree 0.826 0.823 0.788 0.962
BB Anchor 0.727 0.811 0.652 0.928
WB Anchor 0.894 0.817 0.794 0.928

Table 6: Precision of correct rules on testset

5.2 Results

Tables  2 and  3 show the best rules (wrt. both precision and recall) extracted with our proposed framework, along with the baseline correct/incorrect detection rate for each scenario and the detection rate (i.e. precision) for the generated rules. The results indicate that indeed our framework can generate rules that increase significantly the correct and incorrect detection rate of the module. Furthermore, the generated rules are compact and easily interpretable.

For example, the rule for correct behavior for Scenario 1 is ”x coordinate .” In GTA-V, at ego car’s specific location, the condition on x coordinate was equivalent to the otherCar’s distance from ego being greater than 11m. On the other hand, the rule for incorrect behavior for Scenario 1 requires the otherCar to be within 8.84m and its car model to be PRANGER. These rules indicate that the object detector fails when the otherCar is close by, and performs well when located further away.

In the following sections we describe the results in more detail.

Scenario # 1 2 3 4
Original Program 0.347 0.277 0.383 0.104
BB Decision Tree 0.703 0.418 0.506 0.375
WB Decision Tree 0.73 0.449 0.494 0.099
BB Anchor 0.872 0.357 0.834 0.573
WB Anchor 0.674 0.422 0.365 0.176

Table 7: Incorrectly detected image ratio among five hundred new data generated from each refined Scenic program
Scenario # 1 2 3 4
Feature Space Coverage 0.692 0.956 0.898 0.871

Table 8: The proportion of original feature space covered by the best incorrect rule for each scenario from Table 7

5.2.1 Results for Correct Behavior

Tables 5 and 6 summarize the results for the rules explaining correct behavior. The results indicate that there are clear signals in the heavily abstracted feature space and they can be used effectively for scenario characterization via the generated high-precision rules. The results also indicate that DT learning extracts rules with better F1 scores for all scenarios as compared to Anchors. This could be attributed to the difference in the nature of the techniques. The Anchor approach aims to construct rules that have high precision in the locality of a given instance. Decision-trees on the other hand aim to construct global rules that discriminate one label from another. Given that a large proportion of instances were detected correctly by the analyzed module, the decision tree was able to build rules with high precision and coverage for correct behavior.

The results also highlight the benefit of using white-box information to extract rules for correct behavior. Table 4 shows the support for the decision pattern is significant (greater than 65% on average for all scenarios). Using this information to augment the labels of the dataset helps improve the precision and F1 score of the rules (w.r.t. Scenic features) for both DT learning and anchor method.

5.2.2 Results for Incorrect Behavior

Tables 7 and 3 summarize the results for the rules explaining incorrect behavior. Rule derivation for incorrect behavior is more challenging than for correct behavior due to the low percentage of inputs that lead to the incorrect detection for a well trained network.

In fact the F1 scores (computed on the test set) for rules predicting incorrect behavior were too low due to very low (in some cases 0) recall values.

To properly validate the efficacy of the generated rules, we refined the Scenic programs by encoding the rules as constraints and we generated 500 new images. We then evaluated our module’s performance on these new datasets. Figure 4 justifies our choice of 500 as the number of new images that we generate for evaluation.

All four methods contributed to more precisely identifying the subset features spaces in which the module performs worse. Specifically, Table 7 illustrates that the black-box anchor method enhanced the generation rate of incorrectly detected images by 48% on average in Scenarios 1, 3, and 4 compared to the baseline. This is a significant increase in the ratio of incorrectly labelled images generated from the program, providing evidence that the refined programs are more precisely characterizing the failure scenarios. We also note that the anchor method outperforms DT learning. This is expected, because the anchor method extracts rules that are highly precise within a local feature space. The exception is Scenario 2. We conjecture that the reason that the anchor method did not perform better than DT learning is due to uncontrollable non-determinism in GTA-V. The location where Scenario 2 takes place has pedestrians moving around in close vicinity to the camera of ego car. We observed that our perception module often incorrectly predicted the pedestrians as cars. While we could handle the case where pedestrians are deterministically placed at the same location, non-deterministic behavior poses a challenge to our approach. GTA-V does not allow users to control or eliminate these pedestrians and it does not provide features related to pedestrians during data collection process. In future work, we plan to incorporate simulators with deterministic control (such as CARLA [carla]) for more experimentation.

We note that the refined programs are not trivially defining very narrow scenarios. Table 8 quantifies the proportion of the refined Scenic program’s feature space compared to corresponding original Scenic program. Even for Scenarios 1 and 3, with highly precise rules, feature coverage is close to 70%. This indicates that the extracted rules are not only precise but also general, capturing a wide variety of scenes.

Unlike the results for correct behavior, the whitebox approach tends to perform worse than blackbox when focusing on incorrect behavior. This outcome can be attributed to very low support for decision patterns computed for incorrect behavior, with maximum of 27.6% among the four scenarios.

However, we do observe that the white-box approach for both DT learning and anchors does enhance the ratio of incorrectly detected images as shown in Table 7, compared to those of the original programs.

6 Limitations of Scenic Description

The expressiveness and interpretability of Scenic language is enhanced by leveraging detailed information about the map used in the simulator. For example, in Scenario 1 as shown in Figure 5

(a), we can directly refer to specific part of the road such as ”curb” because we have already identified the curb regions in the map. And, description on placing of ”otherCar” with respect to a spot on curb makes the program easily understandable to human. However, because GTA-V is not an open sourced simulator (in fact it is not even meant to be used as a simulator but is widely used for its realistic rendering), we could not parse out detailed map information such as regions of different lanes, location of traffic lights, etc. As the number of cars scaled, describing increasingly complicated geometric relations among cars without any reference points/objects/regions, such as lanes, became more challenging. As a result, we were limited to describe geometric relations in scenarios by only referencing cars, which resulted in much less understandable

Scenic programs, deprecating the use of Scenic as a scenario description. We emphasize that this limitation due to inaccessible map information of the simulator, not Scenic. In fact, Scenic has syntax to refer to other entities in the scene if those information are accessible in the simulator’s map. Such reference would significantly enhance the interpretability of the Scenic programs we generate with our method.

7 Related Work

Most of the techniques that aim to provide explainability and interpretability for deep neural networks (DNNs) in the field of computer vision focus on attributing the network’s decisions to portions of the input images( [NIPS2017_7062, DBLP:journals/corr/SmilkovTKVW17, Qi_2019_CVPR_Workshops, DBLP:journals/corr/SundararajanTY17, DBLP:journals/corr/abs-1711-05611]). GradCAM [SelvarajuCDVPB17] is a popular approach for interpreting CNN models that visualizes how parts of the image affect the neural network’s output by looking into class activation maps (CAM). Other techniques focus on understanding the internal layers by visualizing their activation patterns  [DeepDream, ActivationAtlas]. Our approach, on the other hand, aims to provide characterizations at a higher level than raw image pixels, namely at the level of abstract features defined in a Scenic program. These features can be more meaningful, can characterize populations of images on which a network behaves similarly, and are also much fewer in number than the pixels in an image. This allows one to synthesize compact, meaningful rules that characterize the correct and incorrect behavior of a network.

Several techniques focus on rule extraction from DNNs using decision trees or sets, e.g. [DeepRED, LakkarajuBL16]. The technique in  [DeepRED] aims to represent the entire functionality of a network as a set of rules making it too complex, while the approach in  [LakkarajuBL16] requires the presence of pre-mined set of rules which would be difficult to obtain for object detection. Anchors [DBLP:conf/aaai/Ribeiro0G18], which improves on LIME [LIME], is a rule extraction technique that is closest to our work (and we already discussed it in the body of the paper).

Recent work aims to explain the decisions of DNNs in terms of higher-level concepts. The technique in [KimWGCWVS18] introduces the idea of concept activation vectors, which provide an interpretation of a neural network’s internal state in terms of human-friendly concepts. The approach presented in  [Alyuz_2019_CVPR_Workshops] also interprets machine learning models and attempts to translate them to human-understandable attributes. Feature Guided Exploration [DBLP:conf/tacas/WickerHK18]

aims to analyze the robustness of networks used in computer vision applications by applying perturbations over high-level input features extracted from raw images. They use object detection techniques (such as SIFT – Scale Invariant Feature Transform) to extract the features from an image. In contrast to these techniques we leverage directly

Scenic which defines the high-level features in a way that is already understandable for humans.

Existing approaches typically use classification networks whose output directly corresponds to the decision being made and rely on the derivative of the output with respect to the input to calculate importance. In our application, there is no direct correlation between the output of the detector network (e.g. SqueezeDet) and the validity of the bounding boxes. Furthermore, unlike all previous work, we can use the synthesized rules to automatically generate more input instances, by refining the original Scenic program and then using it to generate data. These instances can be used to test, debug and retrain the network.

8 Conclusion and Future Work

We presented a semantic and programmatic framework for characterizing success and failure scenarios of a given perception module in forms of programs. The method leverages the Scenic language to derive rules in terms of high-level, meaningful features and to generate new inputs that conform with these rules. The method is general and can be applied to other domains beyond the autonomous driving applications we have explored here.

For future work we plan to look into more general refinements of the distributions and also more general transformations to the Scenic programs. We also plan to look into retraining based on data generated from refined programs. Finally, we plan to release images we generated from success/failure scenarios for vision community to analyze the effects of environment features on the inner workings of object detectors.