Deep Neural Networks (networks) are being integrated into commercial, safety critical, autonomous systems that operate in unconstrained environments, e.g., perception for autonomous vehicles. When a network is deployed in an unconstrained environment, the operating domain, i.e., the distribution of context features that describe the network’s environment, can change significantly from the testing domain, i.e., the distribution of context features that describe the test data. Safety critical systems are regulated by international functional safety standards, e.g., ISO 26262 for the automotive industry, IEC 61508 for electronics and software. Functional safety standards leverage various techniques to verify the safety of software, including requirement specification, i.e., linking required system behavior to specific code modules, white box testing, i.e., testing specific inputs that cover all branches or behavior in the code, and code inspection and review to identify human error. These techniques are challenging or impossible to apply directly to networks, e.g., labeled data is used to implicitly specify the correct behavior in supervised learning, networks are black box systems, and network weights cannot be manually inspected to identify failure cases.
New techniques are needed to bridge the gap between the high performance of deep networks and the verification required for safety critical systems. In particular, the ability to predict how a network’s performance will change in a novel operating domain can enable verifying the required level of performance before a network is deployed, we denote this task Network Generalization Prediction. We propose a methodology for Network Generalization Prediction for networks trained via supervised learning. Our contributions are as follows:
We introduce the concept of a Context Subspace, a low-dimensional space, encoding the context features most informative about the network performance.
We propose a greedy feature selection algorithm for identifying the Context Subspace by 1) ranking the context features by the information they provide about the network loss, and 2) selecting the subspace dimensionality that leads to accurate Network Generalization Prediction.
We leverage a Context Subspace for accurate Network Generalization Prediction for pedestrian detection in diverse operating domains, with a prediction error from to for not safety critical pedestrians (pedestrians not in the road), and a prediction error from to for safety critical pedestrians (pedestrians in the road).
We demonstrate that the Context Subspace identified for the Berkeley Deep Drive Dataset can be used to predict pedestrian recall in completely unseen datasets, the JAAD and Cityscapes Datasets, with a prediction bias of or less.
2.1 Network Dependability
Avizienis et al. defined software dependability as “the ability to deliver service that can justifiably be trusted,” where dependability encompasses availability, reliability, safety, integrity, and maintainability
. To describe the dependability of a learned model, O’Brien et al. defined ML Dependability as “the probability that a model will succeed when operated under specified conditions”. Cheng et al. proposed that Robustness, Interpretability, Completeness, and Correctness contribute to a network’s Dependability 
. Ponn et al. trained a random forest to predict whether a network would detect a pedestrian, based on pedestrian attributes; they denote this task Detection Performance Modeling. Where Detection Performance Modeling predicts whether one specific object will be detected, Network Generalization Prediction predicts the expected network performance for a given operating domain, described by a distribution of context features.
2.2 Network Generalization
It has been shown that underspecification causes network performance to degrade when deployed in operating conditions different from the training and testing conditions. The WILDS benchmark was released to provide datasets with “in-the-wild” distribution shifts between the training and test data . Subbaswamy et al. propose to evaluate a model’s robustness to distribution shifts with one fixed evaluation set . Common techniques to improve network generalization include extracting features robust to changing conditions, , zero or few-shot learning , , and identifying when an input is outside the network’s training distribution , .
2.3 Feature Selection
Feature selection algorithms aim to select a subset of the available features, typically to use the features as input to train a model for a given task. Feature selection algorithms can be classified as filter methods, i.e., features are scored according to their association with the task label, wrapper methods, i.e., features are selected to minimize task error, and embedded methods, i.e., features are selected in the model training process. The Mutual Information  is often used in filter methods to measure the information between a given feature and the desired label . As exhaustive feature selection search is typically intractable, greedy feature selection algorithms are often used , , . Note, greedy feature selection is related to matching pursuit in the sparse approximation literature  and has applications in compressed sensing .
3.1 Problem Formulation
It is well known that in supervised learning, a network, , is trained to produce a label, , from data,
, and a loss function,, is used to drive training. In Network Generalization Prediction, we are not training . Instead, we aim to predict the performance of a fixed network , trained via supervised learning, when deployed in an operating domain, , that differs from the testing domain, , see Figure 1. The performance of is measured using test data, , and test labels, , via a loss function , where the elements of are assumed to be discrete and bounded, e.g., an object detection flag, whether a safety criteria was satisfied, or a discretized classification error.
is described via context features, , where indicates a
dimensional context vector associated with. Context features, e.g., image brightness, weather, or robot speed, can be categorical or numerical; numerical features are assumed to be discrete or discretized. It is possible for multiple test samples to map to the same context, i.e., . denotes the probability of encountering c in . is described by the probability of encountering c in , . In many practical applications, the likelihood of encountering a context may be known without annotated data, e.g., there is a chance of snow in Boston, etc. Note, labeled test data from is not required. We assume that while the distribution of contexts shifts between the testing and operating domains, i.e., , the expected network performance in context c is stable for both the testing and operating domains. Table 1 describes the Notation used in the Methods Section.
As is typical, we approximate the posterior expected loss in , , using the empirical loss:
We define . Let be an indicator function that is equal to if and otherwise. can be computed as:
can equivalently be computed as:
Likewise, we can now express the Network Generalization Prediction, , as:
This formulation holds theoretically for any number of context features . However, as grows linearly, computing Eqn. 4 requires exponentially more test samples to cover every possible . Thus, we introduce the Context Subspace, , a low-dimensional space, encoding the context features most informative about the network performance.
|The Test Data|
|The Test Labels|
|The trained network|
|The Test Set Loss|
|The context features|
|A context vector|
|The expected loss of in c|
|The Context Subspace|
|The Testing Domain|
|The Operating Domain|
|,||The probability of c in ,|
|The observed loss in|
|The predicted loss in|
3.2 Defining a Context Subspace
We are interested in selecting the context features that provide the most information about the network loss, to include these features in . Let be the indices of context features of interest and , where are the annotated attributes for each example in the test set for context feature . To select the context features to include in , we 1) rank the context features by how much information they provide about the network loss, 2) select the dimensionality to enable accurate Network Generalization Prediction.
3.2.1 Ranking Context Features
Recall, the Mutual Information is often used to rank features in filter feature selection algorithms and is computed as for loss and context feature :
where indicates the joint probability of and , and and indicate the marginal probabilities for and , respectively. The Interaction Information is a generalization of the Mutual Information to features. The Interaction Information between and the context features is defined as:
For two features, this becomes:
Where can be computed as:
The computational complexity of grows combinatorially with . We are interested in ranking the context features by the Interaction Information, but computing the exact Interaction Information becomes intractable as grows. To make computation tractable, we propose to approximate how much more information including context feature in the Context Subspace provides about .
Intuitively, subtracts the redundant information in , , from the information it provides about the loss, . Note that the computational complexity of computing grows linearly with . Like the Interaction Information, can be positive or negative. In Appendix A, we show that for independent features in the Context Subspace, approaches as approaches perfect information on . We propose a greedy algorithm to iteratively select the most informative features from the context, see Algorithm 1.
3.2.2 Selecting the Context Subspace Dimensionality
Selecting the number of features, , to include in is not trivial: including more features can lead to a more descriptive but can also lead to many untested contexts in . To select , we compute the expected prediction error for a given subspace dimensionality, . Using the most informative context features, can be computed according to Eqn. 2. where is a dimensional feature vector in . We iteratively compute the prediction error within the test set,
, to estimate the expected prediction error, see Algorithm 2. First, we randomly partition the test set into a set and a set: , , with samples and , , with samples respectively. We estimate using the set. We compute the observed loss from the set, . Let indicate the probability of encountering context in . The prediction error, , is the difference between the observed validation loss, , and the predicted validation loss using . This procedure can be iterated multiple times, and the subsequent ’s averaged, to estimate the expected prediction error, , for different random and partitions of the test set. We select the that minimizes .
measures the expected prediction error within . When the context is informative about the loss, we expect to decrease as increases until an optimal is reached, then will begin to rise as increases and there are many untested contexts. If is flat or increasing as increases, it indicates that the context features available are not informative about the loss.
After we have ranked the context features and selected the number of features to include in the subspace, we can form . The most informative context features form the axes of the subspace. Recall, we assumed the context features are categorical or numerical and discrete, this yields a finite set of context partitions, .
3.3 Using the Context Subspace
We use to describe the expected network loss in different contexts, , and to describe the probability of encountering a context in the operating domain, . We can compute using Eqn. 2, note we use the entire test set to compute once we have selected the subspace dimensionality . Recall, we do not assume to have labeled test data in , but we do assume to know
. Individual context feature probabilities can be multiplied to obtain a joint probability distribution if the context feature probabilities are assumed to be independent.
3.4 Network Generalization Prediction
We can now perform Network Generalization Prediction, where is the predicted loss in :
Recall, we selected a small number of informative context features so that it would be practical to describe the unique contexts , but there may be untested contexts in . For conservative predictions, we assume the maximum loss in untested contexts. The maximum loss may correspond to a binary failure or a large expected error. Leveraging renders Network Generalization Prediction practical for interestingly complex applications, like perception for autonomous vehicles.
4 Experimental Results
4.1 Pedestrian Detection Generalization
Perception for autonomous vehicles is an active area of research, and systems that use deep networks to detect and avoid obstacles, like pedestrians, while driving are commercially available. Some of these commercial systems can be used in any driving conditions, at the user’s discretion, and the operating domains can vary significantly in terms of the lighting conditions, e.g., daytime compared to night, road conditions, e.g., clear weather compared to rainy or snowy weather, and obstacle density, e.g., a residential street compared to a restricted access highway. It would be impractical for autonomous vehicle developers to test a perception system in every possible operating domain, but it is also imperative to know whether it is safe to use a perception system in a given operating domain. We perform experiments analogous to an autonomous vehicle developer: we test a fixed network in one testing domain, , and predict the network’s performance in novel operating domains, where the distribution of context features vary significantly from . Our goal is to accurately predict the observed network performance when the network is used in a novel operating domain, .
We test a pretrained Faster RCNN  object detector for pedestrian detection, where the objects detected as are used as pedestrian detections. In our analysis, we consider pedestrians whose ground truth bounding box area is pixels. We evaluate the network performance at the pedestrian level. Pedestrians correctly detected with an and a confidence score are assigned a loss of ; pedestrians that are not detected are assigned a loss of 111We are predicting the network’s recall. We do not assign a loss for false positive detections; this same methodology can be used to predict network precision if that is of interest. We focus on recall because failing to predict a pedestrian who is truly present in the scene is a higher safety risk than trying to avoid a pedestrian who is not present. . Pedestrians in images with multiple people are considered independently; images with no pedestrians present are not assigned any loss.
The Berkeley Deep Drive (BDD) Dataset  was recorded across the continental US and includes data from varying times of day (daytime, dawn/dusk, or night), weather conditions (clear, partly cloudy, overcast, rainy, foggy, or snowy), and scene types (city street, residential, or highway). BDD images are of size . We use images from the BDD Dataset for testing, denoted the BDD Test Set. We use the remaining images in the BDD Dataset, denoted the BDD Operating Set, to define novel operating domains. The BDD Test Set and BDD Operating Set correspond to the BDD “Validation” and “Train” folds, respectively.
4.2 Defining the Context Subspace
We evaluate the network performance at the pedestrian level; therefore, context features are assigned to individual pedestrians. We do not know a priori which pedestrian attributes are informative about the network loss, so we include all available context features. The BDD dataset includes metadata on the image time of day, weather, and scene type. We include the metadata as context features. We also include the image brightness and the pedestrian bounding box brightness. We define the road(s) to be the safety critical (SC) region(s) in the images. Pedestrians in the road are labeled SC, pedestrians outside the road, e.g., on the sidewalk, are labeled not safety critical (NSC). The road is defined using the drivable area annotations. Whether a pedestrian is SC, denoted the safety critical flag, is included as a context feature. To capture information about the obstacle density in the scene, we include the total number of pedestrians, the number of SC pedestrians, and the number of NSC pedestrians in the image as context features.
4.2.1 Ranking Context Features
We use Algorithm 1 to rank the context features by how much information they provide about the network loss. When computing the mutual information for a numerical feature with more than 10 unique values, we uniformly partition the feature into 10 discrete bins. Categorical features are labeled discretely with their assigned labels. See Figure 2 for the computed for the first three iterations of Algorithm 1. The 6 most informative features were found to be: 1) image brightness, 2) safety critical flag, 3) scene , 4) number SC pedestrians, 5) time of day, and 6) bounding box brightness.
4.2.2 Selecting the Context Subspace Dimensionality
To select the number of features to include in the Context Subspace, we compute for values of from to . For each dimensionality, , we compute 50 times by randomly partitioning the test data into for fitting and for validation. We select the with the minimum expected prediction error over the iterations. was found to be optimal, with an average prediction error of , see Figure 2 center. We subsequently define the Context Subspace with three dimensions: 1) image brightness, 2) safety critical flag, and 3) scene.
The image brightness is a continuous feature; we uniformly partition the image brightness into 10 bins. The safety critical flag and the scene type are discrete and categorical features with 2 and 3 possible values, respectively. This results in a Context Subspace, , with discrete contexts, .
4.3 Using the Context Subspace
We use to estimate the expected network loss in a context, , and to describe the probability of encountering a context in , , see Figure 2 right. For all tested contexts, is computed according to Eqn. 2. All untested contexts are assigned an expected loss of , i.e., a chance of failing to detect a pedestrian. The BDD Operating Set is used to define four novel operating domains: 1) daytime, small groups; 2) daytime, large groups; 3) night, small groups; and 4) night, large groups. The time of day annotated in the images was used to assign “daytime” or “night”. The SC and NSC pedestrians are considered independently. Pedestrians in images with fewer than (N)SC pedestrians are categorized as small groups; pedestrians in images with or more (N)SC pedestrians are categorized as large groups, i.e., in an image with 2 SC pedestrians and 15 NSC pedestrians, the SC pedestrians would be labeled ‘small group’ and the NSC pedestrians would be labeled ‘large group’. We compute for each by counting the number of pedestrians that fall into each and dividing by the total number of pedestrians.
4.4 Pedestrian Detection Generalization Prediction
We predict the network loss in the novel operating domains defined in 4.3 using Eqn. 10. The heatmaps of in Figure 3 illustrate that the novel operating domains are significantly different from each other and the testing domain, see in Figure 2. Our network loss is equivalent to the fraction of pedestrians that are not detected by the network; we convert the predictions into the predicted network recall by subtracting the fraction of pedestrians that are not detected from 1, see Figure 3. We then pass the BDD Operating Set through the network; the observed recall is computed as the fraction of pedestrians that were correctly detected. Figure 3 illustrates that our predictions are accurate with Network Generalization Prediction accuracy between and for NSC pedestrian recall and and for SC pedestrian recall. All the SC predictions underpredict the observed recall; this demonstrates that our predictions are conservative. Note, the only prediction with significant error is for night, large group SC pedestrians. Only one image in the BDD Operating Set falls into this category, so the observed performance is based on minimal data.
4.5 Generalization Prediction for Unseen Datasets
As a preliminary study, we investigate whether the Context Subspace, , defined using the BDD Test Set and the network loss, , estimated from the BDD Test Set provide information about completely unseen datasets. Unseen datasets include shifts in the context feature distributions, as well as changes in camera parameters and physical setup that are not captured by the test set. As such, we expect predictions for unseen dataset to contain bias, i.e., the prediction error for an unseen dataset will have a consistent non-zero offset. We are interested in determining the magnitude of this prediction bias to evaluate the usefulness of Network Generalization Prediction across datasets. We perform Network Generalization Prediction for the JAAD Dataset , and the Cityscapes Dataset with the gtFine labels , see Figure 4 for sample images. For both datasets, the (N)SC pedestrian image brightness distribution is computed from the images.
The JAAD Dataset was recorded in North America and Europe; it includes primarily daytime images from residential and city streets in varying weather conditions. JAAD images are of size . For the JAAD Dataset, we sampled images every three seconds from the videos to limit temporal correspondence between frames; this resulted in 1,031 images. Pedestrians in the road were manually annotated as SC, all others were labeled NSC. Scene annotations are not available for the JAAD dataset. To estimate the probability distribution of scenes, the scene type was annotated for a subset of 100 images, we assume the distribution holds for the entire dataset. The marginal (N)SC image brightness distributions and scene type distribution are multiplied to obtain the joint probability distributions for the JAAD Dataset.
The Cityscapes Dataset contains images recorded in 50 cities across Germany in the daytime during fair weather conditions. Cityscapes images are of size . We defined the pedestrian bounding boxes using the outermost edges of the labeled person instance segmentations, and we used the semantic segmentation of the road to define the SC region in the image. For Cityscapes, the scene type is known to be “city street”.
We make Network Generalization Predictions for the JAAD and Cityscapes Datasets using , estimated using the BDD Test Set. The prediction bias is consistently around , with a minimum prediction error of for SC pedestrian recall in the JAAD Dataset. We underpredict pedestrian recall for the JAAD Dataset and we overpredict pedestrian recall for the Cityscapes Dataset.
We make accurate Network Generalization Predictions for the BDD Operating Set, where the observed recall varies from to . This demonstrates that a fixed test set can be used to predict a network’s performance in diverse, novel operating domains. The observed recall for SC pedestrians is about higher than for NSC pedestrians. This makes intuitive sense, as SC pedestrians tend to be central in the image and closer to the vehicle. This is encouraging, because the performance of perception systems for autonomous vehicles will ultimately be determined by how well they detect SC pedestrians and obstacles. However, in the BDD Test Set there are many more examples of NSC pedestrians, , than SC pedestrians, . This leads to more untested contexts for the SC pedestrians, which in turn leads to the slight underprediction of SC recall.
For unseen datasets, we find a Network Generalization Prediction bias of ; we believe these results are promising and that the results indicate the Context Subspace identified for one dataset, e.g., one camera setup and one physical setup, can be informative for unseen datasets. Investigating how network performance changes between datasets and identifying what physical changes lead to performance differences is a direction for future work.
Network Generalization Prediction can be used to link network behavior in novel operating domains to required levels of performance. The Context Subspace can be leveraged for quasi-white box testing by testing the network across variations in context features that are known to impact network behavior. The Context Subspace also makes the network behavior interpretable by elucidating where failure is more likely. In addition to making the Network Generalization Prediction tractable, we believe the Context Subspace can be used during network training to extract features that are robust to changes in the Context Subspace. The Context Subspace can also be used for online error monitoring, e.g., an autonomous vehicle could notify the driver if it detects the surrounding scene is a context with subpar expected performance. We believe the Context Subspace is a tool that can make network performance more interpretable during training, testing, and deployment.
We propose the task Network Generalization Prediction and leverage a Context Subspace to render Network Generalization Prediction tractable with scarce test samples. We identify the Context Subspace automatically and demonstrate accurate Network Generalization Prediction for Faster RCNN used for pedestrian detection in diverse operating domains. We show that the Context Subspace identified for the BDD Dataset is informative for completely unseen datasets. We believe that accurate Network Generalization Prediction, with an interpretable Context Subspace, is a step towards bridging the gap between the high performance of deep networks and the verification required for safety critical systems.
-  (2004) Basic concepts and taxonomy of dependable and secure computing. IEEE transactions on dependable and secure computing 1 (1), pp. 11–33. Cited by: §2.1.
-  (2015) Info-greedy sequential adaptive compressed sensing. IEEE Journal of selected topics in signal processing 9 (4), pp. 601–611. Cited by: §2.3.
Feature selection in machine learning: a new perspective. Neurocomputing 300, pp. 70–79. Cited by: §2.3.
-  (2018) Towards dependability metrics for neural networks. In 2018 16th ACM/IEEE International Conference on Formal Methods and Models for System Design (MEMOCODE), pp. 1–4. Cited by: §2.1.
The cityscapes dataset for semantic urban scene understanding. In , Cited by: §4.5.
-  (2020) Underspecification presents challenges for credibility in modern machine learning. arXiv preprint arXiv:2011.03395. Cited by: §2.2.
-  (2020-06) Generalized odin: detecting out-of-distribution image without learning from out-of-distribution data. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.2.
-  (2021) Greedy-based feature selection for efficient lidar slam. arXiv preprint arXiv:2103.13090. Cited by: §2.3.
-  (2017) Scalable greedy feature selection via weak submodularity. In Artificial Intelligence and Statistics, pp. 1560–1568. Cited by: §2.3.
-  (2019) Learning not to learn: training deep neural networks with biased data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9012–9020. Cited by: §2.2.
-  (2020) WILDS: a benchmark of in-the-wild distribution shifts. arXiv preprint arXiv:2012.07421. Cited by: §2.2.
-  (2004) Estimating mutual information. Physical review E 69 (6), pp. 066138. Cited by: §2.3.
-  (2017) Enhancing the reliability of out-of-distribution image detection in neural networks. arXiv preprint arXiv:1706.02690. Cited by: §2.2.
-  (2020) Dependable neural networks for safety critical tasks. In International Workshop on Engineering Dependable and Secure Machine Learning Systems, pp. 126–140. Cited by: §2.1.
-  (2020) Identification and explanation of challenging conditions for camera-based object detection of automated vehicles. Sensors (Basel, Switzerland) 20 (13). Cited by: §2.1.
-  (2017) Are they going to cross? a benchmark dataset and baseline for pedestrian crosswalk behavior. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 206–213. Cited by: §4.5.
-  (2015) Faster r-cnn: towards real-time object detection with region proposal networks. arXiv preprint arXiv:1506.01497. Cited by: §4.1.
-  (2020) Evaluating model robustness to dataset shift. arXiv preprint arXiv:2010.15100. Cited by: §2.2.
Jigsaw-vae: towards balancing features in variational autoencoders. arXiv preprint arXiv:2005.05496. Cited by: §2.2.
-  (2007) Signal recovery from random measurements via orthogonal matching pursuit. IEEE Transactions on information theory 53 (12), pp. 4655–4666. Cited by: §2.3.
-  (2019) A greedy feature selection algorithm for big data of high dimensionality. Machine learning 108 (2), pp. 149–202. Cited by: §2.3.
-  (2014) A review of feature selection methods based on mutual information. Neural computing and applications 24 (1), pp. 175–186. Cited by: §2.3.
-  (2019) A survey of zero-shot learning: settings, methods, and applications. ACM Transactions on Intelligent Systems and Technology (TIST) 10 (2), pp. 13. Cited by: §2.2.
-  (2018) Zero-shot learning—a comprehensive evaluation of the good, the bad and the ugly. IEEE transactions on pattern analysis and machine intelligence 41 (9), pp. 2251–2265. Cited by: §2.2.
-  (2018) Bdd100k: a diverse driving video database with scalable annotation tooling. arXiv preprint arXiv:1805.04687. Cited by: §4.1.
Appendix A Comparing and
We propose to approximate the Interaction Information between context features and the network loss , . The computational complexity of computing grows linearly with , as compared to the computational complexity of computing which grows combinatorially with . We investigate the difference between and . To simplify the notation, we denote as and as . It is trivial to compute the Mutual Information between the context features and and select to be the feature most informative about the loss. We assume has been selected and we compare and .
The difference between and is:
As we would like the context features in to be roughly independent, let us assume that is not informative of , i.e., .
The reader is reminded that the conditional mutual information is computed as:
For simplicity, let us consider the point wise conditional mutual information at , , and :
Recall, it was assumed that and are independent, thus . The joint probability can also be factored as .
Note and . Thus, the difference between the proposed and the Interaction Information is proportional to
If we consider only combinations of and that exist in the test set, . As the new context feature becomes more informative, and the difference . This demonstrates that, if the context features are informative about the loss, is a good approximation of the Interaction Information.