ML models are increasingly deployed in settings with real world interactions such as vehicles, but unfortunately, these models can fail in systematic ways. To prevent errors, ML engineering teams monitor and continuously improve these models. We propose a new abstraction, model assertions, that adapts the classical use of program assertions as a way to monitor and improve ML models. Model assertions are arbitrary functions over a model's input and output that indicate when errors may be occurring, e.g., a function that triggers if an object rapidly changes its class in a video. We propose methods of using model assertions at all stages of ML system deployment, including runtime monitoring, validating labels, and continuously improving ML models. For runtime monitoring, we show that model assertions can find high confidence errors, where a model returns the wrong output with high confidence, which uncertainty-based monitoring techniques would not detect. For training, we propose two methods of using model assertions. First, we propose a bandit-based active learning algorithm that can sample from data flagged by assertions and show that it can reduce labeling costs by up to 40 uncertainty-based methods. Second, we propose an API for generating "consistency assertions" (e.g., the class change example) and weak labels for inputs where the consistency assertions fail, and show that these weak labels can improve relative model quality by up to 46%. We evaluate model assertions on four real-world tasks with video, LIDAR, and ECG data.READ FULL TEXT VIEW PDF
ML is increasingly deployed in complex contexts that require inference about the physical world, from autonomous vehicles (AVs) to precision medicine. However, ML models can misbehave in unexpected ways. For example, AVs have accelerated toward highway lane dividers lee2018tesla and can rapidly change their classification of objects over time, causing erratic behavior coldewey2018uber; ntsb2019vehicle. As a result, quality assurance (QA) of models, including continuous monitoring and improvement, is of paramount concern.
Unfortunately, performing QA for complex, real-world ML applications is challenging: ML models fail for diverse and reasons unknown before deployment. Thus, existing solutions that focus on verifying training, including formal verification katz2017reluplex, whitebox testing pei2017deepxplore, monitoring training metrics renggli2019continuous, and validating training code odena2018tensorfuzz, only give guarantees on a test set and perturbations thereof, so models can still fail on the huge volumes of deployment data that are not part of the test set (e.g., billions of images per day in an AV fleet). Validating input schemas polyzotis2019data; baylor2017tfx does not work for applications with unstructured inputs that lack meaningful schemas, e.g., images. Solutions that check whether model performance remains consistent over time baylor2017tfx only apply to deployments that have ground truth labels, e.g., click-through rate prediction, but not to deployments that lack labels.
As a step towards more robust QA for complex ML applications, we have found that ML developers can often specify systematic errors made by ML models: certain classes of errors are repetitive and can be checked automatically, via code. For example, in developing a video analytics engine, we noticed that object detection models can identify boxes of cars that flicker rapidly in and out of the video (Figure 1), indicating some of the detections are likely wrong. Likewise, our contacts at an AV company reported that LIDAR and camera models sometimes disagree. While seemingly simple, similar errors were involved with a fatal AV crash ntsb2019vehicle. These systematic errors can arise for diverse reasons, including domain shift between training and deployment data (e.g., still images vs. video), incomplete training data (e.g., no instances of snow-covered cars), and noisy inputs.
To leverage the systematic nature of these errors, we propose model assertions, an abstraction to monitor and improve ML model quality. Model assertions are inspired by program assertions goldstine1947planning; turing1949checking, one of the most common ways to monitor software. A model assertion is an arbitrary function over a model’s input and output that returns a Boolean (0 or 1) or continuous (floating point) severity score to indicate when faults may be occurring. For example, a model assertion that checks whether an object flickers in and out of video could return a Boolean value over each frame or the number of objects that flicker. While assertions may not offer a complete specification of correctness, we have found that assertions are easy to specify in many domains (§2).
We explore several ways to use model assertions, both at runtime and training time.
First, we show that model assertions can be used for runtime monitoring: they can be used to log unexpected behavior or automatically trigger corrective actions, e.g., shutting down an autopilot. Furthermore, model assertions can often find high confidence errors, where the model has high certainty in an erroneous output; these errors are problematic because prior uncertainty-based monitoring would not flag these errors. Additionally, and perhaps surprisingly, we have found that many groups are also interested in validating human-generated labels, which can be done using model assertions.
Second, we show that assertions can be used for active learning, in which data is continuously collected to improve ML models. Traditional active learning algorithms select data to label based on uncertainty, with the intuition that “harder” data where the model is uncertain will be more informative settles2009active; coleman2020selection. Model assertions provide another natural way to find “hard” examples. However, using assertions in active learning presents a challenge: how should the active learning algorithm select between data when several assertions are used? A data point can be flagged by multiple assertions or a single assertion can flag multiple data points, in contrast to a single uncertainty metric. To address this challenge, we present a novel bandit-based active learning algorithm (BAL). Given a set of data that have been flagged by potentially multiple model assertions, our bandit algorithm uses the assertions’ severity scores as context (i.e., features) and maximizes the marginal reduction in the number of assertions fired (§3). We show that our bandit algorithm can reduce labeling costs by up to 40% over traditional uncertainty-based methods.
Third, we show that assertions can be used for weak supervision mintz2009distant; ratner2017weak. We propose an API for writing consistency assertions about how attributes of a model’s output should relate that can also provide weak labels for training. Consistency assertions specify that data should be consistent between attributes and identifiers, e.g., a TV news host (identifier) should have consistent gender (attribute), or that certain predictions should (or should not) exist in temporally related outputs, e.g., cars in adjacent video frames (Figure 1). We demonstrate that this API can apply to a range of domains, including medical classification and TV news analytics. These weak labels can be used to improve relative model quality by up to 46% with no additional human labeling.
We implement model assertions in a Python library, OMG 111OMG is a recursive acronym for OMG Model Guardian.
, that can be used with existing ML frameworks. We evaluate assertions on four ML applications: understanding TV news, AVs, video analytics, and classifying medical readings. We implement assertions for systematic errors reported by ML users in these domains, including checking for consistency between sensors, domain knowledge about object locations in videos, and medical knowledge about heart patterns. Across these domains, we find that model assertions we consider can be written with at most 60 lines of code and with 88-100% precision, that these assertions often find high-confidence errors (e.g., top 90th percentile by confidence), and that our new algorithms for active learning and weak supervision via assertions improve model quality over existing methods.
In summary, we make the following contributions:
[leftmargin=1em, topsep=-0.3em, itemsep=-0.5em]
We introduce the abstraction of model assertions for monitoring and continuously improving ML models.
We show that model assertions can find high confidence errors, which would not be flagged by uncertainty metrics.
We propose a bandit algorithm to select data points for active learning via model assertions and show that it can reduce labeling costs by up to 40%.
We propose an API for consistency assertions that can automatically generate weak labels for data where the assertion fails, and show that weak supervision via these labels can improve relative model quality by up to 46%.
We describe the model assertion interface, examples of model assertions, how model assertions can integrate into the ML development/deployment cycle, and its implementation in OMG.
We formalize the model assertions interface. Model assertions are arbitrary functions that can indicate when an error is likely to have occurred. They take as input a list of inputs and outputs from one or more ML models. They return a severity score, a continuous value that indicates the severity of an error of a specific type. By convention, the 0 value represents an abstention. Boolean values can be implemented in model assertions by only returning 0 and 1. The severity score does not need to be calibrated, as our algorithms only use the relative ordering of scores.
As a concrete example, consider an AV with a LIDAR sensor and camera and object detection models for each sensor. To check that these models agree, a developer may write:
Notably, our library OMG can register arbitrary Python functions as model assertions.
In this section, we provide use cases for model assertions that arose in discussions with industry and academic contacts, including AV companies and academic labs. We show example of errors caught by the model assertions described in this section in Appendix A and describe how one might look for assertions in other domains in Appendix B.
Our discussions revealed two key properties in real-world ML systems. First, ML models are deployed on orders of magnitude more data than can reasonably be labeled, so a labeled sample cannot capture all deployment conditions. For example, the fleet of Tesla vehicles will see over 100 more images in a day than in the largest existing image dataset sun2017revisiting. Second, complex ML deployments are developed by large teams, of which some developers may not have the ability to manage all parts of the application. As a result, it is critical to be able to do QA collaboratively to cover the application end-to-end.
Analyzing TV news.
We spoke to a research lab studying bias in media via automatic analysis. This lab collected over 10 years of TV news (billions of frames) and executed face detection every three seconds. These detections are subsequently used to identify the faces, detect gender, and classify hair color using ML models. Currently, the researchers have no method of identifying errors and manually inspect data. However, they additionally compute scene cuts. Given that most TV new hosts do not move much between scenes, we can assert that the identity, gender, and hair color of faces that highly overlap within the same scene are consistent (Figure6, Appendix). We further describe how model assertions can be implemented via our consistency API for TV news in §4.
Autonomous vehicles (AVs). AVs are required to execute a variety of tasks, including detecting objects and tracking lane markings. These tasks are accomplished with ML models from different sensors, such as visual, LIDAR, or ultrasound sensors davies2018how. For example, a vision model might be used to detect objects in video and a point cloud model might be used to do 3D object detection.
Our contacts at an AV company noticed that models from video and point clouds can disagree. We implemented a model assertion that projects the 3D boxes onto the 2D camera plane to check for consistency. If the assertion triggers, then at least one of the sensors returned an incorrect answer.
Video analytics. Many modern, academic video analytics systems use an object detection method kang2017noscope; kang2018blazeit; hsieh2018focus; jiang2018chameleon; xu2019vstore; canel2019scaling trained on MS-COCO lin2014microsoft, a corpus of still images. These still image object detection methods are deployed on video for detecting objects. None of these systems aim to detect errors, even though errors can affect analytics results.
In developing such systems, we noticed that objects flicker in and out of the video (Figure 1) and that vehicles overlap in unrealistic ways (Figure 7, Appendix). We implemented assertions to detect these.
Medical classification.Deep learning researchers have created deep networks that can outperform cardiologists for classifying atrial fibrillation (AF, a form of heart condition) from single-lead ECG data rajpurkar2017cardiologist. Our researcher contacts mentioned that AF predictions from DNNs can rapidly oscillate. The European Society of Cardiology guidelines for detecting AF require at least 30 seconds of signal before calling a detection developed2010guidelines. Thus, predictions should not rapidly switch between two states. A developer could specify this model assertion, which could be implemented to monitor ECG classification deployments.
We describe how model assertions can be integrated with ML development and deployment pipelines. Importantly, model assertions are complementary to a range of other ML QA techniques, including verification, fuzzing, and statistical techniques, as shown in Figure 2.
First, model assertions can be used for monitoring and validating all parts of the ML development/deployment pipeline. Namely, model assertions are agnostic to the source of the output, whether they be ML models or human labelers. Perhaps surprisingly, we have found several groups to also be interested in monitoring human label quality. Thus, concretely, model assertions can be used to validate human labels (data collection) or historical data (validation), and to monitor deployments (e.g., to populate dashboards).
Second, model assertions can be used at training time to select which data points to label in active learning. We describe BAL, our algorithm for data selection, in §3.
Third, model assertions can be used to generate weak labels to further train ML models without additional human labels. We describe how OMG accomplishes this via consistency assertions in §4. Users can also register their own weak supervision rules.
We implement a prototype library for model assertions, OMG, that works with existing Python ML training and deployment frameworks. We briefly describe OMG’s implementation.
OMG logs user-defined assertions as callbacks. The simplest way to add an assertion is through AddAssertion(func), where func is a function of the inputs and outputs (see below). OMG also provides an API to add consistency assertions as described in §4. Given this database, OMG requires a callback after model execution that takes the model’s input and output as input. Given the model’s input and output, OMG will execute the assertions and record any errors. We assume the assertion signature is similar to the following; this assertion signature is for the example in Figure 1:
For active learning, OMG will take a batch of data and return indices for which data points to label. For weak supervision, OMG will take data and return weak labels where valid. Users can specify weak labeling functions associated with assertions to help with this.
In the following two sections, we describe two key methods that OMG uses to improve model quality: BAL for active learning and consistency assertions for weak supervision.
We introduce an algorithm called BAL to select data for active learning via model assertions. BAL assumes that a set of data points has been collected and a subset will be labeled in bulk. We found that labeling services scale and our industrial contacts usually label data in bulk.
Given a set of data points that triggered model assertions, OMG
must select which points to label. There are two key challenges which make data selection intractable in its full generality. First, we do not know the marginal utility of selecting a data point to label without labeling the data point. Second, even with labels, estimating the marginal gain of data points is expensive to compute as training modern ML models is expensive.
To address these issues, we make simplifying assumptions. We describe the statistical model we assume, the resource-unconstrained algorithm, our simplifying assumptions, and BAL. We note that, while the resource-unconstrained algorithm can produce statistical guarantees, BAL does not. We instead empirically verify its performance in Section 5.
Data selection as multi-armed bandits. We cast the data selection problem as a multi-armed bandit (MAB) problem auer2002finite; berry1985bandit. In MABs, a set of “arms” (i.e., individual data points) is provided and the user must select a set of arms (i.e., points to label) to achieve the maximal expected utility (e.g., maximize validation accuracy, minimize number of assertions that fire). MABs have been studied in a wide variety of settings radlinski2008learning; lu2010contextual; bubeck2009pure, but we assume that the arms have context associated with them (i.e., severity scores from model assertions) and give submodular rewards (defined below). The rewards are possibly time-varying. We further assume there is an (unknown) smoothness parameter that determines the similarity between arms of similar contexts (formally, the in the Hölder condition evans1998graduate). The following presentation is inspired by chen2018contextual.
Concretely, we assume the data will be labeled in rounds and denote the rounds . We refer to the set of data points as . Each data point has a
dimensional feature vector associated with it, whereis the number of model assertions. We refer to the feature vector as , where is the data point index and is the round index; from here, we will refer to the data points as . Each entry in a feature vector is the severity score from a model assertion. The feature vectors can change over time as the model predictions, and therefore assertions, change over the course of training.
We assume there is a budget on the number of arms (i.e., data points to label), , at every round. The user must select a set of arms such that . We assume that the reward from the arms, , is submodular in . Intuitively, submodularity implies diminishing marginal returns: adding the 100th data point will not improve the reward as much as adding the 10th data point. Formally, we first define the marginal gain of adding an extra arm:
where is a subset of arms and is an additional arm such that . The submodularity condition states that, for any and
Resource-unconstrained algorithm. Assuming an infinite labeling and computational budget, we describe an algorithm that selects data points to train on. Unfortunately, this algorithm is not feasible as it requires labels for every point and training the ML model many times.
If we assume that rewards for individual arms can be queried, then a recent bandit algorithm, CC-MAB chen2018contextual can achieve a regret of for to be the smoothness parameter. A regret bound is the (asymptotic) difference with respect to an oracle algorithm. Briefly, CC-MAB explores under-explored arms until it is confident that certain arms have highest reward. Then, it greedily takes the highest reward arms. Full details are given in chen2018contextual and summarized in Algorithm 1.
Unfortunately, CC-MAB requires access to an estimate of selecting a single arm. Estimating the gain of a single arm requires a label and requires retraining and reevaluating the model, which is computationally infeasible for expensive-to-train ML models, especially modern deep networks.
Resource-constrained algorithm. We make simplifying assumptions and use these to modify CC-MAB for the resource-constrained setting. Our simplifying assumptions are that 1) data points with similar contexts (i.e., ) are interchangeable, 2) data points with higher severity scores have higher expected marginal gain, and 3) reducing the number of triggered assertions will increase accuracy.
Under these assumptions, we do not require an estimate of the marginal reward for each arm. Instead, we can approximate the marginal gain from selecting arms with similar contexts by the total number of these arms that were selected. This has two benefits. First, we can train a model on a set of arms (i.e., data points) in batches instead of adding single arms at a time. Second, we can select data points of similar contexts at random, without having to compute its marginal gain.
Leveraging these assumptions, we can simplify Algorithm 1 to require less computation for training models and to not require labels for all data points. Our algorithm is described in Algorithm 2. Briefly, we approximate the marginal gain of selecting batches of arms and select arms proportional to the marginal gain. We additionally allocate 25% of the budget in each round to randomly sample arms that triggered different model assertions, uniformly; this is inspired by -greedy algorithms tokic2011value. This ensures that no contexts (i.e., model assertions) are underexplored as training progresses. Finally, in some cases (e.g., with noisy assertions), it may not be possible to reduce the number of assertions that fire. In this case, BAL will default to random sampling or uncertainty sampling, as specified by the user.
Although developers can write arbitrary Python functions as model assertions in OMG, we found that many assertions can be specified using an even simpler, high-level abstraction that we called consistency assertions. This interface allows OMG to generate multiple Boolean model assertions from a high-level description of the model’s output, as well as automatic correction rules that propose new labels for data that fail the assertion to enable weak supervision.
The key idea of consistency assertions is to specify which attributes of a model’s output are expected to match across many invocations to the model. For example, consider a TV news application that tries to locate faces in TV footage and then identify their name and gender (one of the real-world applications we discussed in §2.2). The ML developer may wish to assert that, within each video, each person should consistently be assigned the same gender, and should appear on the screen at similar positions on most nearby frames. Consistency assertions let developers specify such requirements by providing two functions:
An identification function that returns an identifier for each model output. For example, in our TV application, this could be the person’s name as identified by the model.
An attributes function that returns a list of named attributes expected to be consistent for each identifier. In our example, this could return the gender attribute.
Given these two functions, OMG generates multiple Boolean assertions that check whether the various attributes of outputs with a common identifier match. In addition, it generates correction rules that can replace an inconsistent attribute with a guess at that attribute’s value based on other instances of the identifier (we simply use the most common value). By running the model and these generated assertions over unlabeled data, OMG can thus automatically generate weak labels for data points that do not satisfy the consistency assertions. Notably, OMG provides another way of producing labels for training that is complementary to human-generated labels and other sources of weak labels. OMG is especially suited for unstructured sources, e.g., video. We show in §5 that these weak labels can automatically increase model quality.
The consistency assertions API supports ML applications that run over multiple inputs and produce zero or more outputs for each input. For example, each output could be an object detected in a video frame. The user provides two functions over outputs :
returns an identifier for the output , which is simply an opaque value.
returns zero or more attributes for the output , which are key-value pairs.
In addition to checking attributes, we found that many applications also expect their identifiers to appear in a “temporally consistent” fashion, where objects do not disappear and reappear too quickly. For example, one would expect cars identified in the video to stay on the screen for multiple frames instead of “flickering” in and out in most cases. To express this expectation, developers can provide a temporal consistency threshold, , which specifies that each identifier should not appear or disappear for intervals less than seconds. For example, we might set to one second for TV footage that frequently cuts across frames, or 30 seconds for an activity classification algorithm that distinguishes between walking and biking. The full API for adding a consistency assertion is therefore .
Examples. We briefly describe how one can use consistency assertions in several ML tasks motivated in §2.2:
Face identification in TV footage: This application uses multiple ML models to detect faces in images, match them to identities, classify their gender, and classifier their hair color. We can use the detected identity as our function and gender/hair color as attributes.
Video analytics for traffic cameras: This application aims to detect vehicles in video street traffic, and suffers from problems such as flickering or changing classifications for an object. The model’s output is bounding boxes with classes on each frame. Because we lack a globally unique identifier (e.g., license plate number) for each object, we can assign a new identifier for each box that appears and assign the same identifier as it persists through the video. We can treat the class as an attribute and set as well to detect flickering.
Heart rhythm classification from ECGs: In this application, domain experts informed us that atrial fibrillation heart rhythms need to persist for at least 30 seconds to be considered a problem. We used the detected class as our identifier and set to 30 seconds.
Given the , , and values, OMG automatically generates Boolean assertions to check for matching attributes and to check that when an identifier appears in the data, it persists for at least seconds. These assertions are treated the same as user-provided ones in the rest of the system.
OMG also automatically generates corrective rules that propose a new label for outputs that do not match their identifier’s other outputs on an attribute. The default behavior is to propose the most common value of that attribute (e.g., the class detected for an object on most frames), but users can also provide a function to suggest an alternative based on all of that object’s outputs.
For temporal consistency constraints via , OMG will assert by default that at most one transition can occur within a -second window; this can be overridden. For example, an identifier appearing is valid, but an identifier appearing, disappearing, then appearing is invalid. If a violation occurs, OMG will propose to remove, modify, or add predictions. In the latter case, OMG needs to know how to generate an expected output on an input where the object was not identified (e.g., frames where the object flickered out in Figure 1). OMG requires the user to provide a function to cover this case, since it may require domain specific logic, e.g., averaging the locations of the object on nearby video frames.
We evaluated OMG and model assertions on four diverse ML workloads based on real industrial and academic use-cases: analyzing TV news, video analytics, autonomous vehicles, and medical classification. For each domain, we describe the task, dataset, model, training procedure, and assertions. A summary is given in Table 1.
|TV news||Custom||Consistency (§4, news)|
|Object detection (video)||SSD liu2016ssd||Three vehicles should not highly overlap (multibox), identity consistency assertions (flicker and appear)|
|Vehicle detection (AVs)||Second yan2018second, SSD||Agreement of Point cloud and image detections (agree), multibox|
|AF classification||ResNet rajpurkar2017cardiologist||Consistency assertion within a 30s time window (ECG)|
TV news. Our contacts analyzing TV news provided us 50 hour-long segments that were known to be problematic. They further provided pre-computed boxes of faces, identities, and hair colors; this data was computed from a range of models and sources, including hand-labeling, weak labels, and custom classifiers. We implemented the consistency assertions described in §4. We were unable to access the training code for this domain so were unable to perform retraining experiments for this domain.
Video analytics. Many modern video analytics systems use object detection as a core primitive kang2017noscope; kang2018blazeit; hsieh2018focus; jiang2018chameleon; xu2019vstore; canel2019scaling, in which the task is to localize and classify the objects in a frame of video. We focus on the object detection portion of these systems. We used a ResNet-34 SSD liu2016ssd (henceforth SSD) model pretrained on MS-COCO lin2014microsoft. We deployed SSD for detecting vehicles in the night-street (i.e., jackson) video that is commonly used kang2017noscope; xu2019vstore; canel2019scaling; hsieh2018focus. We used a separate day of video for training and testing.
We deployed three model assertions: multibox, flicker, and appear. The multibox assertion fires when three boxes highly overlap (Figure 7, Appendix). The flicker and appear assertions are implemented with our consistency API as described in §4.
Autonomous vehicles. We studied the problem of object detection for autonomous vehicles using the NuScenes dataset nuscenes2019
, which contains labeled LIDAR point clouds and associated visual images. We split the data into separate train, unlabeled, and test splits. We detected vehicles only. We use the open-source Second model with PointPillarsyan2018second; lang2019pointpillars for LIDAR detections and SSD for visual detections. We improve SSD via active learning and weak supervision in our experiments.
As NuScenes contains time-aligned point clouds and images, we deployed a custom assertion for 2D and 3D boxes agreeing, and the multibox
assertion. We deployed a custom weak supervision rule that imputed boxes from the 3D predictions. While other assertions could have been deployed (e.g.,flicker), we found that the dataset was not sampled frequently enough (at 2 Hz) for these assertions.
Medical classification. We studied the problem of classifying atrial fibrillation (AF) via ECG signals. We used a convolutional network that was shown to outperform cardiologists rajpurkar2017cardiologist. Unfortunately, the full dataset used in rajpurkar2017cardiologist is not publicly available, so we used the CINC17 dataset cinc17. CINC17 contains 8,528 data points that we split into train, validation, unlabeled, and test splits.
We consulted with medical researchers and deployed an assertion that asserts that the classification should not change between two classes in under a 30 second time period (i.e., the assertion fires when the classification changes from within 30 seconds), as described in §4.
We first asked whether model assertions could be written succinctly. To test this, we implemented the model assertions described above and counted the lines of code (LOC) necessary for each assertion. We count the LOC for the identity and attribute functions for the consistency assertions (see Table 1 for a summary of assertions). We counted the LOC with and without the shared helper functions (e.g., computing box overlap); we double counted the helper functions when used between assertions. As we show in Table 2, both consistency and domain-specific assertions can be written in under 25 LOC excluding shared helper functions and under 60 LOC when including helper functions. Thus, model assertions can be written with few LOC.
|Assertion||LOC (no helpers)||LOC (inc. helpers)|
|Assertion||(identifier and output)||(model output only)|
We then asked whether model assertions could be written with high precision. To test this, we randomly sampled 50 data points that triggered each assertion and manually checked whether that data point had an incorrect output from the ML model. The consistency assertions return clusters of data points (e.g., appear) and we report the precision for errors in both the identifier and ML model outputs and only the ML model outputs. As we show in Table 3, model assertions achieve at least 88% precision in all cases.
We asked whether model assertions can identify high-confidence errors, or errors where the model returns the wrong output with high confidence. High-confidence errors are important to identify as confidence is used in downstream tasks, such as analytics queries and actuation decisions kang2017noscope; kang2018blazeit; hsieh2018focus; chinchali2019network. Furthermore, sampling solutions that are based on confidence would be unable to identify these errors.
To determine whether model assertions could identify high confidence errors, we collected the 10 data points with highest confidence error for each of the model assertions deployed for video analytics. We then plotted the percentile of the confidence among all the boxes for each error.
As shown in Figure 3, model assertions can identify errors within the top 94th percentile of boxes by confidence (the flicker confidences were from the average of the surrounding boxes). Importantly, uncertainty-based methods of monitoring would not catch these errors.
We further show that model assertions can identify errors in human labels, which effectively have a confidence of 1. These results are shown in Appendix E.
We evaluated OMG’s active learning capabilities and BAL using the three domains for which we had access to the training code (visual analytics, ECG, AVs).
Multiple model assertions. We asked whether multiple model assertions could be used to improve model quality via continuous data collection. We deployed three assertions over night-street and two assertions for NuScenes. We used random sampling, uncertainty sampling with “least confident” settles2009active, uniform sampling from data that triggered assertions, and BAL for the active learning strategies. We used the mAP metric for both datasets, which is widely used for object detection lin2014microsoft; he2017mask. We defer hyperparmeters to Appendix C.
As we show in Figure 4, BAL outperforms both random sampling and uncertainty sampling on both datasets after the first round, which is required for calibration. BAL also outperforms uniform sampling from model assertions by the last round. For night-street, at a fixed accuracy threshold of 62%, BAL uses 40% fewer labels than random and uncertainty sampling. By the fifth round, BAL outperforms both random sampling and uncertainty sampling by 1.5% mAP. While the absolute change in mAP may seem small, doubling the model depth, which doubles the computational budget, on MS-COCO achieves a 1.7% improvement in mAP (ResNet-50 FPN vs. ResNet-101 FPN) Detectron2018.
These results are expected, as prior work has shown that uncertainty sampling can be unsuited for deep networks sener2017active.
Single model assertion. Due to the limited data quantities for the ECG dataset, we were unable to deploy more than one assertion. Nonetheless, we further asked whether a single model assertion could be used to improve model quality. We ran five rounds of data labeling with 100 examples each round for ECG datasets. We ran the experiment 8 times and report averages. We show results in Figure 5. As shown, data collection with a single model assertion generally matches or outperforms both uncertainty and random sampling.
We used our consistency assertions to evaluate the impact of weak supervision using assertions for the domains we had weak labels for (video analytics, AVs, and ECG).
For night-street, we used 1,000 additional frames with 750 frames that triggered flicker and 250 random frames with a learning rate of
for a total of 6 epochs. For the NuScenes dataset, we used the same 350 scenes to bootstrap the LIDAR model as in the active learning experiments. We trained with 175 scenes of weakly supervised data for one epoch with a learning rate of. For the ECG dataset, we used 1,000 weak labels and the same training procedure as in active learning.
|Video analytics (mAP)||34.4||49.9|
|ECG (% accuracy)||70.7||72.1|
Table 4 shows that model assertion-based weak supervision can improve relative performance by 46.4% for video analytics and 33% for AVs. Similarly, the ECG classification can also improve with no human-generated labels. These results show that model assertions can be useful as a primitive for improving model quality with no additional data labeling.
ML QA. A range of existing ML QA tools focus on validating inputs via schemas or tracking performance over time polyzotis2019data; baylor2017tfx. However, these systems apply to situations with meaningful schemas (e.g., tabular data) and ground-truth labels at test time (e.g., predicting click-through rate). While model assertions could also apply to these cases, they also cover situations that do not contain meaningful schemas or labels at test time.
Other ML QA systems focus on training pipelines renggli2019continuous or validating numerical errors odena2018tensorfuzz. These approaches are important at finding pre-deployment bugs, but do not apply to test-time scenarios; they are complementary to model assertions.
White-box testing systems, e.g., DeepXplore pei2017deepxplore, test ML models by taking inputs and perturbing them. However, as discussed, a validation set cannot cover all possibilities in the deployment set. Furthermore, these systems do not give guarantees under model drift.
Since our initial workshop paper kang2018model, several works have extended model assertions arechiga2019better; henzinger2019outside.
Verified ML. Verification has been applied to ML models in simple cases. For example, Reluplex (katz2017reluplex) can verify that extremely small networks will make correct control decisions given a fixed set of inputs and other work has shown that similarly small networks can be verified against minimal perturbations of a fixed set of input images (raghunathan2018certified). However, verification requires a specification, which may not be feasible to implement, e.g., even humans may disagree on certain predictions kirillov2018panoptic. Furthermore, the largest verified networks we are aware of katz2017reluplex; raghunathan2018certified; wang2018formal; sun2019formal are orders of magnitude smaller than the networks we consider.
Software Debugging. Writing correct software and verifying software has a long history, with many proposals from the research community. We hope that many such practices are adopted in deploying machine learning models; we focus on assertions in this work (goldstine1947planning; turing1949checking). Assertions have been shown to reduce the prevalence of bugs, when deployed correctly (kudrjavets2006assessing; mahmood1984executable). There are many other such methods, such as formal verification (klein2009sel4; leroy2009formal; keller1976formal), conducting large-scale testing (e.g., fuzzing) (takanen2008fuzzing; godefroid2012sage), and symbolic execution to trigger assertions (king1976symbolic; cadar2008klee). Probabilistic assertions have been used to verify simple distributional properties of programs, such as differentially private programs should return an expected mean sampson2014expressing. However, ML developers may not be able to specify distributions and data may shift in deployment.
Structured Prediction, Inductive Bias. Several ML methods encode structure/inductive biases into training procedures or models (bakir2007predicting; haussler1988quantifying; bakir2007predicting). While promising, designing algorithms and models with specific inductive biases can be challenging for non-experts. Additionally, these methods generally do not contain runtime checks for aberrant behavior.
Weak Supervision, Semi-supervised Learning. Weak supervision leverages higher-level and/or noisier input from human experts to improve model quality (mintz2009distant; ratner2017weak; jin2018unsupervised)
. In semi-supervised learning, structural assumptions over the data are used to leverage unlabeled data(zhu2011semi). However, to our knowledge, both of these methods do not contain runtime checks and are not used in model-agnostic active learning methods.
While we believe model assertions are an important step towards a practical solution for monitoring and continuously improving ML models, we highlight three important limitations of model assertions, which may be fruitful directions for future work.
First, certain model assertions may be difficult to express in our current API. While arbitrary code can be expressed in OMG’s API, certain temporal assertions may be better expressed in a complex event processing language wu2006high. We believe that domain-specific languages for model assertions will be a fruitful area of future research.
Second, we have not thoroughly evaluated model assertions’ performance in real-time systems. Model assertions may add overhead to systems where actuation has tight latency constraints, e.g., AVs. Nonetheless, model assertions can be used over historical data for these systems. We are actively collaborating with an AV company to explore these issues.
Third, certain issues in ML systems, such as bias in training sets, are out of scope for model assertions. We hope that complementary systems, such as TFX baylor2017tfx, can help improve quality in these cases.
In this work, we introduced model assertions, a model-agnostic technique that allows domain experts to indicate errors in ML models. We showed that model assertions can be used at runtime to detect high-confidence errors, which prior methods would not detect. We proposed methods to use model assertions for active learning and weak supervision to improve model quality. We implemented model assertions in a novel library, OMG, and demonstrated that they can apply to a wide range of real-world ML tasks, improving monitoring, active learning, and weak supervision for ML models.
This research was supported in part by affiliate members and other supporters of the Stanford DAWN project—Ant Financial, Facebook, Google, Infosys, NEC, and VMware—as well as Toyota Research Institute, Northrop Grumman, Cisco, SAP, and the NSF under CAREER grant CNS-1651570 and Graduate Research Fellowship grant DGE-1656518. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. Toyota Research Institute (“TRI”) provided funds to assist the authors with their research but this article solely reflects the opinions and conclusions of its authors and not TRI or any other Toyota entity.
We further acknowledge Kayvon Fatahalian, James Hong, Dan Fu, Will Crichton, Nikos Arechiga, and Sudeep Pillai for their productive discussions on ML applications.
In this section, we illustrate several errors caught by the model assertions used in our evaluation.
First, we show an example error in the TV news use case in Figure 6. Recall that these assertions were generated with our consistency API (§4). In this example, the identifier is the box’s sceneid and the attribute is the identity.
Second, we show an example error for the visual analytics use case in Figure 7 for the multibox assertion. Here, SSD erroneously detects multiple cars when there should be one.
Third, we show two example errors for the AV use case in Figure 8 from the multibox and agree assertions.
We present a non-exhaustive list of common classes of model assertions in Table 5 and below. Namely, we describe how one might look for assertions in other domains.
Our taxonomization is not exact and several examples will contain features from several classes of model assertions. Prior work on schema validation polyzotis2019data; baylor2017tfx and data augmentation wang2017effectiveness; taylor2017improving can be cast in the model assertion framework. As these have been studied, we do not focus on these classes of assertions in this work.
Consistency assertions. An important class of model assertions checks the consistency across multiple models or sources of data. The multiple sources of data could be the output of multiple ML models on the same data, multiple sensors, or multiple views of the same data. The output from the various sources should agree and consistency model assertions specify this constraint. These assertions can be generated via our API as described in §4.
Domain knowledge assertions. In many physical domains, domain experts can express physical constraints or unlikely scenarios. As an example of a physical constraint, when predicting how proteins will interact, atoms should not physically overlap. As an example of an unlikely scenario, boxes of the visible part of cars should not highly overlap (Figure 7). In particular, model assertions of unlikely scenarios may not be 100% precise, i.e., will be soft assertions.
Perturbation assertions. Many domains contain input and output pairs that can be perturbed (perhaps jointly) such that the output does not change. These perturbations have been widely studied through the lens of data augmentation wang2017effectiveness; taylor2017improving and adversarial examples goodfellow2015explaining; athalye2018synthesizing.
Input validation assertions. Domains that contain schemas for the input data can have model assertions that validate the input data based on the schema polyzotis2019data; baylor2017tfx. For example, boolean inputs that are encoded with integral values (i.e., 0 or 1) should never be negative. This class of assertions is an instance of preconditions for ML models.
Hyperparameters for active learning experiments. For night-street, we used 300,000 frames of one day of video for the training and unlabeled data. We sampled 100 frames per round for five rounds and used 25,000 frames of a different day of video for the test set. Due to the cost of obtaining labels, we ran each trial twice.
For the NuScenes dataset, we used 350 scenes to bootstrap the LIDAR model, 175 scenes for unlabeled/training data for SSD, and 75 scenes for validation (out of the original 850 labeled scenes). We trained for one epoch at a learning rate of . We ran 8 trials.
For the ECG dataset, we train for 5 rounds of active learning with 100 samples per round. We use a learning rate of 0.001 until the loss plateaus, which the original training code did.
We show active learning results for all rounds in Figure 9.
We further asked whether model assertions could be used to identify errors in human-generated labels, i.e., a human is acting as a “ML model.” While verification of human labels has been studied in the context of crowd-sourcing hirth2013analyzing; tran2013efficient, several production labeling services (e.g., Scale scale) do not provide annotator identification which is necessary to perform this verification. We deployed a model assertion in which we tracked objects across frames of a video using an automated method and verified that the same object in different frames had the same label.
We obtained labels for 1,000 random frames from night-street from Scale AI scale, which is used by several autonomous vehicle companies. Table 6 summarizes our results. Scale returned 469 boxes, which we manually verified for correctness. There were no localization errors, but there were 32 classification errors, of which the model assertion caught 12.5%. Thus, we see that model assertions can also be used to verify human labels.