Log In Sign Up

Model Assertions for Monitoring and Improving ML Model

ML models are increasingly deployed in settings with real world interactions such as vehicles, but unfortunately, these models can fail in systematic ways. To prevent errors, ML engineering teams monitor and continuously improve these models. We propose a new abstraction, model assertions, that adapts the classical use of program assertions as a way to monitor and improve ML models. Model assertions are arbitrary functions over a model's input and output that indicate when errors may be occurring, e.g., a function that triggers if an object rapidly changes its class in a video. We propose methods of using model assertions at all stages of ML system deployment, including runtime monitoring, validating labels, and continuously improving ML models. For runtime monitoring, we show that model assertions can find high confidence errors, where a model returns the wrong output with high confidence, which uncertainty-based monitoring techniques would not detect. For training, we propose two methods of using model assertions. First, we propose a bandit-based active learning algorithm that can sample from data flagged by assertions and show that it can reduce labeling costs by up to 40 uncertainty-based methods. Second, we propose an API for generating "consistency assertions" (e.g., the class change example) and weak labels for inputs where the consistency assertions fail, and show that these weak labels can improve relative model quality by up to 46%. We evaluate model assertions on four real-world tasks with video, LIDAR, and ECG data.


page 2

page 14


Model Assertions for Monitoring and Improving ML Models

ML models are increasingly deployed in settings with real world interact...

Finding Label and Model Errors in Perception Data With Learned Observation Assertions

ML is being deployed in complex, real-world scenarios where errors have ...

ML Health: Fitness Tracking for Production Models

Deployment of machine learning (ML) algorithms in production for extende...

Did the Model Change? Efficiently Assessing Machine Learning API Shifts

Machine learning (ML) prediction APIs are increasingly widely used. An M...

MLDemon: Deployment Monitoring for Machine Learning Systems

Post-deployment monitoring of the performance of ML systems is critical ...

Compressive time-lapse seismic monitoring of carbon storage and sequestration with the joint recovery model

Time-lapse seismic monitoring of carbon storage and sequestration is oft...

HAPI: A Large-scale Longitudinal Dataset of Commercial ML API Predictions

Commercial ML APIs offered by providers such as Google, Amazon and Micro...

Code Repositories

1 Introduction

ML is increasingly deployed in complex contexts that require inference about the physical world, from autonomous vehicles (AVs) to precision medicine. However, ML models can misbehave in unexpected ways. For example, AVs have accelerated toward highway lane dividers lee2018tesla and can rapidly change their classification of objects over time, causing erratic behavior coldewey2018uber; ntsb2019vehicle. As a result, quality assurance (QA) of models, including continuous monitoring and improvement, is of paramount concern.

Unfortunately, performing QA for complex, real-world ML applications is challenging: ML models fail for diverse and reasons unknown before deployment. Thus, existing solutions that focus on verifying training, including formal verification katz2017reluplex, whitebox testing pei2017deepxplore, monitoring training metrics renggli2019continuous, and validating training code odena2018tensorfuzz, only give guarantees on a test set and perturbations thereof, so models can still fail on the huge volumes of deployment data that are not part of the test set (e.g., billions of images per day in an AV fleet). Validating input schemas polyzotis2019data; baylor2017tfx does not work for applications with unstructured inputs that lack meaningful schemas, e.g., images. Solutions that check whether model performance remains consistent over time baylor2017tfx only apply to deployments that have ground truth labels, e.g., click-through rate prediction, but not to deployments that lack labels.

As a step towards more robust QA for complex ML applications, we have found that ML developers can often specify systematic errors made by ML models: certain classes of errors are repetitive and can be checked automatically, via code. For example, in developing a video analytics engine, we noticed that object detection models can identify boxes of cars that flicker rapidly in and out of the video (Figure 1), indicating some of the detections are likely wrong. Likewise, our contacts at an AV company reported that LIDAR and camera models sometimes disagree. While seemingly simple, similar errors were involved with a fatal AV crash ntsb2019vehicle. These systematic errors can arise for diverse reasons, including domain shift between training and deployment data (e.g., still images vs. video), incomplete training data (e.g., no instances of snow-covered cars), and noisy inputs.

To leverage the systematic nature of these errors, we propose model assertions, an abstraction to monitor and improve ML model quality. Model assertions are inspired by program assertions goldstine1947planning; turing1949checking, one of the most common ways to monitor software. A model assertion is an arbitrary function over a model’s input and output that returns a Boolean (0 or 1) or continuous (floating point) severity score to indicate when faults may be occurring. For example, a model assertion that checks whether an object flickers in and out of video could return a Boolean value over each frame or the number of objects that flicker. While assertions may not offer a complete specification of correctness, we have found that assertions are easy to specify in many domains (§2).

(a) Frame 1, SSD
(b) Frame 2, SSD
(c) Frame 3, SSD
(d) Frame 1, SSD
(e) Frame 2, assertion corrected
(f) Frame 3, SSD
Figure 1: Top row: example of flickering in three consecutive frames of a video. The object detection method, SSD liu2016ssd, failed to identify the car in the second frame. Bottom row: example of correcting the output of a model. The car bounding box in the second frame can be inferred using nearby frames based on a consistency assertion.

We explore several ways to use model assertions, both at runtime and training time.

First, we show that model assertions can be used for runtime monitoring: they can be used to log unexpected behavior or automatically trigger corrective actions, e.g., shutting down an autopilot. Furthermore, model assertions can often find high confidence errors, where the model has high certainty in an erroneous output; these errors are problematic because prior uncertainty-based monitoring would not flag these errors. Additionally, and perhaps surprisingly, we have found that many groups are also interested in validating human-generated labels, which can be done using model assertions.

Second, we show that assertions can be used for active learning, in which data is continuously collected to improve ML models. Traditional active learning algorithms select data to label based on uncertainty, with the intuition that “harder” data where the model is uncertain will be more informative settles2009active; coleman2020selection. Model assertions provide another natural way to find “hard” examples. However, using assertions in active learning presents a challenge: how should the active learning algorithm select between data when several assertions are used? A data point can be flagged by multiple assertions or a single assertion can flag multiple data points, in contrast to a single uncertainty metric. To address this challenge, we present a novel bandit-based active learning algorithm (BAL). Given a set of data that have been flagged by potentially multiple model assertions, our bandit algorithm uses the assertions’ severity scores as context (i.e., features) and maximizes the marginal reduction in the number of assertions fired (§3). We show that our bandit algorithm can reduce labeling costs by up to 40% over traditional uncertainty-based methods.

Third, we show that assertions can be used for weak supervision mintz2009distant; ratner2017weak. We propose an API for writing consistency assertions about how attributes of a model’s output should relate that can also provide weak labels for training. Consistency assertions specify that data should be consistent between attributes and identifiers, e.g., a TV news host (identifier) should have consistent gender (attribute), or that certain predictions should (or should not) exist in temporally related outputs, e.g., cars in adjacent video frames (Figure 1). We demonstrate that this API can apply to a range of domains, including medical classification and TV news analytics. These weak labels can be used to improve relative model quality by up to 46% with no additional human labeling.

We implement model assertions in a Python library, OMG 111OMG is a recursive acronym for OMG Model Guardian.

, that can be used with existing ML frameworks. We evaluate assertions on four ML applications: understanding TV news, AVs, video analytics, and classifying medical readings. We implement assertions for systematic errors reported by ML users in these domains, including checking for consistency between sensors, domain knowledge about object locations in videos, and medical knowledge about heart patterns. Across these domains, we find that model assertions we consider can be written with at most 60 lines of code and with 88-100% precision, that these assertions often find high-confidence errors (e.g., top 90th percentile by confidence), and that our new algorithms for active learning and weak supervision via assertions improve model quality over existing methods.

In summary, we make the following contributions:

  1. [leftmargin=1em, topsep=-0.3em, itemsep=-0.5em]

  2. We introduce the abstraction of model assertions for monitoring and continuously improving ML models.

  3. We show that model assertions can find high confidence errors, which would not be flagged by uncertainty metrics.

  4. We propose a bandit algorithm to select data points for active learning via model assertions and show that it can reduce labeling costs by up to 40%.

  5. We propose an API for consistency assertions that can automatically generate weak labels for data where the assertion fails, and show that weak supervision via these labels can improve relative model quality by up to 46%.

2 Model Assertions

We describe the model assertion interface, examples of model assertions, how model assertions can integrate into the ML development/deployment cycle, and its implementation in OMG.

2.1 Model Assertions Interface

We formalize the model assertions interface. Model assertions are arbitrary functions that can indicate when an error is likely to have occurred. They take as input a list of inputs and outputs from one or more ML models. They return a severity score, a continuous value that indicates the severity of an error of a specific type. By convention, the 0 value represents an abstention. Boolean values can be implemented in model assertions by only returning 0 and 1. The severity score does not need to be calibrated, as our algorithms only use the relative ordering of scores.

As a concrete example, consider an AV with a LIDAR sensor and camera and object detection models for each sensor. To check that these models agree, a developer may write:

def sensor_agreement(lidar_boxes, camera_boxes):
  failures = 0
  for lidar_box in lidar_boxes:
    if no_overlap(lidar_box, camera_boxes):
      failures += 1
  return failures

Notably, our library OMG can register arbitrary Python functions as model assertions.

2.2 Example Use Cases and Assertions

In this section, we provide use cases for model assertions that arose in discussions with industry and academic contacts, including AV companies and academic labs. We show example of errors caught by the model assertions described in this section in Appendix A and describe how one might look for assertions in other domains in Appendix B.

Our discussions revealed two key properties in real-world ML systems. First, ML models are deployed on orders of magnitude more data than can reasonably be labeled, so a labeled sample cannot capture all deployment conditions. For example, the fleet of Tesla vehicles will see over 100 more images in a day than in the largest existing image dataset sun2017revisiting. Second, complex ML deployments are developed by large teams, of which some developers may not have the ability to manage all parts of the application. As a result, it is critical to be able to do QA collaboratively to cover the application end-to-end.

Analyzing TV news.

We spoke to a research lab studying bias in media via automatic analysis. This lab collected over 10 years of TV news (billions of frames) and executed face detection every three seconds. These detections are subsequently used to identify the faces, detect gender, and classify hair color using ML models. Currently, the researchers have no method of identifying errors and manually inspect data. However, they additionally compute scene cuts. Given that most TV new hosts do not move much between scenes, we can assert that the identity, gender, and hair color of faces that highly overlap within the same scene are consistent (Figure 

6, Appendix). We further describe how model assertions can be implemented via our consistency API for TV news in §4.

Autonomous vehicles (AVs). AVs are required to execute a variety of tasks, including detecting objects and tracking lane markings. These tasks are accomplished with ML models from different sensors, such as visual, LIDAR, or ultrasound sensors davies2018how. For example, a vision model might be used to detect objects in video and a point cloud model might be used to do 3D object detection.

Our contacts at an AV company noticed that models from video and point clouds can disagree. We implemented a model assertion that projects the 3D boxes onto the 2D camera plane to check for consistency. If the assertion triggers, then at least one of the sensors returned an incorrect answer.

Video analytics. Many modern, academic video analytics systems use an object detection method kang2017noscope; kang2018blazeit; hsieh2018focus; jiang2018chameleon; xu2019vstore; canel2019scaling trained on MS-COCO lin2014microsoft, a corpus of still images. These still image object detection methods are deployed on video for detecting objects. None of these systems aim to detect errors, even though errors can affect analytics results.

In developing such systems, we noticed that objects flicker in and out of the video (Figure 1) and that vehicles overlap in unrealistic ways (Figure 7, Appendix). We implemented assertions to detect these.

Medical classification.Deep learning researchers have created deep networks that can outperform cardiologists for classifying atrial fibrillation (AF, a form of heart condition) from single-lead ECG data rajpurkar2017cardiologist. Our researcher contacts mentioned that AF predictions from DNNs can rapidly oscillate. The European Society of Cardiology guidelines for detecting AF require at least 30 seconds of signal before calling a detection developed2010guidelines. Thus, predictions should not rapidly switch between two states. A developer could specify this model assertion, which could be implemented to monitor ECG classification deployments.

Figure 2: A system diagram of how model assertions can integrate into the ML development/deployment pipeline. Users can collaboratively add to an assertion database. We also show how related work can be integrated into the pipeline. Notably, verification only gives guarantees on a test set and perturbations thereof, but not on arbitrary runtime data.

2.3 Using Model Assertions for QA

We describe how model assertions can be integrated with ML development and deployment pipelines. Importantly, model assertions are complementary to a range of other ML QA techniques, including verification, fuzzing, and statistical techniques, as shown in Figure 2.

First, model assertions can be used for monitoring and validating all parts of the ML development/deployment pipeline. Namely, model assertions are agnostic to the source of the output, whether they be ML models or human labelers. Perhaps surprisingly, we have found several groups to also be interested in monitoring human label quality. Thus, concretely, model assertions can be used to validate human labels (data collection) or historical data (validation), and to monitor deployments (e.g., to populate dashboards).

Second, model assertions can be used at training time to select which data points to label in active learning. We describe BAL, our algorithm for data selection, in §3.

Third, model assertions can be used to generate weak labels to further train ML models without additional human labels. We describe how OMG accomplishes this via consistency assertions in §4. Users can also register their own weak supervision rules.

2.4 Implementing Model Assertions in Omg

We implement a prototype library for model assertions, OMG, that works with existing Python ML training and deployment frameworks. We briefly describe OMG’s implementation.

OMG logs user-defined assertions as callbacks. The simplest way to add an assertion is through AddAssertion(func), where func is a function of the inputs and outputs (see below). OMG also provides an API to add consistency assertions as described in §4. Given this database, OMG requires a callback after model execution that takes the model’s input and output as input. Given the model’s input and output, OMG will execute the assertions and record any errors. We assume the assertion signature is similar to the following; this assertion signature is for the example in Figure 1:

def flickering(recent_frames: List[PixelBuf],
  recent_outputs: List[BoundingBox]) -> Float

For active learning, OMG will take a batch of data and return indices for which data points to label. For weak supervision, OMG will take data and return weak labels where valid. Users can specify weak labeling functions associated with assertions to help with this.

In the following two sections, we describe two key methods that OMG uses to improve model quality: BAL for active learning and consistency assertions for weak supervision.

3 Using Model Assertions for Active Learning with BAL

We introduce an algorithm called BAL to select data for active learning via model assertions. BAL assumes that a set of data points has been collected and a subset will be labeled in bulk. We found that labeling services scale and our industrial contacts usually label data in bulk.

Given a set of data points that triggered model assertions, OMG

must select which points to label. There are two key challenges which make data selection intractable in its full generality. First, we do not know the marginal utility of selecting a data point to label without labeling the data point. Second, even with labels, estimating the marginal gain of data points is expensive to compute as training modern ML models is expensive.

To address these issues, we make simplifying assumptions. We describe the statistical model we assume, the resource-unconstrained algorithm, our simplifying assumptions, and BAL. We note that, while the resource-unconstrained algorithm can produce statistical guarantees, BAL does not. We instead empirically verify its performance in Section 5.

Data selection as multi-armed bandits. We cast the data selection problem as a multi-armed bandit (MAB) problem auer2002finite; berry1985bandit. In MABs, a set of “arms” (i.e., individual data points) is provided and the user must select a set of arms (i.e., points to label) to achieve the maximal expected utility (e.g., maximize validation accuracy, minimize number of assertions that fire). MABs have been studied in a wide variety of settings radlinski2008learning; lu2010contextual; bubeck2009pure, but we assume that the arms have context associated with them (i.e., severity scores from model assertions) and give submodular rewards (defined below). The rewards are possibly time-varying. We further assume there is an (unknown) smoothness parameter that determines the similarity between arms of similar contexts (formally, the in the Hölder condition evans1998graduate). The following presentation is inspired by chen2018contextual.

Concretely, we assume the data will be labeled in rounds and denote the rounds . We refer to the set of data points as . Each data point has a

dimensional feature vector associated with it, where

is the number of model assertions. We refer to the feature vector as , where is the data point index and is the round index; from here, we will refer to the data points as . Each entry in a feature vector is the severity score from a model assertion. The feature vectors can change over time as the model predictions, and therefore assertions, change over the course of training.

We assume there is a budget on the number of arms (i.e., data points to label), , at every round. The user must select a set of arms such that . We assume that the reward from the arms, , is submodular in . Intuitively, submodularity implies diminishing marginal returns: adding the 100th data point will not improve the reward as much as adding the 10th data point. Formally, we first define the marginal gain of adding an extra arm:


where is a subset of arms and is an additional arm such that . The submodularity condition states that, for any and


Resource-unconstrained algorithm. Assuming an infinite labeling and computational budget, we describe an algorithm that selects data points to train on. Unfortunately, this algorithm is not feasible as it requires labels for every point and training the ML model many times.

If we assume that rewards for individual arms can be queried, then a recent bandit algorithm, CC-MAB chen2018contextual can achieve a regret of for to be the smoothness parameter. A regret bound is the (asymptotic) difference with respect to an oracle algorithm. Briefly, CC-MAB explores under-explored arms until it is confident that certain arms have highest reward. Then, it greedily takes the highest reward arms. Full details are given in chen2018contextual and summarized in Algorithm 1.

Input: , , ,
Output: choice of arms at rounds
for  do
        if Underexplored arms then
               Select arms from under-explored contexts at random
               Select arms by highest marginal gain (Eq. 1):
               for  do
               end for
        end if
end for
Algorithm 1 A summary of the CC-MAB algorithm. CC-MAB first explores under-explored arms, then greedily selects arms with highest marginal gain. Full details are given in chen2018contextual.

Unfortunately, CC-MAB requires access to an estimate of selecting a single arm. Estimating the gain of a single arm requires a label and requires retraining and reevaluating the model, which is computationally infeasible for expensive-to-train ML models, especially modern deep networks.

Resource-constrained algorithm. We make simplifying assumptions and use these to modify CC-MAB for the resource-constrained setting. Our simplifying assumptions are that 1) data points with similar contexts (i.e., ) are interchangeable, 2) data points with higher severity scores have higher expected marginal gain, and 3) reducing the number of triggered assertions will increase accuracy.

Under these assumptions, we do not require an estimate of the marginal reward for each arm. Instead, we can approximate the marginal gain from selecting arms with similar contexts by the total number of these arms that were selected. This has two benefits. First, we can train a model on a set of arms (i.e., data points) in batches instead of adding single arms at a time. Second, we can select data points of similar contexts at random, without having to compute its marginal gain.

Input: , , ,
Output: choice of arms at rounds
for  do
        if t = 0 then
               Select data points uniformly at random from the model assertions
               Compute the marginal reduction of the number of times model assertion triggered from the previous round;
               if all  then
                      Fall back to baseline method;
               end if
              for  do
                      Select model assertion proportional to ;
                      Select that triggers , sample proportional to severity score rank;
                      Add to ;
               end for
        end if
end for
Algorithm 2 BAL algorithm for data selection for continuous training. BAL samples from the assertions at random in the first round, then selects the assertions that result in highest marginal reduction in the number of assertions that fire in subsequent rounds. BAL will default to random sampling or uncertainty sampling if none of the assertions reduce.

Leveraging these assumptions, we can simplify Algorithm 1 to require less computation for training models and to not require labels for all data points. Our algorithm is described in Algorithm 2. Briefly, we approximate the marginal gain of selecting batches of arms and select arms proportional to the marginal gain. We additionally allocate 25% of the budget in each round to randomly sample arms that triggered different model assertions, uniformly; this is inspired by -greedy algorithms tokic2011value. This ensures that no contexts (i.e., model assertions) are underexplored as training progresses. Finally, in some cases (e.g., with noisy assertions), it may not be possible to reduce the number of assertions that fire. In this case, BAL will default to random sampling or uncertainty sampling, as specified by the user.

4 Consistency Assertions and Weak Supervision

Although developers can write arbitrary Python functions as model assertions in OMG, we found that many assertions can be specified using an even simpler, high-level abstraction that we called consistency assertions. This interface allows OMG to generate multiple Boolean model assertions from a high-level description of the model’s output, as well as automatic correction rules that propose new labels for data that fail the assertion to enable weak supervision.

The key idea of consistency assertions is to specify which attributes of a model’s output are expected to match across many invocations to the model. For example, consider a TV news application that tries to locate faces in TV footage and then identify their name and gender (one of the real-world applications we discussed in §2.2). The ML developer may wish to assert that, within each video, each person should consistently be assigned the same gender, and should appear on the screen at similar positions on most nearby frames. Consistency assertions let developers specify such requirements by providing two functions:

  • [leftmargin=1em, itemsep=0pt]

  • An identification function that returns an identifier for each model output. For example, in our TV application, this could be the person’s name as identified by the model.

  • An attributes function that returns a list of named attributes expected to be consistent for each identifier. In our example, this could return the gender attribute.

Given these two functions, OMG generates multiple Boolean assertions that check whether the various attributes of outputs with a common identifier match. In addition, it generates correction rules that can replace an inconsistent attribute with a guess at that attribute’s value based on other instances of the identifier (we simply use the most common value). By running the model and these generated assertions over unlabeled data, OMG can thus automatically generate weak labels for data points that do not satisfy the consistency assertions. Notably, OMG provides another way of producing labels for training that is complementary to human-generated labels and other sources of weak labels. OMG is especially suited for unstructured sources, e.g., video. We show in §5 that these weak labels can automatically increase model quality.

4.1 API Details

The consistency assertions API supports ML applications that run over multiple inputs and produce zero or more outputs for each input. For example, each output could be an object detected in a video frame. The user provides two functions over outputs :

  • [leftmargin=1em, itemsep=0pt]

  • returns an identifier for the output , which is simply an opaque value.

  • returns zero or more attributes for the output , which are key-value pairs.

In addition to checking attributes, we found that many applications also expect their identifiers to appear in a “temporally consistent” fashion, where objects do not disappear and reappear too quickly. For example, one would expect cars identified in the video to stay on the screen for multiple frames instead of “flickering” in and out in most cases. To express this expectation, developers can provide a temporal consistency threshold, , which specifies that each identifier should not appear or disappear for intervals less than seconds. For example, we might set to one second for TV footage that frequently cuts across frames, or 30 seconds for an activity classification algorithm that distinguishes between walking and biking. The full API for adding a consistency assertion is therefore .

Examples. We briefly describe how one can use consistency assertions in several ML tasks motivated in §2.2:

Face identification in TV footage: This application uses multiple ML models to detect faces in images, match them to identities, classify their gender, and classifier their hair color. We can use the detected identity as our function and gender/hair color as attributes.

Video analytics for traffic cameras: This application aims to detect vehicles in video street traffic, and suffers from problems such as flickering or changing classifications for an object. The model’s output is bounding boxes with classes on each frame. Because we lack a globally unique identifier (e.g., license plate number) for each object, we can assign a new identifier for each box that appears and assign the same identifier as it persists through the video. We can treat the class as an attribute and set as well to detect flickering.

Heart rhythm classification from ECGs: In this application, domain experts informed us that atrial fibrillation heart rhythms need to persist for at least 30 seconds to be considered a problem. We used the detected class as our identifier and set to 30 seconds.

4.2 Generating Assertions and Labels from the API

Given the , , and values, OMG automatically generates Boolean assertions to check for matching attributes and to check that when an identifier appears in the data, it persists for at least seconds. These assertions are treated the same as user-provided ones in the rest of the system.

OMG also automatically generates corrective rules that propose a new label for outputs that do not match their identifier’s other outputs on an attribute. The default behavior is to propose the most common value of that attribute (e.g., the class detected for an object on most frames), but users can also provide a function to suggest an alternative based on all of that object’s outputs.

For temporal consistency constraints via , OMG will assert by default that at most one transition can occur within a -second window; this can be overridden. For example, an identifier appearing is valid, but an identifier appearing, disappearing, then appearing is invalid. If a violation occurs, OMG will propose to remove, modify, or add predictions. In the latter case, OMG needs to know how to generate an expected output on an input where the object was not identified (e.g., frames where the object flickered out in Figure 1). OMG requires the user to provide a function to cover this case, since it may require domain specific logic, e.g., averaging the locations of the object on nearby video frames.

5 Evaluation

5.1 Experimental Setup

We evaluated OMG and model assertions on four diverse ML workloads based on real industrial and academic use-cases: analyzing TV news, video analytics, autonomous vehicles, and medical classification. For each domain, we describe the task, dataset, model, training procedure, and assertions. A summary is given in Table 1.

Task Model Assertions
TV news Custom Consistency (§4, news)
Object detection (video) SSD liu2016ssd Three vehicles should not highly overlap (multibox), identity consistency assertions (flicker and appear)
Vehicle detection (AVs) Second yan2018second, SSD Agreement of Point cloud and image detections (agree), multibox
AF classification ResNet rajpurkar2017cardiologist Consistency assertion within a 30s time window (ECG)
Table 1: A summary of tasks, models, and assertions used in our evaluation.

TV news. Our contacts analyzing TV news provided us 50 hour-long segments that were known to be problematic. They further provided pre-computed boxes of faces, identities, and hair colors; this data was computed from a range of models and sources, including hand-labeling, weak labels, and custom classifiers. We implemented the consistency assertions described in §4. We were unable to access the training code for this domain so were unable to perform retraining experiments for this domain.

Video analytics. Many modern video analytics systems use object detection as a core primitive kang2017noscope; kang2018blazeit; hsieh2018focus; jiang2018chameleon; xu2019vstore; canel2019scaling, in which the task is to localize and classify the objects in a frame of video. We focus on the object detection portion of these systems. We used a ResNet-34 SSD liu2016ssd (henceforth SSD) model pretrained on MS-COCO lin2014microsoft. We deployed SSD for detecting vehicles in the night-street (i.e., jackson) video that is commonly used kang2017noscope; xu2019vstore; canel2019scaling; hsieh2018focus. We used a separate day of video for training and testing.

We deployed three model assertions: multibox, flicker, and appear. The multibox assertion fires when three boxes highly overlap (Figure 7, Appendix). The flicker and appear assertions are implemented with our consistency API as described in §4.

Autonomous vehicles. We studied the problem of object detection for autonomous vehicles using the NuScenes dataset nuscenes2019

, which contains labeled LIDAR point clouds and associated visual images. We split the data into separate train, unlabeled, and test splits. We detected vehicles only. We use the open-source Second model with PointPillars 

yan2018second; lang2019pointpillars for LIDAR detections and SSD for visual detections. We improve SSD via active learning and weak supervision in our experiments.

As NuScenes contains time-aligned point clouds and images, we deployed a custom assertion for 2D and 3D boxes agreeing, and the multibox

assertion. We deployed a custom weak supervision rule that imputed boxes from the 3D predictions. While other assertions could have been deployed (e.g.,

flicker), we found that the dataset was not sampled frequently enough (at 2 Hz) for these assertions.

Medical classification. We studied the problem of classifying atrial fibrillation (AF) via ECG signals. We used a convolutional network that was shown to outperform cardiologists rajpurkar2017cardiologist. Unfortunately, the full dataset used in rajpurkar2017cardiologist is not publicly available, so we used the CINC17 dataset cinc17. CINC17 contains 8,528 data points that we split into train, validation, unlabeled, and test splits.

We consulted with medical researchers and deployed an assertion that asserts that the classification should not change between two classes in under a 30 second time period (i.e., the assertion fires when the classification changes from within 30 seconds), as described in §4.

5.2 Model Assertions can be Written with High Precision and Few LOC

We first asked whether model assertions could be written succinctly. To test this, we implemented the model assertions described above and counted the lines of code (LOC) necessary for each assertion. We count the LOC for the identity and attribute functions for the consistency assertions (see Table 1 for a summary of assertions). We counted the LOC with and without the shared helper functions (e.g., computing box overlap); we double counted the helper functions when used between assertions. As we show in Table 2, both consistency and domain-specific assertions can be written in under 25 LOC excluding shared helper functions and under 60 LOC when including helper functions. Thus, model assertions can be written with few LOC.

Assertion LOC (no helpers) LOC (inc. helpers)
news 7 39
ECG 23 50
flicker 18 60
appear 18 35
multibox 14 28
agree 11 28
Table 2: Number of lines of code (LOC) for each assertion. Consistency assertions are on the top and custom assertions are on the bottom. All assertions could be written in under 60 LOC including helper functions, when double counting between assertions. The assertion main body could be written in under 25 LOC in all cases. The helper functions included utilities such as computing the overlap between boxes.
Precision Precision
Assertion (identifier and output) (model output only)
news 100% 100%
ECG 100% 100%
flicker 100% 96%
appear 100% 88%
multibox N/A 100%
agree N/A 98%
Table 3: Precision of our model assertions we deployed on 50 randomly selected examples. The top are consistency assertions and the bottom are custom assertions. We report both precision in the ML model outputs only and when counting errors in the identification function and ML model outputs for consistency assertions. As shown, model assertions can be written with 88-100% precision across all domains when only counting errors in the model outputs.

We then asked whether model assertions could be written with high precision. To test this, we randomly sampled 50 data points that triggered each assertion and manually checked whether that data point had an incorrect output from the ML model. The consistency assertions return clusters of data points (e.g., appear) and we report the precision for errors in both the identifier and ML model outputs and only the ML model outputs. As we show in Table 3, model assertions achieve at least 88% precision in all cases.

5.3 Model Assertions can Identify High-Confidence Errors

We asked whether model assertions can identify high-confidence errors, or errors where the model returns the wrong output with high confidence. High-confidence errors are important to identify as confidence is used in downstream tasks, such as analytics queries and actuation decisions kang2017noscope; kang2018blazeit; hsieh2018focus; chinchali2019network. Furthermore, sampling solutions that are based on confidence would be unable to identify these errors.

To determine whether model assertions could identify high confidence errors, we collected the 10 data points with highest confidence error for each of the model assertions deployed for video analytics. We then plotted the percentile of the confidence among all the boxes for each error.

Figure 3: Percentile of confidence of the top-10 ranked errors by confidence found by OMG for video analytics. The x-axis is the rank of the errors caught by model assertions, ordered by rank. The y-axis is the percentile of confidence among all the boxes. As shown, model assertions can find errors where the original model has high confidence (94th percentile), allowing them to complement existing confidence-based methods for data selection.

As shown in Figure 3, model assertions can identify errors within the top 94th percentile of boxes by confidence (the flicker confidences were from the average of the surrounding boxes). Importantly, uncertainty-based methods of monitoring would not catch these errors.

We further show that model assertions can identify errors in human labels, which effectively have a confidence of 1. These results are shown in Appendix E.

5.4 Model Assertions can Improve Model Quality via Active Learning

We evaluated OMG’s active learning capabilities and BAL using the three domains for which we had access to the training code (visual analytics, ECG, AVs).

Multiple model assertions. We asked whether multiple model assertions could be used to improve model quality via continuous data collection. We deployed three assertions over night-street and two assertions for NuScenes. We used random sampling, uncertainty sampling with “least confident” settles2009active, uniform sampling from data that triggered assertions, and BAL for the active learning strategies. We used the mAP metric for both datasets, which is widely used for object detection lin2014microsoft; he2017mask. We defer hyperparmeters to Appendix C.

(a) Active learning for night-street.
(b) Active learning for NuScenes.
Figure 4: Performance of random sampling, uncertainty sampling, uniform sampling from model assertions, and BAL for active learning. The round is the round of data collection (see §3). As shown in (a), BAL improves accuracy on unseen data and can achieve an accuracy target (62% mAP) with 40% fewer labels compared to random and uncertainty sampling for night-street. BAL also outperforms both baselines for the NuScenes dataset as shown in (b). We show figures with all rounds of active learning in Appendix D.

As we show in Figure 4, BAL outperforms both random sampling and uncertainty sampling on both datasets after the first round, which is required for calibration. BAL also outperforms uniform sampling from model assertions by the last round. For night-street, at a fixed accuracy threshold of 62%, BAL uses 40% fewer labels than random and uncertainty sampling. By the fifth round, BAL outperforms both random sampling and uncertainty sampling by 1.5% mAP. While the absolute change in mAP may seem small, doubling the model depth, which doubles the computational budget, on MS-COCO achieves a 1.7% improvement in mAP (ResNet-50 FPN vs. ResNet-101 FPN) Detectron2018.

These results are expected, as prior work has shown that uncertainty sampling can be unsuited for deep networks sener2017active.

Single model assertion. Due to the limited data quantities for the ECG dataset, we were unable to deploy more than one assertion. Nonetheless, we further asked whether a single model assertion could be used to improve model quality. We ran five rounds of data labeling with 100 examples each round for ECG datasets. We ran the experiment 8 times and report averages. We show results in Figure 5. As shown, data collection with a single model assertion generally matches or outperforms both uncertainty and random sampling.

Figure 5: Active learning results with a single assertion for the ECG dataset. As shown, with just a single assertion, model-assertion based active learning can match uncertainty sampling and outperform random sampling.

5.5 Model Assertions can Improve Model Quality via Weak Supervision

We used our consistency assertions to evaluate the impact of weak supervision using assertions for the domains we had weak labels for (video analytics, AVs, and ECG).

For night-street, we used 1,000 additional frames with 750 frames that triggered flicker and 250 random frames with a learning rate of

for a total of 6 epochs. For the NuScenes dataset, we used the same 350 scenes to bootstrap the LIDAR model as in the active learning experiments. We trained with 175 scenes of weakly supervised data for one epoch with a learning rate of

. For the ECG dataset, we used 1,000 weak labels and the same training procedure as in active learning.

Domain Pretrained Weakly supervised
Video analytics (mAP) 34.4 49.9
AVs (mAP) 10.6 14.1
ECG (% accuracy) 70.7 72.1
Table 4: Accuracy of the pretrained and weakly supervised models for video analytics, AV and ECG domains. Weak supervision can improve accuracy with no human-generated labels.

Table 4 shows that model assertion-based weak supervision can improve relative performance by 46.4% for video analytics and 33% for AVs. Similarly, the ECG classification can also improve with no human-generated labels. These results show that model assertions can be useful as a primitive for improving model quality with no additional data labeling.

6 Related Work

ML QA. A range of existing ML QA tools focus on validating inputs via schemas or tracking performance over time polyzotis2019data; baylor2017tfx. However, these systems apply to situations with meaningful schemas (e.g., tabular data) and ground-truth labels at test time (e.g., predicting click-through rate). While model assertions could also apply to these cases, they also cover situations that do not contain meaningful schemas or labels at test time.

Other ML QA systems focus on training pipelines renggli2019continuous or validating numerical errors odena2018tensorfuzz. These approaches are important at finding pre-deployment bugs, but do not apply to test-time scenarios; they are complementary to model assertions.

White-box testing systems, e.g., DeepXplore pei2017deepxplore, test ML models by taking inputs and perturbing them. However, as discussed, a validation set cannot cover all possibilities in the deployment set. Furthermore, these systems do not give guarantees under model drift.

Since our initial workshop paper kang2018model, several works have extended model assertions arechiga2019better; henzinger2019outside.

Verified ML. Verification has been applied to ML models in simple cases. For example, Reluplex (katz2017reluplex) can verify that extremely small networks will make correct control decisions given a fixed set of inputs and other work has shown that similarly small networks can be verified against minimal perturbations of a fixed set of input images (raghunathan2018certified). However, verification requires a specification, which may not be feasible to implement, e.g., even humans may disagree on certain predictions kirillov2018panoptic. Furthermore, the largest verified networks we are aware of katz2017reluplex; raghunathan2018certified; wang2018formal; sun2019formal are orders of magnitude smaller than the networks we consider.

Software Debugging. Writing correct software and verifying software has a long history, with many proposals from the research community. We hope that many such practices are adopted in deploying machine learning models; we focus on assertions in this work (goldstine1947planning; turing1949checking). Assertions have been shown to reduce the prevalence of bugs, when deployed correctly (kudrjavets2006assessing; mahmood1984executable). There are many other such methods, such as formal verification (klein2009sel4; leroy2009formal; keller1976formal), conducting large-scale testing (e.g., fuzzing) (takanen2008fuzzing; godefroid2012sage), and symbolic execution to trigger assertions (king1976symbolic; cadar2008klee). Probabilistic assertions have been used to verify simple distributional properties of programs, such as differentially private programs should return an expected mean sampson2014expressing. However, ML developers may not be able to specify distributions and data may shift in deployment.

Structured Prediction, Inductive Bias. Several ML methods encode structure/inductive biases into training procedures or models (bakir2007predicting; haussler1988quantifying; bakir2007predicting). While promising, designing algorithms and models with specific inductive biases can be challenging for non-experts. Additionally, these methods generally do not contain runtime checks for aberrant behavior.

Weak Supervision, Semi-supervised Learning. Weak supervision leverages higher-level and/or noisier input from human experts to improve model quality (mintz2009distant; ratner2017weak; jin2018unsupervised)

. In semi-supervised learning, structural assumptions over the data are used to leverage unlabeled data 

(zhu2011semi). However, to our knowledge, both of these methods do not contain runtime checks and are not used in model-agnostic active learning methods.

7 Discussion

While we believe model assertions are an important step towards a practical solution for monitoring and continuously improving ML models, we highlight three important limitations of model assertions, which may be fruitful directions for future work.

First, certain model assertions may be difficult to express in our current API. While arbitrary code can be expressed in OMG’s API, certain temporal assertions may be better expressed in a complex event processing language wu2006high. We believe that domain-specific languages for model assertions will be a fruitful area of future research.

Second, we have not thoroughly evaluated model assertions’ performance in real-time systems. Model assertions may add overhead to systems where actuation has tight latency constraints, e.g., AVs. Nonetheless, model assertions can be used over historical data for these systems. We are actively collaborating with an AV company to explore these issues.

Third, certain issues in ML systems, such as bias in training sets, are out of scope for model assertions. We hope that complementary systems, such as TFX baylor2017tfx, can help improve quality in these cases.

8 Conclusion

In this work, we introduced model assertions, a model-agnostic technique that allows domain experts to indicate errors in ML models. We showed that model assertions can be used at runtime to detect high-confidence errors, which prior methods would not detect. We proposed methods to use model assertions for active learning and weak supervision to improve model quality. We implemented model assertions in a novel library, OMG, and demonstrated that they can apply to a wide range of real-world ML tasks, improving monitoring, active learning, and weak supervision for ML models.


This research was supported in part by affiliate members and other supporters of the Stanford DAWN project—Ant Financial, Facebook, Google, Infosys, NEC, and VMware—as well as Toyota Research Institute, Northrop Grumman, Cisco, SAP, and the NSF under CAREER grant CNS-1651570 and Graduate Research Fellowship grant DGE-1656518. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. Toyota Research Institute (“TRI”) provided funds to assist the authors with their research but this article solely reflects the opinions and conclusions of its authors and not TRI or any other Toyota entity.

We further acknowledge Kayvon Fatahalian, James Hong, Dan Fu, Will Crichton, Nikos Arechiga, and Sudeep Pillai for their productive discussions on ML applications.


Appendix A Examples of Errors Caught by Model Assertions

(a) Frame 1
(b) Frame 2
Figure 6: Two example frames from the same scene with an inconsistent attribute (the identity) from the TV news use case.
(a) Example error 1.
(b) Example error 2.
Figure 7: Examples errors when three boxes highly overlap (see multibox in Section 5). Best viewed in color.
(a) Example error flagged by multibox. SSD predicts three trucks when only one should be detected.
(b) Example error flagged by agree. SSD misses the car on the right and the LIDAR model predicts the truck on the left to be too large.
Figure 8: Examples of errors that the multibox and agree assertions catch for the NuScenes dataset. LIDAR model boxes are in pink and SSD boxes are in green. Best viewed in color.

In this section, we illustrate several errors caught by the model assertions used in our evaluation.

First, we show an example error in the TV news use case in Figure 6. Recall that these assertions were generated with our consistency API (§4). In this example, the identifier is the box’s sceneid and the attribute is the identity.

Second, we show an example error for the visual analytics use case in Figure 7 for the multibox assertion. Here, SSD erroneously detects multiple cars when there should be one.

Third, we show two example errors for the AV use case in Figure 8 from the multibox and agree assertions.

Appendix B Classes of Model Assertions

Assertion class
Description Examples
Model outputs from multiple
sources should agree
  • [leftmargin=*]

  • Verifying human labels (e.g., number of labelers that disagree)

  • Multiple models (e.g., number of models that disagree)

Model outputs from multiple
modes of data should agree
  • [leftmargin=*]

  • Multiple sensors (e.g., number of disagreements from LIDAR and camera models)

  • Multiple data sources (e.g., text and images)

Model outputs from multiple views
of the same data should agree
  • [leftmargin=*]

  • Video analytics (e.g., results from overlapping views of different cameras should agree)

  • Medical imaging (e.g., different angles should agree)

Physical constraints
on model outputs
  • [leftmargin=*]

  • Video analytics (e.g., cars should not flicker)

  • Earthquake detection (e.g., earthquakes should appear across sensors in physically consistent ways)

  • Protein-protein interaction (e.g., number of overlapping atoms)

Scenarios that are
unlikely to occur
  • [leftmargin=*]

  • Video analytics (e.g., maximum confidence of 3 vehicles that highly overlap),

  • Text generation (e.g., two of the same word should not appear sequentially)

Inserting certain types of data
should not modify model outputs
  • [leftmargin=*]

  • Visual analytics (e.g., synthetically adding a car to a frame of video should be detected as a car),

  • LIDAR detection (e.g., similar to visual analytics)

Replacing parts of the input with
similar data should not modify
model outputs
  • [leftmargin=*]

  • Sentiment analysis (e.g., classification should not change with synonyms)

  • Object detection (e.g., painting objects different colors should not change the detection)

Adding noise should not
modify model outputs
  • [leftmargin=*]

  • Image classification (e.g., small Gaussian noise should not affect classification)

  • Time series (e.g., small Gaussian noise should not affect time series classification)

Inputs should
conform to a schema
  • [leftmargin=*]

  • Boolean features should not have inputs that are not 0 or 1

  • All features should be present

Table 5: Example of model assertions. We describe several assertion classes, sub-classes, and concrete instantiations of each class. In parentheses, we describe a potential severity score or an application.

We present a non-exhaustive list of common classes of model assertions in Table 5 and below. Namely, we describe how one might look for assertions in other domains.

Our taxonomization is not exact and several examples will contain features from several classes of model assertions. Prior work on schema validation polyzotis2019data; baylor2017tfx and data augmentation wang2017effectiveness; taylor2017improving can be cast in the model assertion framework. As these have been studied, we do not focus on these classes of assertions in this work.

Consistency assertions. An important class of model assertions checks the consistency across multiple models or sources of data. The multiple sources of data could be the output of multiple ML models on the same data, multiple sensors, or multiple views of the same data. The output from the various sources should agree and consistency model assertions specify this constraint. These assertions can be generated via our API as described in §4.

Domain knowledge assertions. In many physical domains, domain experts can express physical constraints or unlikely scenarios. As an example of a physical constraint, when predicting how proteins will interact, atoms should not physically overlap. As an example of an unlikely scenario, boxes of the visible part of cars should not highly overlap (Figure 7). In particular, model assertions of unlikely scenarios may not be 100% precise, i.e., will be soft assertions.

Perturbation assertions. Many domains contain input and output pairs that can be perturbed (perhaps jointly) such that the output does not change. These perturbations have been widely studied through the lens of data augmentation wang2017effectiveness; taylor2017improving and adversarial examples goodfellow2015explaining; athalye2018synthesizing.

Input validation assertions. Domains that contain schemas for the input data can have model assertions that validate the input data based on the schema polyzotis2019data; baylor2017tfx. For example, boolean inputs that are encoded with integral values (i.e., 0 or 1) should never be negative. This class of assertions is an instance of preconditions for ML models.

Appendix C Hyperparameters

Hyperparameters for active learning experiments. For night-street, we used 300,000 frames of one day of video for the training and unlabeled data. We sampled 100 frames per round for five rounds and used 25,000 frames of a different day of video for the test set. Due to the cost of obtaining labels, we ran each trial twice.

For the NuScenes dataset, we used 350 scenes to bootstrap the LIDAR model, 175 scenes for unlabeled/training data for SSD, and 75 scenes for validation (out of the original 850 labeled scenes). We trained for one epoch at a learning rate of . We ran 8 trials.

For the ECG dataset, we train for 5 rounds of active learning with 100 samples per round. We use a learning rate of 0.001 until the loss plateaus, which the original training code did.

(a) Active learning for night-street.
(b) Active learning for NuScenes.
Figure 9: Performance of random sampling, uncertainty sampling, uniform sampling from model assertions, and BAL for active learning. The round is the round of data collection (see §3). As shown, BAL improves accuracy on unseen data and can achieve the same accuracy (62% mAP) as random sampling with 40% fewer labels for night-street. BAL also outperforms both baselines for the NuScenes dataset.

Appendix D Full Active Learning Figures

We show active learning results for all rounds in Figure 9.

Appendix E Model Assertions can Identify Errors in Human Labels

We further asked whether model assertions could be used to identify errors in human-generated labels, i.e., a human is acting as a “ML model.” While verification of human labels has been studied in the context of crowd-sourcing hirth2013analyzing; tran2013efficient, several production labeling services (e.g., Scale scale) do not provide annotator identification which is necessary to perform this verification. We deployed a model assertion in which we tracked objects across frames of a video using an automated method and verified that the same object in different frames had the same label.

Description Number
All labels 469
Errors 32
Errors caught 4
Table 6: Number of labels, errors, and errors caught from model assertions for Scale-annotated images for the video analytics task. As shown, model assertions caught 12.5% of the errors in this data.

We obtained labels for 1,000 random frames from night-street from Scale AI scale, which is used by several autonomous vehicle companies. Table 6 summarizes our results. Scale returned 469 boxes, which we manually verified for correctness. There were no localization errors, but there were 32 classification errors, of which the model assertion caught 12.5%. Thus, we see that model assertions can also be used to verify human labels.