MLDemon: Deployment Monitoring for Machine Learning Systems

04/28/2021 ∙ by Antonio Ginart, et al. ∙ Harvard University Stanford University 0

Post-deployment monitoring of the performance of ML systems is critical for ensuring reliability, especially as new user inputs can differ from the training distribution. Here we propose a novel approach, MLDemon, for ML DEployment MONitoring. MLDemon integrates both unlabeled features and a small amount of on-demand labeled examples over time to produce a real-time estimate of the ML model's current performance on a given data stream. Subject to budget constraints, MLDemon decides when to acquire additional, potentially costly, supervised labels to verify the model. On temporal datasets with diverse distribution drifts and models, MLDemon substantially outperforms existing monitoring approaches. Moreover, we provide theoretical analysis to show that MLDemon is minimax rate optimal up to logarithmic factors and is provably robust against broad distribution drifts whereas prior approaches are not.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

As machine learning (ML) automation permeates increasingly varied domains, engineers find that deployment, monitoring, and managing the model lifecycle increasingly dominates the technical costs for ML systems [zaharia2018accelerating, kumar2017data]. Furthermore, classical ML relies on the assumption that training data is collected from the same distribution as data encountered in deployment [friedman2001elements, abu2012learning]. However, this assumption is increasingly shown to be fragile or false in many real-world applications [koh2020wilds]. Thus, even when ML models achieve expert-level performance in laboratory conditions, many automated systems still require deployment monitoring capable of alerting engineers to unexpected behavior due to distribution shift in the deployment data.

Figure 1: A schematic for the deployment monitoring workflow. For example a ML system could be deployed to help automate content moderation on a social media platform. In real-time, a trained model determines if a post or tweet should result in a ban. A human expert

content moderator can review if the content has been correctly classified by the

model, though this review is expensive. A deployment monitoring (demon) policy prioritizes expert attention by determining when tweets get forwarded to the expert for labeling. The demon policy also estimate how well the model is performing at all times during deployment.

When ground-truth labels are not readily available at deployment time, which is often the case since labels are expensive, the most common solution is to use an unsupervised anomaly detector that is purely feature-based [lu2018learning, rabanser2018failing]

. In some cases, these detectors work well. However, they may also fail catastrophically since it is possible for model accuracy to fall precipitously without possible detection in just the features. This can happen in one of two ways. First, for high-dimensional data, feature detectors simply lack a sufficient number of samples to detect all covariate drifts. Second, it is possible that drift only occurs in the conditional

, which can by construction never be detected without supervision. One potential approach is proposed in [yu2018request]. The policy proposed in [yu2018request] applies statistical tests to estimate a change in distribution in the features and requests expert labels only when such a change is detected. While it is natural to assume that distribution drift in features should be indicative of a drift in the model’s accuracy, in reality feature drift is neither necessary nor sufficient as a predictor of accuracy drift (as described above). In fact, we find that unsupervised anomaly detectors are often brittle and are not highly reliable. Thus, any monitoring policy that only triggers supervision from feature-based anomaly can fail both silently and catastrophically.

In this work, we focus on a setting where an automated deployment monitoring policy can query experts for labels during deployment (Fig. 1

). The goal of the policy is to estimate the model’s real-time accuracy throughout deployment while querying the fewest amount of expert labels. Of course, these two objectives are in direct contention. Thus, we seek to design a policy that can effectively prioritize expert attention at key moments.


In this paper, we formulate ML deployment monitoring as an online decision problem and propose a principled adaptive deployment monitoring policy, MLDemon, that substantially improves over prior art both empirically as well as in theoretical guarantees. We summarize our primary contributions below.

(1) Our new formulation is tractable and captures the key trade-off between monitoring cost and risk.

(2) Our proposed adaptive monitoring policy, MLDemon, is minimax rate optimal up to logarithmic factors. Additionally, MLDemon is provably robust to broad types of distribution drifts.

(3) We empirically validate our policy across diverse, real timeseries data streams. Our experiments reveal that feature-based anomaly detectors can be brittle with respect to real distribution shifts and that MLDemon simultaneously provides robustness to errant detectors while reaping the benefits of informative detectors.

2 Problem Formulation

We consider a novel online streaming setting [munro1980selection, karp1992line] where for each time point , the data point and the corresponding label are generated from a distribution that may vary over time: . For a given model , let denote its accuracy at time . The total time can be understood as the life-cycle of the model as measured by the number of user queries. In addition, we assume that we have an anomaly detector , which can depend on both present and past observations and is potentially informative of the accuracy . For example, the detector can quantify the distributional shift of the feature stream and a large drift may imply a deterioration of the model accuracy.

We consider scenarios where high-quality labels are costly to obtain and are only available upon request from an expert. Therefore, we wish to monitor the model performance over time while obtaining minimum number of labels. We consider two settings that are common in machine learning deployments: 1) point estimation of the model accuracy across all time points (estimation problem), 2) determining if the model’s current accuracy is above or below a user-specified threshold (decision problem).

At time , the policy receives a data point and submits a pair of actions , where denotes whether or not to query for an expert label on and is the estimate of the model’s current accuracy. We use to denote the observed prediction outcome, whose value is if the policy asks for (namely, ) and otherwise. We only consider a class natural policies that, for the decision problem, predict the accuracy to be above the threshold if and vice versa.

We wish to balance two types of costs: the average number of queries and the monitoring risk. In the estimation problem, without loss of generality, we consider the mean absolute error (MAE) for the monitoring risk:


In the decision problem, we consider a binary version and a continuous version for the monitoring risk:


where we note that the summand in Eq. (2) is if the predicted accuracy and the true accuracy incur different decisions when compared to the threshold . We use to denote the monitoring risk in general when there is no need to distinguish between the risk functions. Therefore, the combined loss can be written as:


where indicates the cost per label query and controls the trade-off between the two types of loss. Our goal is to design a policy to minimize the expected loss .

Assumption on Distributional Drift

We are especially interested in settings for which the distribution varies in time. Without any assumption about how changes over time, then it is impossible to guarantee that any labeling strategy achieves reasonable performance. Fortunately, many real-world data drifts tend to be more gradual over time. Many ML systems can process hundreds or thousands of user queries per hour, while many real-world data drifts tend to take place over days or weeks. Motivated by this, we consider distribution drifts that are Lipschitz-continuous [o2006metric] over time in total variation. 111If is -Lipschitz in , then it follows that is -Lipschitz in absolute value . This gives a natural interpretation to in terms of controlling the maximal change in model accuracy over time. [villani2008optimal]: . The -Lipschitz constraint captures that the distribution shift must happen in a way that is controlled over time, which is a natural assumption for many cases. The magnitude of captures the inherit difficulty of the monitoring problem. All instances are -Lipschitz because . When , we are certain that no drift can occur, and thus do not need to monitor at deployment at all. When is small, the deployment is easier to monitor because the stream drifts slowly. For our theory, we focus on asymptotic regret, in terms of , but amortized over the length of the deployment . While our theoretical analysis relies on the -Lipschitz assumption, our algorithms do not require it to work well empirically.

Feature-Based Anomaly Detection

We assume that our policy has access to a feature-based anomaly detector, that computes an anomaly signal from the online feature stream, namely . We let denote the anomaly detection signal at time : . Signal captures the magnitude of the anomaly in the feature stream such that indicates that no feature drift is detected and large indicates the feature drift is likely or significant. The design of the specific anomaly detector is usually domain-specific and is out of scope of the current work (see [rabanser2018failing, yu2018request, wang2019drifted, pinto2019automatic, kulinski2020feature, xuan2020bayesian] for recent examples). At a high-level, most detectors first apply some type of dimensionality reduction to the raw features to produce some summary statistics. Then, they apply some statistical tests on the summary statistics, such as a KS-test [lilliefors1967kolmogorov], to generate a drift -value. This drift -value can be interpreted as an anomaly signal. Common summary statistics include embedding layers of deep models or even just the model confidence. In the case of ML APIs, only the confidence score is typically available [chen2020frugalml]. Our MLDemon framework is compatible with any anomaly detector.

3 Algorithms

We present MLDemon along with two baselines. The first baseline, Periodic Querying (PQ), is a simple non-adaptive policy that periodically queries according to a predetermined cyclical schedule. The second baseline, Request-and-Reverify (RR) [yu2018request], is the state-of-art to our problem. All of the policies run in constant space and amortized constant time — an important requirement for a scalable long-term monitoring system. Intuitively, for the deployment monitoring problem, adaptive policies can adjust the sampling rate based on the anomaly scores from the feature-based detector. PQ is non-adaptive while RR only adapts to the anomaly score. MLDemon, in comparison, uses both anomaly information and some infrequent surveillance labeling to improve performance.

Periodic Querying

PQ works for both the estimate problem and the decision problem. As shown in Alg. 1, given a budget for the average number of queries per round, PQ periodically queries for a batch of labels in every rounds, and uses the estimate from the current batch of labels for the entire period 222Another possible variant of this policy queries once every rounds and combines the previous labels. When is known or upper bounded, we may instead set the query rate to guarantee some worst-case monitoring risk..

Inputs: At each time , observed outcome

Outputs: At each time , ,

Hyperparameters: Window length , query budget

  1. Query () for consecutive labels and then do not query () for rounds

  2. Compute from most recent labels as empirical mean

Algorithm 1 Periodic Querying (PQ)

RR sets a threshold for anomaly signal and queries for a batch of labels whenever the predetermined threshold is exceeded by the anomaly signal . As for the anomaly detector, RR applies a statistical test on a sliding window of model confidence scores. Threshold is directly compared to the -value of the statistical test. By varying the threshold , RR can vary the number of labels queried in a deployment. While training data can be used to calibrate the threshold, in our theoretical analysis we show that for any , RR cannot provide a non-trivial worst-case guarantee for monitoring risk, regardless of the choice of anomaly detector.

Inputs: At each time , observed outcome , anomaly signal

Outputs: At each time , ,

Hyperparameters: Window length , anomaly threshold

  if   then
     (1) Query () for consecutive labels
     (2) Compute from most recent observed outcomes as empirical mean
     Do not query for labels () and keep fixed
  end if
Algorithm 2 Request-and-Reverify [yu2018request] (RR)

3.1 MLDemon

Figure 2: A schematic highlighting the main components of the MLDemon

policy on a facial recognition data stream example (

FACE-R). At some time

, there is a sudden shift towards public masking in the population (for example, due to an unexpected pandemic). This makes facial recognition more difficult, especially because prior systems were trained on unmasked faces. Step (1) computes anomaly signals for the data stream. Step (2) computes an empirical histogram over the signal collected over time, and performs quantile normalization. Step (3) sets the policy’s final query rate by scaling the normalized anomaly score onto a range of allowed query rates. The minimum and maximum query rates can be determined from the worst-case risk tolerance and the total budget for querying expert labels.

MLDemon consists of three steps (Fig. 2). First an anomaly score is computed for the data point at time . From our vantage point, the first step is computed by a black-box routine, as discussed in previous sections. Second, the quantile of is determined among all of the histogram of all previous scores . This quantile informs us how anomalous the -th score is compared to what we have previously observed. Finally, the normalized anomaly score is mapped onto a labeling rate (more anomalous scores get more frequent labeling and vice-versa). The upper and lower range of the labeling rates are determined by the label query budget and monitoring risk tolerance, respectively. We describe steps two and three in more detail below in Alg. 3-6.

Normalization and Scaling

See Alg. 3 for a code sketch of the quantile normalization step. The key decision is the range onto which we map our quantiles onto, denoted . A large (small) range means that the anomaly score has more (less) modulation on the labeling period. The quantile normalized anomaly score is linearly scaled onto so that for example, quantile 0 (low anomaly) maps to .

Inputs: Anomaly signal ,

Outputs: Modulation factor

Hyperparameters: Interval for

   //Get percentile for by evaluating the empirical CDF
Algorithm 3 quantile_norm
Adaptive Querying with MLDemon

Since we cannot guarantee that the anomaly signal always correlates with changes in the model’s accuracy, we would also like to incorporate some robustness to counteract the possible failure event of an uninformative feature-based detector. In Alg. 6, we present a code sketch for MLDemon’s adaptive policy. We think of Alg. 6 as a routine that runs at each time step. One of the safeguards that we implement to achieve robustness is to establish a range of possible labeling periods, .333The longest period we would go without getting an expert label is . Query period range is set based on label budgets and risk tolerances, whereas quantile normalization range is set based on controlling the anomaly-based modulation. The lower bound is determined by the total budget that can be spent on expert labels. The upper bound controls the worst-case expected monitoring risk. For the decision problem, we can additionally adapt the query period based off the estimated margin to the target threshold using our estimate of

. With a larger margin, we need a looser confidence interval to guarantee the same monitoring risk. This translates into fewer label queries. In Alg.

6, we sketch the high-level blueprint for MLDemon. To deploy MLDemon the engineer, along with a maximum query rate specifies a monitoring risk tolerance such that for any deployment. For decision problems, based on our statistical analysis, we can leverage the estimated threshold margin to safely increase while still preserving a risk tolerance of .

Inputs: Anomaly signal , point estimate history ,

Outputs: At each time ,

Hyperparameters: Window length , Risk tolerance , Maximum query rate , Drift bound (optional)

  if decision problem then
  end if
  if estimation problem then
  end if
   //Compute max query period to meet requirements
   //Compute min period given query budget
   //Subroutine defined in Alg. 3
   //Use anomaly signal to modulate query period
  if  then
     //Condition is true if at least rounds since previous query
  end if
Algorithm 4 Code Sketch for MLDemon

4 Experiments

In this section we present an empirical study that benchmarks MLDemon, PQ, and RR on eight realistic data streams. For full details on the experiments, refer to Appendix.

4.1 Experimental Protocol

Data Streams

We benchmark MLDemon, PQ, and RR on 8 data stream benchmarks are summarized below and in Table 1. KEYSTROKE, 4CR and 5CVT where used in [yu2018request] so we include them as reference points.

  1. SPAM-CORPUS [katakis2006dynamic]: A non-stationary data set for detecting spam mail over time based on text. It represents a real, chronologically ordered email inbox from the early 2000s.

  2. KEYSTROKE [killourhy2009comparing]: A non-stationary data set of biometric keystroke features representing four individuals typing over time. The goal is to identify which individual is typing as their keystroke biometrics drift over time.

  3. WEATHER-AUS444 A non-stationary data set for predicting rain in Australia based on other weather and geographic features. The data is gathered from a range of locations and time spanning years.

  4. EMO-R: A stream based on RAFDB [li2017reliable, li2019reliable] for emotion recognition in faces. The distribution drift mimics a change in demographics by increasing the elderly black population.

  5. FACE-R [wang2020masked]: A data set that contains multiple images of hundreds of individuals, both masked and unmasked. The distribution drift mimics the onset of a pandemic by increasing the percentage of masked individuals.

  6. IMAGE-DRIFT [recht2019imagenet]

    : A new data set, called ImageNetV2, for the ImageNet benchmark

    [deng2009imagenet] was collected about a decade later. This stream mimics temporal drift in natural images on the web by increasing the fraction of V2 images over time.

  7. 4CR [souzaSDM:2015]: A non-stationary data set that was synthetically generated for the express purpose of benchmarking distribution drift detection. It features 4 Gaussian clusters rotating in Euclidean space.

  8. 5CVT [souzaSDM:2015]: A non-stationary data set that was synthetically generated for the express purpose of benchmarking distribution drift detection. It features 5 Gaussian clusters translating in Euclidean space.

Data Stream Details
Data Stream # class Model
SPAM-CORPUS 7400 2 Logistic
KEYSTROKE 1500 4 Logistic
WEATHER-AUS 7000 2 Logistic
EMO-R 1000 7 Face++
FACE-R 1000 400 Residual CNN
IMAGE-DRIFT 1000 1000 SqueezeNet
4CR 20000 4 Logistic
5CVT 6000 5 Logistic
Table 1: Eight data streams used in our empirical study.

is the length of the stream in our benchmark. Also reported are the number of classes in the classification task and the classifier used. Face++ is a commercial API based on deep learning.

Implementation Details

Each data stream is a time-series of labeled data . As a proxy for the true , which is unknown for real data, we use compute a moving average for the empirical with sliding window length . To produce a trade-off frontier for each method, we sweep the hyperparameters for each method. For PQ, we sweep the amortized query budget . For MLDemon we sweep the risk tolerance . For RR we also sweep the anomaly threshold for label request. In order to have the strongest baselines, we set the optimal hyperparameter values for PQ and RR (see Appendix for details). For consistency, we set , , and for MLDemon in all the experiments.

Anomaly Detector

Following [yu2018request, rabanser2018failing], for all streams except FACE-R, we use base our anomaly signal on the model’s confidence. If the confidence score at time is given by , we use a KS-test [lilliefors1967kolmogorov] to determine a -value between empirical samples and . We set . For logistic and neural models, we obtain model confidence in the usual way. When using the commercial Face++ API555 for EMO-R, we use the confidence scores provided by the service. For FACE-R, we use an embedding based detector using the model’s face embeddings (see Appendix for more details).

Model Training

For the logistic regression models, we train models on the first

of the drift, then treat the rest as the deployment test. We obtain reasonable validation accuracy (at least ) for all of the models we trained. For EMO-R we use the Face++ dataset collected in [chen2020frugalml]. For FACE-R

, we use an open-source facial recognition package

666 that is powered by a pre-trained residual CNN [he2015deep] that computes face embeddings. For IMAGE-DRIFT we use the pre-trained SqueezeNet [iandola2016squeezenet]

from the Pytorch

[paszke2019pytorch] model zoo.

4.2 Results


Figure 3: Trade-off frontiers for monitoring risk vs. amortized queries on 8 benchmark data streams. The two rows show MAE as the monitoring risk, and the bottom rows show hinge loss. MLDemon is blue. Periodic Querying is denoted PQ (green). Request-and-Reverify is denoted RR (red). Error bars (in both and axes) denote std. error of the mean over at least 5 iterates. For hinge and binary losses, the target threshold is set to be below the validation accuracy at time . Additional thresholds are reported in the supplementary materials.
MAE Hinge Binary
Data Stream MLDemon RR PQ MLDemon RR PQ MLDemon RR PQ
SPAM-CORPUS 0.261 0.228 0.381
KEYSTROKE 0.213 0.136 0.263
WEATHER-AUS 0.226 0.157 0.260
EMO-R 0.280 0.242 0.304
FACE-R 0.209 0.229 0.267
IMAGE-DRIFT 0.156 0.139 0.323
5CVT 0.198 0.072 0.212
4CR 0.308 0.173 0.188
Table 2: Normalized AUC for trade-off frontier in Fig. 3. Lower is better, indicative of a more label-efficient deployment monitoring policy. The lowest (best) score for each data stream under each risk is in bold. For hinge and binary risk, the target threshold is always set at 10% below the model’s validation accuracy at time 0. The AUC is computed using the mean values reported in Fig. 3. Results for higher and lower thresholds can be found in Appendix.
Combined loss for varying label cost
MLDemon (0.101, 0.028) (0.093, 0.022) (0.088, 0.018)
RR (0.127, 0.037) (0.118, 0.030) (0.110, 0.025)
PQ (0.171, 0.058) (0.146, 0.045) (0.130, 0.035)
Table 3: The average combined losses across eight data streams for varying label costs . The first value for each field is using MAE risk and the second value is using hinge risk. The cost per label normalizes the online cost of querying an expert for a label versus the online monitoring risk (Eqn. 4). The combined loss is computed by minimizing Eqn. 4 over the empirical trade-off frontiers of vs. at the given value of .

Holistically across eight data streams, MLDemon’s empirical trade-off frontier between monitoring risk and query rate is superior to both RR and PQ for MAE and hinge risk (Fig. 3). The same holds true for binary risk, reported in the Appendix. For decision problems, the monitoring risk can vary significantly depending on the chosen threshold, so we include two additional thresholds in the Appendix for both binary and hinge loss. As for RR, it tends to outperform PQ, however in some cases it can actually perform significantly worse. When the anomaly scores are very informative, RR can modestly outperform MLDemon in some parts of the trade-off curve. This is expected, since MLDemon attains monitoring robustness at the cost of a minimal amount of surveillance querying that might prove itself unnecessary in the most fortuitous of circumstances. Empirically, in the limit of few labels, MLDemon averages about a reduction in MAE risk and a reduction in hinge risk at a given label amount compared to RR. We also can summarize a policy’s performance by its normalized AUC (Table 2). Because policies simultaneously minimize monitoring risk and amortized queries , a lower AUC is better. Additional AUC scores for varying thresholds are reported in Appendix.

Consistent with the trends in Fig. 3, across the eight streams and three risk functions, MLDemon achieves the lowest AUC on 19 out of 24 benchmarks. Of the scores in Table 2, even when MLDemon does not score the lowest AUC, it only does worse than the lowest scoring policy by at most , whereas RR averages worse than the lowest scoring policy, and is at least worse than the lowest scoring policy seven times. This supports the conclusion that it is risky policy to purely rely on potentially brittle anomaly detection instead of balancing surveillance queries with anomaly-driven queries. In our theoretical analysis we mathematically confirm this empirical trend. We find that, compared to RR and PQ, MLDemon consistently decreases the combined loss for varying labeling costs (Table 3), indicating that MLDemon could improve monitoring efficiency in a variety of domains that each have different relative cost between expert supervision and monitoring risk.

5 Theoretical Analysis

Our analysis can be summarized as follows. First, we show that PQ is worst-case rate optimal up to logarithmic factors over the class of -Lipschitz drifts while RR is not even close. Second, we show that MLDemon matches PQ’s worst-case performance while achieving a significantly better average-case performance under a realistic probabilistic model. All the proofs are in the appendix.

Our asymptotic analysis in this section is concerned with an asymptotic rate in terms of small

and amortized by a large . When using asymptotic notation, by loss we mean , for some constant . Recall that amortization is implicit in the definition of . We use tildes to denote the omission of logarithmic factors. For example, means , for some constants . Recall that amortization is implicit in the definition of We let be the combined loss when using anomaly detector and policy .

5.1 Minimax Analysis

Theorem 5.1.

Let be the set of -Lipschitz drifts and let be the space of deployment monitoring policies. On both estimation problems with MAE risk and decision problems with hinge risk, for any model and anomaly detector , the following hold:

(i) MLDemon and PQ achieve worst-case expected loss

(ii) RR has a worst-case expected loss

(iii) No policy can achieve a better worse-case expected loss than MLDemon and PQ

The above result confirms that MLDemon is minimax rate optimal up to logarithmic factors. In contrast to the robustness of MLDemon, RR can fail catastrophically regardless of the choice of detector. For hard problem instances, the anomaly signal is always errant and the threshold margin is always small, in which case it is understandable why MLDemon cannot outperform PQ.

Lemma 5.2.

For both estimation and decision problems, MLDemon and PQ achieve a worst-case expected monitoring risk of with a query rate of and no policy can achieve a query rate of

Lem. 5.2 is used to prove Thm. 5.1, but we include it here because it is of independent interest to understand the trade-off between monitoring risk and query costs and it also gives intuition for Thm. 5.1. The emergence of the rate also follows from Lem. 5.2 by considering the combined loss optimizing over to minimize subject to the constraints imposed by Lem. 5.2. Lem. 5.2 itself follows from an analysis that pairs a lower bound derived with Le Cam’s method [yu1997assouad] and an upper bound constructed with an extension to Hoeffding’s inequality [hoeffding1994probability] that enables us to wield it for samples from -Lipschitz drifts. We turn our attention to analyzing more optimistic regimes next.

5.2 Average-Case Analysis

To perform an average-case analysis, we introduce a stochastic model to define a distribution over problem instances in . Our model assumes the following law for generating the sequence from any arbitrary initial condition :


The accuracy drift is modeled as a simple random walk. As discussed in [szpankowski2011average] the maximum entropy principle (used by our model at each time step under the -Lipschitz constraint) is often a reasonable stochastic model for average-case analysis.

We already know that MLDemon is robust in the worst-case. For estimation problems, MLDemon outperforms PQ on average, although only by a constant factor. The reason we have a constant factor gain in the estimation case is because we limit the minimum query rate in order to guarantee robustness against an errant detector. So even if the detector is perfectly informative, we would not completely stop surveillance queries even during the stability periods. On the other hand, we can obtain an actual rate improvement for the hinge case.

Theorem 5.3.

Let be the distribution over problem instances implied by the stochastic model. For the decision problem with hinge risk and model , and detector :

The reason we have a better asymptotic gain in the decision problem is illuminated below in Lem. 5.4.

Lemma 5.4.

For decision problems with hinge risk under model , MLDemon achieves an expected monitoring hinge risk with an amortized query amount .

MLDemon can save an average factor in query cost, which ultimately translates into the rate improvement in Thm. 5.3. MLDemon does this by leveraging the margin between estimate and threshold to increase the slack in the confidence interval around the estimate without increasing risk.

6 Discussion

Related Works

While our problem setting is novel, there are a variety of settings relating to ML deployment and distribution drift. One such line of work focuses on reweighting data to detect and counteract label drift [lipton2018detecting, garg2020unified]. Another related problem is when one wants to combine expert and model labels to maximize the accuracy of a joint classification system [cesa1997use, farias2006combining, morris1977combining, fern2003online]. The problem is similar in that some policy needs to decide which user queries are answered by an expert versus an AI, but the problem is different in that it is interesting even in the absence of online drift or high labeling costs. It would be an interesting to augment our formulation with a reward for the policy when it can use an expert label to correct a user query that the model got wrong. This setting would combine both the online monitoring aspects of our setting along with ensemble learning [polikar2012ensemble, minku2009impact] under concept drift with varying costs for using each classifier depending on if it is an expert or an AI.

Adaptive sampling rates are a well-studied topic in the signal processing literature [dorf1962adaptive, mahmud1989high, peng2009adaptive, feizi2010locally]. The essential difference is that in signal processing, measurements tend to be exact whereas in our setting a measurement just reveals the outcome of a single Bernoulli trial. Another popular related direction is online learning or lifelong learning during concept drift. Despite a large and growing body of work in this direction, including [fontenla2013online, hoi2014libol, nallaperuma2019online, gomes2019machine, chen2019novel, nagabandi2018deep, hayes2020lifelong, chen2018lifelong, liu2017lifelong, hong2018lifelong], this problem is by no means solved. Our setting assumes that after some time , the model will eventually be retired for a updated one. It would be interesting to augment our formulation to allow a policy to update based on the queried labels. Model robustness is a related topic that focuses on designing such that accuracy does not fall during distribution drift [zhao2019learning, lecue2020robust, li2018principled, goodfellow2018making, zhang2019building, shafique2020robust, miller2020effect].

Conclusion and Future Directions

We pose and analyze a novel formulation for studying automated ML deployment monitoring policies. Understanding the trade-off between expert attention and monitoring quality is of both research and practical interest. Our proposed policy comes with theoretical guarantees and performs favorably on empirical benchmarks. The potential impact of this work is that MLDemon could be used to improve the reliability and efficiency of ML systems in deployment. Since this is a relatively new research direction, there are several interesting directions of future work. We have implicitly assumed that experts respond to label requests approximately instantly. In the future, we can allow the policy to incorporate labeling delay. Also, we have assumed that an expert label is as good as a ground-truth label. We can relax this assumption to allow for noisy expert labels. We could also let the policy more actively evaluate the apparent informativeness of the anomaly signal over time or even input an ensemble of different anomaly signals and learn which are most relevant at a given time. While MLDemon is robust even if the feature-based anomaly detector is not good, it is more powerful with an informative detector. Improving the robustness of anomaly detectors for specific applications and domains is a promising area of research.



We begin with supplementary figures in A. In B, we include additional details regarding our experiments. In C, we include additional mathematical details and all proofs.

Appendix A Supplementary Figures

a.1 Additional Thresholds for Decision Experiments

We report additional trade-off frontiers. Fig. 4 reports a higher () threshold and Fig. 5 reports a lower threshold (). Both report both hinge and binary risk. Fig. 6 reports a medium threshold () for binary risk. We include normalized AUC values in Table 4 for higher and lower thresholds as well.

Figure 4: Trade-off frontiers for monitoring risk vs. query rate on 8 Data Streams. On top is hinge loss. On the bottom is binary loss. MLDemon is denoted MLD (blue). Periodic Querying is denoted PQ (green).Request-and-Reverify is denoted RR (red). Error bars (in both and axes) denote std. error of the mean over at least 5 iterates. For hinge and binary losses, the target threshold is set to be below the validation accuracy at time . Additional frontiers are reported in Fig. 5 and Fig. 6
Figure 5: On top is hinge loss. On the bottom is binary loss. The target threshold is set to be below the validation accuracy at time . Refer to Fig. 4 for more details and legend.
Figure 6: Binary risk is used. The target threshold is set to be below the validation accuracy at time . Refer to Fig. 4 for more details and legend.
Table 4: Normalized AUC for trade-off frontier. Lower is better, indicative of a more label-efficient deployment monitoring policy.

a.2 Anomaly Detector Ablation

It is also important to understand the sensitivity of each policy to the particular choice of detector. Since MLDemon is designed with from first-principles to be robust, we should expect it to work sufficiently well for any choice of detector. On the other hand, RR’s performance is highly variable depending on the choice of detector.

Recall that we use a KS-test to detect drift in our main results. Another reasonable choice of anomaly score is based on a -test (essentially this boils down to computing and comparing the empirical means over two consecutive sliding windows) [rabanser2018failing]. Like for our KS-test, we select a window size of .

It is also reasonable to explore a baseline set by a detector that is driven entirely by random noise.

For this ablation study, we opt to include the six non-synthetic data streams. However, recall that the detector for FACE-R was based on the face embeddings generated by the model, and we already used a moving average based detector in the main experiments. Thus, we include 4CR instead of FACE-R for the

Figure 7: An ablation study reporting the trade-off frontier when using a moving average based means test instead of the KS-test.
-test KS-test Noise
Table 5: Normalized AUC for trade-off frontier. Risk is measured in MAE.

Appendix B Experimental Details

b.1 Data Stream Details

We describe each of the eight data streams in greater detail. All data sets are public and may be found in the references. All lengths for each data stream were determined by ensuring that the stream was long enough to capture interesting drift dynamics.

b.1.1 Data Stream Construction

  1. SPAM-CORPUS: We take the first points from the data in the order that it comes in the data file.

  2. KEYSTROKE: We take the first points from the data in the order that it comes in the data file.

  3. WEATHER-AUS: We take the first points from the data in the order that it comes in the data file.

  4. EMO-R: We randomly sample from the entire population (i.e. every data points in the data set) for the first points in the stream, after which we randomly sample from the elderly black population.

  5. FACE-R: We randomly subsample 400 individuals out of the data set that have at least 3 unmasked images and 1 masked image to create a reference set. For these 400 individuals, we randomly sample from the set of unmasked images for the first points, after which we randomly sample from the set of masked images for the same 400 individuals.


    : We sampled random images from the standard ImageNet validation set for the first

    , and we sampled a random set out of the MatchedFrequency set within ImageNetV2 for the second .

  7. 4CR: We take the first points from the data in the order that it comes in the data file.

  8. 5CVT: We take the first points from the data in the order that it comes in the data file.

b.1.2 Bootstrapping

In order to get iterates for each data set, we generate the stream by bootstrap as follows. We block the data sequence into blocks of length and uniformly at random permute the data within each block. Because this bootstrapping does not materially alter the structure of the drift.

b.2 Models

b.2.1 Logistic Regression

For the logistic regression model, we used the default solver provided in the scikit-learn library [pedregosa2011scikit].

b.2.2 Facial Recognition

For the facial recognition system, we used the open-source model referenced in the main text. The model computes face embeddings given images. The embeddings are then used to compute a similarity score as described in the API. For any query image belonging to one of 400 individuals, the model looks for the best match among the 400 individuals by comparing the query image to each of the 3 reference images for each individual and taking an average similarity score. The highest average similarity score out of the 400 individuals is returned as the predicted matching individual.

b.2.3 ImageNet

For the ImageNet model, we used SqueezeNet as referenced in the main text.

b.2.4 Emotion Recognition

For the emotion recognition system, we used the Face++ API as reference in the main text.

b.3 Anomaly Detectors

We describe further details for our anomaly detection protocol. As mentioned in the main text, the compare a window of the recent features to an adjacent window of the next most recent for a total width of the

recent features in the stream. These anomaly signal window lengths are determined heuristically by trying a range of widths across various data streams.

b.3.1 Confidence-Based Detector

For most of our benchmarks, we follow the confidence-based detector proposed in [yu2018request]. To reiterate the main text, we run a KS-test on the model’s confidence scores over time. The -value from the KS-test serves as an anomaly metric for the policy.

b.3.2 Embedding-Based Detector

For the FACE-R

task, the model computes a face embedding. This embedding vector already summarizes the face. By comparing recent and less recent samples of face embeddings from the data stream, we can interpret the Euclidean distance between the empirical means as metric for anomaly. More formally, we require the use of a multi-dimensional location test. In principle, any such test should work roughly the same. Because the sample sizes are always the same, the main quantity influencing the

-value of the test is just the distance between in the empirical means. We use the test in [srivastava2013two].

b.4 Policies

b.4.1 Periodic Querying

The only hyperparameter we sweep for PQ is the query budget, The other hyperparameter, , the window length and batch size for the querying. For our benchmarks, we fix , which performs well empirically.

b.4.2 Request-and-Reverify

The only hyperparameters for RR are , the window length and batch size for the querying and , the query threshold for the anomaly signal. We sweep both. For our benchmarks, we sweep . For , the effective range is data stream dependent, but is always swept exhaustively from the extreme of to the extreme of for all .

b.4.3 MLDemon

The hyperparameters for MLDemon are all specified in the main text, except for the interval . We set and .

b.5 Computing Infrastructure and Resource Usage

Experiments are all run on high-end consumer grade CPUs. Depending on the benchmark, a single simulation (i.e. random seed) could range from minutes to several hours. A single simulation used at most 4 GB of memory, and for most benchmarks much less. For all the data stream, except FACE-R this includes all preprocessing time such as training the model. For FACE-R, we precomputed all of the face embeddings for all images in the dataset, a process that took less than 64 GB of memory and took less than 24 hours with 32 cores working in parallel. In total, including repetitions and preliminary experiments, we estimate the entire project took approximately 10,000 CPU-hours.

Appendix C Mathematical Details & Proofs

We begin with reviewing definitions and notations.

c.1 Definitions & Notation

Definition C.1.

Accuracy at time

We use the convention that is known from the training data. All policies can make use of this as their initial estimate:


c.1.1 Instantaneous & Amortized Monitoring Risk

We first define the instantaneous monitoring risk, which we distinguish here from the amortized monitoring risk (defined in the main text). Instantaneous monitoring risk is the risk for a particular data point whereas is the amortized risk over the entire deployment.

Definition C.2.

(Instantaneous) Monitoring Risk

We define the monitoring risks in MAE and hinge settings for a single data point in the stream below.


As mentioned in the main text, we omit the subscript when the loss function is clear from context or is not relevant. In the context of our online problem, at time

, we may generally infer that , and is fixed over time. In this case, we might use the shorthand , as below:


where is the usual amortized monitoring risk term defined in Section 2.

c.1.2 Policies

We use the following abbreviations to formally denote the PQ, RR, and MLDemon policies: , , and , respectively.

Definition C.3.

Formal Request-and-Reverify Policy

We let denote RR policy as defined in Section 3 of the main text. We write to emphasize the dependence on a particular threshold hyperparameter . Recall that is set at time and fixed throughout deployment. When is omitted, the dependence is to be inferred. Policy is the same for both decision and estimation problems.

Definition C.4.

Formal Periodic Querying Policy

We let denote PQ policy as defined in Section 3 of the main text. Recall that can be parameterized by a particular query rate budget as defined in the main text. Alternatively, we can parameterize by a worst-case risk tolerance such that in any problem instance. Using the theory we will presently develop, we can convert a risk tolerance into a constant average query rate given by as computed in Alg. 5. The guaranteed risk tolerance implicitly depends on the Lipschitz constant , which we can assume to be known or upper bounded for the purposes of our mathematical analysis. We write for policy instances parameterized by budget and for policy instances parameterized by risk tolerance . When the particular choice of parameterization can be inferred or is otherwise irrelevant, we omit it. By convention, if is not updated at a given time , then .

Inputs: Time

Outputs: At each time , ,

Hyperparameters: Drift bound , Risk tolerance , Decision threshold

Constants: ;

//The state vars are persistent variables initialized to the values below
//By convention, if they are not updated at any given time, the value persists to the next round
State Vars: Query counter , Buffer counter , Point estimate

  //Only one of the 4 if-statements below will execute
  if   then
  end if
  if  then
  end if
  if  then
      empirical mean from most recent label queries
  end if
  if  then
  end if
Algorithm 5 Periodic Querying parameterized by risk tolerance

Although the algorithm looks different than the one presented in the main text, it is actually the same, just more formally keeping track of the query period and the waiting period with counters.

Definition C.5.

Formal MLDemon Policy

We let denote the MLDemon policy. We give formal versions of the routines comprising . As for PQ and RR, we let specify an exact risk tolerance for the policy. is nearly identical to the sketch given in the main text, but there are a few minor clarifications and distinctions that need to be made for formal reasoning. When necessary, we will write and to distinguish the estimation and decision variants of the MLDemon policies. By convention, if a state variable, such as, is not updated at a given time , then . Also, if a state variable gets updated more than once in any given round, the final value of the state variable persists to the next round at time .

The differences between the formal variant and the sketch given in the main texts are minor and are as follows. First, the lower bound on query period is forced to be within a constant factor of . This is a technical point in that in order to achieve the minimax optimality results, we need to control how often the policy is allowed to query. Letting the policy set an arbitrary budget makes sense in practice, but for theoretical analysis we will assume the policy is trying to be as frugal as possible with label queries while respecting the specified risk tolerance.

Another minor point, namely, the technical safety condition that is needed for decision problems. In practice, this condition will only triggers exceptionally infrequently, but in order to complete the technical parts of the proofs, it is expedient to keep the condition. We also point out that our theoretical analysis specifies particular asymptotics for window length , reflected in the constants in Alg. 6.

Finally, we point out the unbiased sample flag. When this flag is set, formally speaking, one requires an additional assumption, namely, that the average expected increment to is mean zero. Strictly speaking, does not require this assumption and the formal guarantees hold without it. However, we found stronger empirical performance with the flag turned on, and thus recommend it for applications.

Inputs: Anomaly signal , point estimate history ,

Outputs: At each time ,

Hyperparameters: Window length , Risk tolerance , Maximum query rate factor , Drift bound , Unbiased increments flag , Margin surplus factor

Constants: ;

//The state vars are persistent variables initialized to the values below
//By convention, if they are not updated at any given time, the value persists to the next round
State Vars: Query counter , Buffer counter , Point estimate , Margin surplus

  if decision problem then
  end if
  if estimation problem then
  end if
   //Compute max query period to meet requirements
   //Minimum period is a fixed factor of max
   //Formally defined below
   //Use anomaly signal to modulate query period
  if  then
     //Condition is true if at least rounds since previous query
  end if
   // is the set of all indices for we have labels since time
   //Compute steps back such that we have labels
   // is a set of indices for the most recent labels
  if  then
      //Compute empirical mean from most recent labels
  end if
  if decision problem then
     //A technical safety condition required for proving robustness
     //While formally required, the condition below triggers infrequently practice
     //Only one of the 4 if-statements below will execute
     if   then
     end if
     if  then
     end if
     if  then
          empirical mean from most recent label queries
     end if
     if  then
     end if
  end if
Algorithm 6
Quantile Scaling

We proceed to more formally define the quantile_norm subroutine. The quantile normalization step itself is standard and we refer the reader to [amaratunga2001analysis]. In short, we compute the

empirical inverse cumulative distribution function

, denoted here by , and use it to map data back to a normalized value in . Once we have normalized, we scale our value onto :


Thus, a formal definition of Alg. 3 (quantile_norm) is given by the scaling defined above in Eqn. 10. In order for to be well-posed, we require the following condition on :


c.2 Preliminary Results

c.2.1 Bounding Accuracy Drift in Absolute Value Based on Total Variation

We begin with proving the claim from the introduction regarding the equivalence of a -Lipschitz bound in terms of accuracy drift and total variation between the distribution drift. Although Prop. C.1

may not be immediately obvious for one unfamiliar with total variation distance on probability measures, the result is in fact trivial. To clarify notation, for probability measure

and event we let .

Proposition C.1.

Let and

be two supervised learning tasks (formally, distributions over

). Let be a model. If then where and .


Let be the sample space for distributions and . TV-distance has many equivalent definitions. One of them is given in Eqn. 12 below [villani2008optimal]:


c.2.2 Adapting Hoeffding’s Inequality for Lipschitz Sequences

One of the key ingredients in many of the proofs is the following modification to Hoeffding’s inequality that enables us to use it to construct a confidence interval for the empirical mean of observed outcomes even though is drifting (rather than i.i.d. as is usual).

Lemma C.2.

Hoeffing’s Inequality for Bernoulli Samples with Bounded Bias

Assume we have a random sample of Bernoulli trials such that each trials is biased by some small amount: . Let denote a sample mean. Let .



We can invoke the classical version of Hoeffding’s [hoeffding1994probability]:


Notice that . Plugging in below yields:


Also, notice that due to the reverse triangle inequality [sutherland2009introduction]:




Moving the to the other side of the inequality finishes the result:


All of the work has already been in done above in Lem. C.2, but in order to make it more clear how it is applied to a Lipschitz sequence of distributions we also state Lem C.3.

Lemma C.3.

Hoeffing’s Inequality for Lipschitz Sequences

Assume drift is -Lipschitz. Let be any subset of the set of rounds for which policy has queried:




denote a sample mean of observed outcomes for rounds in .



be defined analogously to Lemma C.2.



The result follows by setting and applying Lemma C.2. ∎

c.3 Proof of Lemma 5.2

We study the average query rate required to guarantee a worst-case monitoring risk, taken over all -Lipschitz drifts. Note that in the worst-case, the anomaly signal is uninformative and thus adaptivity with respect to the detection signal will not be helpful. The following results hold both for MAE loss and hinge loss. Lemma 5.2 has two parts. The first statement is about PQ whereas the second is for MLDemon.

c.3.1 Lemma 5.2 for Periodic Querying

Lemma C.4.

Let be the class of -Lipschitz drifts. Assume .

For both estimation and decision problems (using MAE and hinge loss), PQ achieves a worst-case expected monitoring risk of with a query rate of :


Because for all , it will be sufficient to prove the result for the estimation case (doing so directly implies the decision case).

It should be clear from Alg. 5 that the amortized query complexity is just , which indeed satisfies the query rate condition. This holds for any choice of (PQ is an open-loop policy).


It remains to verify that the choice of and results in a worst-case expected warning risk of .

Consider Lemma C.3. Based on Lemma C.3, imagine we are applying Eqn. 22 at each point in time with quantity being used as our point estimate . If, for all time:


then we can be assured that PQ maintains an expected monitoring risk of because the probability of an error greater than is at most and such an error event is bounded by in the worst-case.

Fixing , it is easy to verify that .

Recall that is fixed:


Plugging in the above produces:


It remains to verify the first inequality, , at all . This is slightly more involved, as is not constant over time. We ask ourselves, what is the worst-case that would be possible in Eqn. 22 when using PQ? Well, the PQ policy specifies a query batch of size , and maintains the empirical accuracy from this batch as the point estimate for the next rounds in the stream. Thus, is largest when precisely when the policy is one query away from completing a batch. At this point in time, because the policy has not yet updated because it only does so at the end of the batch once all queries have been made. Thus, we are using an estimate that is the empirical mean of label queries such that this batch was started iterations ago and was completed iterations ago.

The maximal value possible for is hence:


It is easy to upper bound this sum as follows:


Recall :


Plugging in into , along with the upper bound for yields: