1 Introduction
As machine learning (ML) automation permeates increasingly varied domains, engineers find that deployment, monitoring, and managing the model lifecycle increasingly dominates the technical costs for ML systems [zaharia2018accelerating, kumar2017data]. Furthermore, classical ML relies on the assumption that training data is collected from the same distribution as data encountered in deployment [friedman2001elements, abu2012learning]. However, this assumption is increasingly shown to be fragile or false in many realworld applications [koh2020wilds]. Thus, even when ML models achieve expertlevel performance in laboratory conditions, many automated systems still require deployment monitoring capable of alerting engineers to unexpected behavior due to distribution shift in the deployment data.
When groundtruth labels are not readily available at deployment time, which is often the case since labels are expensive, the most common solution is to use an unsupervised anomaly detector that is purely featurebased [lu2018learning, rabanser2018failing]
. In some cases, these detectors work well. However, they may also fail catastrophically since it is possible for model accuracy to fall precipitously without possible detection in just the features. This can happen in one of two ways. First, for highdimensional data, feature detectors simply lack a sufficient number of samples to detect all covariate drifts. Second, it is possible that drift only occurs in the conditional
, which can by construction never be detected without supervision. One potential approach is proposed in [yu2018request]. The policy proposed in [yu2018request] applies statistical tests to estimate a change in distribution in the features and requests expert labels only when such a change is detected. While it is natural to assume that distribution drift in features should be indicative of a drift in the model’s accuracy, in reality feature drift is neither necessary nor sufficient as a predictor of accuracy drift (as described above). In fact, we find that unsupervised anomaly detectors are often brittle and are not highly reliable. Thus, any monitoring policy that only triggers supervision from featurebased anomaly can fail both silently and catastrophically.In this work, we focus on a setting where an automated deployment monitoring policy can query experts for labels during deployment (Fig. 1
). The goal of the policy is to estimate the model’s realtime accuracy throughout deployment while querying the fewest amount of expert labels. Of course, these two objectives are in direct contention. Thus, we seek to design a policy that can effectively prioritize expert attention at key moments.
Contributions
In this paper, we formulate ML deployment monitoring as an online decision problem and propose a principled adaptive deployment monitoring policy, MLDemon, that substantially improves over prior art both empirically as well as in theoretical guarantees. We summarize our primary contributions below.
(1) Our new formulation is tractable and captures the key tradeoff between monitoring cost and risk.
(2) Our proposed adaptive monitoring policy, MLDemon, is minimax rate optimal up to logarithmic factors. Additionally, MLDemon is provably robust to broad types of distribution drifts.
(3) We empirically validate our policy across diverse, real timeseries data streams. Our experiments reveal that featurebased anomaly detectors can be brittle with respect to real distribution shifts and that MLDemon simultaneously provides robustness to errant detectors while reaping the benefits of informative detectors.
2 Problem Formulation
We consider a novel online streaming setting [munro1980selection, karp1992line] where for each time point , the data point and the corresponding label are generated from a distribution that may vary over time: . For a given model , let denote its accuracy at time . The total time can be understood as the lifecycle of the model as measured by the number of user queries. In addition, we assume that we have an anomaly detector , which can depend on both present and past observations and is potentially informative of the accuracy . For example, the detector can quantify the distributional shift of the feature stream and a large drift may imply a deterioration of the model accuracy.
We consider scenarios where highquality labels are costly to obtain and are only available upon request from an expert. Therefore, we wish to monitor the model performance over time while obtaining minimum number of labels. We consider two settings that are common in machine learning deployments: 1) point estimation of the model accuracy across all time points (estimation problem), 2) determining if the model’s current accuracy is above or below a userspecified threshold (decision problem).
At time , the policy receives a data point and submits a pair of actions , where denotes whether or not to query for an expert label on and is the estimate of the model’s current accuracy. We use to denote the observed prediction outcome, whose value is if the policy asks for (namely, ) and otherwise. We only consider a class natural policies that, for the decision problem, predict the accuracy to be above the threshold if and vice versa.
We wish to balance two types of costs: the average number of queries and the monitoring risk. In the estimation problem, without loss of generality, we consider the mean absolute error (MAE) for the monitoring risk:
(1) 
In the decision problem, we consider a binary version and a continuous version for the monitoring risk:
(2)  
(3) 
where we note that the summand in Eq. (2) is if the predicted accuracy and the true accuracy incur different decisions when compared to the threshold . We use to denote the monitoring risk in general when there is no need to distinguish between the risk functions. Therefore, the combined loss can be written as:
(4) 
where indicates the cost per label query and controls the tradeoff between the two types of loss. Our goal is to design a policy to minimize the expected loss .
Assumption on Distributional Drift
We are especially interested in settings for which the distribution varies in time. Without any assumption about how changes over time, then it is impossible to guarantee that any labeling strategy achieves reasonable performance. Fortunately, many realworld data drifts tend to be more gradual over time. Many ML systems can process hundreds or thousands of user queries per hour, while many realworld data drifts tend to take place over days or weeks. Motivated by this, we consider distribution drifts that are Lipschitzcontinuous [o2006metric] over time in total variation. ^{1}^{1}1If is Lipschitz in , then it follows that is Lipschitz in absolute value . This gives a natural interpretation to in terms of controlling the maximal change in model accuracy over time. [villani2008optimal]: . The Lipschitz constraint captures that the distribution shift must happen in a way that is controlled over time, which is a natural assumption for many cases. The magnitude of captures the inherit difficulty of the monitoring problem. All instances are Lipschitz because . When , we are certain that no drift can occur, and thus do not need to monitor at deployment at all. When is small, the deployment is easier to monitor because the stream drifts slowly. For our theory, we focus on asymptotic regret, in terms of , but amortized over the length of the deployment . While our theoretical analysis relies on the Lipschitz assumption, our algorithms do not require it to work well empirically.
FeatureBased Anomaly Detection
We assume that our policy has access to a featurebased anomaly detector, that computes an anomaly signal from the online feature stream, namely . We let denote the anomaly detection signal at time : . Signal captures the magnitude of the anomaly in the feature stream such that indicates that no feature drift is detected and large indicates the feature drift is likely or significant. The design of the specific anomaly detector is usually domainspecific and is out of scope of the current work (see [rabanser2018failing, yu2018request, wang2019drifted, pinto2019automatic, kulinski2020feature, xuan2020bayesian] for recent examples). At a highlevel, most detectors first apply some type of dimensionality reduction to the raw features to produce some summary statistics. Then, they apply some statistical tests on the summary statistics, such as a KStest [lilliefors1967kolmogorov], to generate a drift value. This drift value can be interpreted as an anomaly signal. Common summary statistics include embedding layers of deep models or even just the model confidence. In the case of ML APIs, only the confidence score is typically available [chen2020frugalml]. Our MLDemon framework is compatible with any anomaly detector.
3 Algorithms
We present MLDemon along with two baselines. The first baseline, Periodic Querying (PQ), is a simple nonadaptive policy that periodically queries according to a predetermined cyclical schedule. The second baseline, RequestandReverify (RR) [yu2018request], is the stateofart to our problem. All of the policies run in constant space and amortized constant time — an important requirement for a scalable longterm monitoring system. Intuitively, for the deployment monitoring problem, adaptive policies can adjust the sampling rate based on the anomaly scores from the featurebased detector. PQ is nonadaptive while RR only adapts to the anomaly score. MLDemon, in comparison, uses both anomaly information and some infrequent surveillance labeling to improve performance.
Periodic Querying
PQ works for both the estimate problem and the decision problem. As shown in Alg. 1, given a budget for the average number of queries per round, PQ periodically queries for a batch of labels in every rounds, and uses the estimate from the current batch of labels for the entire period ^{2}^{2}2Another possible variant of this policy queries once every rounds and combines the previous labels. When is known or upper bounded, we may instead set the query rate to guarantee some worstcase monitoring risk..
RequestandReverify
RR sets a threshold for anomaly signal and queries for a batch of labels whenever the predetermined threshold is exceeded by the anomaly signal . As for the anomaly detector, RR applies a statistical test on a sliding window of model confidence scores. Threshold is directly compared to the value of the statistical test. By varying the threshold , RR can vary the number of labels queried in a deployment. While training data can be used to calibrate the threshold, in our theoretical analysis we show that for any , RR cannot provide a nontrivial worstcase guarantee for monitoring risk, regardless of the choice of anomaly detector.
3.1 MLDemon
MLDemon consists of three steps (Fig. 2). First an anomaly score is computed for the data point at time . From our vantage point, the first step is computed by a blackbox routine, as discussed in previous sections. Second, the quantile of is determined among all of the histogram of all previous scores . This quantile informs us how anomalous the th score is compared to what we have previously observed. Finally, the normalized anomaly score is mapped onto a labeling rate (more anomalous scores get more frequent labeling and viceversa). The upper and lower range of the labeling rates are determined by the label query budget and monitoring risk tolerance, respectively. We describe steps two and three in more detail below in Alg. 36.
Normalization and Scaling
See Alg. 3 for a code sketch of the quantile normalization step. The key decision is the range onto which we map our quantiles onto, denoted . A large (small) range means that the anomaly score has more (less) modulation on the labeling period. The quantile normalized anomaly score is linearly scaled onto so that for example, quantile 0 (low anomaly) maps to .
Adaptive Querying with MLDemon
Since we cannot guarantee that the anomaly signal always correlates with changes in the model’s accuracy, we would also like to incorporate some robustness to counteract the possible failure event of an uninformative featurebased detector. In Alg. 6, we present a code sketch for MLDemon’s adaptive policy. We think of Alg. 6 as a routine that runs at each time step. One of the safeguards that we implement to achieve robustness is to establish a range of possible labeling periods, .^{3}^{3}3The longest period we would go without getting an expert label is . Query period range is set based on label budgets and risk tolerances, whereas quantile normalization range is set based on controlling the anomalybased modulation. The lower bound is determined by the total budget that can be spent on expert labels. The upper bound controls the worstcase expected monitoring risk. For the decision problem, we can additionally adapt the query period based off the estimated margin to the target threshold using our estimate of
. With a larger margin, we need a looser confidence interval to guarantee the same monitoring risk. This translates into fewer label queries. In Alg.
6, we sketch the highlevel blueprint for MLDemon. To deploy MLDemon the engineer, along with a maximum query rate specifies a monitoring risk tolerance such that for any deployment. For decision problems, based on our statistical analysis, we can leverage the estimated threshold margin to safely increase while still preserving a risk tolerance of .4 Experiments
In this section we present an empirical study that benchmarks MLDemon, PQ, and RR on eight realistic data streams. For full details on the experiments, refer to Appendix.
4.1 Experimental Protocol
Data Streams
We benchmark MLDemon, PQ, and RR on 8 data stream benchmarks are summarized below and in Table 1. KEYSTROKE, 4CR and 5CVT where used in [yu2018request] so we include them as reference points.

SPAMCORPUS [katakis2006dynamic]: A nonstationary data set for detecting spam mail over time based on text. It represents a real, chronologically ordered email inbox from the early 2000s.

KEYSTROKE [killourhy2009comparing]: A nonstationary data set of biometric keystroke features representing four individuals typing over time. The goal is to identify which individual is typing as their keystroke biometrics drift over time.

WEATHERAUS^{4}^{4}4https://www.kaggle.com/jsphyg/weatherdatasetrattlepackage: A nonstationary data set for predicting rain in Australia based on other weather and geographic features. The data is gathered from a range of locations and time spanning years.

EMOR: A stream based on RAFDB [li2017reliable, li2019reliable] for emotion recognition in faces. The distribution drift mimics a change in demographics by increasing the elderly black population.

FACER [wang2020masked]: A data set that contains multiple images of hundreds of individuals, both masked and unmasked. The distribution drift mimics the onset of a pandemic by increasing the percentage of masked individuals.

IMAGEDRIFT [recht2019imagenet]
: A new data set, called ImageNetV2, for the ImageNet benchmark
[deng2009imagenet] was collected about a decade later. This stream mimics temporal drift in natural images on the web by increasing the fraction of V2 images over time. 
4CR [souzaSDM:2015]: A nonstationary data set that was synthetically generated for the express purpose of benchmarking distribution drift detection. It features 4 Gaussian clusters rotating in Euclidean space.

5CVT [souzaSDM:2015]: A nonstationary data set that was synthetically generated for the express purpose of benchmarking distribution drift detection. It features 5 Gaussian clusters translating in Euclidean space.
Data Stream Details  

Data Stream  # class  Model  
SPAMCORPUS  7400  2  Logistic 
KEYSTROKE  1500  4  Logistic 
WEATHERAUS  7000  2  Logistic 
EMOR  1000  7  Face++ 
FACER  1000  400  Residual CNN 
IMAGEDRIFT  1000  1000  SqueezeNet 
4CR  20000  4  Logistic 
5CVT  6000  5  Logistic 
is the length of the stream in our benchmark. Also reported are the number of classes in the classification task and the classifier used. Face++ is a commercial API based on deep learning.
Implementation Details
Each data stream is a timeseries of labeled data . As a proxy for the true , which is unknown for real data, we use compute a moving average for the empirical with sliding window length . To produce a tradeoff frontier for each method, we sweep the hyperparameters for each method. For PQ, we sweep the amortized query budget . For MLDemon we sweep the risk tolerance . For RR we also sweep the anomaly threshold for label request. In order to have the strongest baselines, we set the optimal hyperparameter values for PQ and RR (see Appendix for details). For consistency, we set , , and for MLDemon in all the experiments.
Anomaly Detector
Following [yu2018request, rabanser2018failing], for all streams except FACER, we use base our anomaly signal on the model’s confidence. If the confidence score at time is given by , we use a KStest [lilliefors1967kolmogorov] to determine a value between empirical samples and . We set . For logistic and neural models, we obtain model confidence in the usual way. When using the commercial Face++ API^{5}^{5}5https://www.faceplusplus.com/ for EMOR, we use the confidence scores provided by the service. For FACER, we use an embedding based detector using the model’s face embeddings (see Appendix for more details).
Model Training
For the logistic regression models, we train models on the first
of the drift, then treat the rest as the deployment test. We obtain reasonable validation accuracy (at least ) for all of the models we trained. For EMOR we use the Face++ dataset collected in [chen2020frugalml]. For FACER, we use an opensource facial recognition package
^{6}^{6}6https://github.com/ageitgey/face_recognition that is powered by a pretrained residual CNN [he2015deep] that computes face embeddings. For IMAGEDRIFT we use the pretrained SqueezeNet [iandola2016squeezenet]from the Pytorch
[paszke2019pytorch] model zoo.4.2 Results
MAE  Hinge  Binary  
Data Stream  MLDemon  RR  PQ  MLDemon  RR  PQ  MLDemon  RR  PQ 
SPAMCORPUS  0.261  0.228  0.381  
KEYSTROKE  0.213  0.136  0.263  
WEATHERAUS  0.226  0.157  0.260  
EMOR  0.280  0.242  0.304  
FACER  0.209  0.229  0.267  
IMAGEDRIFT  0.156  0.139  0.323  
5CVT  0.198  0.072  0.212  
4CR  0.308  0.173  0.188 
Combined loss for varying label cost  

Policy  
MLDemon  (0.101, 0.028)  (0.093, 0.022)  (0.088, 0.018) 
RR  (0.127, 0.037)  (0.118, 0.030)  (0.110, 0.025) 
PQ  (0.171, 0.058)  (0.146, 0.045)  (0.130, 0.035) 
Holistically across eight data streams, MLDemon’s empirical tradeoff frontier between monitoring risk and query rate is superior to both RR and PQ for MAE and hinge risk (Fig. 3). The same holds true for binary risk, reported in the Appendix. For decision problems, the monitoring risk can vary significantly depending on the chosen threshold, so we include two additional thresholds in the Appendix for both binary and hinge loss. As for RR, it tends to outperform PQ, however in some cases it can actually perform significantly worse. When the anomaly scores are very informative, RR can modestly outperform MLDemon in some parts of the tradeoff curve. This is expected, since MLDemon attains monitoring robustness at the cost of a minimal amount of surveillance querying that might prove itself unnecessary in the most fortuitous of circumstances. Empirically, in the limit of few labels, MLDemon averages about a reduction in MAE risk and a reduction in hinge risk at a given label amount compared to RR. We also can summarize a policy’s performance by its normalized AUC (Table 2). Because policies simultaneously minimize monitoring risk and amortized queries , a lower AUC is better. Additional AUC scores for varying thresholds are reported in Appendix.
Consistent with the trends in Fig. 3, across the eight streams and three risk functions, MLDemon achieves the lowest AUC on 19 out of 24 benchmarks. Of the scores in Table 2, even when MLDemon does not score the lowest AUC, it only does worse than the lowest scoring policy by at most , whereas RR averages worse than the lowest scoring policy, and is at least worse than the lowest scoring policy seven times. This supports the conclusion that it is risky policy to purely rely on potentially brittle anomaly detection instead of balancing surveillance queries with anomalydriven queries. In our theoretical analysis we mathematically confirm this empirical trend. We find that, compared to RR and PQ, MLDemon consistently decreases the combined loss for varying labeling costs (Table 3), indicating that MLDemon could improve monitoring efficiency in a variety of domains that each have different relative cost between expert supervision and monitoring risk.
5 Theoretical Analysis
Our analysis can be summarized as follows. First, we show that PQ is worstcase rate optimal up to logarithmic factors over the class of Lipschitz drifts while RR is not even close. Second, we show that MLDemon matches PQ’s worstcase performance while achieving a significantly better averagecase performance under a realistic probabilistic model. All the proofs are in the appendix.
Our asymptotic analysis in this section is concerned with an asymptotic rate in terms of small
and amortized by a large . When using asymptotic notation, by loss we mean , for some constant . Recall that amortization is implicit in the definition of . We use tildes to denote the omission of logarithmic factors. For example, means , for some constants . Recall that amortization is implicit in the definition of We let be the combined loss when using anomaly detector and policy .5.1 Minimax Analysis
Theorem 5.1.
Let be the set of Lipschitz drifts and let be the space of deployment monitoring policies. On both estimation problems with MAE risk and decision problems with hinge risk, for any model and anomaly detector , the following hold:
(i) MLDemon and PQ achieve worstcase expected loss
(ii) RR has a worstcase expected loss
(iii) No policy can achieve a better worsecase expected loss than MLDemon and PQ
The above result confirms that MLDemon is minimax rate optimal up to logarithmic factors. In contrast to the robustness of MLDemon, RR can fail catastrophically regardless of the choice of detector. For hard problem instances, the anomaly signal is always errant and the threshold margin is always small, in which case it is understandable why MLDemon cannot outperform PQ.
Lemma 5.2.
For both estimation and decision problems, MLDemon and PQ achieve a worstcase expected monitoring risk of with a query rate of and no policy can achieve a query rate of
Lem. 5.2 is used to prove Thm. 5.1, but we include it here because it is of independent interest to understand the tradeoff between monitoring risk and query costs and it also gives intuition for Thm. 5.1. The emergence of the rate also follows from Lem. 5.2 by considering the combined loss optimizing over to minimize subject to the constraints imposed by Lem. 5.2. Lem. 5.2 itself follows from an analysis that pairs a lower bound derived with Le Cam’s method [yu1997assouad] and an upper bound constructed with an extension to Hoeffding’s inequality [hoeffding1994probability] that enables us to wield it for samples from Lipschitz drifts. We turn our attention to analyzing more optimistic regimes next.
5.2 AverageCase Analysis
To perform an averagecase analysis, we introduce a stochastic model to define a distribution over problem instances in . Our model assumes the following law for generating the sequence from any arbitrary initial condition :
(5) 
The accuracy drift is modeled as a simple random walk. As discussed in [szpankowski2011average] the maximum entropy principle (used by our model at each time step under the Lipschitz constraint) is often a reasonable stochastic model for averagecase analysis.
We already know that MLDemon is robust in the worstcase. For estimation problems, MLDemon outperforms PQ on average, although only by a constant factor. The reason we have a constant factor gain in the estimation case is because we limit the minimum query rate in order to guarantee robustness against an errant detector. So even if the detector is perfectly informative, we would not completely stop surveillance queries even during the stability periods. On the other hand, we can obtain an actual rate improvement for the hinge case.
Theorem 5.3.
Let be the distribution over problem instances implied by the stochastic model. For the decision problem with hinge risk and model , and detector :
The reason we have a better asymptotic gain in the decision problem is illuminated below in Lem. 5.4.
Lemma 5.4.
For decision problems with hinge risk under model , MLDemon achieves an expected monitoring hinge risk with an amortized query amount .
MLDemon can save an average factor in query cost, which ultimately translates into the rate improvement in Thm. 5.3. MLDemon does this by leveraging the margin between estimate and threshold to increase the slack in the confidence interval around the estimate without increasing risk.
6 Discussion
Related Works
While our problem setting is novel, there are a variety of settings relating to ML deployment and distribution drift. One such line of work focuses on reweighting data to detect and counteract label drift [lipton2018detecting, garg2020unified]. Another related problem is when one wants to combine expert and model labels to maximize the accuracy of a joint classification system [cesa1997use, farias2006combining, morris1977combining, fern2003online]. The problem is similar in that some policy needs to decide which user queries are answered by an expert versus an AI, but the problem is different in that it is interesting even in the absence of online drift or high labeling costs. It would be an interesting to augment our formulation with a reward for the policy when it can use an expert label to correct a user query that the model got wrong. This setting would combine both the online monitoring aspects of our setting along with ensemble learning [polikar2012ensemble, minku2009impact] under concept drift with varying costs for using each classifier depending on if it is an expert or an AI.
Adaptive sampling rates are a wellstudied topic in the signal processing literature [dorf1962adaptive, mahmud1989high, peng2009adaptive, feizi2010locally]. The essential difference is that in signal processing, measurements tend to be exact whereas in our setting a measurement just reveals the outcome of a single Bernoulli trial. Another popular related direction is online learning or lifelong learning during concept drift. Despite a large and growing body of work in this direction, including [fontenla2013online, hoi2014libol, nallaperuma2019online, gomes2019machine, chen2019novel, nagabandi2018deep, hayes2020lifelong, chen2018lifelong, liu2017lifelong, hong2018lifelong], this problem is by no means solved. Our setting assumes that after some time , the model will eventually be retired for a updated one. It would be interesting to augment our formulation to allow a policy to update based on the queried labels. Model robustness is a related topic that focuses on designing such that accuracy does not fall during distribution drift [zhao2019learning, lecue2020robust, li2018principled, goodfellow2018making, zhang2019building, shafique2020robust, miller2020effect].
Conclusion and Future Directions
We pose and analyze a novel formulation for studying automated ML deployment monitoring policies. Understanding the tradeoff between expert attention and monitoring quality is of both research and practical interest. Our proposed policy comes with theoretical guarantees and performs favorably on empirical benchmarks. The potential impact of this work is that MLDemon could be used to improve the reliability and efficiency of ML systems in deployment. Since this is a relatively new research direction, there are several interesting directions of future work. We have implicitly assumed that experts respond to label requests approximately instantly. In the future, we can allow the policy to incorporate labeling delay. Also, we have assumed that an expert label is as good as a groundtruth label. We can relax this assumption to allow for noisy expert labels. We could also let the policy more actively evaluate the apparent informativeness of the anomaly signal over time or even input an ensemble of different anomaly signals and learn which are most relevant at a given time. While MLDemon is robust even if the featurebased anomaly detector is not good, it is more powerful with an informative detector. Improving the robustness of anomaly detectors for specific applications and domains is a promising area of research.
References
Appendix
We begin with supplementary figures in A. In B, we include additional details regarding our experiments. In C, we include additional mathematical details and all proofs.
Appendix A Supplementary Figures
a.1 Additional Thresholds for Decision Experiments
We report additional tradeoff frontiers. Fig. 4 reports a higher () threshold and Fig. 5 reports a lower threshold (). Both report both hinge and binary risk. Fig. 6 reports a medium threshold () for binary risk. We include normalized AUC values in Table 4 for higher and lower thresholds as well.
Data Stream  MLD  PQ  RR  MLD  PQ  RR  MLD  PQ  RR  MLD  PQ  RR 

SPAMCORPUS  
KEYSTROKE  
WEATHERAUS  
EMOR  
FACER  
IMAGEDRIFT  
5CVT  
4CR 
a.2 Anomaly Detector Ablation
It is also important to understand the sensitivity of each policy to the particular choice of detector. Since MLDemon is designed with from firstprinciples to be robust, we should expect it to work sufficiently well for any choice of detector. On the other hand, RR’s performance is highly variable depending on the choice of detector.
Recall that we use a KStest to detect drift in our main results. Another reasonable choice of anomaly score is based on a test (essentially this boils down to computing and comparing the empirical means over two consecutive sliding windows) [rabanser2018failing]. Like for our KStest, we select a window size of .
It is also reasonable to explore a baseline set by a detector that is driven entirely by random noise.
For this ablation study, we opt to include the six nonsynthetic data streams. However, recall that the detector for FACER was based on the face embeddings generated by the model, and we already used a moving average based detector in the main experiments. Thus, we include 4CR instead of FACER for the
test  KStest  Noise  

Data Stream  MLD  RR  PQ  MLD  RR  PQ  MLD  RR  PQ 
SPAMCORPUS  
KEYSTROKE  
WEATHERAUS  
EMOR  
FACER  
IMAGEDRIFT  
4CR 
Appendix B Experimental Details
b.1 Data Stream Details
We describe each of the eight data streams in greater detail. All data sets are public and may be found in the references. All lengths for each data stream were determined by ensuring that the stream was long enough to capture interesting drift dynamics.
b.1.1 Data Stream Construction

SPAMCORPUS: We take the first points from the data in the order that it comes in the data file.

KEYSTROKE: We take the first points from the data in the order that it comes in the data file.

WEATHERAUS: We take the first points from the data in the order that it comes in the data file.

EMOR: We randomly sample from the entire population (i.e. every data points in the data set) for the first points in the stream, after which we randomly sample from the elderly black population.

FACER: We randomly subsample 400 individuals out of the data set that have at least 3 unmasked images and 1 masked image to create a reference set. For these 400 individuals, we randomly sample from the set of unmasked images for the first points, after which we randomly sample from the set of masked images for the same 400 individuals.

IMAGEDRIFT
: We sampled random images from the standard ImageNet validation set for the first
, and we sampled a random set out of the MatchedFrequency set within ImageNetV2 for the second . 
4CR: We take the first points from the data in the order that it comes in the data file.

5CVT: We take the first points from the data in the order that it comes in the data file.
b.1.2 Bootstrapping
In order to get iterates for each data set, we generate the stream by bootstrap as follows. We block the data sequence into blocks of length and uniformly at random permute the data within each block. Because this bootstrapping does not materially alter the structure of the drift.
b.2 Models
b.2.1 Logistic Regression
For the logistic regression model, we used the default solver provided in the scikitlearn library [pedregosa2011scikit].
b.2.2 Facial Recognition
For the facial recognition system, we used the opensource model referenced in the main text. The model computes face embeddings given images. The embeddings are then used to compute a similarity score as described in the API. For any query image belonging to one of 400 individuals, the model looks for the best match among the 400 individuals by comparing the query image to each of the 3 reference images for each individual and taking an average similarity score. The highest average similarity score out of the 400 individuals is returned as the predicted matching individual.
b.2.3 ImageNet
For the ImageNet model, we used SqueezeNet as referenced in the main text.
b.2.4 Emotion Recognition
For the emotion recognition system, we used the Face++ API as reference in the main text.
b.3 Anomaly Detectors
We describe further details for our anomaly detection protocol. As mentioned in the main text, the compare a window of the recent features to an adjacent window of the next most recent for a total width of the
recent features in the stream. These anomaly signal window lengths are determined heuristically by trying a range of widths across various data streams.
b.3.1 ConfidenceBased Detector
For most of our benchmarks, we follow the confidencebased detector proposed in [yu2018request]. To reiterate the main text, we run a KStest on the model’s confidence scores over time. The value from the KStest serves as an anomaly metric for the policy.
b.3.2 EmbeddingBased Detector
For the FACER
task, the model computes a face embedding. This embedding vector already summarizes the face. By comparing recent and less recent samples of face embeddings from the data stream, we can interpret the Euclidean distance between the empirical means as metric for anomaly. More formally, we require the use of a multidimensional location test. In principle, any such test should work roughly the same. Because the sample sizes are always the same, the main quantity influencing the
value of the test is just the distance between in the empirical means. We use the test in [srivastava2013two].b.4 Policies
b.4.1 Periodic Querying
The only hyperparameter we sweep for PQ is the query budget, The other hyperparameter, , the window length and batch size for the querying. For our benchmarks, we fix , which performs well empirically.
b.4.2 RequestandReverify
The only hyperparameters for RR are , the window length and batch size for the querying and , the query threshold for the anomaly signal. We sweep both. For our benchmarks, we sweep . For , the effective range is data stream dependent, but is always swept exhaustively from the extreme of to the extreme of for all .
b.4.3 MLDemon
The hyperparameters for MLDemon are all specified in the main text, except for the interval . We set and .
b.5 Computing Infrastructure and Resource Usage
Experiments are all run on highend consumer grade CPUs. Depending on the benchmark, a single simulation (i.e. random seed) could range from minutes to several hours. A single simulation used at most 4 GB of memory, and for most benchmarks much less. For all the data stream, except FACER this includes all preprocessing time such as training the model. For FACER, we precomputed all of the face embeddings for all images in the dataset, a process that took less than 64 GB of memory and took less than 24 hours with 32 cores working in parallel. In total, including repetitions and preliminary experiments, we estimate the entire project took approximately 10,000 CPUhours.
Appendix C Mathematical Details & Proofs
We begin with reviewing definitions and notations.
c.1 Definitions & Notation
Definition C.1.
Accuracy at time
We use the convention that is known from the training data. All policies can make use of this as their initial estimate:
(6) 
c.1.1 Instantaneous & Amortized Monitoring Risk
We first define the instantaneous monitoring risk, which we distinguish here from the amortized monitoring risk (defined in the main text). Instantaneous monitoring risk is the risk for a particular data point whereas is the amortized risk over the entire deployment.
Definition C.2.
(Instantaneous) Monitoring Risk
We define the monitoring risks in MAE and hinge settings for a single data point in the stream below.
(7) 
(8) 
As mentioned in the main text, we omit the subscript when the loss function is clear from context or is not relevant. In the context of our online problem, at time
, we may generally infer that , and is fixed over time. In this case, we might use the shorthand , as below:(9) 
where is the usual amortized monitoring risk term defined in Section 2.
c.1.2 Policies
We use the following abbreviations to formally denote the PQ, RR, and MLDemon policies: , , and , respectively.
Definition C.3.
Formal RequestandReverify Policy
We let denote RR policy as defined in Section 3 of the main text. We write to emphasize the dependence on a particular threshold hyperparameter . Recall that is set at time and fixed throughout deployment. When is omitted, the dependence is to be inferred. Policy is the same for both decision and estimation problems.
Definition C.4.
Formal Periodic Querying Policy
We let denote PQ policy as defined in Section 3 of the main text. Recall that can be parameterized by a particular query rate budget as defined in the main text. Alternatively, we can parameterize by a worstcase risk tolerance such that in any problem instance. Using the theory we will presently develop, we can convert a risk tolerance into a constant average query rate given by as computed in Alg. 5. The guaranteed risk tolerance implicitly depends on the Lipschitz constant , which we can assume to be known or upper bounded for the purposes of our mathematical analysis. We write for policy instances parameterized by budget and for policy instances parameterized by risk tolerance . When the particular choice of parameterization can be inferred or is otherwise irrelevant, we omit it. By convention, if is not updated at a given time , then .
Although the algorithm looks different than the one presented in the main text, it is actually the same, just more formally keeping track of the query period and the waiting period with counters.
Definition C.5.
Formal MLDemon Policy
We let denote the MLDemon policy. We give formal versions of the routines comprising . As for PQ and RR, we let specify an exact risk tolerance for the policy. is nearly identical to the sketch given in the main text, but there are a few minor clarifications and distinctions that need to be made for formal reasoning. When necessary, we will write and to distinguish the estimation and decision variants of the MLDemon policies. By convention, if a state variable, such as, is not updated at a given time , then . Also, if a state variable gets updated more than once in any given round, the final value of the state variable persists to the next round at time .
The differences between the formal variant and the sketch given in the main texts are minor and are as follows. First, the lower bound on query period is forced to be within a constant factor of . This is a technical point in that in order to achieve the minimax optimality results, we need to control how often the policy is allowed to query. Letting the policy set an arbitrary budget makes sense in practice, but for theoretical analysis we will assume the policy is trying to be as frugal as possible with label queries while respecting the specified risk tolerance.
Another minor point, namely, the technical safety condition that is needed for decision problems. In practice, this condition will only triggers exceptionally infrequently, but in order to complete the technical parts of the proofs, it is expedient to keep the condition. We also point out that our theoretical analysis specifies particular asymptotics for window length , reflected in the constants in Alg. 6.
Finally, we point out the unbiased sample flag. When this flag is set, formally speaking, one requires an additional assumption, namely, that the average expected increment to is mean zero. Strictly speaking, does not require this assumption and the formal guarantees hold without it. However, we found stronger empirical performance with the flag turned on, and thus recommend it for applications.
Quantile Scaling
We proceed to more formally define the quantile_norm subroutine. The quantile normalization step itself is standard and we refer the reader to [amaratunga2001analysis]. In short, we compute the
empirical inverse cumulative distribution function
, denoted here by , and use it to map data back to a normalized value in . Once we have normalized, we scale our value onto :(10) 
Thus, a formal definition of Alg. 3 (quantile_norm) is given by the scaling defined above in Eqn. 10. In order for to be wellposed, we require the following condition on :
(11) 
c.2 Preliminary Results
c.2.1 Bounding Accuracy Drift in Absolute Value Based on Total Variation
We begin with proving the claim from the introduction regarding the equivalence of a Lipschitz bound in terms of accuracy drift and total variation between the distribution drift. Although Prop. C.1
may not be immediately obvious for one unfamiliar with total variation distance on probability measures, the result is in fact trivial. To clarify notation, for probability measure
and event we let .Proposition C.1.
Let and
be two supervised learning tasks (formally, distributions over
). Let be a model. If then where and .Proof.
Let be the sample space for distributions and . TVdistance has many equivalent definitions. One of them is given in Eqn. 12 below [villani2008optimal]:
(12) 
(13) 
∎
c.2.2 Adapting Hoeffding’s Inequality for Lipschitz Sequences
One of the key ingredients in many of the proofs is the following modification to Hoeffding’s inequality that enables us to use it to construct a confidence interval for the empirical mean of observed outcomes even though is drifting (rather than i.i.d. as is usual).
Lemma C.2.
Hoeffing’s Inequality for Bernoulli Samples with Bounded Bias
Assume we have a random sample of Bernoulli trials such that each trials is biased by some small amount: . Let denote a sample mean. Let .
Then:
Proof.
We can invoke the classical version of Hoeffding’s [hoeffding1994probability]:
(14) 
Notice that . Plugging in below yields:
(15) 
Also, notice that due to the reverse triangle inequality [sutherland2009introduction]:
(16) 
Implying:
(17) 
Moving the to the other side of the inequality finishes the result:
(18) 
∎
All of the work has already been in done above in Lem. C.2, but in order to make it more clear how it is applied to a Lipschitz sequence of distributions we also state Lem C.3.
Lemma C.3.
Hoeffing’s Inequality for Lipschitz Sequences
Assume drift is Lipschitz. Let be any subset of the set of rounds for which policy has queried:
(19) 
Let
(20) 
denote a sample mean of observed outcomes for rounds in .
Let
(21) 
be defined analogously to Lemma C.2.
Then:
(22) 
Proof.
The result follows by setting and applying Lemma C.2. ∎
c.3 Proof of Lemma 5.2
We study the average query rate required to guarantee a worstcase monitoring risk, taken over all Lipschitz drifts. Note that in the worstcase, the anomaly signal is uninformative and thus adaptivity with respect to the detection signal will not be helpful. The following results hold both for MAE loss and hinge loss. Lemma 5.2 has two parts. The first statement is about PQ whereas the second is for MLDemon.
c.3.1 Lemma 5.2 for Periodic Querying
Lemma C.4.
Let be the class of Lipschitz drifts. Assume .
For both estimation and decision problems (using MAE and hinge loss), PQ achieves a worstcase expected monitoring risk of with a query rate of :
Proof.
Because for all , it will be sufficient to prove the result for the estimation case (doing so directly implies the decision case).
It should be clear from Alg. 5 that the amortized query complexity is just , which indeed satisfies the query rate condition. This holds for any choice of (PQ is an openloop policy).
(23) 
It remains to verify that the choice of and results in a worstcase expected warning risk of .
Consider Lemma C.3. Based on Lemma C.3, imagine we are applying Eqn. 22 at each point in time with quantity being used as our point estimate . If, for all time:
(24) 
then we can be assured that PQ maintains an expected monitoring risk of because the probability of an error greater than is at most and such an error event is bounded by in the worstcase.
Fixing , it is easy to verify that .
Recall that is fixed:
(25) 
Plugging in the above produces:
(26) 
It remains to verify the first inequality, , at all . This is slightly more involved, as is not constant over time. We ask ourselves, what is the worstcase that would be possible in Eqn. 22 when using PQ? Well, the PQ policy specifies a query batch of size , and maintains the empirical accuracy from this batch as the point estimate for the next rounds in the stream. Thus, is largest when precisely when the policy is one query away from completing a batch. At this point in time, because the policy has not yet updated because it only does so at the end of the batch once all queries have been made. Thus, we are using an estimate that is the empirical mean of label queries such that this batch was started iterations ago and was completed iterations ago.
The maximal value possible for is hence:
(27) 
It is easy to upper bound this sum as follows:
(28) 
Recall :
(29) 
Plugging in into , along with the upper bound for yields:
(30) 