1 Introduction
The fast advancement of machine learning technologies has triggered tremendous interest in their adoption in a large range of applications, both in science and business. Developing robust machine learning (ML) solutions to realworld, missioncritical applications, however, is challenging. It is a continuous procedure that requires many iterations of training, tuning, validating, and testing various machine learning models before a good one can be found, which is tedious, timeconsuming, and errorprone. This dilemma has inspired lots of recent work towards simplifying ML application development, covering aspects such as distributed computation [14, 16], resource management [15, 27], AutoML [2, 11], etc.
During the last couple of years, we have been working together with many developers, most of whom are not computer science experts, in building a range of scientific and business applications [4, 8, 24, 25, 28, 32] and in the meanwhile, observe the challenges that they are facing. On the positive side, recent research on efficient model training and AutoML definitely improves their productivity significantly. However, as training machine learning model becomes less of an issue, new challenges arise — in the iterative development process of an ML model, users are left with a powerful tool but not enough principled guidelines regarding many design decisions that cannot be automated by AutoML.
(Motivating Example) In this paper, we focus on two of the most commonly asked questions from our users^{1}^{1}1As an anecdotal note, THE most popularly asked question is actually “how large does my training set need to be?” —
(Q1) How large does my validation set need to be?
(Q2) How large does my test set need to be?
Both questions essentially ask about the generalization property of these two datasets — they need to be large enough such that they form a representative sample of the (unknown) underlying true distribution. However, giving meaningful answers to these two questions (rather than answers like “as large as possible” or “maybe a million”) is not an easy task. The reasons are three fold. First, the answers depend on the error tolerance of the ML application — a missioncritical application definitely needs a larger validation/test set. Second, the answers depend on the history of how these data were used — when a dataset (especially the validation set) is used multiple times, it loses its statistical power and as a result its size relies on the set of all historical operations ever conducted. Third, the answers need to be practically feasible/affordable — simply applying standard concentration inequalities can lead to answers that require millions of (or more) labels that may be intractable.
A Data Management System for Generalization
In this paper, we present ease.ml/meter, a system that takes the first step in (not completely) tackling the above challenges. Specifically, ease.ml/meter is a data management system designed to manage the statistical power of the validation and test data sets. Figure 1 illustrates the functionality of ease.ml/meter, which interacts with its user in the following way:

The user inputs a set of parameters specifying the error tolerance of an ML application (defined and scoped later).

The system returns and , the required sizes of the validation and test sets that can satisfy the error tolerance.

The user provides a validation set and a test set.

The user queries the validation and test set. The system returns the answer that satisfies the userdefined error tolerance in Step 1. The system monitors and constrains the usage of the given validation set and test set.

When the system discovers that the validation set or test set loses its statistical power, it asks the user for another validation or test set of size or , respectively.
(Scope) In this paper, we focus on a very specific (yet typical) special case of the above generic interaction framework — the user has full access to the validation set. She starts from the current machine learning model, , conducts error analysis by looking at the error that is making on the validation set, and proposes a new machine learning model (e.g., by adding new feature extractors, trying a different ML model, or adding more training examples). However, as the validation set is open to the user, after many development iterations the user’s decision might start to overfit to this specific validation set. Meanwhile, although the test set is hidden from the user, the signals returned by the system inevitably carries some information about the test set as well. As a result, the user’s decision might also overfit to this particular test set, which undermines its plausibility as a delegate of the underlying true data distribution. The goal of our system is to (1) inform the user whenever she needs to use a new validation set for error analysis, and (2) inform the user whenever she needs to use a new test set to measure validation set’s overfitting behavior.
(System Overview) In the above workload, the “error tolerance” is the generialization power of the current validation set. Specifically, let be the test set and be the validation set. Let be a machine learning model and returns the loss of on , , and , respectively.^{2}^{2}2 represents the expected loss over the true distribution, which is unknown. We assume that the user hopes to be alerted whenever
One critical design decision in ease.ml/meter is to decompose the LHS of the above inequality into two terms:

Empirical Overfitting: ;

Distributional Overfitting: .
The rationale behind this decomposition is that it separates the roles of the validation set and the test set when overfitting occurs — when the empirical overfitting term is large, the validation set “diverges from” the test set, and therefore, the system should ask for a new validation set; when the distributional overfitting
term is large, the test set “diverges from” the true distribution, and therefore, the system should ask for a new test set. Moreover, empirical overfitting can be directly computed, whereas distributional overfitting has to be estimated using nontrivial techniques, as we will see.
ease.ml/meter provides a “meter” to communicate the current level of empirical overfitting and distributional overfitting to the user. For each model the user developed, the system returns one out of five possible signals as illustrated in Figure 1 — the solid color bar represents the range of the empirical overfitting, and the gray color bar represents the upper bound of distributional overfitting. The user decides whether to replace a validation set according to the signals returned by the system, and the system asks for a new test set whenever it cannot guarantee that the distributional overfitting is smaller than a userdefined upper bound, .
Technical Challenges and Contributions
The key technical challenge in building ease.ml/meter is to estimate distributional overfitting, which is nontrivial due to the fact that subsequent versions of the ML application (i.e., ) can be dependent on prior versions (i.e., ). As was demonstrated by recent work [6], such kind of adaptivity can accelerate the degradation of test set, fading its statistical power quickly. To accommodate dependency between successive versions of the ML application, one may have to use a larger test set (compared with the case where all versions are independent of each other) to provide the same guarantee on distributional overfitting.
C1. The first technical contribution of this work is to adapt techniques developed recently by the theory and ML community on adaptive statistical querying [5, 6] to our scenario. Although the underlying method (i.e., bounding the description length) is the same, it is the first time that such techniques are applied to enabling a scenario that is similar to what ease.ml/meter tries to support. Compared with the naive approach that draws a new test set for each new version of the application, the amount of test samples required by ease.ml/meter can be an order of magnitude smaller, which makes it more practical.
C2. The second technical contribution of this work is a set of simple, but novel optimizations that further reduce the amount of test samples required. These optimizations go beyond traditional adaptive analytics techniques [5, 6] by taking into consideration different specific operation modes that ease.ml/meter provides to its user, namely (1) nonuniform error tolerance, (2) multitenant isolation, (3) nonadversarial developers, and (4) “time travel.” Each of these techniques focuses on one specific application scenario of ease.ml/meter, and can further reduce the (expected) size of the test set significantly.
Relationship with Previous Work
The most similar work to ease.ml/meter is our recent paper ease.ml/ci [22]. ci is a “continuous integration” system designed for the ML development process — given a new model provided by the user, ci checks whether certain statistical property holds (e.g., the new model is better than the old model, tested with ). At a very high level, meter shares a similar goal as ci, however, from the technical perspective, is significantly more difficult to build, for two reasons. First, ease.ml/meter cannot use the properties of the continuous integration process (e.g., the new model will not change too much) to optimize for the sample complexity. As we will see, to achieve practical sample complexity, meter relies on a completely different set of optimizations. Second, ease.ml/meter needs to support multiple signals, instead of a binary pass/fail signal as in ci. As a result, we see ci and meter fall into the same “conceptual umbrella” of data management for statistical generalization, but focus on different scenarios and thus require different technical optimizations and system designs.
Limitations
We believe that ease.ml/meter is an “innovative system” that provides functionalities that we have not seen in current ML ecosystems. Although ease.ml/meter works reasonably well for our target workload in this paper, it has several limitations/assumptions that require future investment. First, ease.ml/meter requires new test data points as well as their labels from developers. Although recent work on automated labeling (e.g., Snorkel [20]) alleviates the dependency on human labor for generating labeled training data points, it does not address our problem as we require labels for test data rather than training data. Second, the question of what actions should be taken upon receiving overfitting signals is left to developers. A specific reaction strategy can lead to a more specific type of adaptive analysis that may have further impact on reducing the size of the test set. Moreover, while this work focuses on monitoring overfitting, there are other aspects regarding quality control in ML application development. For instance, one may wish to ensure that there is no performance regression, i.e., the next version of the ML application must improve over the current version [22]. We believe that quality control and lifecycle management in ML application development is a promising area that deserves further research.
2 Preliminaries
The core of ease.ml/meter is based on the theory of answering adaptive statistical queries [6]. In this section, we start by introducing the traditional data model in supervised machine learning, and the existing theory of supporting adaptive statistical queries which will serve as the baseline that we will compare with in ease.ml/meter. As we will see in Section 5, with a collection of simple, but novel optimization techniques, we are able to significantly bring down the requirement of the number of human labels, sometimes by several orders of magnitude.
2.1 Training, Validation, and Testing
In the current version of ease.ml/meter, we focus on the supervised machine learning setting, which consists of three data distribution: (1) the training distribution , (2) the validation distribution , and (3) the test distribution
, each of which defines a probability distribution
over , where is thefeature vector
of dimension and is the label.(Application Scenarios)
In traditional supervised learning setting, one often assumes that all three distributions are the same, i.e.,
. In ease.ml/meter, we intentionally distinguish between these three distributions to incorporate two emerging scenarios that we see from our users. First, in weakly supervised learning paradigm such as data programming [21] or distant supervision [17], the training distribution is a noisy version of the real distribution. As a result. The second example is motivated by one anomaly detection application we built together with a telecommunication company, in which the validation and training distributions are
injected with anomalies and the real distribution is an empirical distribution composed of/from real anomalies collected over history. As a result, . The functionality provided by ease.ml/meter does not rely on the assumption that these three distributions are the same.(Sampling from Distribution) When building ML applications, in many, if not all, cases user does not have access to the data distributions. Instead, user only has access to a finite set of samples from each distribution: , , and . As noted in the introduction, one common question from our users is: How large does the training/validation/test set need to be? The goal of ease.ml/meter is to provide one way of deciding the test set size , as well as when to draw a new test set.
2.2 Humanintheloop ML Development
An ML application is a function that maps a feature vector to its predicted label . Coming up with this function is not a oneshot process, as indicated in previous work [13, 12, 30, 31, 33]. Instead, it often involves human developers who (1) start from a baseline application , (2) conduct error analysis by looking at the prediction of the current application and summarize a taxonomy of errors the application is making, and (3) try out a “fix” to produce a new application . A potential “fix” could be (1) adding a new feature, (2) using a different noise model of data, and (3) using a different model family, architecture, or hyperparamter.
There are different frameworks of modeling human behavior. In this paper, we adopt one that is commonly used by previous work on answering adaptive statistical queries [5, 6, 9, 29]. Specifically, we assume that the user, at step , is a deterministic mapping that maps the current application into
We explain the parameterization of in the following:

The first three parameters , , and captures the scenario that the human developer has full access to the training and validation sets, as well as the current version of the ML application, and can use them to develop the next version of the application.

The fourth parameter captures the scenario in which the human developer only has limited access to the test set. Here is a set function mapping from and to a set of feedback returned by ease.ml/meter to the developer. As a special case, if , it models the scenario in which the developer does not have access to the test set at all (i.e., does not have any feedback) during development.

The fifth parameter models the “environment effect” that is orthogonal to the developer. is a variable that does not rely on past decisions, and is only a function of the step ID .
When it is clear from the context that , , and are given, we abbreviate the notation as
(Limitations and Assumptions) There are multiple limitations that are inherent to the above model of human behaviours. Some we believe are fixable with techniques similar to what we propose in this paper, while others are more fundamental. The most fundamental assumption is that human decision does not have impact on the environment. In other words, is only a function of and all past feedback signals . There are also other potential extensions that one could develop. For example, instead of treating human behavior as a deterministic function , one can extend it to a class of deterministic functions following some (known or unknown) probabilistic distribution.
2.3 Generalization and Test Set Size
The goal of ML is to learn a model over a finite sample that can be generalized to the underlying distribution that the user does not have access to. In this paper, we focus on the following loss function which maps each data point, along with its prediction, to either 0 or 1. (We refer to this loss as “01 loss.”) We also focus on the following notion of “generialization” for a given model ([26]):
where in the first and in the second . Given an ML model , checking whether generializes according to this definition is simple if one only uses the test set once. In this case, one can simply apply Hoeffding’s inequality (Appendix A.1) to obtain:
This provides a way of deciding the required number of samples in the test set. It becomes tricky, however, when the test set is used multiple times, and the goal of ease.ml/meter is to automatically manage this scenario and decrease the required size of .
2.4 Adaptive Analysis and Statistical Queries
In recent years, there is an emerging research field regarding the socalled adaptive analysis or reusable holdout [6] that focuses on ML scenarios where the test set can be accessed multiple times. In our setting, consider ML models , … where
The goal is to make sure that
When is nontrivial, simply applying union bound and requiring a test set of size (see Appendix A.2.2 for details)
(1) 
does not provide the desired probabilistic guarantee because of the dependency between and . We will discuss adaptive analysis in detail when we discuss the overfitting meter in Section 4.
(Baseline: Resampling) If a new test set is sampled from the distribution in each step , to make sure that with probability all models return a generalized loss, one only needs to apply union bound to make sure that each sampled test set generalizes with probability . Thus, to support adaptive steps, one needs a test set of size (see Appendix A.2.3 for details)
(2) 
This gives us a simple baseline approach that provides the above generalization guarantee. Unfortunately, it usually requires a huge amount of samples, as there is essentially no “reuse” of the test set.
3 System Design
We describe in detail (1) the interaction model between a user and ease.ml/meter, (2) different system components of ease.ml/meter, and (3) the formalization of the guarantee that ease.ml/meter provides. Last but not least, we provide a concrete example illustrating how ease.ml/meter would operate using real ML development traces.
3.1 User Interactions
ease.ml/meter assumes that there are two types of users — (1) developers, who develop ML models, and (2) labelers, who can provide labels for data points in the validation set or the test set. We assume that the developers and labelers do not have offline communication that ease.ml/meter is not aware of (e.g., labelers cannot send developers the test set via email).
Access Control
The separation between developers and labelers is to allow ease.ml/meter to manage the access of data:

Developers have full access to the validation set;

Developers have no access to the test set (ease.ml/meter encrypts the test set and only labelers can decrypt it);

Labelers have full access to the test set;

Labelers have full access to the validation set.
The rationale for the above protocol is that ease.ml/meter can measure the amount of information that is “leaked” from the test set to the developers, which, as we will see, is the key to bounding the degree of overfitting over the test set.
Interaction with Developers
A meter (as illustrated in Figure 2) is specified by a set of triples
Each defines one of the possible overfitting signals (e.g., in Figure 1) returned by ease.ml/meter to the developer: (1) Each specifies the range of empirical overfitting that this signal covers (i.e., the solid color bar in Figure 2); and (2) specifies the upper bound of distributional overfitting that this signal guarantees (i.e., the gray color bar in Figure 2).
We assume in ease.ml/meter that
The rationale behind the nondecreasing is because of the intuition that when the empirical overfitting is already quite large, the developer often cares less about small distributional overfitting.
Interaction with Developers
There are two phases when the developer interacts with ease.ml/meter, initialization and model development:

Initialization Phase (user initiates). The developer initializes an ease.ml/meter session by specifying the length of development cycle, , the number of iterative development steps the developer hopes that this session can support; the developer also provides the performance metric, , a loss function (e.g., accuracy) whose output is bounded by . The developer further submits the current validation set to ease.ml/meter.

Initialization Phase (system response). Given , the meter returns , the number of examples required to support development steps from the developer. The system will then request labels from the labeler.
The developer starts development after the initialization phase:

Model Development Phase (developer initiates). The developer submits the new ML model to ease.ml/meter.

Model Development Phase (system response). Given the model , the system calculates the empirical overfitting, (i.e., the gap between the losses over the validation and test sets) and finds the response s.t.
and returns the value to the developer. In the meantime, the system guarantees that the distributional overfitting, i.e.,
is smaller than , with (high) probability .

Model Development Phase (developer). The developer receives the response and decodes it to . The developer then decides whether the empirical overfitting is too large. If so, she might choose to collect a new validation set.
After development cycles (i.e., after the developer has checked in models) the system terminates. The developer may then initiate a new ease.ml/meter session.
Interaction with Labelers
Labelers are responsible for providing labeled test data. Whenever the preset budget is used up, that is, the developer has submitted versions of the ML application, ease.ml/meter issues a new request to the labeler to ask for a new, independent test set. The old test set can be replaced by and released to the developer for development use.
3.2 Overfitting Signals
In standard ML settings, overfitting is connected with model training – not testing. This is due to the (implicit) assumption that the test set will only be accessed once by the ML model. In the context of continuous ML application development, this assumption is no longer valid and the test set is subject to overfitting as well. The presence of adaptive analysis further accelerates the process towards overfitting. We now formally define the semantics of the overfitting signals returned by ease.ml/meter. Without loss of generality, we assume that and only discuss the case of them being different in Section 5.1.
3.2.1 Formalization of User Behavior
To formalize this notion of overfitting, we need precise characterization of the behavior of the developer:

The developer does not have access to .

At the beginning, the developer specifies: (1) , the number of development iterations; (2) , the tolerance of distributional overfitting (defined later); and (3) , the confidence. In response, the system returns the required size .

At every single step , the system returns to the user an indicator which is a function of . As we will see, indicates that degree of overfitting is bounded by , with probability at least .
3.2.2 Formal Semantics of Overfitting
There could be various definitions and semantics of overfitting. It is not our goal to investigate all those alternatives in this work, which is itself an interesting topic. Instead, we settle on the following definitions that we believe are useful via our conversations with ML application developers.
“Distributional Overfitting” and “Empirical Overfitting”. Formally, let be an ML application and be a performance measure (e.g., a loss function). For a given data set drawn i.i.d. from , we use to represent the performance of over . We also use to represent the expected performance of over the distribution . We define the degree of overfitting of the validation set as
We decompose this term into two terms – the empirical difference between and , i.e.,
and the “quality” of estimator relying on , i.e.,
We call the first term empirical overfitting as it is measured in terms of the empirical loss, and call the second term distributional overfitting as it measures the gap between the current test set and the (unknown) true distribution. One crucial design decision we made in ease.ml/meter is to decouple these two terms and only report the empirical overfitting to the user while treating distributional overfitting as a hard error tolerance constraint.
Distributional Overfitting (Overfitting). As the empirical overfitting term can be measured directly by calculating the difference between the validation set and the test set, the technical challenge of ease.ml/meter hinges on the measurement/control of distributional overfitting. In the rest of this paper, we use the term “ overfits by ” to specifically refer to distributional overfitting. When the context of and is clear, we use
to denote distributional overfitting.
We want to measure not just for a single model , but in the context of a series of models.
Definition 1.
We say that overfits by with respect to a submission history and performance measure , if and only if Here
Intuitively, this guarantees that, as long as the test set does not overfit up to step , all decisions made by the developer are according to a test set that closely matches the real distribution – at least in terms of some aggregated statistics (e.g., accuracy).
3.3 Example Use Cases
To illustrate how ease.ml/meter can be used in the development process of ML models, we use the development trace data from two realworld ML applications we developed in the past and showcase the signals ease.ml/meter would return to its user.
Development Trace 1: Emotion Detection in Text
As our first case study, we took the development history of System X, which is a participant of the “EmoContext” task in SemEval 2019.^{3}^{3}3https://www.humanizingai.com/emocontext.html This task aims for detecting emotions from text leveraging contextual information, which is deemed challenging due to the lack of facial expressions and voice modulations.^{4}^{4}4https://competitions.codalab.org/competitions/19790 It took developers eight iterations before delivering the final version of System X. Changes in each individual step include adding various word representations such as ELMo [19] and GloVe [18], which lead to significant performance increase/drop. Figure 3(a) plots the accuracy of System X on the validation set and the test set (assuming that the accuracy on the test set was reported to the user in every development step), respectively, in each of the eight development steps.
Development Trace 2: Relation Extraction
Our second case study comes from System Y [23], which is a participant of the task “Semantic Relation Extraction and Classification in Scientific Papers” in SemEval 2018.^{5}^{5}5https://competitions.codalab.org/competitions/17422 This task aims for identifying concepts from scientific documents and recognizing the semantic relation that holds between the concepts. In particular, it requires semantic relation extraction and classification into six categories specific to scientific literature. The development history of System Y indicates that it involves eight steps before reaching the final (version of the) system. Figure 3(b) presents the accuracy of System Y on the training set (using 5fold cross validation) and the test set, respectively, for each step in the development cycle.
Meter in Action
Figure 3 illustrates the signals that developers would receive when applying ease.ml/meter to these two development traces. At each step, the current empirical overfitting is visible while the distributional overfitting is guaranteed to be smaller than with probability (i.e., ).
Figure 3 also reveals two working modes of ease.ml/meter: (1) the regular meter and (2) the incremental meter. The regular meter simply returns an overfitting signal for each submission that indicates its degree of overfitting, as we have been discussing so far. However, this is often unnecessary in practice, as developers usually only care about the maximum (or worst) degree of overfitting of all the models that have been submitted. The rationale is that the tolerance for overfitting usually only depends on the application, not a particular model — a submitted model is acceptable as long as its overfitting is below the applicationwide tolerance threshold. The incremental meter is designed for this observation.
It is worth mentioning some tradeoffs if developers choose to use the incremental meter. On the positive side, it can significantly reduce the number of human labels, compared with the regular meter (see Sections 4.3 and 4.4). For instance, for the particular setting here ( and ), the incremental meter would have required only 50K labels, compared with the 80K labels required by the regular meter, to support steps as in Figure 3(c). On the negative side, developers may lose clue on the performance of an individual submission, if the incremental meter does not march — in such case developers only know that this submission is better than the worst one in the history.
3.4 Discussion
One may wonder why not taking a more straightforward approach that bounds directly, rather than the decomposition strategy ease.ml/meter uses. The rationale is that the former becomes much more challenging, if not impossible, when the validation distribution drifts from the true distribution (i.e., ). When there is no distribution drift, we can indeed apply the same techniques in Section 4 below to (in lieu of ) to derive a lower bound for . However, given that the developer has full access to and is often used for hyperparameter tuning that involves lots of iterations (i.e., very large ’s), the required can easily blow up.
In fact, it is even not our goal to bound . Recall that, ease.ml/meter aims for understanding sizes of both the validation set and the test set. Even if directly bounding were possible, it would only give us an answer to the question of desired validation set size, and the question about desired test set size remains unanswered. Our decision of decomposing is indeed a design choice, not a compromise. Instead of providing a specific number about the validation set size, ease.ml/meter answers the question in probably the strongest sense: Pick whatever size — the validation set size no longer matters! In practice, one can simply take a conservative, progressive approach: Start with a validation set with moderate size, and let the meter tell the degree of overfitting (via explicit control over the test set size); If the degree of overfitting exceeds the tolerance, replace the validation set (e.g., by adding more samples). Therefore, our decomposition design is indeed a “two birds, one stone” approach that simultaneously addresses the two concerns regarding both validation and test set sizes.
Given that we do not explicitly bound , one may raise the question about the semantics of overfitting signals by ease.ml/meter, in terms of . When ease.ml/meter returns an overfitting signal , it indicates the corresponding range on the meter. Since and , it follows that, with probability (i.e., confidence) at least ,
Limitations and Assumptions
Although the above setup is quite generic, a range of limitations remain. One important limitation is that ease.ml/meter assumes that user is able to draw samples from each distribution at any time/step. This is not always true, especially in many medicalrelated applications and anomaly detection applications in physical systems. Another limitation is that ease.ml/meter assumes that each data distribution is stationary, i.e., all three distributions do not change or drift over time, though in many applications concept/domain drift is inevitable [7, 35]. We believe that these limitations are all interesting directions to explore. However, as one of the early efforts on overfitting management, we leave these as future work in this paper.
4 Monitoring Overfitting
We now present techniques in ease.ml/meter that monitor the degree of (distributional) overfitting. We first piggyback on techniques recently developed in adaptive analysis [5, 6] and apply them to our new scenario. We have also developed multiple simple, but novel, optimizations, which we will discuss later in Section 5. The main technical question we aim to answer is that, given the error tolerance parameters , , and the length of development cycles , how large should the test set be?
4.1 Recap: Adaptive Analysis
As we have discussed in Section 2.4, we cannot simply draw a fresh test set for each new submission (i.e., the Resampling baseline), as it would become prohibitively expensive in practice for reasonable choices of and in many circumstances. For instance, if we set and , it would require 380K examples to test just models (by Equation 2).
To reduce this sample complexity, it is natural to consider reusing the same test set for subsequent submissions. As one special case (which is unrealistic for ease.ml/meter), if all submissions are independent (i.e., the next submission does not depend on the overfitting signal returned by ease.ml/meter for the present submission), then we can simply apply the union bound combined with the Hoeffding’s inequality to conclude a sample complexity as shown in Equation 1. Using the previous setting again (, , and ), we now need only 38K examples in the test set.
However, this independence assumption seldom holds in practice, as the developers would always receive overfitting signals returned by ease.ml/meter, which in the worst case, would always have impact on her choice of the next model (see Figure 4). It then implies that the models submitted by developers can be dependent. We now formally examine this kind of adaptive analysis during ML application development in more detail, studying its impact on the size of the test set.
4.2 Cracking Model Dependency
The basic technique remains similar to when the submissions are independent: We can (1) apply Hoeffding’s inequality to each submission and then (2) apply union bound to all possible submissions. While (1) is the same as the independent case, (2) requires additional work as the set of all possible submissions expands significantly under adaptive analysis. We use a technique based on description length, which is similar to those used by other adaptive analysis work [5, 6, 22].
Specifically, consider step . If the submission is independent of , the number of all possible submissions is simply after steps. If, on the other hand, depends on both and (i.e., the indicator returned by the meter, which specifies the range that the degree of overfitting of falls into), then for each different value of we can have a different . To count the total number of possible submissions, we can naturally use a tree to capture dependencies between and .
In more detail, the tree contains levels, where level represents the corresponding step in the submission history . In particular, the root represents . Each node at level represents a particular realization of , i.e., a possible submission made by developers at step . Meanwhile, the children of represent all possible ’s that are realizations of given that the submission at step is .
Example 1 (Dependency Tree).
Figure 5(a) showcases the corresponding tree for a regular meter when and . It contains levels. The root represents , the initial submission. Since , the meter contains two overfitting ranges and therefore can return one of the two possible overfitting signals, Signal 1 or Signal 2. Depending on which signal is returned for , developers may come up with different realizations for . This is why the root has two children at level . The same reasoning applies to these two nodes at level as well, which results in four nodes at level .
The problem of applying union bound to all possible submissions under adaptive analysis therefore boils down to computing the size of the tree . We next analyze for the regular meter and the incremental meter, respectively.
4.3 Regular Meter
In the regular meter, each signal can take values () for a meter with possible signals. One can then easily see that, in general, the number of nodes in the (model) dependency tree is . This leads to the following result on sample complexity for the regular meter (the complete proof is in Appendix A.3.1):
Theorem 1 (Regular Meter).
The test set size (in the adaptive setting) of the regular meter satisfies
where . As a result, it follows that
(3) 
by using the approximation .
(Comparison to Baseline) Compared to the Resampling baseline (Equation 2), the regular meter (with ) can reduce the number of test examples from 380K to 108K when , , and , a 3.5 improvement.
4.4 Incremental Meter
As we have discussed in Section 3.3, the incremental meter reports the worst degree of overfitting for all models that have been submitted so far. Formally, at step it returns
where is the overfitting signal that would have been returned by the regular meter at step . As a result, the incremental meter can only move (indeed, increase) towards one direction. This constraint allows us to further reduce the required amount of test examples, often significantly (compared to the regular meter):
Theorem 2 (Incremental Meter).
The test set size (in the adaptive setting) of the incremental meter satisfies
where
As a result, it follows that
(4) 
The proof is in Appendix A.3.2. ^{6}^{6}6The proof is quite engaged but the idea is simple: Count the number of tree nodes with respect to (1) the overfitting signal returned by ease.ml/meter and (2) the level , and observe that We can further show that . Compared to the regular meter, the size of the (model) dependency tree can be further pruned. Figure 5(b) illustrates this for the incremental meter when and : Shadowed nodes are pruned with respect to the tree of the corresponding regular meter (Figure 5(a)).
(Comparison to Baseline) Compared to the Resampling baseline (Equation 2), the incremental meter (with ) can reduce the number of test examples from 380K to 66K (, , and ), a 5.8 improvement.
5 Optimizations
In the previous section, we adapt existing techniques directly to ease.ml/meter. However, these techniques are developed for general adaptive analysis without considering the specific application scenario that ease.ml/meter is designed for. We now describe a set of simple, but novel optimizations that further decrease the requirement of labels by ease.ml/meter.
5.1 Nonuniform Error Tolerance
Our first observation is that a uniform error tolerance for all signals, as was assumed by all previous work [5, 22], is perhaps an “overkill” – when empirical overfitting is large, user might have higher error tolerance (e.g., when validation accuracy and test accuracy are already off by 20 points, user might not need to control distributional overfitting to a single precision point). Our first optimization is then to extend existing result to support different ’s for different signals , such that . This leads to the following extensions of the results in Sections 4.3 and 4.4.
Corollary 1 (Nonuniform, Regular Meter).
The test set size (in the adaptive setting) of the regular meter with nonuniform satisfies
(5) 
where remains the same as in Theorem 1.
The proof can be found in Appendix A.3.3. The basic idea is the same as that in Section 4.2: We can use a tree to capture the dependencies between historical submissions and count the number of tree nodes (i.e., possible submissions). However, in the nonuniform case, we have to count the number of tree nodes for each individual overfitting signal separately, since we need to apply the Hoeffding’s inequality for each group of nodes corresponding to a particular , with respect to . For the regular meter, it turns out that .
Corollary 2 (Nonuniform, Incremental Meter).
The test set size (in the adaptive setting) of the incremental meter with nonuniform satisfies
(6) 
where
The previous remark on the proof of Corollary 1 can be applied to the nonuniform, incremental meter, too: We can compute and then apply the Hoeffding’s inequality with respect to , for each separately. The complete proof can be found in Appendix A.3.4. Note that the tree size is the same as that of Theorem 2.
Impact on Sample Complexity
The difficulty of finding a closed form solution for nonuniform makes it challenging to directly compare this optimization with those that we derived in Section 4. To better understand sample complexity of nonuniform meters, in the following we conduct an analysis based on the assumption that is dominated by :^{7}^{7}7Given that and the exponential terms that enclose the ’s, one can expect that the LHS sides of Equations 5 and 6 are dominated by the terms that contain .
(Sample Complexity assuming Dominance) Specifically, for the nonuniform, regular meter, we have
(7) 
On the other hand, for the nonuniform, incremental meter, we have
(8) 
(Improvement over Section 4) We now compare the sample complexity of the nonuniform meters to their uniform counterparts. The nonuniform, regular meter can reduce sample complexity to when setting to (0.01, 0.02, 0.03, 0.04, 0.05), , and , compared to with a uniform error tolerance . We do not observe significant improvement, though. In fact, if we compare Equation 7 with Equation 3, the improvement is upper bounded by . For and , it means that the best improvement would be regardless of . Nonetheless, one can increase either or to boost the expected improvement. On the other hand, the nonuniform, incremental meter can further reduce sample complexity from to (which matches Equation 1, the ideal sample complexity when all submissions are independent), a improvement.
5.2 Multitenancy
Our second observation is that, when multiple users are having access to the same meter, it is possible to decrease the requirement on the number of examples if we assume that these users do not communicate with each other. Multitenancy is a natural requirement in practice given that developing ML applications is usually team work. We implemented a multitenancy management subsystem to enable concurrent access to the meter from different users.
Figure 6 illustrates the specific multitenancy scenario targeted by ease.ml/meter. Unlike traditional multitenancy setting where multiple users access system simultaneously, our multitenancy scenario captures more of a collaboration pattern between two developers where one starts from the checkpoint made by the other, similar to the wellknown “git branch” and “git merge” development pattern (if we have to make an analogy).
An interesting, perhaps counterintuitive observation is that multitenancy can further reduce the desired size of the test set. We illustrate this using a simple, twotenant case where there are only two developers, each working on steps. We provide the result for more than two developers in Appendix B. We focus on nonuniform meters in our discussion, as uniform meters (i.e., with a single ) can be viewed as special cases.
(Two Tenants in Nonuniform, Regular Meter) Consider the regular meter first. Since each tenant has development cycles, it follows from Corollary 1 that, in the presence of two tenants,
(9)  
(Two Tenants in Nonuniform, Incremental Meter) Similarly, for the incremental meter, it follows from Corollary 2 that, in the presence of two tenants,
(10)  
Intuition
This result may be counterintuitive at a first glance. One might wonder what the fundamental difference is between multiple developers and a single developer, given our twotenancy setting. The observation here is that the second developer can forget about the “development history” made by the first developer, since it is irrelevant to her own development (see Figure 7 for an example when and ).
Impact on Sample Complexity
We can illustrate the impact of multitenancy by again analyzing sample complexity under the dominance assumption:
(Sample Complexity Assuming Dominance) Specifically, for the nonuniform, regular meter, we have
(11) 
On the other hand, for the nonuniform, incremental meter, we have
(12) 
(Improvement over Section 5.1) We now compare the sample complexity of the twotenancy meters to the singletenancy ones. The twotenancy, regular meter can reduce the number of test examples from to when setting = (0.01, 0.02, 0.03, 0.04, 0.05), , and , a improvement. One can indeed achieve asymptotically improvement as increases (see Section 6.2.2). On the other hand, the twotenancy, incremental meter cannot further improve the sample complexity. This is not surprising, as Equation 12 and Equation 8 are exactly the same. In fact, both have matched Equation 1, which represents the ideal sample complexity when all submissions are independent.
5.3 “Time Travel”
Our third observation is that it is a natural action for developers to revert to a previous step once an overfitting signal is observed, i.e., “traveling back in timeline.” In practice, it makes little sense to revert to older steps except for the latest one prior to the one that resulted in overfitting. Therefore, we only consider the case of taking one step back, just like what the “git revert HEAD” command does. Intuitively, “time travel” permits “regret” in development history. As a result, we can use a smaller test set to support the same number of development steps. In the following, we present a formal analysis that verifies this intuition.
Analysis
For ease of exposition, we start by assuming one single budget for “time travel,” i.e., only one reversion is allowed in development history. Again, we are interested in the total number of possible model submissions. Suppose that user decides to revert at step . One can decompose the entire procedure into three phases: (1) keep submitting models until step ; (2) revert and go back to step ; (3) continue submitting times until step . For each , the number of possible submissions in the three phases are then (1) , (2) , and (3) .
It is straightforward to generalize the results to budgets. Suppose that user decides to revert at steps . Both phases (1) and (2) can repeat for each , though each should be replaced by to accommodate the “time shift” effect due to “time travel.” Phase (3) follows afterwards. Therefore, the total number of possible submissions is
(“Time Travel” in Nonuniform, Regular Meter) By Corollary 1, the total number of possible submissions is
(13) 
The test set size therefore satisfies
(14) 
(“Time Travel” in Nonuniform, Incremental Meter) By Corollary 2, the total number of possible submissions is
(15) 
The test set size therefore satisfies
(16) 
Impact on Sample Complexity
As before, we illustrate the impact of “time travel” by analyzing sample complexity under the dominance assumption.
(Sample Complexity Assuming Dominance) Specifically, for the nonuniform, regular meter, we have
(17) 
On the other hand, for the nonuniform, incremental meter, regardless of the choice of and to , we have
(18) 
(Improvement over Section 5.1) We now compare the sample complexity of the “time travel” meters to the ones in Section 5.1. We set and = (1, 2, 3). The “time travel” regular meter can reduce sample complexity from to when setting = (0.01, 0.02, 0.03, 0.04, 0.05), , and , a improvement. We can further improve the sample complexity by increasing the “time travel” budget — for we can achieve an improvement up to (see Section 6.2.3). We do not see improvement for the incremental meters under the dominance assumption, though, as Equation 18 already matches the lower bound in Equation 1.
6 Experimental Evaluation
We report experimental evaluation results in this section. Our evaluation covers the following aspects of ease.ml/meter:

How can ease.ml/meter be fit into real ML application development lifecycle management? — see Section 6.3.
We compare the participated techniques in terms of their induced sample complexity (i.e., the amount of human labels required), under various parameter settings that are typical in practice.
6.1 Meters vs. Baselines
Baseline
We have presented the Resampling baseline in Section 2.4 in the context of a single , which can only serve as a baseline for uniform meters. It is straightforward to extend it to the nonuniform case, though: Given that , the required sample size in each single step is bounded by Hence, the total number of samples satisfies
(19) 
Computation of Sample Size for Nonuniform Meters
Parameter Settings
In our experiments, we choose from the set , which consists of common confidence thresholds encountered in practice. We choose from and vary from 10 to 100.
Evaluation Results
We choose from . Figure 8 compares the uniform meters with the baseline approaches when setting , , and . In each of the four subfigures, we plot the number of samples required by the baseline (“Resampling”), the regular meter (“Regular Meter”), the incremental meter (“Incremental Meter”), and the ideal case where the submissions are independent (“No Adaptivity”), respectively. Note that the axis is in log scale.
(Impact of ) We have the following observations on the impact of , regardless of the choices for and :

As increases, the number of samples required by each participant approach increases.

However, the speeds of growth differ dramatically: Both Regular Meter and Incremental Meter grow much more slowly than Resampling, and moreover, Incremental Meter grows much more slowly than Regular Meter. In fact, based on Equations 2, 3, and 4, we can show that the sample size required by Resampling, Regular Meter, and Incremental Meter are , , and , respectively, with respect to .^{8}^{8}8We need to apply Stirling’s approximation (to Equation 4) to obtain the result for Incremental Meter.

Although both Incremental Meter and No Adaptivity grow at the rate of , there is still visible gap between them, which indicates opportunity for further improvement.
(Impact of ) By comparing Figure 8(a) and Figure 8(b), where we keep unchanged but vary from 0.05 to 0.01, we observe that the sample complexity only slightly increases (not quite noticible given that the axis is in log scale). Comparing Figure 8(c) and Figure 8(d) leads to the same observation. This is understandable, as the impact of on the sample complexity of all participant approaches is (the same) .
(Impact of ) On the other hand, the impact of on the sample complexity of all participant approaches is much more significant. This can be evidenced by comparing Figure 8(a) and Figure 8(c), where we keep but change from 0.05 to 0.01. As we can see, the sample complexity increases by around 25! This observation remains true if we compare Figure 8(b) and Figure 8(d). The rationale is simple – the impact of on the sample complexity is (the same) for all approaches.
(Impact of ) Figure 9 further compares the sample complexity of the uniform meters when fixing and . Figures 9(a) and 9(b) depict results when and , respectively. (The axis is in log scale.) We see that sample complexity increases for both Regular Meter and Incremental Meter. In fact, we can show that both meters actually grow at the (same) rate with respect to .^{9}^{9}9Again, we need to apply Stirling’s approximation (to Equation 4) to obtain the result for Incremental Meter.
6.2 Optimizations for Meters
We next evaluate the effectiveness of the optimization techniques presented in Section 5: (1) nonuniform ’s and (2) multitenancy.
6.2.1 Nonuniform Error Tolerance
Figure 10 compares nonuniform meters when setting (0.01,0.02,0.03,0.04,0.05) with uniform ones. For uniform meters and the two baselines Resampling and No Adaptivity, we set . We observe the following regardless of :

On the other hand, the nonuniform incremental meter (“Incremental, Nonuniform”) significantly improves over its uniform counterpart Incremental Meter. Again, we can verify this by comparing Equation 8 with Equation 6 — the uniform version has sample complexity whereas the nonuniform version has sample complexity .
Comments
There are no comments yet.