Unknown Examples & Machine Learning Model Generalization

Over the past decades, researchers and ML practitioners have come up with better and better ways to build, understand and improve the quality of ML models, but mostly under the key assumption that the training data is distributed identically to the testing data. In many real-world applications, however, some potential training examples are unknown to the modeler, due to sample selection bias or, more generally, covariate shift, i.e., a distribution shift between the training and deployment stage. The resulting discrepancy between training and testing distributions leads to poor generalization performance of the ML model and hence biased predictions. We provide novel algorithms that estimate the number and properties of these unknown training examples---unknown unknowns. This information can then be used to correct the training set, prior to seeing any test data. The key idea is to combine species-estimation techniques with data-driven methods for estimating the feature values for the unknown unknowns. Experiments on a variety of ML models and datasets indicate that taking the unknown examples into account can yield a more robust ML model that generalizes better.


page 1

page 2

page 3

page 4


Towards out of distribution generalization for problems in mechanics

There has been a massive increase in research interest towards applying ...

Information FOMO: The unhealthy fear of missing out on information. A method for removing misleading data for healthier models

Not all data are equal. Misleading or unnecessary data can critically hi...

BEDS-Bench: Behavior of EHR-models under Distributional Shift–A Benchmark

Machine learning has recently demonstrated impressive progress in predic...

Combining predictions from linear models when training and test inputs differ

Methods for combining predictions from different models in a supervised ...

Picket: Self-supervised Data Diagnostics for ML Pipelines

Data corruption is an impediment to modern machine learning deployments....

Robust Generalization despite Distribution Shift via Minimum Discriminating Information

Training models that perform well under distribution shifts is a central...

Locally Optimized Random Forests

Standard supervised learning procedures are validated against a test set...

1. Introduction

Over the past decades, researchers and Machine Learning (ML) practitioners have come up with better and better ways to build, understand and improve the quality of ML models, but mostly under the key assumption that the training data is distributed identically to the testing data. This assumption usually holds true in algorithm-development environments and data-science competitions, where a single dataset is split into training and testing sets, but does it hold more generally? If not, what are the consequences for ML?

Experience shows that the foregoing assumption can fail dramatically in many real-world scenarios, especially when the data needs to be collected and integrated over multiple sources and over a long period of time. This issue is well known, for example, to the Census Bureau. A 2016 Census Advisory Committee report (Dowling et al., 2016) highlights the difficulties in reaching groups such as racial and ethnic minorities, poor English speakers, low income and homeless persons, undocumented immigrants, children, and more. Some of these groups do not have access to smartphones or the internet, or they fear interactions with authorities, so the the prospects for data collection will remain difficult into the foreseeable future. Similarly, a recent report on fairness in precision medicine (Ferryman and Pitcan, 2018) documents bias in labeled medical datasets and asserts that “insofar as we still have a systematically describable group who are not in a health care system with data being collected upon them, from them, then that will be a source of bias.” In each of these cases, factors such as income and ethnicity can result in exclusion of items from a training set, yielding unrepresentative training data. We emphasize that the issue here is not just underrepresentation of classes of data items, but the complete absence of these items from consideration because they are unknown to the ML modeler.

Besides sampling bias, population shifts over time can lead to unrepresentative training data. For example, a regression model for predicting height based on weight that was trained on the US population in the 80’s may not be usable today, because the variables (population-wide height and weight distributions) have changed over time, so that the old training data do not represent the actual testing data of today.

In either case, the unrepresentativeness of the training data will adversely affect an ML model’s ability to handle unseen test data. Indeed, the better the fit to biased training data, the harder is becomes for the model to handle new test data. Clearly, mitigation of biased training data is crucial for achieving fair ML.

In this work, we focus on the impact of unknown training instances on ML model performance. The unknown examples during training can arise if the training distribution is different from the testing distribution due to sample selection bias—where some data items from the testing distribution are more or less likely to be sampled in the training data—and, more generally, covariate shift, where the training data and testing data distributions can be different for any reason. Although , we assume that , so that the conditional distribution of the class variable of interest is the same for both training and test data. That is, the predictive relationship between and is the same; only the data distribution of the -values differs.

Generalization is the ability of a trained ML model to accurately predict on examples that were not used for training (Bishop, 2006). Good generalization performance is a key goal of any practical learning algorithm. Ideally, we want to fit the model on a training set that well represents the hidden testing data or the target population—i.e., the training and testing data are drawn from the same distribution. If this is not the case, then we end up with unknown instances that are missing during the training. If the training data is biased, in a sense that some parts of the population are under-represented or missing in the training data (i.e., we have unknown instances), then the fitted model on that training data will be biased away from the optimal function and have poor generalization performance. This issue is orthogonal to the typical model complexity and generalization trade-off (Wikipedia contributors, 2017), where training and testing distributions are assumed to be identical. In Section 4, we show that both simple and complex models can suffer if the training data is biased.

1.1. Learning under covariate shift

Learning under covariate shift has been studied extensively (Bickel et al., 2007; Sugiyama et al., 2013; Zadrozny, 2004; Liu and Ziebart, 2014; Huang et al., 2007). An important observation from importance sampling states that the loss on the test distribution can be minimized by weighting the loss on the training distribution with the scaling factor, (Shimodaira, 2000). The previous work proposes many techniques to estimate the scaling factor or the training, testing or the conditional densities more accurately and efficiently, which in turn, require both training and “unlabeled” testing data.

Access to the unlabeled testing data (during training) is only feasible in a setting where the actual test data is provided, e.g., in a data science competition. However, using such a target dataset (or re-training the model after seeing the test data) may not be possible in many real applications. We therefore propose the first techniques for learning under covariate shift that require just the training data.

1.2. Our goal and approach

We aim to develop methods for mitigating unrepresentativeness in training data arising from sampling bias, or covariate shift more generally, thereby improving ML generalization performance. Our key idea is to exploit the fact that training data is typically created by combining overlapping data sets, so that instances often appear multiple times in the combined data. We first apply species-estimation techniques that use the multiplicity counts for the existing training instances to estimate the number of unknown instances. We then use this information to correct the sample by either weighting existing instances or generating synthetic instances. For the latter approach, we investigate both kernel density and interpolation techniques for generating feature values for the synthetic instances. As shown in our experiments over different types of ML models and datasets, correcting a training set by taking unknown instances into account can indeed improve model generalization.

2. The Impact of Unknown Examples

In this section, we define unknown unknowns (Chung et al., 2018), and describe how common data collection procedures can produce a biased training data with unknown unknowns. Our goal is twofold: First, we are interested in training a model that generalizes better to the unseen examples (e.g., testing data); Second, we want to minimize the generalization gap between the training and the testing/validation scores. We formally state the problem and the learning objectives at the end of the section.

2.1. Problem Setup

A typical training data collection process involves sourcing and integrating multiple data sources, e.g., data crowdsourcing where each worker is an independent source (Hsueh et al., 2009; Lease, 2011). In this work, we assume that data sources are independent but overlapping samples , each obtained by sampling data items from the underlying distribution ; the sampling is without replacement, because a data source typically only mentions a data item once. is also our target distribution for learning, and each data item has a sampling likelihood and consists of features/variables. The data sources are then integrated into a training data set of size . contains duplicates because every data source is sampling from the same underlying population. If we integrate a sufficiently large number of sources, then approximates a sample with replacement from . The duplicate counts resulting from the overlap of the ’s enables the use of species estimation techniques to estimate the number of the missing, unseen test instances .

Ideally, we would like the integrated sample to follow the target distribution for . However, we assume covariate shift between the actual training data distribution of , denoted as . That is, but for all . As discussed previously, this situation may arise if (1) any of the integrated sources exhibits a strong sample selection bias or (2) the source is outdated, but the fundamental relationship between and is unchanged.

For training, we assume that, for , each class label is perfectly curated. Neither testing data —nor the distribution —is available during training. We assume that the hidden test data comprises an i.i.d. sample from , and well represents this distribution.

2.2. The Unknown Examples

We focus on the missing training examples that actually exist in the underlying data distribution or the test set, and now formally define such missing examples as unknown unknowns.

Definition 0 (Unknown Unknowns).

Given an integrated data set for training and a (hidden) testing data set , the set of unknown unknowns is defined as .

The terminology “unknown unknowns” stems from the fact that both the cardinality of and the feature values of are both unknown. The existence of unknown unknowns critically impacts a model’s generalization ability.

2.3. Problem Statement

We quantify a model’s generalization ability via generalization error

, the difference between the error (expected loss) with respect to the underlying joint probability distribution and the error (average loss) on the finite training data:


where and

is a loss function such as


Our goal is to minimize generalization error rather than training error in order to maximize the predictive ability on new data. However, is not available during training, and so ML training algorithms aim to minimize the empirical risk , where . Thus, any significant discrepancy between and will be reflected in through the average loss over the unknown examples. Training on when can result in worse model performance on the actual testing data.

We now define the impact of unknown unknowns in the context of the generalization error.

Definition 0 (The Impact of Unknown Unknowns).

Given an integrated data set for training and a (hidden) testing data set , the impact of unknown unknowns is defined as


Under the sample selection bias model (Zadrozny, 2004) where is sampled from (so that ), the generalization error is equivalent to the difference between the training and the testing scores over the unknown examples:


In a more general setting, in which training and testing data can differ arbitrarily, as defined in (2) only approximates the rightmost difference in (2) (and hence ) because only contains missing testing instances from , not additional training instances in . In this work, we simply take to approximate in the evaluation, and leave an estimation error bound (for the general setting) as future work.

Our goal is to capture the distributional difference between and via unknown unknowns, and make the unknown examples part of model training (). The challenge arises because neither nor is available in training; we cannot directly compute . In Section 3, we explain how we Adding to can un-bias or correct the training distribution to resemble .

3. Learning The Unknown

In this section, we focus on a simple regression problem to illustrate our techniques for learning the unknown examples. The proposed techniques can easily be extended to other problem types, such as classification (apply the same technique for each class label). In Section 4, we present the experimental results for both regression and classification problems.

Figure 1. A simple regression example. From left to right, we have a target population

and the optimal linear regression model (gray line), a biased training data

(in this case, examples with smaller values for the dependent variable are less likely to be sampled) and the biased model (black line), weighted training examples by the unknown count estimates (darker colors correspond to higher weights) and the model (red line), and enriched training data sets with two different kinds of synthetic unknown examples and their fitted models (blue and cyan lines). Notice that the biased model (black line) trained on the original training data does not model the target population, while the models (red, blue, cyan lines) trained with the unknown examples more resemble the ground truth model (gray line) fitted with the hidden target/test data.

We propose two approaches to model the unknown examples during training. Figure 1 provides an overview of different techniques and their results using a simple regression example.

3.1. Weighting by Unknown Example Count

The first approach simply weights an existing training example by the estimated number of unknown unknowns that surround it. This is similar to the importance sampling-based techniques studied previously, where the instance-specific weights are computed based on (Shimodaira, 2000; Bickel et al., 2007; Zadrozny, 2004). Other techniques exist for learning the weights, directly or indirectly, from the training and testing distributions (Huang et al., 2007; Liu and Ziebart, 2014), but they are not applicable if the target/test data or the ground truth distribution is not available. We use a sample-coverage-based species estimation technique to estimate the size of the missing mass. The underlying assumption of the species estimator is that the rare species in a sample with replacement are the best indicators of the missing unknown species in the target distribution. In our case, we assume that the rare examples in our collected are the best indicators of the unknown unknowns. We estimate a large number of unknown unknowns near a given training example, then we expect a large number of similar instances to be present in the actual test data or target population; thus, we attach more importance to such a training example during the training phase.

3.1.1. Chao92 Species Estimator

There are several species estimation techniques, and no single estimator performs well in all of the settings (Haas et al., 1995). In this work, we use the popular Chao92 estimator, which is defined as:


where is coefficient of variation and can be estimated as:


Here is the number of unique examples in the training data , the sample coverage estimate—i.e., the percentage of the covered by —and our estimate of the total number of unique examples in . The sample coverage is estimated using the Good-Turing estimator (Good, 1953):


The Good-Turing estimator leverages what are called the -statistics, or the “frequencies of frequencies.” Specifically, denotes the number of singletons, the examples that appear exactly once in the integrated training data . Similarly, denotes the number of doubletons, the examples which appear exactly twice in , and so on. Given the -statistics, we can estimate the missing distribution mass of all the unknown unknowns as .

3.1.2. Dynamic Bucketization

To estimate the number of unknown unknowns near each training example, we partition the data into buckets based on the values of a selected feature and then perform the estimation for each partition. Instead of partitioning the feature space statically, with fixed boundaries and sizes, we define the buckets dynamically, making sure that each partition contains enough examples and duplicates to permit high-quality estimation.

Input : Integrated training data , feature index , min sample coverage threshold
Output : data partitions (buckets)
  /* buckets */
  /* priority queue sorted by ascending feature value of */
  /* current bucket to fill */
1 while  not empty do
         /* the next by ascending */
2        if sample_coverage()  then
                /* if , has enough */
                /* new bucket to fill */
4        end if
5       else
6               ;
8        end if
10 end while
11return ;
ALGORITHM 1 Dynamic Bucketization

Algorithm 1 illustrates the mechanism. First, we push the training data onto a priority queue sorted by ascending feature value (line 2). The feature index is selected based on the feature correlation to the class label

, to have the count estimates with respect to the most informative feature. This is important because we can have different unknown unknowns count estimates depending on the feature we choose. Our strategy ensures that we do not scale examples by missing values in a less relevant feature dimension for the final ML task. Alternatively, we can also choose a feature with the highest entropy or variance. Afterwards, we group nearby examples

in a way that each bucket has enough examples and duplicates, according to the sample coverage estimate, for the quality unknown unknowns species estimation (lines 4-13).

3.2. Synthetic Unknown Examples

The second approach tries to model the unknown unknowns more explicitly and generate synthetic training examples. Generating the synthetic examples for unknown unknowns requires estimating both the number of unknown unknowns and their feature values. For each bucket , we first use the species estimation technique and dynamic bucketization (Algorithm 1) to estimate a number of unknown unknowns, and then we generate synthetic unknown unknowns as described below.

The weighting approach in Section 3.1 is cleaner in the sense that it avoids the difficult feature-value estimation step, a source of additional uncertainty and error. Indeed, if we generate bad examples, then it might harm the final model performance against . A naïve unknown value estimation would be using mean substitution (Schafer and Graham, 2002), where the observed average feature values are used for any unknown unknowns. We can also try doing this at the bucket-level (Chung et al., 2018), but all in all, this does not add much value to the learner. Instead, we use a couple of data-driven oversampling techniques for unknown unknowns value estimation that are more aggressive than the weighting approach, but also do not generate values that depart arbitrarily far from the observed data distribution . As mentioned above, we hew to more conservative data-driven approaches to avoid adding bad examples that can harm the model’s generalization performance.

3.2.1. KDE-Based Value Estimator

We use a Kernel Density Estimation (KDE) approach to estimate the probability density of each bucket and sample the missing unknown examples from it. This is effective, especially when covariate shift is mainly due to sample selection bias, i.e.,

, and unknown unknowns are similar to the observed training examples. On the other hand, the value estimator can actually mislead the training if is very far apart from .

For the purpose of this work, we used a Gaussian kernel and a “normal reference” rule of thumb (Henderson and Parmeter, 2012) to determine the smoothing bandwidth, but other parameters are subject to study.

3.2.2. SMOTE-Based Value Estimator

Synthetic Minority Oversampling Technique (SMOTE) is widely-accepted technique to balance a dataset (Chawla et al., 2002). A dataset is said to be imbalanced if different class examples are not equally represented. SMOTE generates extra training examples for the minority class in a very conservative way; the algorithm randomly generates synthetic examples in between minority class examples and their closest neighbors. Motivated by this class label-balancing algorithm, we generate synthetic unknown examples in a similar fashion.

Input : Integrated training data , expected number of unknown examples , number of nearest neighbors
Output : Synthetic unknown examples
  /* synthetic unknown examples */
1 for  do
         /* randomly pick */
         /* nearest neighbors of */
         /* randomly pick */
         /* synthetic example with features */
2        for  do
3               ;
                /* random number b/w 0 and 1 */
                /* generate new feature value */
5        end for
6       ;
8 end for
9return ;
ALGORITHM 2 SMOTE-Based Value Estimator

Algorithm 2 illustrates how the synthetic unknown examples are generated. We generate exactly unknown examples, where is computed using the species estimation technique (line 2). We first initialize a dummy synthetic example with features (an arbitrary example from the same feature space (line 6). Next, we take nearest neighbors of a randomly picked example , and also pick a neighbor . Setting high results in more aggressive data generation, since (line 8) can be larger. Finally, we randomly generates features in between and (lines 7-11).

4. Experiments

We evaluate our unknown example learning techniques using real-world crowdsourced datasets as well as simulated ones. For the simulation, we used datasets from UCI Machine Learning repository111https://archive.ics.uci.edu/ml/datasets.html as the base datasets (population) and re-sampled to get a biased training dataset. We designed our experiments to address the following questions:

  • Do learning the unknown techniques improve model generalization on real-world datasets?

  • How do the proposed techniques compare to each other?

  • How do the techniques compare to the prior work that requires unlabeled testing data?

  • What is the sensitivity of different ML algorithms to unknown unknowns?

4.1. Real Crowdsourced Examples

We used Amazon Mechanical Turks (AMT) for real-world data crowdsourcing. Each worker received for each example/data item provided. Because we are collecting data from multiple crowdworkers, each treated as a data source, our final dataset contains duplicate information. Our data crowdsourcing spans the following scenarios:


U.S. Tech Employees & Revenue. We used the crowd to collect a training dataset of 2722 records in order to see if there is a positive correlation between the size of a tech company and its revenue, and to build a model to predict a tech company’s revenue based on its number of employees. We do not have the ground truth as in a complete list of U.S. tech companies. Instead, we use the entire crowdsourced data as testing data . In this case, the training data is more biased in the begging of data collection, with fewer crowd answers (HITs), and less so with more HITs.


NBA Player Body Measurement. To determine the correlation between height and weight, we crowdsourced body measurements of active NBA players, and the final dataset contains 471 records, many of them redundant (e.g., more popular player information). We have ground truth test data with all 439 active NBA players, which we hide during training.


Hollywood Movie Budget & Revenue. Do blockbusters always make lots of money?

We want to build a classifier that predicts a movie’s success (i.e., make triple the production budget) based on the budget cost. We used the crowd to collect movie production budget and world-wide gross information (300 records with duplicates) for movies released from 1995 to 2018. The scale of movie production and gross income have changes a lot over the years; we use the Hollywood movies released between 1995 and 2015 as test data, which should cause more covariate shift between the training and testing data.


U.S. College Ranking & Tuition. We are also interested in the relationship between the U.S. college ranking and the school tuition. We use the crowd to collect this year’s U.S. college ranking and tuition from U.S. News Ranking222https://www.usnews.com/best-colleges/rankings (fixing the ranking system/the target population). We have a separate test dataset, which lists 220 U.S. college ranking and tuition for the same year. We have collected 300 college information with duplicates.

Workers sample from the same underlying population (for each question), and they are assumed to be independent. Yet, the combined datasets might still be biased (i.e., not a uniform random sample from the population) and contain more of the popular movies or NBA players, while missing many others due to this sample selection bias.

Figure 2. Mean absolute error on test datasets (the lower the better) for real-world crowdsourcing problems. Considering the unknown examples in training can improve the final model generalization (i.e., better test scores) in the first three cases, where we have more biased training datasets; however, the techniques did worse than the model trained on the original data (Original) on US College Ranking & Tuition. It is interesting to see that the different techniques win in different problem cases; Synthetic Unknown Examples approaches (SynUnk) seem to perform at least as well as the baseline (Original) in all four cases. We do not show WeightByUnk for Hollywood Movie Budget & Revenue, which performs much worse than the others.

For each problem case, we train a simple ML model (e.g., linear regression) with default hyper-parameters. We focus on evaluating the learning techniques, given a blackbox ML algorithm.

Figure 2 compares testing error of the base model trained on the original training data (Original) and the models trained on a weighted training data (WeightByUnk, Section 3.1), and the models trained with synthetic unknown examples using two different unknown value estimators (SynUnk (KDE) and SynUnk(SMOTE), Section 3.2), on the real-world crowdsourcing problems. We use the mean absolute error metric:


The scales on the y-axes are not normalized and dependent on the actual target variables for the problems. It is interesting to see that the different techniques win in different problem cases; Synthetic Unknown Examples approaches (SynUnk) seem to perform at least as well as the baseline (Original) in all four cases. Adding synthetic examples can be useful (Ha and Bunke, 1997), and we generate feature values conservatively near the existing training points in order to minimize bad examples.

On the other hand, and surprisingly, the weighting by unknown count estimates is less consistent even though it does not require estimating the unknown example features; WeightByUnk performs the best in NBA Player Body Measurement, but also the worst in the other three problems. This is because the unknown example count estimates can be inaccurate, but more importantly, they are not necessarily the best weighting factors: there is no guarantee that putting more importance on rare items would correct the sample bias nor the sample selection bias actually existed from the beginning. Unlike the prior work, we do not assume or (unlabeled) to figure this out. And yet, WeightByUnk can be useful under covariate shift.

4.2. Simulation Study

Next, we evaluate the proposed techniques in simulations using UCI ML repository datasets:


Adult Dataset (Kohavi, 1996). The prediction task is to predict if a person makes over 50K/year or not. The repository has separate training and testing data, and each example instance consists of 14 numerical and categorical features. We use only the original training data for training, and the original testing data for the evaluation (testing scores/errors).


Auto MPG Dataset (Quinlan, 1993). The prediction task is to predict city-cycle fuel consumption in milies per gallon. There are 8 features, both numerical and categorical. The original dataset is collected over three different cities; to simulate a sample selection bias, we sample examples mostly from city 1 for training and use the examples from all the cities (1, 2, and 3) for testing.

For all simulations, we permute/re-sample the training dataset to repeat the experiments . For Adult dataset, we have a separate training and testing datasets. We re-sample from the training data with replacement, so the new simulated training data contains duplicates. The sampling procedure is slightly biased to favor people with higher education backgrounds. For Auto MPG dataset, we inject a sampling bias in a way that our training dataset mostly consists of the examples from a particular location. We also sample the training data with replacement. This simulates the data collection process that combine multiple data sources.

Figure 3.

UCI Adult Dataset classification test scores (F1 scores; the higher the better). Overall, the proposed techniques either improve or perform as well as the baseline (original), in terms of model generalization. It is interesting to see that a simple ML algorithm, like logistic regression, is more sensitive to unknown unknowns or the covariate shift. Random forest classifier is known for a better generalization, and we see that it performs better than the other classifiers even with the original data. For neural network classifier, training with instance-specific weights is not applicable; thus, we do not show any result for WeightByUnk.

Figure 3 shows the classification scores on Adult Dataset test dataset, using the proposed learning technique and different ML algorithms. For some ML algorithms, like Neural Network

, training with instance-specific weights is not possible, so for such classifiers we do not show the scores. For the evaluation metric, we used F1 Score:


In binary classification with positive and negative class labels, precision is the fraction of correct positive predictions among all positive predictions, and recall is the fraction of correct positive predictions among all true positives. It is interesting to see that a simple ML algorithm, like logistic regression, is more sensitive to unknown unknowns; SynUnk (SMOTE) improves the model generalization by far. Random forest classifier is known for a better generalization (Wyner et al., 2017), and we see that it performs better than the other classifiers even with the original data. As the algorithm is less sensitive to covariate shift, our techniques can help only so much. For neural network classifier, training with instance-specific weights is not applicable; thus, we do not show any result for WeightByUnk. Once again, SynUnk (SMOTE) is a relatively safer technique in that it performs at least as well as the original with all three algorithms.

Figure 4. Comparison with existing techniques (Two-Stage LR and Two-Stage LR (SBB)) on Adult Dataset. Both techniques look at and use to re-scale the original training data. Though given the target distribution (or its estimate) to adjust the importance of , the existing techniques do not perform any better than the original. We suspect that and provided by UCI repo are indeed similarly distributed. On the other hand, adding synthetic examples improve the generalization.

Figure 4 illustrates how our learning the unknown techniques compare to the existing techniques on Adult dataset. We take two existing techniques based on the importance-sampling result (Shimodaira, 2000). Two-Stage LR first learns a logistic regression model to classify if an example belongs to and (this requires an access to ); next, the scale-factor is computed as follows (Bickel et al., 2007):


outputs the likelihood that belongs to . Two-Stage LR (SSB) is simpler in that it uses (normalized) to scale . The existing techniques do not perform better for this particular example, but adding synthetic examples did improve the model generalization.

Figure 5. UCI Auto MPG Dataset model evaluation error on training and testing and the impact of unknown unknowns (generalization error).The leftmost column (Target: City 1/2/3) is an ideal case where we train on (). Our techniques did not improve the final model generalization in that the testing errors are not much different from the original case–in fact, WeightByUnk is much worse (we do not show its test error to keep the shared y-axis reasonably scaled); however, we see that SynUnk (KDE) improves the generalization error, .

Figure 5 illustrates how our techniques can reduce the generalization error or the impact of unknown unknowns. We evaluated our techniques using the Auto MPG dataset, and we show the evaluation error on both and over the increasing training data size . And the techniques do not perform better than the original in terms of the actual test error after training on the full training data. However, adding synthetic unknown examples (SynUnk (KDE)) results in reduced generalization error. This means that the training error better represent what is to be expected on the actual testing data. This is much desirable in practice, since ML algorithm assumes that the training distribution follows the testing, and optimizes for the best training score/error. Under covariate shift or the sample selection bias, this can be problematic.

4.3. Which Technique To Use?

Learning under covariate shift is not an easy problem. Doing so even without any reference dataset (e.g., the unlabeled test data) to detect the distributional shift and un-bias the training data is more challenging. In this work, we have proposed learning techniques combined with species estimation, that allows us to reason about the unknown missing mass.

As we have seen in this evaluation section, no single technique always performs better than working with the original data. Worse yet, there are cases where all techniques fail and, sometimes, perform worse than the original. To this end, we have been more conservative about generating synthetic training examples, in a way that we would interpolate or re-sample near the existing examples.

We would need to study more to have more thorough guidelines, but as a rule-of-thumb, based on our experiences, SynUnk (SMOTE) worth the shot when any sampling bias or covariate shift is suspected.

5. Related Work

To the best of our knowledge, this is the first work to consider learning under covariate shift without an access to the (unlabeled) test data, but instead using species estimation techniques.

Learning under covariate shift or sample selection bias has been studied extensively (Bickel et al., 2007; Sugiyama et al., 2013; Zadrozny, 2004; Liu and Ziebart, 2014; Huang et al., 2007; Shimodaira, 2000), for training and test distributions diverge quite often and for many reasons in practice. As mentionsed in Section 1.1, most known learning techniques under covariate shift requires the (unlabeled) test data, which may not be available in many real applications.

The situation where training and test data follow different distributions is also related to transfer learning, domain adaptation and dataset-shift adaptation (Pan et al., 2010; Daume III and Marcu, 2006; Sugiyama et al., 2017)

. In transfer learning, knowledge or model built in one problem is transferred to a similar problem. Here, the conditional distribution

is not constant and the learner is even asked to predict different labels. In that case, model is re-trained, at least partially, to adapt to a new problem.

There are also techniques to detect covariate shift. The most intuitive and direct approach would be taking the two distributions, training and testing, and perform use Kullback-Leibler divergence model

(Wikipedia contributors, 2018a) or Wald-Wolffowitz test (Wikipedia contributors, 2018b) to detect any significant data-shift. Researchers also have looked at covariate shift detection where the distribution is non-stationary (Raza et al., 2015). In our case, we cannot test to see if a training distribution is really different from a testing distribution, which is hidden.

Species estimation techniques has been studied in prior work for distinct count estimation, data quality estimation, and crowdsourced data enumeration (Haas et al., 1995; Chung et al., 2017, 2018; Trushkowsky et al., 2016). In this work, we use species estimation techniques to model the unknown examples, missing from the training data. This allows us to correct biased training data without the test data.

6. conclusion

Good model generalization is critical for any practical learning algorithm. And hoping that the training distribution closely follows the testing distribution, many ML algorithms simply optimizes for the best training score, taken as an expected testing score on a hidden target population (i.e., test data). Unfortunately, this is often risky under covariate shift (i.e., training and testing data are not sampled from the same distribution).

In this work, we have developed novel techniques for learning the unknown examples that account for the distributional shift. The key challenge is that we do not have an access to any test data (or the ground truth distribution), labeled or unlabeled. Prior work for learning under covariate shift compared the training and the (unlabeled) testing data to detect and correct the shift. Instead, we use the fact that training data is created by combining multiple sources, yielding duplicate instances, to apply the species estimation technique; we then either explicitly model the missing unknown examples or scale the existing examples, all without using the test data in training. Our experimental results using real-world data and simulations show that the proposed techniques can help improve model generalization.

There are a number of interesting directions for future work. So far, we have focused on blackbox approaches where we work directly with the data. Instead, we plan on working directly with the ML algorithms. We saw that not all ML models are created equal (i.e., some are more sensitive to unknown unknowns), and by taking this to a further study, we hope to develop a more robust learning algorithm. We are also interested in developing species estimation based covariate shift detector, as well as better unknown example value estimators.


  • (1)
  • Bickel et al. (2007) Steffen Bickel, Michael Brückner, and Tobias Scheffer. 2007. Discriminative learning for differing training and test distributions. In Proceedings of the 24th international conference on Machine learning. ACM, 81–88.
  • Bishop (2006) Christopher M. Bishop. 2006. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag, Berlin, Heidelberg.
  • Chawla et al. (2002) Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. 2002. SMOTE: synthetic minority over-sampling technique.

    Journal of artificial intelligence research

    16 (2002), 321–357.
  • Chung et al. (2017) Yeounoh Chung, Sanjay Krishnan, and Tim Kraska. 2017. A Data Quality Metric (DQM): How to Estimate the Number of Undetected Errors in Data Sets. Proc. VLDB Endow. 10, 10 (June 2017), 1094–1105. https://doi.org/10.14778/3115404.3115414
  • Chung et al. (2018) Yeounoh Chung, Michael Lind Mortensen, Carsten Binnig, and Tim Kraska. 2018. Estimating the Impact of Unknown Unknowns on Aggregate Query Results. ACM Trans. Database Syst. 43, 1, Article 3 (March 2018), 37 pages. https://doi.org/10.1145/3167970
  • Daume III and Marcu (2006) Hal Daume III and Daniel Marcu. 2006. Domain adaptation for statistical classifiers. Journal of Artificial Intelligence Research 26 (2006), 101–126.
  • Dowling et al. (2016) J. Dowling et al. 2016. Final Report: National Advisory Committee on Racial, Ethnic, and Other Populations Administrative Records, Internet, and Hard to Count Population Working Group. Technical Report. U.S. Census Bureau. www2.census.gov/cac/nac/reports/2016-07-admin_internet-wg-report.pdf
  • Ferryman and Pitcan (2018) Kadija Ferryman and Mikaela Pitcan. 2018. Fairness in Precision Medicine. Technical Report. Data&Society. tinyurl.com/y7q5xnzs
  • Good (1953) I. J. Good. 1953. The Population Frequencies of Species and the Estimation of Population Parameters. Biometrika 40, 3/4 (1953).
  • Ha and Bunke (1997) Thien M. Ha and Horst Bunke. 1997. Off-Line, Handwritten Numeral Recognition by Perturbation Method. IEEE Trans. Pattern Anal. Mach. Intell. 19, 5 (May 1997), 535–539. https://doi.org/10.1109/34.589216
  • Haas et al. (1995) Peter J. Haas, Jeffrey F. Naughton, S. Seshadri, and Lynne Stokes. 1995. Sampling-Based Estimation of the Number of Distinct Values of an Attribute. In PVLDB. 311–322. http://www.vldb.org/conf/1995/P311.PDF
  • Henderson and Parmeter (2012) Daniel J Henderson and Christopher F Parmeter. 2012. Normal reference bandwidths for the general order, multivariate kernel density derivative estimator. Statistics & Probability Letters 82, 12 (2012), 2198–2205.
  • Hsueh et al. (2009) Pei-Yun Hsueh, Prem Melville, and Vikas Sindhwani. 2009. Data quality from crowdsourcing: a study of annotation selection criteria. In

    Proceedings of the NAACL HLT 2009 workshop on active learning for natural language processing

    . Association for Computational Linguistics, 27–35.
  • Huang et al. (2007) Jiayuan Huang, Arthur Gretton, Karsten M Borgwardt, Bernhard Schölkopf, and Alex J Smola. 2007. Correcting sample selection bias by unlabeled data. In Advances in neural information processing systems. 601–608.
  • Kohavi (1996) Ron Kohavi. 1996.

    Scaling up the accuracy of Naive-Bayes classifiers: a decision-tree hybrid.. In

    KDD, Vol. 96. Citeseer, 202–207.
  • Lease (2011) Matthew Lease. 2011. On quality control and machine learning in crowdsourcing. Human Computation 11, 11 (2011).
  • Liu and Ziebart (2014) Anqi Liu and Brian Ziebart. 2014. Robust classification under sample selection bias. In Advances in neural information processing systems. 37–45.
  • Pan et al. (2010) Sinno Jialin Pan, Qiang Yang, et al. 2010. A survey on transfer learning. IEEE Transactions on knowledge and data engineering 22, 10 (2010), 1345–1359.
  • Quinlan (1993) J Ross Quinlan. 1993. Combining instance-based and model-based learning. In Proceedings of the tenth international conference on machine learning. 236–243.
  • Raza et al. (2015) Haider Raza, Girijesh Prasad, and Yuhua Li. 2015. EWMA model based shift-detection methods for detecting covariate shifts in non-stationary environments. Pattern Recognition 48, 3 (2015), 659–669.
  • Schafer and Graham (2002) Joseph L Schafer and John W Graham. 2002. Missing data: our view of the state of the art. Psychological methods 7, 2 (2002), 147.
  • Shimodaira (2000) Hidetoshi Shimodaira. 2000. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of statistical planning and inference 90, 2 (2000), 227–244.
  • Sugiyama et al. (2017) Masashi Sugiyama, Neil D Lawrence, Anton Schwaighofer, et al. 2017. Dataset shift in machine learning. The MIT Press.
  • Sugiyama et al. (2013) Masashi Sugiyama, Makoto Yamada, and Marthinus Christoffel du Plessis. 2013. Learning under nonstationarity: covariate shift and class-balance change. Wiley Interdisciplinary Reviews: Computational Statistics 5, 6 (2013), 465–477.
  • Trushkowsky et al. (2016) Beth Trushkowsky, Tim Kraska, and Purnamrita Sarkar. 2016. Answering enumeration queries with the crowd. Commun. ACM 59, 1 (2016), 118–127. https://doi.org/10.1145/2845644
  • Wikipedia contributors (2017) Wikipedia contributors. 2017. Overfitting — Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/wiki/Overfitting [Online; accessed 22-July-2018].
  • Wikipedia contributors (2018a) Wikipedia contributors. 2018a. Kullback-Leibler divergence — Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence [Online; accessed 22-July-2018].
  • Wikipedia contributors (2018b) Wikipedia contributors. 2018b. Wald-Wolfowitz runs test — Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/wiki/Wal%E2%80%93Wolfowitz_runs_test [Online; accessed 22-July-2018].
  • Wyner et al. (2017) Abraham J Wyner, Matthew Olson, Justin Bleich, and David Mease. 2017. Explaining the success of adaboost and random forests as interpolating classifiers. The Journal of Machine Learning Research 18, 1 (2017), 1558–1590.
  • Zadrozny (2004) Bianca Zadrozny. 2004. Learning and evaluating classifiers under sample selection bias. In Proceedings of the twenty-first international conference on Machine learning. ACM, 114.