In machine learning systems, we typically divide training data into the training and validation datasets. The use of the validation dataset is to understand how the model is expected to perform on test data. However, if the feature distributions in the train and test datasets are different, the model performance on the validation and test datasets will be different.
At Uber, automated machine learning system - MaLTA (Machine Learning based Targeting Automation) is responsible for building, deploying and using various models for internal business stakeholders. These models often have different predicting goals and use different data sources including both raw and derived features. Furthermore, they normally have different life cycles. A model could be built, deployed, used, deprecated and resurrected at different points in time. Such modeling aspects are of the utmost importance to the accuracy and robustness of automated machine learning systems. In practice, however, they are often outside the control of the automated machine learning system due to business and organizational challenges. The predictions from our models are often used in other systems many weeks out in the future without any immediate feedback loop back.
In MaLTA, we leverage multiple snapshots of the user’s past behavior, normally expressed through handcrafted derived features and metrics produced by different business units, to model the evolution of the user’s future behaviors. This process, however, can cause concept drift as the distribution of these derived features are prone to shifting over times. This issue is inherent to the data collection and modeling process at many businesses. To overcome this problem, we used feature selection through adversarial validation.
In order to identify features with the problems of concept drift automatically as well as to improve the model performance by “correcting” the distribution, we suggest to leverage the approach of adversarial validation.
Adversarial validation is an approach to address the difference between the training and test datasets 
, particularly when these datasets are collected at different points in time. In adversarial validation, we train a binary classifier, adversarial classifier, with a dummy variable indicating if a sample belongs to the test dataset as target. If the distributions of features in the training and test datasets are similar, the classification performance will be close to random guesses. On the other hand, if the distributions of features in the training and test datasets are different, the classification performance will be better than random guesses.
Once the adversarial classifier is trained, we can identify features or subsamples with potential concept drift, and remove biases from concept drift either by excluding such features or selecting subsamples from the training data of which the distributions of the features match with the test dataset.
2 Related Work
2.1 Generative Adversarial Networks
The name of adversarial validation came from Generative Adversarial Networks (GAN) 
, which became increasingly popular in content generation. It consists of two neural networks, the generator and discriminator. The generator produces new data resembling the training data provided, while the discriminator distinguishes between the original training data and newly generated data.
Similarly, adversarial validation consists of two models. One model is trained to predict a target variable with training data, while the other model, adversarial classifier is trained to distinguish between the training data and new test data. In adversarial validation, the predictions of the adversarial classifier are used to help the first model generalize better with the test data instead of generating the new test data as in GAN.
2.2 Heterogeneous Treatment Effect Estimation
Adversarial validation approach is similar to propensity score modeling in causal inference [12, 1]. In causal inference, propensity score modeling addresses the in-homogeneity between the treatment and control group data by training a classifier to predict if a sample belongs to a treatment group. Rosenbaum and Rubin argue in 
that it is sufficient to achieve the balance in the distributions between the treatment and control groups by matching on the single dimensional propensity score alone, which is significantly more efficient than matching on the joint distribution of all confounding variables.
More generally, analogous to the model prediction problem with adversarial validation, heterogeneous treatment effect estimation aims to estimate the effect of a treatment variable on an outcome variable from observational data at the individual sample level. With randomized controlled trial (RCT) data, we can estimate the average treatment effect (ATE) by calculating the difference between the average outcomes of the treatment and control groups, since the control group forms a valid counterfactual prediction given the similar distribution to the treatment group. However, with observational data, the split between the treatment and control groups is not random and might depend on other confounding variables. It is generally known that a naive counterfactual prediction, such as T learner in, which uses a model whose training data is from one group with different distribution than the other group where the counterfactual prediction is made, generates bias. Therefore, to address this key issue, different methods have been designed to use the propensity score as a critical part of the estimation procedure [7, 13, 9]. In adversarial validation, we also explore the use of the propensity score in the main prediction model to reduce the bias in the test dataset.
2.3 AutoML3 for Lifelong Machine Learning Challenge
At NeurIPS 2018, the AutoML3 for Lifelong Machine Learning challenge was hosted, where “the aim is assessing the robustness of methods to concept drift and its lifelong learning capabilities” .
The winning solutions at AutoML3 used similar approaches . First, a model was trained only with the training data, and predicted for test data. Then, after observing the model performance on the test data, models were retrained with new training data consisting of the old training and latest test data with labels using techniques such as incremental training and sliding window to weigh more on latest data.
3 Adversarial Validation Methods
Adversarial validation can be used to detect and address concept drift problem between the training and test data.
We start with a labeled training dataset , and an unlabeled test dataset
with an unknown conditional probability. Then, we train an adversarial classifier that predicts to separate train and test, and generate the propensity score on both and .
The feature importance and propensity score from the adversarial classifier can be used to detect concept drift between the training and test data, and provide insights on the cause of the concept drift such as which features and subsamples in the training data are most different from ones in the test data.
In addition to concept drift detection, here, we propose three adversarial validation methods that address concept drift between the training and test data, and generate predictions adapted to the test dataset.
3.1 Automated Feature Selection
If the features from the train and test data are distributed similarly, we expect the adversarial classifier to be as good as random guesses. However, if the adversarial classifier can distinguish between train and test data well (i.e. AUC score 50%), the top features from the adversarial classifier are potential candidates exhibiting concept drift between the train and test data. We can then exclude these features from model training, based on the feature importance ranking.
Such feature selection can be automated by determining the number of features to exclude based on the performance of adversarial classifier (e.g. AUC score) and raw feature importance values (e.g. mean decrease impurity (MDI) in Decision Trees) as follows:
Train an adversarial classifier that predicts to separate train and test.
If the AUC score of the adversarial classifier is greater than an AUC threshold, remove features ranked within top of remaining features in feature importance ranking and with raw feature importance values higher than a threshold (e.g. MDI 0.1).
Go back to Step 1.
Once the adversarial AUC drops lower than the AUC threshold, train an outcome classifier with the selected features and original target variable.
Figure 3 shows the diagram of adversarial validation with automated feature selection in MaLTA.
Automated feature selection prevents a model from overfitting to features with potential concept drift, and, as a result, leads to an outcome model that generalizes well on the test data. There is a trade-off for this method between losing information by dropping features from the model and reducing the size of training data using other approaches proposed below.
3.2 Validation Data Selection
With validation data selection, we construct a new validation dataset by selecting from the training data so that the empirical distribution of the features data is similar to the test data,
. This way, model evaluation metrics on the validation set should get similar results on the test set, which means if the model works well on the validation data, it should work well on the test data.
Specifically, we apply propensity score matching (PSM) methodology  to reduce the selection bias due to concept drift as follows:
Train an adversarial classifier that predicts to separate train and test, and generate the propensity score on both and .
Run propensity score matching for the propensity scores with nearest neighbor and check the standardized mean difference (SMD) defined below for the propensity scores and covariates.
We consider the matched data to be balanced if . Tune the nearest neighbor threshold if necessary to achieve balance.
Pick the matched examples (e.g. 20% of train) as the new validation dataset and the remaining as the train data to train the model.
It is possible that there are not enough matched samples from the training dataset because the training and test features are significantly different. In that case, we can select the subset of training data with highest propensity scores as in .
3.3 Inverse Propensity Weighting
Alternative to the validation data selection method that extracts a matched validation set while maintaining the training set the same, one can also use the inverse propensity weight (IPW)  technique to generate the weights for the training set for weighted training. Specifically, we use the weights from the propensity score for the training sample as follows:
Therefore, the weighted distribution of the features in the training data follows
which reproduces the distribution of the features in the test data. We trim the weights for those observations with propensity scores close to 1 to avoid the pathological case of over-reliance of those observations and the consequential reduction of the effective sample size.
Adversarial validation with three different methods, feature selection, validation selection, and inverse propensity weighting (IPW) are applied to seven datasets from AutoML3 for Lifelong Machine Learning Challenge as well as MaLTA dataset. For adversarial validation with feature selection, three different algorithms, Decision Trees (DT) 
, Random Forests (RF)
, and Gradient Boosted Decision Trees (GBDT) are used for model training.
For outcome classifiers with the original target variables as target, we use LightGBM  to train GBDT models. For adversarial classifiers with the train / test split as target, we use LightGBM for GBDT models, and scikit-learn  for DT and RF models.
As feature preprocessing, missing values in numerical features are replaced with zero for scikit-learn models that do not handle missing values. Categorical features are label-encoded with missing values as a new label.
4.1 AutoML3 for Lifelong Machine Learning Challenge Datasets
At the AutoML3 challenge, participants started with a labeled training dataset, and a series of test datasets were provided sequentially. Once participants submitted predictions for a test dataset, the labels for the test dataset were released, and the next test dataset without label was available (see Figure 4). The model performance was determined by taking the average performance across all test datasets.
We use public datasets made available during the feedback phase at AutoML3. The public datasets consist of seven datasets: ADA, RL, AA, B, C, D and E. Dataset ADA and RL have one training and three test datasets, while datasets AA, B, C, D and E have one training and four test datasets. All seven datasets have binary target variables.
For a demonstration purpose, we exclude date-time and multi-value categorical features, and use only numerical and categorical features for model training. The description of datasets, such as the number of features, size of training and test datasets, and percentage of missing values after label encoding of categorical features is shown in Table 1.
|# of Total Features||48||22||82||25||79||76||34|
|Training Dataset Size||4K||31K||10M||1.6M||1.8M||1.5M||16M|
|Test Dataset #1 Size||41K||5K||9M||1.7M||1.9M||1.5M||17M|
|Test Dataset #2 Size||41K||5K||9M||1.6M||1.3M||1.6M||18M|
|Test Dataset #3 Size||41K||15K||10M||1.4M||1.6M||1.5M||18M|
|Test Dataset #4 Size||-||-||9M||1.7M||1.8M||1.6M||18M|
|Missing Values %||0%||4%||0%||2%||3%||1%||0%|
MaLTA constructs automated machine learning models with real datasets to target users who show higher propensity toward being cross-sold into new products and services. For this experiment, we used data from Uber’s users data in the Asian Pacific region who have been the users of product X and whether they used new products or services between November 2019 and February 2020.
The dataset consists of four snapshots at four different timestamps over the above period. Each snapshot has 309 features including 297 numerical features, 5 categorical features and 7 date-time features, predominantly about how each user has been interacted with Uber’s different services. 304 features have missing values, and 53 of them have more than 90% of the data points missing. Overall, the dataset has 28% missing values.
The first snapshot is used for modeling and the performance of the model is measured by the average of the rest three snapshots.
5.1 AutoML3 for Lifelong Machine Learning Challenge Datasets
Table 2 shows the average test AUC scores of outcome classifiers across all methods on AutoML3 datasets. All AutoML3 datasets except ADA show perfect (100%) or almost perfect adversarial validation AUC scores (see Table 2), indicating different feature distributions between the training and test datasets. On the other hand, ADA shows close-to-random (50%) adversarial validation AUC score, indicating the consistent feature distribution between the training and test datasets.
Among three adversarial validation methods, feature selection outperforms validation selection and IPW as well as the baseline without adversarial validation. Validation selection and IPW perform even worse than the baseline. Among three training algorithms for feature selection, DT and RF outperform GBDT except in the dataset B, where GBDT performs slightly better than DT and RF. DT and RF score 1.6 4.6% (closing gap by 2.5 10.0%) better than the baseline in the average test AUC scores.
|Feature Selection (GBDT)||92.13||53.73||72.18||60.16||67.35||62.08||83.73|
|Feature Selection (DT)||92.13||64.80||72.26||59.62||70.60||65.56||83.83|
|Feature Selection (RF)||92.13||64.69||72.47||59.77||70.44||65.32||83.97|
Table 3 shows training, validation and test AUC scores of outcome classifiers across all methods on MaLTA dataset. The adversarial classifier built with Training and Test1 dataset has a perfect (100%) AUC score, which indicates that concept drift problem exists across different time snapshots.
Among three adversarial validation methods, feature selection method with GBDT algorithm outperforms others, and enables the outcome classifier to be trained on a fewer features (from 309 features to 281 features), yet achieves 3.9% increase in average test AUC scores (closing gap by 6.3%) over the baseline without adversarial validation. Among three training algorithms for feature selection, GBDT outperforms DT and RF unlike in AutoML3 datasets, where DT and RF outperform GBDT.
This different results in the feature selection method with GBDT between AutoML3 and MaLTA datasets might be caused by high proportion of missing values (28%) as well as large number of features (309 features) in MaLTA datasets compared to AutoML3 datasets (0 4% missing values and 22 82 features). LightGBM’s GBDT handles missing values differently from scikit-learn’s DT and RF. scikit-learn
’s DT and RF require imputation, and we replace missing values with zero. On the other hand,LightGBM’s GBDT ignores missing values during splitting in each tree node and classifies them into the optimal default direction . This way, adversarial classifier with LightGBM’s GBDT might be able to detect a feature with the different distribution between the training and test datasets, even when the feature has many missing values.
Adversarial validation based feature selection in automated machine learning system provides a solution to identify concept drift problem automatically and improves the model performance, which significantly frees the manpower from investigating hundreds of features’ distribution and gives a direction which features for further examination in the system.
|Feature Selection (GBDT)||70.52||68.26||65.04||65.05||63.05|
|Feature Selection (DT)||70.40||68.90||61.27||63.32||61.67|
|Feature Selection (RF)||70.02||68.81||61.25||63.77||62.06|
This paper proposes a set of new approaches that address the issue of concept drift in automated machine learning systems. The novelty of these approaches derives from the relation between concept drift and adversarial learning. Using datasets from AutoML3 Lifelong Machine Learning Challenge and Uber’s automated machine learning system, MaLTA, we demonstrate that one can improve the model performance in a large scale machine learning setting by adjusting the training process through 1) adversarial feature selection, 2) adversarial validation selection, or 3) adversarial inverse propensity weighting, all of which leverage the distributional differences between the training and test datasets.
By comparing these different approaches, we show that adversarial feature selection consistently outperforms both adversarial validation selection and adversarial inverse propensity weighting, while the latter two methods relying on propensity score estimation are more widely used in practice  and in the literature [12, 1]
. We conjecture that in large scale machine learning, the high noise level in the propensity score brings more variance in the estimation, so that even though it reduces the bias, overall it increases the total risk. In contrast, adversarial feature selection, which is analogous to model selection and outlier removal in the regular machine learning, is more robust to the errors in estimating the distributional differences. In addition, the study shows the importance of choosing the right adversarial detection procedure. It suggests that one should put emphasis on extracting the discriminating features between the train and test data, instead of just minimizing the classification error, for the performance of the subsequent training task.
The study also opens new questions for further investigation. First, how does one design a validation process for optimizing the hyper-parameters that specifically relate to the mitigation of the concept drift issue? In a regular machine learning setting, one benefits from the assumption that the training set is a representative sample of the entire population. In the concept drift world, one possible solution is to assume that the concept drift follows some underlying process so that the use of backtest as a validation can be rationalized. Second, even though adversarial inverse propensity weighting has relatively inferior performance, it will be interesting to find ways to reduce the noise of the inverse propensity weights with a combination of direct outcome regression with adversarial feature selection, similar to doubly robust estimators in the causal inference literature . Lastly, as shown in the study, adversarial detection/estimation method plays a critical role in the overall procedure. It is a promising direction to further investigate the performance of different first stage estimation methods, or a joint estimation of adversarial detection and adversarial training in a single neural network setting (e.g. the neural network architecture in  for individual treatment effect estimation).
We would like to express our appreciation to Zhenyu Zhao, Will (Youzhi) Zou and Byung-Woo Hong for their valuable and constructive feedback during the development of this work.
-  (2011) An Introduction to Propensity Score Methods for Reducing the Effects of Confounding in Observational Studies. Multivariate behavioral research 46 (3), pp. 399–424 (eng). External Links: Cited by: §2.2, §3.2, §3.3, §6.
-  (2001) Random forests. Machine learning 45 (1), pp. 5–32. Cited by: §4.
-  (2016) Xgboost: a scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pp. 785–794. Cited by: §5.2.
-  (2001) Greedy function approximation: a gradient boosting machine. Annals of statistics, pp. 1189–1232. Cited by: §4.
-  (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §2.1.
-  (2017) Lightgbm: a highly efficient gradient boosting decision tree. In Advances in neural information processing systems, pp. 3146–3154. Cited by: §4.
-  (2019) Metalearners for estimating heterogeneous treatment effects using machine learning. Proceedings of the National Academy of Sciences of the United States of America 116 (10), pp. 4156–4165 (eng). External Links: Cited by: §2.2.
-  (2019) Towards AutoML in the presence of Drift: first results. arXiv preprint arXiv:1907.10772. Cited by: §2.3, Figure 4.
-  (2017) Quasi-oracle estimation of heterogeneous treatment effects. arXiv preprint arXiv:1712.04912. Cited by: §2.2.
-  (2011) Scikit-learn: machine learning in python. Journal of machine learning research 12 (Oct), pp. 2825–2830. Cited by: §4.
-  (1995) Semiparametric efficiency in multivariate regression models with missing data. Journal of the American Statistical Association 90 (429), pp. 122–129. External Links: Cited by: §6.
-  (1983) The central role of the propensity score in observational studies for causal effects. Biometrika 70 (1), pp. 41–55. External Links: Cited by: §2.2, §6.
-  (2017) Estimating individual treatment effect: generalization bounds and algorithms. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3076–3085. Cited by: §2.2, §6.
-  (2009) CART: classification and regression trees. In The top ten algorithms in data mining, pp. 193–216. Cited by: §4.
-  (2020) Automatically optimized gradient boosting trees for classifying large volume high cardinality data streams under concept drift. In The NeurIPS’18 Competition, pp. 317–335. Cited by: §2.3.
-  (2016) Adversarial validation, part one. External Links: Cited by: §1, §3.2, §6.