There are many reasons why missing values often occur in a dataset – it may be because of erroneous data entry process, irregular data collection or intentionally not supplied by the users, especially when they are filling forms that have non-mandatory fields. Missing values can make the downstream tasks, such as, predicting the value of a target variable, much more challenging. Since performance of all algorithms degrade in the presence of sparsity, and many a times algorithms are not even designed to handle missing data, one may either ignore rows that have missing values altogether – however, this may cause problems, especially when the sparsity level is high, or use some data imputation technique before applying the algorithm. The situation is further aggravated when one tries to detect concept drifts on such datasets with sparsity – we experienced this firsthand when we tried to do the same with our deployment of machine learning solution for change risk assessment Gupta et al. (2021). This solution is currently operational for changes targeted across Walmart’s US, UK and Mexico stores, US Sam’s Clubs and e-Commerce. Post deployment, the production team has confirmed that the number of major incidents has been reduced by 33% with net savings ranging into multi-million dollars as in Q2 of 2021. Note that there may be other factors, e.g., software design changes, that have contributed to the savings; however, it is acknowledged that our ML based prediction system has been the primary contributor. To improve the performance further, we studied concept drift detection in the presence of sparsity, and based on our findings, the main contributions of this work are as follows:
Provide an empirical guideline on applying data imputation given a data distribution and a sparsity pattern.
Suggest various metrics for concept drift detection – we found that some of the metrics in literature can be misleading in practice, and propose some new ones that we found helpful.
Provide a majority voting based ensemble of concept drift detectors for abrupt and gradual drifts that perform well across the whole spectrum of concept drift detection metrics – a feat that individual concept drift detectors may not attain.
2.1 Types of Missingness
The nomenclature for the types of missing values was introduced by Rubin in Rubin (1976) that is considered a defacto standard in any kind of statistical analysis with incomplete data. This nomenclature distinguishes between three cases:
Missing Completely At Random (MCAR). In MCAR, the missingness is completely independent of the data.
Missing At Random (MAR).
In MAR, the probability of missingness depends only on observed values.
Missing Not At Random (MNAR). In MNAR, the probability of missingness depends on the unobserved values, and therefore it leads to important biases in the data.
2.2 Data Imputation Techniques
These techniques can be broadly differentiated into two categories. First, those techniques that look into a single feature at a time – mean, median, mode and zero – these replace the missing values with the mean, median, highest occurring (mode) and a constant value zero, respectively. Second, those techniques that look into other features as well for correlation – k-nearest neighbours (kNN), iterative imputer van Buuren and Groothuis-Oudshoorn (2011), soft impute Hastie et al. (2015) and optimal transport Muzellec et al. (2020).
2.3 Concept Drift
2.4 Concept Drift Detection Algorithms
The concept drift detection algorithms discussed here are as follows. Page-Hinkley (PH) Page (1954) signals concept drift when the difference of the observed values from the mean crosses over a user-defined threshold. Drift Detection Method (DDM) Gama et al. (2004)
detects concept drifts in streams by analyzing the error rates and their standard deviation; if the error rate increases, then DDM concludes that the current predictor is outdated. Early Drift Detection Method (EDDM) monitors the distance between two consecutive errors, rather than the error rate. Consequently, when the concepts remain stationary the distance grows larger, whereas, a decrease in the distance signals drift. Heoffding’s inequality based Drift Detection Method (HDDM) Frías-Blanco et al. (2015) has two variants: HDDMA that uses moving averages to detect drifts, and HDDMW that uses exponentially weighted moving averages to detect drifts. ADaptive WINdowing (ADWIN) Bifet and Gavaldà (2007) maintains two sub-windows, one for historic data and the other for new data; a significant difference between the means of these sub-windows indicates a concept drift. Kolmogorov-Smirnov WINdowing (KSWIN) Raab et al. (2020) maintains a sliding window of fixed size where the last samples represent the latest concept. A concept drift is detected if Kolmogorov-Smirnov test between the two distributions of and samples yields a significant difference.
2.5 Metrics for Concept Drift
It is important to note that the first two metrics apply to the model that makes the predictions, while the latter four apply to the concept drift detector (CDD). Note that the last two metrics have been proposed by us.
Prequential Error: For a given sequence of instances , it is the accumulated sum, i.e.
, of a loss functionbetween the prediction and the observed value .
Accuracy: It is the fraction of the accurate predictions made to the total number of predictions.
Average Detection Delay (ADD): It is average of the distances (in terms of instances) between the actual and the predicted concept drifts.
True Positive Rate (TPR): Fraction of detected drifts which were true, i.e., within acceptable detection interval (ADI), to the total number of drifts detected.
Due to detection delays, in our experiments, we keep the ADI four times the drift width as per existing literature Pesaranghader and Viktor (2016); Yan (2020).
True Positives per Drift (TPD): Fraction of detected drifts which were true to the total number of actual drifts.
It is possible that a CDD declares or more drifts within the ADI, and accordingly, its TPR may be high which can be misleading; therefore, we introduce TPD whose optimal value is (though it can be higher or lower), i.e., CDD declares a drift exactly once for every actual drift in the dataset.
Drift Count: Total number of actual drifts detected.
Since detecting multiple drifts within the ADI for an actual drift may offset for the case when no drift is detected for another actual drift, we included drift count in our set of metrics.
3 Background & Motivation
Concept drift problem exists in many real-world situations such as automated change risk assessment in technology driven industry. This can result in poor and degrading predictive performance in predictive models that assume a static relationship between input and output variables.
One brute-force way to mitigate this problem is to periodically train the static model with more recent data. However, the cost factor, such as Computational Cost, Labor Cost and Implementation Cost, puts up a significant impediment towards retraining model frequently.
A more elegant way to approach this problem is to algorithmically detect concept drift in data and retrain the model depending on the outcome of the algorithm. There are different categories of drift detection algorithms such as error rate-based drift detection algorithms, distribution-based drift detection algorithms, and few others Lu et al. (2019)
. However, such drift detection methods suffer from severe impairment in presence of data sparsity. Non-consideration of inaccuracies in the imputation methods in estimating the underlying data distribution is at the core of this problemLiu et al. (2020). A fuzzy distance estimation based method for detecting concept drifts in the presence of missing values has been addresses in Liu et al. (2020). To formally discuss the impact of data sparsity on drift-detection methods, we need to delve deep into the following questions:
Question 1. How can imputing missing values perturb the original distribution of the data?
Question 2. How can a drift detection method be impacted by the perturbed distribution of data?
To analyze Question 1, we first introduce a few notations. We assume that we are given a sample space , a finite set of data points. From this sample space, we are given a set of positive labeled points . We assume these labeled points are drawn i.i.d. from some unknown target distribution over . We assume that represents a data-point which is characterized by a set of features for . We define the missing indicator variable as follows:
In any setting of missingness, such as, MCAR, MAR or MNAR, the true expectation of the data can be quite deviant from the observable expectation depending on the accuracy of missing value imputation. Having to represent the prior on and to represent the conditional distribution of missingness which can be MCAR, MAR or MNAR, we can write applying Bayes rule:
We can estimate the empirical expected value of a single feature the following way:
In other words, the estimated expected value of a feature is with respect to the conditional probability instead of . We now discuss the situation when we impute the missing values. Let , which we denote as for simplicity, represent any imputation method invoked when is missing. In other words, we use to approximate the ground truth of when it is missing.
Observe that two weighted expectations are contributing to the estimated expectation of feature . The first term in equation 3 represents the expectation of feature estimated on the basis of only the observed values. The second term, however, is the expectation over another conditional distribution of some imputed feature . The deviation between the imputed feature value , and the corresponding ground truth depends on the accuracy of the imputation method. Notice also that higher is the degree of sparsity, more pronounced is the effect of on .
Equation 3 essentially means that, in effect of missing value imputation, , the marginal distribution of every feature, , the overall data distribution and , the conditional distribution where represents the label of the data-point represented by admit some degree of perturbation.
For Question 2, we first take an example of an error-rate based method, such as, DDM Gama et al. (2004)
. The essence of this method lies in comparing the error rate of a classifier between two consecutive batches of data-points of lengthand checking if the change in error-rate of the classifier between these two batches is statistically significant.
Suppose a sequence of examples, in the form of pairs . For each example, the actual classification model predicts
, that can be True or False. For a set of examples, the error is a random variable from Bernoulli trialsForbes et al. (2010)
. The Binomial distribution gives the general form of the probability for the random variable that represents the number of errors in a sample ofexamples. For each point in the sequence, the error-rate is the probability of observed False, , with standard deviation given by . Statistical theory Mitchell (1997) guarantees the decline in error rate of the learning algorithm () with the increase of on the condition of stationarity of class distribution. Suppose the length of the first batch is and that of the second batch is with proportion of error being and , respectively. and represent two random variables satisfying and
. Error-rate based concept drift detection methods, such as, DDM, end up constructing a test to validate the following null hypothesis:
The limitation of the above approach becomes prominent in the presence of a high degree of sparsity in the two consecutive sequences of data samples. As explained in previous section, the distributions of the data of those two batches get perturbed following missing value imputation. Therefore, the binomial distribution which is used to model the error-rate in those two batches also ends up being perturbed and eventually results in wrong perception of concept drift.
Data distribution-based drift detection algorithms are considered the second largest category of drift detection methods where a distance function is employed to quantify the dissimilarity between the distributions of two consecutive sequences or batches of data samples. As the missing value imputation perturbs the batches of data samples, this category of algorithms suffers from similar problem as mentioned above in the presence of high degree of sparsity.
Therefore, it warrants us to develop a strategy to ensure a guarantee, at least to some extent, on the performance of the drift detection method even in the presence of various non-ideal data characteristics, such as, high degree of sparsity.
Our methodology involves the following broad steps.
1. Find distribution-wise best data imputation scheme. We started by creating various distributions (one may think of these as synthetic datasets with a single feature except for the multi-variate normal case), and applied MCAR, MAR and MNAR sparsities at different levels – . The data imputation techniques applied were mean, median, mode, zero (constant), and kNN (we tried multiple values of k). Our findings are summarized in Table 1
. Note that for some distributions, some of the data imputation schemes may be identical, e.g., mean, median and mode for normal distribution; however, we mention only one of these identical schemes for a given distribution in this table. Surprisingly, other than the multi-variate normal distribution, the best performing scheme is identical irrespective of the type of missingness and the sparsity level.
2. Find the best data imputation scheme for a given dataset. For a given dataset with sparsity, we first identify what is the level of sparsity for each feature. Next we replace each missing value with 0 and non-missing value with 1 and apply runs test for randomness Bradley (1968); Paindaveine (2009)
. Given our application domain and the corresponding data, which is mostly obtained through forms filled in by human change requestors who tend to leave out non-mandatory fields and thus leading to sparsity, we ruled out the possibility of MCAR (although we do perform experiments with MCAR on synthetic datasets in subsequent section), and strongly believe that there is a (possibly hidden) pattern to the missingness in our data. We envision that most domain experts should be able to use prior knowledge to tell whether MCAR is indeed a possibility for their use cases or not – in our experience, it is a rare phenomenon. Therefore, if the test declares that the data is missing at random, then we conclude that it is a case of MNAR, i.e., we are currently not collecting the variables on which the missingness depend; otherwise, we consider it a case of MAR. We also use quantile-quantile (Q-Q) plots to find the probability distribution of each feature. Subsequently, we isolate the rows that do not have any missing values. Having already identified the sparsity level and the type of missingness, we apply the same on these rows (without sparsity). Since we know the original values for these newly introduced missing values, we use the same to find the best data imputation scheme among the winner identified in the previous step and correlation-dependent imputation techniques mentioned in Section2.2. We use root-mean-square error (RMSE) to find the best scheme. This strategy is captured in Figure 1.
3. Apply the majority voting based ensemble of concept drift detectors on the dataset with imputed values. The central idea behind our approach is to construct an ensemble of multiple CDDs and infer on concept drift based on the majority voting among the individual CDDs. The intuition is that the majority voting strategy will provide a lower bound on the resulting accuracy by taking advantage of the strengths of individual CDDs in diverse situations.
On a formal note, consider a concept drift detection problem in which an algorithm takes two consecutive sequences of data instances, and , from a data stream and predicts if drift exists between and . We have to represent the prediction of the drift detection algorithm where -1 and +1 represent absence and presence of drift respectively. The overall decision is based on the average prediction of the base CDDs. Let the average prediction of the base CDDs, , be the score :
In the usual case, if the score is negative then the overall decision is , and if positive then , i.e. .
To analyze the properties of the ensemble CDD, consider a generic framework in which the ensemble CDD, , takes the overall decision:
where is a rejection threshold and is the most uncertain region for the ensemble drift detector . Define a random variable where represents the ground truth of the existence of drift between .and and admits value either or depending on the existence of drift while is the output of the ensemble drift detector as defined above. Due to the encoding , is negative for incorrect predictions and positive for correct predictions. If is in the range , it causes the ensemble drift detector to suffer from indecision. With being a random variable, the probability of wrong prediction
and the probability of rejection
We now define the risk of the ensemble drift detector as below:
Here and represent the cost of wrong prediction and rejection or indecision respectively. We now look to find an upper bound of to analyze the worst case situation. As discussed in Breiman (2001), we have
where represents the average pairwise correlation between base CDDs and . Following Cantelli’s inequality (Wu et al., 2021), we have
Setting , we obtain
Similarly, using the inequality 10, we have
Thus combining 9, 12 and 13, we can find the upper bound of as below:
In our model, we set which gives,
Observe that the above inequality applies only when and the upper bound of comes down with the decrease in . In other words, with more diversity among the base CDDs, the ensemble, as defined in equation 6, becomes less prone to wrong prediction about concept drift even if the base CDDs suffer from deterioration in performance due to sparsity.
5 Experimental Setup & Results
5.1 Dataset Description
We use the publicly available Harvard dataverse Lobo (2020) that contains synthetic datasets for abrupt and gradual concept drift detection. For our change risk assessment project, we have 50K samples (change requests) that are labelled as either “risky” (i.e., potentially may lead to some major incident) or “not risky” – thus, our primary task is binary classification. Now, we take the 50K samples and shuffle these to remove any pre-existing concept drifts in the data, and then create separate datasets of abrupt and gradual concept drifts by flipping the class labels; we tried various combinations of sample points where the drifts are introduced and various drift widths. Note that this process of shuffling the original dataset (which ideally has no adverse effect on the original classifier), and then introducing drifts artificially is a common process for real-world datasets because in reality, it is not always feasible to clearly identify the onset and width of a concept drift Souza et al. (2020); Sethi and Kantardzic (2017).
5.2 Experimental Results
Initially, we check whether data imputation aids in concept drift detection or not. On experimenting with Harvard dataverse and our data, we find that data imputation had a positive effect on all the metrics mentioned in Section 2.5 and for all the algorithms in Section 2.4 – this is shown in Figure 2 for our data, which had lowest RMSE for kNN4.
Next we introduce different levels of sparsity using different techniques – MCAR, MAR, MNAR – on Harvard dataverse, and record the performance of the 7 concept drift detectors for the different metrics. We notice that although prequential error is widely used as a metric for comparing CDDs Baena-Garcia et al. (2006); Yan (2020); Lu et al. (2019); however, it can be sometimes misleading – consider the case when one has a bad CDD that detects drifts often (even when there is none), and consequent retraining, which are typically quite costly, lead to low prequential error value. Similar problem may arise with the metric accuracy. We have discussed the challenges with the other metrics already in Section 2.5. However, these metrics together serve as key indicators for performance of CDDs.
Empirically, we find that none of the 7 drift detectors performs well across all the metrics, as can be seen in Figure 3, for any given missingness or drift type. Hence, we decide to go with a majority voting based ensemble of detectors. Ranking the algorithms based on their performance for different metrics, we find that for abrupt drifts: ADWIN HDDMA KSWIN, and for gradual drifts: HDDMA + HDDMW + Page-Hinkley, achieve best performance across all the metrics.
A challenge for working with an ensemble of detectors is finding the optimal window size, such that, if at any instant within that window, two of the detectors detect a drift, then the ensemble declares a drift at that instant. Our experiments show that for Harvard dataverse, the optimal window size is , and for our data, it is . The ensemble detector for Harvard dataverse is also shown in Figure 3. We omit the details of this experiment for finding the optimal window size due to page limitation.
We, finally, test our ensemble detector on change risk assessment data as shown in Figure 4. As can be seen in this figure, if not the best, the ensemble detector always features in the top-3 for all the metrics. Therefore, we plan to deploy it in near future because the majority voting based ensemble detector serves as a strong baseline, and it is unlikely that a concept drift will be missed by it.
Concept drift detection poses a major challenge for real-world deployments of machine learning solutions. The problem is further worsened in the presence of sparsity, especially since the ground truths for the missing values can never be obtained in reality. One of our major findings in course of tackling this problem is that none of the popular concept drift detectors exhibit optimal performance across all metrics in all situations consistently. Therefore, we design a majority voting based ensemble of detectors for abrupt and gradual drifts and it delivers best or close to best performance for the whole spectrum of metrics.
- Early drift detection method. In Proc. 4th Int. Workshop Knowledge Discovery from Data Streams, Cited by: §5.2.
- Learning from time-changing data with adaptive windowing. In SDM, pp. 443–448. Cited by: §2.4.
- Distribution-free statistical tests. Prentice-Hall. Cited by: §4.
- Random forests. Mach. Learn. 45 (1), pp. 5–32. Cited by: §4.
- Statistical distributions. 4th edition, John Wiley & Sons. Cited by: §3.
- Online and non-parametric drift detection methods based on hoeffding’s bounds. IEEE Transactions on Knowledge and Data Engineering 27 (3), pp. 810–823. Cited by: §2.4.
- Learning with drift detection. In SBIA, LNCS, Vol. 3171, pp. 286–295. Cited by: §2.4, §3.
- Look before you leap! designing a human-centered AI system for change risk assessment. CoRR abs/2108.07951. Cited by: §1, §2.3.
- Matrix completion and low-rank SVD via fast alternating least squares. J. Mach. Learn. Res. 16, pp. 3367–3402. Cited by: §2.2.
- Concept drift detection: dealing with missingvalues via fuzzy distance estimations. CoRR abs/2008.03662. Cited by: §3.
- Synthetic datasets for concept drift detection purposes. Harvard Dataverse. External Links: Cited by: §5.1.
- Learning under concept drift: A review. IEEE Trans. Knowl. Data Eng. 31 (12), pp. 2346–2363. Cited by: §3, §5.2.
- Machine learning. McGraw-Hill. External Links: Cited by: §3.
- Missing data imputation using optimal transport. In ICML, Proceedings of Machine Learning Research, Vol. 119, pp. 7130–7140. Cited by: §2.2.
- Continuous inspection schemes. Biometrika 41 (1/2), pp. 100–115. Cited by: §2.4.
- On multivariate runs tests for randomness. Journal of the American Statistical Association 104 (488), pp. 1525–1538. Cited by: §4.
- Fast hoeffding drift detection method for evolving data streams. In Machine Learning and Knowledge Discovery in Databases, pp. 96–111. Cited by: §2.5.
- Reactive soft prototype computing for concept drift streams. Neurocomputing 416, pp. 340–351. Cited by: §2.4.
- Inference and missing data. Biometrika 63 (3), pp. 581–592. Cited by: §2.1.
- On the reliable detection of concept drift from streaming unlabeled data. Expert Syst. Appl. 82, pp. 77–99. Cited by: §5.1.
- Challenges in benchmarking stream learning algorithms with real-world data. Data Mining and Knowledge Discovery 34, pp. 1805–1858. Cited by: §5.1.
- Mice: multivariate imputation by chained equations in r. Journal of Statistical Software, Articles 45 (3), pp. 1–67. External Links: Cited by: §2.2.
- Accurate detecting concept drift in evolving data streams. ICT Express 6 (4), pp. 332–338. Cited by: §2.5, §5.2.