A common challenge when mining data streams is that the data streams are not always strictly stationary, i.e., the concept of data (underlying distribution of incoming data) unpredictably drifts over time. This has encouraged the need to detect these concept drifts in the data streams in a timely manner, be it for business intelligence or as a means to track the performance of statistical prediction models that use these data streams as input.
This paper focuses on detecting concept drifts affecting binary classification models. For a binary classification problem, concept drift is said to occur when the joint distributionchanges over time where are the predictor variables at time step and the corresponding binary response variable. Intuitively, concept drift refers to the scenario when the underlying distribution that generates the response variable changes over time. Popular approaches for detecting concept drift identify the change point [1, 2]. DDM is the most widely used concept drift detection algorithm, that is strictly designed for streaming data 
. The test statistic DDM employs is the sum of overall classification error (
) and its empirical standard deviation (). DDM focuses on the overall error rate and hence fails to detect a drift unless the sum of false positive and false negatives changes. An example of such a scenario, is when a confusion matrix changes from to thus preserving their overall error rate. This limitation is accentuated in imbalanced classification tasks 
, as seen in the example. Unfortunately, this failure to detect a drastic drop in recall of the minority class is often critical. For instance, if the minority class in the above example corresponded to products at a manufacturing plant that were classified as defective, this critical threefold decrease in ’true positive rate’ (i.e., from 0.75 to 0.25) would go unnoticed by DDM.
Drift Detection Method for Online Class Imbalance (DDM-OCI) addresses the limitation of DDM when class ratio is imbalanced . However, DDM-OCI triggers a number of false alarms due to an inherent weakness in the model. DDM-OCI assumes that the concept drift in an imbalanced classification task is indicated by the change of underlying true positive rate (i.e., minority-class recall). This hypothesis unfortunately does not consider the case when concept drift occurs without affecting the recall of the minority class. It can be shown that it is possible for concept to drift from an imbalanced class data to balanced class data, while true positive rate (), positive predicted value () and F1-score remain unchanged. Thus, this type of drift is unlikely to be detected by DDM-OCI unless other rates such as true negative rate () or negative predicted value () are also considered. Additionally, the test statistic used by DDM-OCI is not approximately distributed as , under the stable concept. Thus, the rationale of constructing confidence levels specified in  is not suitable with the null distribution of . This is the reason DDM-OCI triggers false alarms quickly and frequently.
Early Drift Detection Method (EDDM) achieves better detection results than DDM if the data stream has slow gradual change. EDDM monitors the distance between the two classification errors 
. PerfSim algorithm considers all the components of a confusion matrix and monitors the cosine similarity coefficient of all components from two batches of data. If the similarity coefficient drops below some user-specified threshold, a concept drift is signified. However, EDDM requires to wait for a minimum of classification errors before calculating the monitoring statistic at each decision point. That is, the length of a time interval between decision points of a drift is a random number depending on appearances of classification errors. It is possible that there is a great many examples between classification errors. PerfSim algorithms is also constrained by the requirement for collecting mini-batch data to calculate monitoring statistics. The method to partition data stream in [3, 4] is either user-specified by practical experience or to be learned before the start of detection. Hence, EDDM and PerfSim are not well suited for streaming environments in which decisions are made instantly. The approach specified in  makes use of SVM to monitor three measures: overall accuracy, recall, and precision over time. This aproach too computes the three measures by assuming that the data arrives in batches, on which SVM is learned.
To address the limitations of existing approaches, we present Linear Four Rates (LFR) for detecting the drift of . Unlike other proposed approaches, LFR can detect all possible variants of concept drift, even in the presence of imbalanced class labels, as shown in Section IV. LFR outperforms existing approaches in terms of earliest detection of concept drift, with the least false alarms and best recall. Additionally, LFR does not require the data to arrive in batches and is independent of the underlying classifier employed.
Ii Problem formulation
Given that detection of concept drift is equivalent to detecting a change-point in , an intuitive approach is to test the statistical hypothesis upon the multivariate variable in the data stream [6, 7, 8]. The limitation of this approach is that the performance of the statistical power degrades when the dimension () of is extremely large or if the magnitude of the drift small. Hence, to overcome these limitations, the proposed approach identifies the change in where is the classifier used for prediction. This is motivated by the fact that any drift of would imply a drift in
, with probability 1.
Let be a binary classifier for the given data stream (). We define the corresponding confusion probability matrix () for to be
where, , , , denotes the underlying percentage of true positives (TP), true negatives(TN), false positives (FP) and false negatives (FN) respectively, for classifier . i.e., .
The four characteristic rates (True Positive Rate, True Negative Rate, Positive Predicted Value, Negative Predicted Value) can be computed as follows: , , and . All the mentioned characteristic rates in are equal to , if there is no misclassification.
Under a stable concept (i.e., remains unchanged), remains the same. Thus, a significant change of any , implies a change in underlying joint distribution , or concept. It is worth noting that at every time step , for any possible pair, only two of the four empirical rates in will change and these two rates are referred to as “influenced by ”. Also, note that in certain applications the detection of concept drift is not of interest and thus unnecessarily alarmed if all empirical rates in are increasing. This is because it suggests that an old model learned from historical data performs even better in classifications of current data stream. We do not use this assumption in this paper, but all methodologies and arguments we propose below can be easily adapted for this assumption.
Iii Concept Drift Detection Framework
Given the efficacy of the (where,
) to detect concept drift, the proposed concept drift detection framework uses estimators of the rates in
as test statistics to conduct statistical hypothesis testing at each time step. Specifially, the framework at each time stepconducts statistical tests with the following null and alternative hypotheses:
The concept is stable under and is considered to have drifted if is rejected. The idea is to compare the statistical significance level of the running test statistic under at each time step to the user defined warning () and detection () significance levels. This type of test is called ”continuing test”  and in our problem all time stamps are decision points of acceptance or rejection. Then when the concept is stable, false alarms on will be triggered unnecessarily once in every time steps in the long run. In this paper, we assume the spacing of decision points is fixed. Accordingly, the familiywise error rate and its cost in our continuing test can be controlled by using a simultaneous inference method such as classical Bonferroni corrections on . In a more general case where the spacings of decision points are unequal and test statistics are strongly positive correlated, we should instead consider the average run length of the test  or more powerful alternatives that controls the familywise error rate.
A naïve implementation of the ”continuing test” framework (Naïve Four Rates) would be to use (empirical rate of ), as the estimators and test statistics. But as shown in Section III-C, there are better estimates of .
In the following section, Linear Four Rates (LFR) algorithm will be used to elaborate on the concept drift detection framework. LFR differs from Naïve Four Rates (NFR) in terms of the estimator used. However, both LFR as well as NFR perform better than DDM and DDM-OCI due to the more comprehensive detection framework utilized.
Iii-a Linear Four Rates algorithm (LFR)
Iii-A1 Algorithm Outline
LFR uses modified rates as the test statistics for . is a modified version of the empirical rate . At each , is updated as : for those empirical rates “influenced by ”. is essentially a linear combination of classifier’s previous performance and current performance , where is a time decay factor for weighting the classifier’s performance at current instance. has been used as a class imbalance detector and as a revised recall test statistic in . The probabilistic characteristic of our test statistic are investigated in § III-A2. The pseudocode of the framework (using as an estimator of for required test statistic), is detailed in Algorithm 1.
The three user defined parameters are the time decaying factor (), warning significance level () and detection significance level () for each rate. Time decaying factor is a weight in to evaluate performance of classifier at current instance prediction . Given that the detection methodology is conducting hypothesis testing at each time step, and
are interpretable statistical significance levels, i.e., type I error (false alarm rate), in standard testing framework. In practice, allowable false warning rate and false detection rate in applications such as quality control of the moving assembly line are guidelines to help the user choose the parametersand . For the fair comparison, is set to the same value of 0.9 as in , for all experiments of this paper. The optimal selection of is domain dependent and can be pre-learned if necessary.
is a geometrically weighted sum of i.i.d Bernoulli random variables, which emphasizes the most recent prediction accuracy and places exponentially decaying weights on the historical prediction accuracies. By taking advantage of this weighting scheme,is more sensitive to concept drifts, foreshadowing the non-stationarity of classifier’s performance.
and construct a more reliable running confidence interval forto control the type-I error . is distributed as geometrically weighted sum of Bernoulli random variables. Bhati et. al investigates the closed-form distribution function of for the special case . However, a closed-form distribution function for other values of is unattainable. Alternatively, according to Theorem 1, a reasonable empirical distribution can also be independently obtained by Monte Carlo simulation for given , and time decaying factor . The pseudocode for the Monte Carlo sampling procedure is provided in Algorithm 2. As is unknown, is used as its surrogate to generate the empirical distribution of
. Based on the empirical distribution, the lower and upper quantile for the given significance level, serves as the required (warning/detect) bounds. The selection of as the best surrogate of , is supported by Lemma 1.
and denote warning and detection significance levels respectively, where . The corresponding and are obtained from Monte Carlo simulations as described. The bounds of four rates of the framework, can be independently set based on importance, by having distinct . For instance, in some imbalanced classification tasks, performance of the classifier on the minority class is a higher priority than on the majority class.
Having computed the bounds, the framework considers that a concept drift is likely to occur and sets the warning signal (), when any crosses the corresponding warning bounds () for the first time. If any reaches the correspoinding detection bound (), the concept drift is affirmed at ().
All examples stored between and are extracted to relearn a new classifier since the stored examples are considered samples of the new concept. In case the number of stored examples is too few to relearn a reasonable classifier, one will have to wait for sufficient training examples. However, if cross the corresponding warning bounds , but fail to reach , previous warning flag will be erased. After detecting concept drift, are reset to their initial values, so that a new monitoring cycle can restart.
The following theorems investigate the statistical properties of LFR test statistic .
For any , is a geometrically weighted sum of Bernoulli random variables, when there is a stable concept up to time : i.e., , where and is the underlying rate.
Among total time steps, suppose is changed according to line 7 at time step where . Hence,
where the last equation hold by the stable concept assumption and all indicators are i.i.d Bernoulli random variables with underlying rate .
is an unbiased estimator of . This is because where are i.i.d Bernoulli random variables realized at time with parameter . By factorization theorem, is a sufficient statistic. Also,
If , it implies because is a polynomial of . Thereby and is a complete sufficient statistic by definition. By Lehmann-Scheffe Theorem, is the unique UMVUE.
The complexity of Linear Four Rates (LFR) detection algorithm is at each time step. The LFR algorithm can be optimized by using a precomputed by Algorithm 2. The 4 dimensional with varying input can itself be precomputed and stored before running Algorithm 1. It is unnecessary to spend any computational resource on quantiles calculation during stream monitoring because observer can find a closest to from to look up lower and upper quantiles. Thus, LFR algorithm takes O(1) to test drift occurrence at each time point and suits with streaming environment.
Iii-B Naïve Four Rates algorithm (NFR)
For the purpose of comparison, this section details the characteristics of a naïve implementation of the proposed framework that uses as the test statistic. A benefit of choosing this test statistic, is that there exists a closed-form distribution as shown in Lemma 1. Using the same strategy of LFR algorithm, NFR algorithm monitors the four rates sequentially. At each time stamp, for each rate, hypothesis testing is done with null distribution and the warning / detection alarms set when exceeds the expected bounds.
The main difference with respect to LFR is the estimation of used to find null distribution. LFR algorithm uses as a surrogate of unknown while NFR algorithm uses , where is a running average of all previous . This update rule allows old prediction performance contributes more to the estimate of and recent predictions contributes less. Thus, is more robust in terms of estimating the underlying when concept drift occurs. Additionally, is still a MSE-consistent estimator under the stable concept presented in Lemma 2.
Assume the setting in Theorem 1. Under the stable concept up to , for any , in NFR algorithm is a MSE-consistent estimator of .
Among total time steps, suppose is changed at time step where . Hence,
By IID assumption of indicators, we obtain
where the last limit hold by the fact that as . Thus, as and , is a MSE-consistent estimator.
Iii-C Comparison between NFR and LFR
To empirically compare the test statistics of NFR and LFR, we use Figure 1 to illustrate a single run of both LFR and NFR algorithm on the same synthetic streaming data . The data stream of pairs with one change-point at is generated by sampling from two confusion probability matrices and . The two concepts are characterized by and respectively. The type of drift is determined by particular settings of . In this example, to generate a balanced stream of pairs representing the scenario that overall accuracy of classifier drops but remains constant, we chose and . The objective of detection algorithms is to identify the change-point .
It is clear that the test statistic in LFR algorithm has a larger variance than for each rate. LFR algorithm reports an earlier detection at (true detection point t=5000) when compared to NFR in this run, even though . This observation matches well with the rationale of constructing , described in §III-A1, to gain detection sensitivity through introducing large variances. To rigorously compare detection performance of and , more investigations are provided below.
Power characteristics of two competing test statistics (LFR) and (NFR), are compared empirically on synthetic data. We denote by and the power estimates of and respectively. The and against varying time lag and are presented in Figure 2. Figure 2 indicates that neither nor dominates all the time because (red surface) achieves a larger statistical power when the time lag is small but a smaller power when is large. This is because the update rule line 7 enables the estimator to shift from to at an exponential rate which leads the power dominance in a short lag. The price is that limiting distributions of under both null and alternative have larger variances than and thus limiting power, when is large and is small, is degraded.
In order to compare sensitivities of and with regard to detecting concept drift in more general settings, we used . The result is illustrated in Figure 3.
Except when, , we see that for any fixed pair of , when is small and when is large. This is because decreases, as time lag increases. This suggests that LFR is preferable if earlier detection is highly desired. The alarms are more likely to be triggered in the earliest time after the occurrence of concept drift. Earlier detection allows observer to adjust the model and avoid costs of incorrect predictions immediately. On the other hand, if observers are only concerned with detecting the occurrence of drift in the data stream but unconcerned with its detection promptness, then NFR algorithm provides a higher power test statistic to detect the drift. This is because with convergence rate . In the long run, as , implies that .
To guide the selection between LFR and NFR, Figure 4 is a heatmap of limiting power estimates on all pairs using . We can see that is already close to 1 for , when and are significantly different.
In this section, we compared the detection performance of LFR to NFR, DDM and DDM-OCI approaches using both synthetic data and public datasets. We considered simulated class-balance datasets, simulated class-imbalance datasets and public datasets to demonstrate LFR algorithm performs well across various types of concept drifts, including those where the baseline performs poorly.
To generalize the performance and evaluate confidences of algorithms, we utilize the bootstrapping technique. For each synthetic dataset, we generate data streams of rather than so that comparison of detection algorithms is independent of classifiers employed; For each public dataset, the order of pairs within each concept are permutated to create 100 bootstrapped dataset streams. Each stream is fed to all detection algorithms to obtain single-run detections for each method. To illustrate the accuracy of the prediction, we use overlapped histograms to visualize the distribution of detection points obtained from the concept drift detection models across the 100 runs. To avoid redundancy, we present histograms out of experiments and remaining ones are similar. As shown below, LFR consistently outperformed the baseline approaches. When compared to NFR, LFR correctly identifies more true drift points with higher probability and smaller number of false alarms even with a smaller .
Iv-a Synthetic Data
Numerous experiments were run on synthetic data, covering various types of concept drift. In each bootstrap, a data stream of pairs with one change-point at is generated by using the same mechanism introduced in §III-C. The objective of detection algorithms is to identify the change-point . Six challenging and interesting scenarios are discussed below.
Iv-A1 Balanced Dataset
In balanced datasets, is required in underlying data generation. Class-balance data are the most typical scenario in classification task and hence investigated with following three representative experiments.
Balance1: Overall accuracy of classifier drops but remains constant with and .
Balance2: Gradual drift in which overall accuracy () remains the same with .
Balance3: Overall accuracy () increases and remains unchanged with , .
Iv-A2 Imbalanced Dataset
For imbalanced datasets, we used the same data generation mechanism as balanced case but make and imbalanced. We considered the following three interesting types of concept drifts given many attentions to in real applications.
Imbalance1: From class balance dataset to class imbalance dataset with and . Without loss of generality, let be the minority class. It is also noteworthy that and are unchanged after drift occurrence. Hence, many detectors in imbalance data learning society, using F1 score as a measure to monitor classifier performance, is unable to alarm this type of drift. However, Fig. 7 shows that LFR performs very well by dominating both high early detection rate and trivial false alarms. Besides, DDM and DDM-OCI has no detection after change-point due to the increment of () and , respectively.
Imbalance2: The class ratio and remain unchanged but decreases with and .
Imbalance3: All , and decreases. Though class ratio remains the same, both F1-score and overall accuracy decreases. Two conditional probability matrices are selected as and .
Iv-B Public Datasets
All detection algorithms are evaluated on four public datasets used in literature. Without loss of generality, we chose the Support Vector Machine (SVM) with an RBF Kernel as the classifier , because all detection algorithms are independent of type of classifiers. Misclassification of the minority class is penalized 100 times more than the majority class. If a potential concept drift is reported by the algorithm, examples from the new concept will be stored to retrain a new SVM classifier , adapted with new concept. Specifically,
examples are used for retraining on SEA and Rotating Hyperplane datasets;examples are used for retraining on USENET1 and USENET2 datasets.
|Dataset||True Drift Time||dimensions ()|
SEA Concepts dataset is used in . The dataset is available at http://www.liaad.up.pt/kdus/products/datasets-for-concept-drift, and is widely used as a testbed by concept drift detection algorithms. Rotating Hyperplane dataset is created by .The dataset and specific pairs of each concept are available at http://www.win.tue.nl/~mpechen/data/DriftSets/. USENET1 and USENET2 datasets, used in , are available at http://mlkd.csd.auth.gr/concept_drift.html. They are stream collections of messages from different newsgroups (e.g. medicine, space, baseball) to a user. The difference between USENET1 and USENET2 is the magnitude of drift. The user in USENET1 has a sharper topic shift than the one in USENET2.
All above datasets are in form of and their key features are summarized in Table I. Other details such as the imbalance status and type of drift of each dataset are available through above links.
In SEA Concepts Dataset experiment, Fig. 8 shows that LFR dominates other three approaches in terms of early detections and fewer false or delayed detections.
Fig. 9 shows that LFR has a dominant performance on the Rotation Hyperplane Dataset experiment. At the second true drift time point, the underlying concept change is very minor. Hence the drift is neglected by all detection algorithms.
In USENET1 dataset experiment, Fig. 10 indicates LFR dominates other approaches and all drift points are alarmed. Similarly, in USENET2 dataset experiment, LFR also outperforms other approaches but detections are delayed with longer time lag. The decrement of superiority of LFR, from USENET1 to USENET2 is due to decrements of magnitude of concept drifts.
Iv-C summary statistics
|Parameters||Detect Sig.||Warn Sig.||Decay|
In general, the best algorithm will have the minimal number of false alarms and maximal number of early detections, whereas poor algorithms give large number of false alarms, missing or severely delayed true detections. A summary of the counts of correct detections at true drift timestamp and counts of false detections during false detection period for the simulated and public datasets are provided in Tables II and Table III.
False detection period refers to the period preceding the data points that belong to the new concept. For the synthetically generated datasets in §IV-A, there were two concepts spanning the data points, such that the false detection period is defined as . For the datasets specified in §IV-B, if there were more than two concepts, the false detection period corresponds to the range from the concept midway up to the next true drift point. Each bin in the histograms correspond to time steps in §IV-A and dataset-dependent in §IV-B. Since it has been observed in [17, 18] that false alarms may have a smaller influence on predictive performance than late drift detections, the true detection period in our experiments refers to the period spanning next time steps ( bin) after in §IV-A and the period spanning bin after each true drfit point in §IV-B. Other parameter settings of detection algorithms are summarized in Table IV and V. They are particularly selected to show the dominating performance of LFR, i.e. the smallest allowable type-I error but the largest statistical power, over benchmark algorithms.
The paper presents a concept drift detection framework (LFR) for detecting the occurance of a concept drift and identifies the data points that belong to the new concept. The versitality of LFR allows it to work with both batch and stream datasets, imbalanced data sets and it uses user-specified parameters that are intuitively comprehensible, unlike other popular concept drift detection approaches. LFR significantly outperforms existing benchmark approaches in terms of early detection of concept drifts, high detection rate and low false alarm rate across the types of concept drifts.
J. Gama, P. Medas, G. Castillo, and P. Rodrigues, “Learning with drift
Advances in Artificial Intelligence–SBIA 2004. Springer, 2004, pp. 286–295.
-  S. Wang, L. L. Minku, D. Ghezzi, D. Caltabiano, P. Tino, and X. Yao, “Concept drift detection for online class imbalance learning,” in Neural Networks (IJCNN), The 2013 International Joint Conference on. IEEE, 2013, pp. 1–10.
-  M. Baena-García, J. del Campo-Ávila, R. Fidalgo, A. Bifet, R. Gavaldà, and R. Morales-Bueno, “Early drift detection method,” 2006.
-  D. K. Antwi, H. L. Viktor, and N. Japkowicz, “The perfsim algorithm for concept drift detection in imbalanced data,” in Data Mining Workshops (ICDMW), 2012 IEEE 12th International Conference on. IEEE, 2012, pp. 619–628.
R. Klinkenberg and T. Joachims, “Detecting concept drift with support vector
Proceedings of the Seventeenth International Conference on Machine Learning. Morgan Kaufmann Publishers Inc., 2000, pp. 487–494.
-  D. S. Matteson and N. A. James, “A nonparametric approach for multiple change point analysis of multivariate data,” Journal of the American Statistical Association, vol. 109, no. 505, pp. 334–345, 2014.
-  X. Song, M. Wu, C. Jermaine, and S. Ranka, “Statistical change detection for multi-dimensional data,” in Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2007, pp. 667–676.
-  A. Dries and U. Rückert, “Adaptive concept drift detection,” Statistical Analysis and Data Mining, vol. 2, no. 5-6, pp. 311–327, 2009.
-  L. A. Aroian and H. Levene, “The effectiveness of quality control charts,” Journal of the American Statistical Association, vol. 45, no. 252, pp. 520–529, 1950.
-  M. Basseville, I. V. Nikiforov et al., Detection of abrupt changes: theory and application, vol. 104.
-  S. Wang, L. L. Minku, and X. Yao, “A learning framework for online class imbalance learning,” in Computational Intelligence and Ensemble Learning (CIEL), 2013 IEEE Symposium on. IEEE, 2013, pp. 36–45.
-  D. Bhati, P. Kgosi, and R. N. Rattihalli, “Distribution of geometrically weighted sum of bernoulli random variables,” Applied Mathematics, vol. 2, p. 1382, 2011.
-  D. Meyer and F. T. Wien, “Support vector machines,” 2014.
-  W. N. Street and Y. Kim, “A streaming ensssssemble algorithm (sea) for large-scale classification,” in Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2001, pp. 377–382.
-  W. Fan, “Systematic data selection to mine concept-drifting data streams,” in Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2004, pp. 128–137.
-  I. Katakis, G. Tsoumakas, and I. Vlahavas, “An ensemble of classifiers for coping with recurring contexts in data streams,” in Proceedings of the 2008 conference on ECAI 2008: 18th European Conference on Artificial Intelligence. IOS Press, 2008, pp. 763–764.
-  C. Alippi, G. Boracchi, and M. Roveri, “Just-in-time classifiers for recurrent concepts.” IEEE transactions on neural networks and learning systems, vol. 24, no. 4, pp. 620–634, 2013.
-  L. L. Minku and X. Yao, “Ddd: A new ensemble approach for dealing with concept drift,” Knowledge and Data Engineering, IEEE Transactions on, vol. 24, no. 4, pp. 619–633, 2012.