Concept Drift Detection for Streaming Data

04/04/2015 ∙ by Heng Wang, et al. ∙ Bosch Johns Hopkins University 0

Common statistical prediction models often require and assume stationarity in the data. However, in many practical applications, changes in the relationship of the response and predictor variables are regularly observed over time, resulting in the deterioration of the predictive performance of these models. This paper presents Linear Four Rates (LFR), a framework for detecting these concept drifts and subsequently identifying the data points that belong to the new concept (for relearning the model). Unlike conventional concept drift detection approaches, LFR can be applied to both batch and stream data; is not limited by the distribution properties of the response variable (e.g., datasets with imbalanced labels); is independent of the underlying statistical-model; and uses user-specified parameters that are intuitively comprehensible. The performance of LFR is compared to benchmark approaches using both simulated and commonly used public datasets that span the gamut of concept drift types. The results show LFR significantly outperforms benchmark approaches in terms of recall, accuracy and delay in detection of concept drifts across datasets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

A common challenge when mining data streams is that the data streams are not always strictly stationary, i.e., the concept of data (underlying distribution of incoming data) unpredictably drifts over time. This has encouraged the need to detect these concept drifts in the data streams in a timely manner, be it for business intelligence or as a means to track the performance of statistical prediction models that use these data streams as input.

This paper focuses on detecting concept drifts affecting binary classification models. For a binary classification problem, concept drift is said to occur when the joint distribution

changes over time where are the predictor variables at time step and the corresponding binary response variable. Intuitively, concept drift refers to the scenario when the underlying distribution that generates the response variable changes over time. Popular approaches for detecting concept drift identify the change point [1, 2]. DDM is the most widely used concept drift detection algorithm, that is strictly designed for streaming data [1]

. The test statistic DDM employs is the sum of overall classification error (

) and its empirical standard deviation (

). DDM focuses on the overall error rate and hence fails to detect a drift unless the sum of false positive and false negatives changes. An example of such a scenario, is when a confusion matrix changes from to thus preserving their overall error rate. This limitation is accentuated in imbalanced classification tasks [2]

, as seen in the example. Unfortunately, this failure to detect a drastic drop in recall of the minority class is often critical. For instance, if the minority class in the above example corresponded to products at a manufacturing plant that were classified as defective, this critical threefold decrease in ’true positive rate’ (i.e., from 0.75 to 0.25) would go unnoticed by DDM.

Drift Detection Method for Online Class Imbalance (DDM-OCI) addresses the limitation of DDM when class ratio is imbalanced [2]. However, DDM-OCI triggers a number of false alarms due to an inherent weakness in the model. DDM-OCI assumes that the concept drift in an imbalanced classification task is indicated by the change of underlying true positive rate (i.e., minority-class recall). This hypothesis unfortunately does not consider the case when concept drift occurs without affecting the recall of the minority class. It can be shown that it is possible for concept to drift from an imbalanced class data to balanced class data, while true positive rate (), positive predicted value () and F1-score remain unchanged. Thus, this type of drift is unlikely to be detected by DDM-OCI unless other rates such as true negative rate () or negative predicted value () are also considered. Additionally, the test statistic used by DDM-OCI is not approximately distributed as , under the stable concept. Thus, the rationale of constructing confidence levels specified in [1] is not suitable with the null distribution of . This is the reason DDM-OCI triggers false alarms quickly and frequently.

Early Drift Detection Method (EDDM) achieves better detection results than DDM if the data stream has slow gradual change. EDDM monitors the distance between the two classification errors [3]

. PerfSim algorithm considers all the components of a confusion matrix and monitors the cosine similarity coefficient of all components from two batches of data

[4]. If the similarity coefficient drops below some user-specified threshold, a concept drift is signified. However, EDDM requires to wait for a minimum of classification errors before calculating the monitoring statistic at each decision point. That is, the length of a time interval between decision points of a drift is a random number depending on appearances of classification errors. It is possible that there is a great many examples between classification errors. PerfSim algorithms is also constrained by the requirement for collecting mini-batch data to calculate monitoring statistics. The method to partition data stream in [3, 4] is either user-specified by practical experience or to be learned before the start of detection. Hence, EDDM and PerfSim are not well suited for streaming environments in which decisions are made instantly. The approach specified in [5] makes use of SVM to monitor three measures: overall accuracy, recall, and precision over time. This aproach too computes the three measures by assuming that the data arrives in batches, on which SVM is learned.

To address the limitations of existing approaches, we present Linear Four Rates (LFR) for detecting the drift of . Unlike other proposed approaches, LFR can detect all possible variants of concept drift, even in the presence of imbalanced class labels, as shown in Section IV. LFR outperforms existing approaches in terms of earliest detection of concept drift, with the least false alarms and best recall. Additionally, LFR does not require the data to arrive in batches and is independent of the underlying classifier employed.

Ii Problem formulation

Given that detection of concept drift is equivalent to detecting a change-point in , an intuitive approach is to test the statistical hypothesis upon the multivariate variable in the data stream [6, 7, 8]. The limitation of this approach is that the performance of the statistical power degrades when the dimension () of is extremely large or if the magnitude of the drift small. Hence, to overcome these limitations, the proposed approach identifies the change in where is the classifier used for prediction. This is motivated by the fact that any drift of would imply a drift in

, with probability 1.

Let be a binary classifier for the given data stream (). We define the corresponding confusion probability matrix () for to be

= PredTrue 0 1 0 TN FN 1 FP TP

where, , , , denotes the underlying percentage of true positives (TP), true negatives(TN), false positives (FP) and false negatives (FN) respectively, for classifier . i.e., .

The four characteristic rates (True Positive Rate, True Negative Rate, Positive Predicted Value, Negative Predicted Value) can be computed as follows: , , and . All the mentioned characteristic rates in are equal to , if there is no misclassification.

Under a stable concept (i.e., remains unchanged), remains the same. Thus, a significant change of any , implies a change in underlying joint distribution , or concept. It is worth noting that at every time step , for any possible pair, only two of the four empirical rates in will change and these two rates are referred to as “influenced by ”. Also, note that in certain applications the detection of concept drift is not of interest and thus unnecessarily alarmed if all empirical rates in are increasing. This is because it suggests that an old model learned from historical data performs even better in classifications of current data stream. We do not use this assumption in this paper, but all methodologies and arguments we propose below can be easily adapted for this assumption.

Iii Concept Drift Detection Framework

Given the efficacy of the (where,

) to detect concept drift, the proposed concept drift detection framework uses estimators of the rates in

as test statistics to conduct statistical hypothesis testing at each time step. Specifially, the framework at each time step

conducts statistical tests with the following null and alternative hypotheses:

The concept is stable under and is considered to have drifted if is rejected. The idea is to compare the statistical significance level of the running test statistic under at each time step to the user defined warning () and detection () significance levels. This type of test is called ”continuing test” [9] and in our problem all time stamps are decision points of acceptance or rejection. Then when the concept is stable, false alarms on will be triggered unnecessarily once in every time steps in the long run. In this paper, we assume the spacing of decision points is fixed. Accordingly, the familiywise error rate and its cost in our continuing test can be controlled by using a simultaneous inference method such as classical Bonferroni corrections on . In a more general case where the spacings of decision points are unequal and test statistics are strongly positive correlated, we should instead consider the average run length of the test [10] or more powerful alternatives that controls the familywise error rate.

A naïve implementation of the ”continuing test” framework (Naïve Four Rates) would be to use (empirical rate of ), as the estimators and test statistics. But as shown in Section III-C, there are better estimates of .

In the following section, Linear Four Rates (LFR) algorithm will be used to elaborate on the concept drift detection framework. LFR differs from Naïve Four Rates (NFR) in terms of the estimator used. However, both LFR as well as NFR perform better than DDM and DDM-OCI due to the more comprehensive detection framework utilized.

Iii-a Linear Four Rates algorithm (LFR)

Iii-A1 Algorithm Outline

LFR uses modified rates as the test statistics for . is a modified version of the empirical rate . At each , is updated as : for those empirical rates “influenced by ”. is essentially a linear combination of classifier’s previous performance and current performance , where is a time decay factor for weighting the classifier’s performance at current instance. has been used as a class imbalance detector and as a revised recall test statistic in [11][2]. The probabilistic characteristic of our test statistic are investigated in § III-A2. The pseudocode of the framework (using as an estimator of for required test statistic), is detailed in Algorithm 1.

1:Data: where and Binary classifier ; Time decaying factors ; Warn significance level ; Detect significance level .
2:Detected concept drift time ().
3:, , where and confusion matrix ;
4:for  to  do
5:     
6:     
7:     for each  do
8:         if ( is influenced by then
9:              
10:         else
11:              
12:         end if
13:         if ( then
14:              
15:              
16:         else
17:              
18:              
19:         end if
20:         
21:         
22:     end for
23:     if (any exceeds & warn.time 0)  then
24:         
25:     else if (no exceeds then
26:          0
27:     end if
28:     if (any exceeds )  then
29:         detect.time ;
30:         relearn using
31:         reset as done in Step 1
32:         return
33:     end if
34:end for
Algorithm 1 Linear Four Rates method (LFR)

The three user defined parameters are the time decaying factor (), warning significance level () and detection significance level () for each rate. Time decaying factor is a weight in to evaluate performance of classifier at current instance prediction . Given that the detection methodology is conducting hypothesis testing at each time step, and

are interpretable statistical significance levels, i.e., type I error (false alarm rate), in standard testing framework. In practice, allowable false warning rate and false detection rate in applications such as quality control of the moving assembly line are guidelines to help the user choose the parameters

and . For the fair comparison, is set to the same value of 0.9 as in [2], for all experiments of this paper. The optimal selection of is domain dependent and can be pre-learned if necessary.

Theorem 1 in Section III-A2 shows that under the stable concept,

is a geometrically weighted sum of i.i.d Bernoulli random variables, which emphasizes the most recent prediction accuracy and places exponentially decaying weights on the historical prediction accuracies. By taking advantage of this weighting scheme,

is more sensitive to concept drifts, foreshadowing the non-stationarity of classifier’s performance.

Standing on Theorem 1, we are able to overcome the shortcoming of [2]

and construct a more reliable running confidence interval for

to control the type-I error . is distributed as geometrically weighted sum of Bernoulli random variables. Bhati et. al investigates the closed-form distribution function of for the special case [12]. However, a closed-form distribution function for other values of is unattainable. Alternatively, according to Theorem 1, a reasonable empirical distribution can also be independently obtained by Monte Carlo simulation for given , and time decaying factor . The pseudocode for the Monte Carlo sampling procedure is provided in Algorithm 2. As is unknown, is used as its surrogate to generate the empirical distribution of

. Based on the empirical distribution, the lower and upper quantile for the given significance level

, serves as the required (warning/detect) bounds. The selection of as the best surrogate of , is supported by Lemma 1.

and denote warning and detection significance levels respectively, where . The corresponding and are obtained from Monte Carlo simulations as described. The bounds of four rates of the framework, can be independently set based on importance, by having distinct . For instance, in some imbalanced classification tasks, performance of the classifier on the minority class is a higher priority than on the majority class.

Having computed the bounds, the framework considers that a concept drift is likely to occur and sets the warning signal (), when any crosses the corresponding warning bounds () for the first time. If any reaches the correspoinding detection bound (), the concept drift is affirmed at ().

All examples stored between and are extracted to relearn a new classifier since the stored examples are considered samples of the new concept. In case the number of stored examples is too few to relearn a reasonable classifier, one will have to wait for sufficient training examples. However, if cross the corresponding warning bounds , but fail to reach , previous warning flag will be erased. After detecting concept drift, are reset to their initial values, so that a new monitoring cycle can restart.

1:Estimate of underlying rate ; Time decaying factor ; Significance level ; Number of time steps ; Number of random variables ;
2:Numeric bound for significance level .
3:for  to  do
4:     Generate independent Bernoulli random variables where
5:     
6:end for
7: forms a empirical distribution , find level quantile as the lower bound and level quantile as the upper bound
Algorithm 2 Generation of BoundTable in LFR algorithm

Iii-A2 Analysis

The following theorems investigate the statistical properties of LFR test statistic .

Theorem 1

For any , is a geometrically weighted sum of Bernoulli random variables, when there is a stable concept up to time : i.e., , where and is the underlying rate.

Among total time steps, suppose is changed according to line 7 at time step where . Hence,

where the last equation hold by the stable concept assumption and all indicators are i.i.d Bernoulli random variables with underlying rate .

Lemma 1

Assume the setting in Theorem 1. Under the stable concept, for any ,

is the unique Uniformly Minimum Variance Unbiased Estimator (UMVUE) of

. As , is approximately distributed as .

is an unbiased estimator of . This is because where are i.i.d Bernoulli random variables realized at time with parameter . By factorization theorem, is a sufficient statistic. Also,

If , it implies because is a polynomial of . Thereby and is a complete sufficient statistic by definition. By Lehmann-Scheffe Theorem, is the unique UMVUE.

The complexity of Linear Four Rates (LFR) detection algorithm is at each time step. The LFR algorithm can be optimized by using a precomputed by Algorithm 2. The 4 dimensional with varying input can itself be precomputed and stored before running Algorithm 1. It is unnecessary to spend any computational resource on quantiles calculation during stream monitoring because observer can find a closest to from to look up lower and upper quantiles. Thus, LFR algorithm takes O(1) to test drift occurrence at each time point and suits with streaming environment.

Iii-B Naïve Four Rates algorithm (NFR)

For the purpose of comparison, this section details the characteristics of a naïve implementation of the proposed framework that uses as the test statistic. A benefit of choosing this test statistic, is that there exists a closed-form distribution as shown in Lemma 1. Using the same strategy of LFR algorithm, NFR algorithm monitors the four rates sequentially. At each time stamp, for each rate, hypothesis testing is done with null distribution and the warning / detection alarms set when exceeds the expected bounds.

The main difference with respect to LFR is the estimation of used to find null distribution. LFR algorithm uses as a surrogate of unknown while NFR algorithm uses , where is a running average of all previous . This update rule allows old prediction performance contributes more to the estimate of and recent predictions contributes less. Thus, is more robust in terms of estimating the underlying when concept drift occurs. Additionally, is still a MSE-consistent estimator under the stable concept presented in Lemma 2.

Lemma 2

Assume the setting in Theorem 1. Under the stable concept up to , for any , in NFR algorithm is a MSE-consistent estimator of .

Among total time steps, suppose is changed at time step where . Hence,

By IID assumption of indicators, we obtain

and

where the last limit hold by the fact that as . Thus, as and , is a MSE-consistent estimator.

Iii-C Comparison between NFR and LFR

To empirically compare the test statistics of NFR and LFR, we use Figure 1 to illustrate a single run of both LFR and NFR algorithm on the same synthetic streaming data . The data stream of pairs with one change-point at is generated by sampling from two confusion probability matrices and . The two concepts are characterized by and respectively. The type of drift is determined by particular settings of . In this example, to generate a balanced stream of pairs representing the scenario that overall accuracy of classifier drops but remains constant, we chose and . The objective of detection algorithms is to identify the change-point .

It is clear that the test statistic in LFR algorithm has a larger variance than for each rate. LFR algorithm reports an earlier detection at (true detection point t=5000) when compared to NFR in this run, even though . This observation matches well with the rationale of constructing , described in §III-A1, to gain detection sensitivity through introducing large variances. To rigorously compare detection performance of and , more investigations are provided below.

Fig. 1: A single run of LFR and NFR on the same synthetic streaming data. Black, red and green vertical lines are the ’true drift time’, LFR detection and NFR detection time respectively. Four colored dots (black, red, green, blue) are running and four colored horizontal lines (indigo, pink, yellow, grey) are running where .

Power characteristics of two competing test statistics (LFR) and (NFR), are compared empirically on synthetic data. We denote by and the power estimates of and respectively. The and against varying time lag and are presented in Figure 2. Figure 2 indicates that neither nor dominates all the time because (red surface) achieves a larger statistical power when the time lag is small but a smaller power when is large. This is because the update rule line 7 enables the estimator to shift from to at an exponential rate which leads the power dominance in a short lag. The price is that limiting distributions of under both null and alternative have larger variances than and thus limiting power, when is large and is small, is degraded.

   

Fig. 2: Power comparison between and where null distribution is at and alternative distribution is at . , where is the maximal time lag. The underlying rate is drifted from to where .

In order to compare sensitivities of and with regard to detecting concept drift in more general settings, we used . The result is illustrated in Figure 3.

   

Fig. 3: Power difference along the time lag in different combinations of concept change from to .

Except when, , we see that for any fixed pair of , when is small and when is large. This is because decreases, as time lag increases. This suggests that LFR is preferable if earlier detection is highly desired. The alarms are more likely to be triggered in the earliest time after the occurrence of concept drift. Earlier detection allows observer to adjust the model and avoid costs of incorrect predictions immediately. On the other hand, if observers are only concerned with detecting the occurrence of drift in the data stream but unconcerned with its detection promptness, then NFR algorithm provides a higher power test statistic to detect the drift. This is because with convergence rate . In the long run, as , implies that .

To guide the selection between LFR and NFR, Figure 4 is a heatmap of limiting power estimates on all pairs using . We can see that is already close to 1 for , when and are significantly different.

Fig. 4: Heatmap of Power estimates

Iv Experiments

In this section, we compared the detection performance of LFR to NFR, DDM and DDM-OCI approaches using both synthetic data and public datasets. We considered simulated class-balance datasets, simulated class-imbalance datasets and public datasets to demonstrate LFR algorithm performs well across various types of concept drifts, including those where the baseline performs poorly.

To generalize the performance and evaluate confidences of algorithms, we utilize the bootstrapping technique. For each synthetic dataset, we generate data streams of rather than so that comparison of detection algorithms is independent of classifiers employed; For each public dataset, the order of pairs within each concept are permutated to create 100 bootstrapped dataset streams. Each stream is fed to all detection algorithms to obtain single-run detections for each method. To illustrate the accuracy of the prediction, we use overlapped histograms to visualize the distribution of detection points obtained from the concept drift detection models across the 100 runs. To avoid redundancy, we present histograms out of experiments and remaining ones are similar. As shown below, LFR consistently outperformed the baseline approaches. When compared to NFR, LFR correctly identifies more true drift points with higher probability and smaller number of false alarms even with a smaller .

Iv-a Synthetic Data

Numerous experiments were run on synthetic data, covering various types of concept drift. In each bootstrap, a data stream of pairs with one change-point at is generated by using the same mechanism introduced in §III-C. The objective of detection algorithms is to identify the change-point . Six challenging and interesting scenarios are discussed below.

Iv-A1 Balanced Dataset

In balanced datasets, is required in underlying data generation. Class-balance data are the most typical scenario in classification task and hence investigated with following three representative experiments.

  1. Balance1: Overall accuracy of classifier drops but remains constant with and .

  2. Balance2: Gradual drift in which overall accuracy () remains the same with .

  3. Balance3: Overall accuracy () increases and remains unchanged with , .

Fig. 5: Overlapping histograms comparing detection timestamps on Balance1 dataset in which overall accuracy of classifier drops but remains unchanged. Number of counts of LFR is above the top bar of each bin.

   

Fig. 6: Overlapping histograms comparing detection timestamps on Balance2 dataset in which gradual drift occurs but overall accuracy remains the same. Number of counts of LFR is above the top bar of each bin.

Iv-A2 Imbalanced Dataset

   

Fig. 7: Overlapping histograms comparing detection timestamps on Imbalance1 dataset in which class ratio transits from 1:1 to 9:1 but score remains unchanged. Number of counts of LFR is above the top bar of each bin.

For imbalanced datasets, we used the same data generation mechanism as balanced case but make and imbalanced. We considered the following three interesting types of concept drifts given many attentions to in real applications.

  1. Imbalance1: From class balance dataset to class imbalance dataset with and . Without loss of generality, let be the minority class. It is also noteworthy that and are unchanged after drift occurrence. Hence, many detectors in imbalance data learning society, using F1 score as a measure to monitor classifier performance, is unable to alarm this type of drift. However, Fig. 7 shows that LFR performs very well by dominating both high early detection rate and trivial false alarms. Besides, DDM and DDM-OCI has no detection after change-point due to the increment of () and , respectively.

  2. Imbalance2: The class ratio and remain unchanged but decreases with and .

  3. Imbalance3: All , and decreases. Though class ratio remains the same, both F1-score and overall accuracy decreases. Two conditional probability matrices are selected as and .

Iv-B Public Datasets

All detection algorithms are evaluated on four public datasets used in literature. Without loss of generality, we chose the Support Vector Machine (SVM)

[13] with an RBF Kernel as the classifier , because all detection algorithms are independent of type of classifiers. Misclassification of the minority class is penalized 100 times more than the majority class. If a potential concept drift is reported by the algorithm, examples from the new concept will be stored to retrain a new SVM classifier , adapted with new concept. Specifically,

examples are used for retraining on SEA and Rotating Hyperplane datasets;

examples are used for retraining on USENET1 and USENET2 datasets.

Iv-B1 Datasets

Dataset True Drift Time dimensions ()
SEA
HYPER.
USENET1 100
USENET2 100
TABLE I: Key features of datasets.

SEA Concepts dataset is used in [14]. The dataset is available at http://www.liaad.up.pt/kdus/products/datasets-for-concept-drift, and is widely used as a testbed by concept drift detection algorithms. Rotating Hyperplane dataset is created by [15].The dataset and specific pairs of each concept are available at http://www.win.tue.nl/~mpechen/data/DriftSets/. USENET1 and USENET2 datasets, used in [16], are available at http://mlkd.csd.auth.gr/concept_drift.html. They are stream collections of messages from different newsgroups (e.g. medicine, space, baseball) to a user. The difference between USENET1 and USENET2 is the magnitude of drift. The user in USENET1 has a sharper topic shift than the one in USENET2.

Fig. 8: Overlapping histograms comparing detection timestamps on SEA.

Fig. 9: Overlapping histograms comparing detection timestamps on HYPERPLANE.

Fig. 10: Overlapping histograms comparing detection timestamps on USENET1.

All above datasets are in form of and their key features are summarized in Table I. Other details such as the imbalance status and type of drift of each dataset are available through above links.

Metric LFR NFR DDM DDM-OCI
Balance1 38 12 4 12
Balance2 16 3 0 11
Balance3 25 4 0 3
Imbalance1 95 59 0 4
Imbalance2 91 21 0 43
Imbalance3 95 38 36 39
SEA 142 29 17 26
HYPRPLN 671 598 345 149
USENET1 207 47 108 66
USENET2 3 17 3 21
TABLE II: The count (sum) at (multiple) true drift point correctly detected for simulated (public) datasets.
Metric LFR NFR DDM DDM-OCI
Balance1 6 77 36 304
Balance2 13 19 33 339
Balance3 18 54 11 219
Imbalance1 18 81 16 259
Imbalance2 10 91 23 165
Imbalance3 9 86 55 204
SEA 72 32 54 658
HYPRPLN 84 56 73 826
USENET1 12 50 43 322
USENET2 43 80 65 272
TABLE III: The count (sum) at (multiple) false detection for the simulated (public) datasets

Iv-B2 Evaluation

In SEA Concepts Dataset experiment, Fig. 8 shows that LFR dominates other three approaches in terms of early detections and fewer false or delayed detections.

Fig. 9 shows that LFR has a dominant performance on the Rotation Hyperplane Dataset experiment. At the second true drift time point, the underlying concept change is very minor. Hence the drift is neglected by all detection algorithms.

In USENET1 dataset experiment, Fig. 10 indicates LFR dominates other approaches and all drift points are alarmed. Similarly, in USENET2 dataset experiment, LFR also outperforms other approaches but detections are delayed with longer time lag. The decrement of superiority of LFR, from USENET1 to USENET2 is due to decrements of magnitude of concept drifts.

Iv-C summary statistics

Parameters Detect Sig. Warn Sig. Decay
LFR
NFR
DDM
DDM-OCI
TABLE IV: Parameter settings used in §IV-A experiments
Para. SEA HYPRPLN. USENET1&2
LFR
NFR
DDM
DDM-OCI
TABLE V: Parameter settings used in §IV-B experiments

In general, the best algorithm will have the minimal number of false alarms and maximal number of early detections, whereas poor algorithms give large number of false alarms, missing or severely delayed true detections. A summary of the counts of correct detections at true drift timestamp and counts of false detections during false detection period for the simulated and public datasets are provided in Tables II and Table III.

False detection period refers to the period preceding the data points that belong to the new concept. For the synthetically generated datasets in §IV-A, there were two concepts spanning the data points, such that the false detection period is defined as . For the datasets specified in §IV-B, if there were more than two concepts, the false detection period corresponds to the range from the concept midway up to the next true drift point. Each bin in the histograms correspond to time steps in §IV-A and dataset-dependent in §IV-B. Since it has been observed in [17, 18] that false alarms may have a smaller influence on predictive performance than late drift detections, the true detection period in our experiments refers to the period spanning next time steps ( bin) after in §IV-A and the period spanning bin after each true drfit point in §IV-B. Other parameter settings of detection algorithms are summarized in Table IV and V. They are particularly selected to show the dominating performance of LFR, i.e. the smallest allowable type-I error but the largest statistical power, over benchmark algorithms.

As sumarized in Table II, LFR fared best in terms of recall of true change point detecion across the various datasets. Equally importantly, LFR had the the highest precision with regard to detecting change points by producing the least amount of false detection and delayed detection (Table III).

V Conclusion

The paper presents a concept drift detection framework (LFR) for detecting the occurance of a concept drift and identifies the data points that belong to the new concept. The versitality of LFR allows it to work with both batch and stream datasets, imbalanced data sets and it uses user-specified parameters that are intuitively comprehensible, unlike other popular concept drift detection approaches. LFR significantly outperforms existing benchmark approaches in terms of early detection of concept drifts, high detection rate and low false alarm rate across the types of concept drifts.

References

  • [1] J. Gama, P. Medas, G. Castillo, and P. Rodrigues, “Learning with drift detection,” in

    Advances in Artificial Intelligence–SBIA 2004

    .   Springer, 2004, pp. 286–295.
  • [2] S. Wang, L. L. Minku, D. Ghezzi, D. Caltabiano, P. Tino, and X. Yao, “Concept drift detection for online class imbalance learning,” in Neural Networks (IJCNN), The 2013 International Joint Conference on.   IEEE, 2013, pp. 1–10.
  • [3] M. Baena-García, J. del Campo-Ávila, R. Fidalgo, A. Bifet, R. Gavaldà, and R. Morales-Bueno, “Early drift detection method,” 2006.
  • [4] D. K. Antwi, H. L. Viktor, and N. Japkowicz, “The perfsim algorithm for concept drift detection in imbalanced data,” in Data Mining Workshops (ICDMW), 2012 IEEE 12th International Conference on.   IEEE, 2012, pp. 619–628.
  • [5] R. Klinkenberg and T. Joachims, “Detecting concept drift with support vector machines,” in

    Proceedings of the Seventeenth International Conference on Machine Learning

    .   Morgan Kaufmann Publishers Inc., 2000, pp. 487–494.
  • [6] D. S. Matteson and N. A. James, “A nonparametric approach for multiple change point analysis of multivariate data,” Journal of the American Statistical Association, vol. 109, no. 505, pp. 334–345, 2014.
  • [7] X. Song, M. Wu, C. Jermaine, and S. Ranka, “Statistical change detection for multi-dimensional data,” in Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining.   ACM, 2007, pp. 667–676.
  • [8] A. Dries and U. Rückert, “Adaptive concept drift detection,” Statistical Analysis and Data Mining, vol. 2, no. 5-6, pp. 311–327, 2009.
  • [9] L. A. Aroian and H. Levene, “The effectiveness of quality control charts,” Journal of the American Statistical Association, vol. 45, no. 252, pp. 520–529, 1950.
  • [10] M. Basseville, I. V. Nikiforov et al., Detection of abrupt changes: theory and application, vol. 104.
  • [11] S. Wang, L. L. Minku, and X. Yao, “A learning framework for online class imbalance learning,” in Computational Intelligence and Ensemble Learning (CIEL), 2013 IEEE Symposium on.   IEEE, 2013, pp. 36–45.
  • [12] D. Bhati, P. Kgosi, and R. N. Rattihalli, “Distribution of geometrically weighted sum of bernoulli random variables,” Applied Mathematics, vol. 2, p. 1382, 2011.
  • [13] D. Meyer and F. T. Wien, “Support vector machines,” 2014.
  • [14] W. N. Street and Y. Kim, “A streaming ensssssemble algorithm (sea) for large-scale classification,” in Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining.   ACM, 2001, pp. 377–382.
  • [15] W. Fan, “Systematic data selection to mine concept-drifting data streams,” in Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining.   ACM, 2004, pp. 128–137.
  • [16] I. Katakis, G. Tsoumakas, and I. Vlahavas, “An ensemble of classifiers for coping with recurring contexts in data streams,” in Proceedings of the 2008 conference on ECAI 2008: 18th European Conference on Artificial Intelligence.   IOS Press, 2008, pp. 763–764.
  • [17] C. Alippi, G. Boracchi, and M. Roveri, “Just-in-time classifiers for recurrent concepts.” IEEE transactions on neural networks and learning systems, vol. 24, no. 4, pp. 620–634, 2013.
  • [18] L. L. Minku and X. Yao, “Ddd: A new ensemble approach for dealing with concept drift,” Knowledge and Data Engineering, IEEE Transactions on, vol. 24, no. 4, pp. 619–633, 2012.