Governments and companies are generating huge amounts of streaming data and urgently need efficient data analytics and machine learning techniques to support them making predictions and decisions. However, the rapidly changing environment of new products, new markets and new customer behaviors inevitably results in the appearance of concept drift problem. Concept drift means that the statistical properties of the target variable, which the model is trying to predict, change over time in unforeseen ways . If the concept drift occurs, the induced pattern of past data may not be relevant to the new data, leading to poor predictions and decision outcomes. The phenomenon of concept drift has been recognized as the root cause of decreased effectiveness in many data-driven information systems such as data-driven early warning systems and data-driven decision support systems. In an ever-changing and big data environment, how to provide more reliable data-driven predictions and decision facilities has become a crucial issue.
Concept drift problem exists in many real-world situations. An example can be seen in the changes of behavior in mobile phone usage, as shown in Fig. 1. From the bars in this figure, the time percentage distribution of the mobile phone usage pattern has changed from “Audio Call” to “Camera” and then to “Mobile Internet” over the past two decades.
Recent attractive research in the field of concept drift targets more challenging problems, i.e., how to accurately detect concept drift in unstructured and noisy datasets [83, 84], how to quantitatively understand concept drift in a explainable way [79, 80], and how to effectively react to drift by adapting related knowledge [75, 99].
Solving these challenges endows prediction and decision-making with the adaptability in an uncertain environment. Conventional research related to machine learning has been significantly improved by introducing concept drift techniques in data science and artificial intelligence in general, and in pattern recognition and data stream mining in particular. These new studies enhance the effectiveness of analogical and knowledge reasoning in an ever-changing environment. A new topic is formed during this development: adaptive data-driven prediction/decision systems. In particular, concept drift is a highly prominent and significant issue in the context of the big data era because the uncertainty of data types and data distribution is an inherent nature of big data.
Conventional machine learning has two main components: training/learning and prediction. Research on learning under concept drift presents three new components: drift detection (whether or not drift occurs), drift understanding (when, how, where it occurs) and drift adaptation (reaction to the existence of drift) as shown in Fig. 2. These will be discussed in Section 3-5.
In literature, a detailed concept drift survey paper  was published in 2014 but intentionally left certain sub-problems of concept drift to other publications, such as the details of the data distribution change () as mentioned in their Section 2.1. In 2015, another comprehensive survey paper  was published, which surveys and gives tutorial of both the established and the state-of-the-art approaches. It provides a hybrid-view about concept drift from two primary perspectives, active and passive. Both survey papers are comprehensive and can be a good introduction to concept drift researching. However, many new publications have become available in the last three years, even a new category of drift detection methods has arisen, named multiple hypothesis tests drift detection. It is necessary to review the past research focuses and give the most recent research trends about concept drift, which is one of the main contribution of this survey paper.
Besides these two publications, four related survey papers [52, 75, 99, 107] have also provided valuable insights into how to address concept drift, but their specific research focus is only on data stream learning, rather than analyzing concept drift adaptation algorithms and understanding concept drift. Specifically, paper  focuses on data reduction for stream learning incorporating concept drift, while  only focuses on investigating the development in learning ensembles for data stream learning in a dynamic environment.  concerns the evolution of data stream clustering, and  focuses on investigating the current and future trends of data stream learning. There is therefore a gap in the current literature that requires a fuller picture of established and the new emerged research on concept drift; a comprehensive review of the three major aspects of concept drift: concept drift detection, understanding and adaptation, as shown in Fig. 2; and a discussion about the new trend of concept drift research.
The selection of references in this survey paper was performed according to the following steps:
Step 1. Publication database: Science Direct, ACM Digital Library, IEEE Xplore and SpringerLink.
Step 2. Preliminary screening of articles: The first search was based on keywords. The articles were then selected as references if they 1) present new theory, algorithm or methodology in the area of concept drift, or 2) report a concept drift application.
Step 3. Result filtering for articles: The articles selected in Step 2 were divided into three groups: concept drift detection, understanding, and adaptation. The references in each group were filtered again, based on 1) Time: published mainly within the last 10 years, or 2) Impact: published in high quality journals/conferences or having high citations.
Step 4. Dataset selection: To help readers test their research results, this paper lists popular datasets and their characteristics, the dataset providers, and how each dataset can be used.
On completion of this process, 137 research articles, 10 widely used synthetic datasets for evaluating the performance of learning algorithms dealing with concept drift, and 14 publicly available and widely used real-world datasets were listed for discussion.
The main contributions of this paper are:
It perceptively summarizes concept drift research achievements and clusters the research into three categories: concept drift detection, understanding and adaptation, providing a clear framework for concept drift research development (Fig. 2);
It proposes a new component, concept drift understanding, for retrieving information about the status of concept drift in aspects of when, how, and where. This also creates a connection between drift detection and drift adaptation;
It uncovers several very new concept drift techniques, such as active learning under concept drift and fuzzy competence model-based drift detection, and identifies related research involving concept drift;
It systematically examines two sets of concept drift datasets, Synthetic datasets and Real-world datasets, through multiple dimensions: dataset description, availability, suitability for type of drift, and existing applications;
It suggests several emerging research topics and potential research directions in this area.
The remainder of this paper is structured as follows. In Section 2, the definitions of concept drift are given and discussed. Section 3 presents research methods and algorithms in concept drift detection. Section 4 discusses research developments in concept drift understanding. Research results on drift adaptation (concept drift reaction) are reported in Section 5. Section 6 presents evaluation systems and related datasets used to test concept drift algorithms. Section 7 summaries related research concerning the concept drift problem. Section 8 presents a comprehensive analysis of main findings and future research directions.
2 Problem Description
2.1 Concept drift definition and the sources
Concept drift is a phenomenon in which the statistical properties of a target domain change over time in an arbitrary way . It was first proposed by  who aimed to point out that noise data may turn to non-noise information at different time. These changes might be caused by changes in hidden variables which cannot be measured directly . Formally, concept drift is defined as follows:
Given a time period , a set of samples, denoted as , where is one observation (or a data instance),
is the feature vector,is the label, and follows a certain distribution . Concept drift occurs at timestamp , if , denoted as [51, 82, 83, 139].
’s work, the authors proposed that concept drift or shift is only one subcategory of dataset shift and the dataset shift is consists of covariate shift, prior probability shift and concept shift. These definitions clearly stated the research scope of each research topics. However, since concept drift is usually associated with covariate shift and prior probability shift, and an increasing number of publications[51, 82, 83, 139] refer to the term ”concept drift” as the problem in which . Therefore, we apply the same definition of concept drift in this survey. Accordingly, concept drift at time can be defined as the change of joint probability of and at time . Since the joint probability can be decomposed into two parts as , concept drift can be triggered by three sources:
Source I: while , that is, the research focus is the drift in while remains unchanged. Since drift does not affect the decision boundary, it has also been considered as virtual drift , Fig. 3(a).
Source II: while while remains unchanged. This drift will cause decision boundary change and lead to learning accuracy decreasing, which is also called actual drift, Fig. 3(b).
Source III: mixture of Source I and Source II, namely and . Concept drift focus on the drift of both and , since both changes convey important information about learning environment Fig. 3(c).
Fig. 3 demonstrates how these sources differ from each other in a two-dimensional feature space. Source I is feature space drift, and Source II is decision boundary drift. In many real-world applications, Source I and Source II occur together, which creates Source III.
2.2 The types of concept drift
Research into concept drift adaptation in Types 1-3 focuses on how to minimize the drop in accuracy and achieve the fastest recovery rate during the concept transformation process. In contrast, the study of Type 4 drift emphasizes the use of historical concepts, that is, how to find the best matched historical concepts with the shortest time. The new concept may suddenly, incrementally, or gradually reoccur.
To better demonstrate the differences between these types, the term “intermediate concept” was introduced by  to describe the transformation between concepts. As mentioned by , a concept drift may not only take place at an exact timestamp, but may also last for a long period. As a result, intermediate concepts may appear during the transformation as one concept (starting concept) changes to another (ending concept). An intermediate concept can be a mixture of the starting concept and the ending concept, like the incremental drift, or one of the starting or ending concept, such as the gradual drift.
3 Concept Drift detection
This section focuses on summarizing concept drift detection algorithms. Section 3.1 introduces a typical drift detection framework. Then, Section 3.2 systematically reviews and categorizes drift detection algorithms according to their implementation details for each component in the framework. At last, Section 3.3 lists the state-of-the-art drift detection algorithms with comparisons of their implementation details.
3.1 A general framework for drift detection
Drift detection refers to the techniques and mechanisms that characterize and quantify concept drift via identifying change points or change time intervals . A general framework for drift detection contains four stages, as shown in Fig. 5.
Stage 1 (Data Retrieval) aims to retrieve data chunks from data streams. Since a single data instance cannot carry enough information to infer the overall distribution , knowing how to organize data chunks to form a meaningful pattern or knowledge is important in data stream analysis tasks .
Stage 2 (Data Modeling) aims to abstract the retrieved data and extract the key features containing sensitive information, that is, the features of the data that most impact a system if they drift. This stage is optional, because it mainly concerns dimensionality reduction, or sample size reduction, to meet storage and online speed requirements .
Stage 3 (Test Statistics Calculation) is the measurement of dissimilarity, or distance estimation. It quantifies the severity of the drift and forms test statistics for the hypothesis test. It is considered to be the most challenging aspect of concept drift detection. The problem of how to define an accurate and robust dissimilarity measurement is still an open question. A dissimilarity measurement can also be used in clustering evaluation, and to determine the dissimilarity between sample sets .
Stage 4 (Hypothesis Test) uses a specific hypothesis test to evaluate the statistical significance of the change observed in Stage 3, or the p-value. They are used to determine drift detection accuracy by proving the statistical bounds of the test statistics proposed in Stage 3. Without Stage 4, the test statistics acquired in Stage 3 are meaningless for drift detection, because they cannot determine the drift confidence interval, that is, how likely it is that the change is caused by concept drift and not noise or random sample selection bias. The most commonly used hypothesis tests are: estimating the distribution of the test statistics [4, 48], bootstrapping [25, 33], the permutation test , and Hoeffding’s inequality-based bound identification .
It is also worth to mention that, without Stage 1, the concept drift detection problem can be considered as a two-sample test problem which examines whether the population of two given sample sets are from the same distribution . In other words, any multivariate two-sample test is an option that can be adopted in Stages 2-4 to detect concept drift . However, in some cases, the distribution drift may not be included in the target features, therefore the selection of the target feature will affect the overall performance of a learning system and is a critical problem in concept drift detection .
3.2 Concept drift detection algorithms
This section surveys drift detection methods and algorithms, which are classified into three categories in terms of the test statistics they apply.
3.2.1 Error rate-based drift detection
PLearner error rate-based drift detection algorithms form the largest category of algorithms. These algorithms focus on tracking changes in the online error rate of base classifiers. If an increase or decrease of the error rate is proven to be statistically significant, an upgrade process (drift alarm) will be triggered.
One of the most-referenced concept drift detection algorithms is the Drift Detection Method (DDM) . It was the first algorithm to define the warning level and drift level for concept drift detection. In this algorithm, Stage 1 is implemented by a landmark time window, as shown in Fig. 6. When a new data instance become available for evaluation, DDM detects whether the overall online error rate within the time window has increased significantly. If the confidence level of the observed error rate change reaches the warning level, DDM starts to build a new learner while using the old learner for predictions. If the change reached the drift level, the old learner will be replaced by the new learner for further prediction tasks. To acquire the online error rate, DDM needs a classifier to make the predictions. This process converts training data to a learning model, which is considered as the Stage 2 (Data Modeling). The test statistics in Stage 3 constitute the online error rate. The hypothesis test, Stage 4, is conducted by estimating the distribution of the online error rate and calculating the warning level and drift threshold.
Similar implementations have been adopted and applied in the Learning with Local Drift Detection (LLDD) , Early Drift Detection Method (EDDM) , Heoffding’s inequality based Drift Detection Method (HDDM) , Fuzzy Windowing Drift Detection Method (FW-DDM) , Dynamic Extreme Learning Machine (DELM) 
. LLDD modifies Stages 3 and 4, dividing the overall drift detection problem into a set of decision tree node-based drift detection problems; EDDM improves Stage 3 of DDM using the distance between two correct classifications to improve the sensitivity of drift detection; HDDM modifies Stage 4 using Hoeffding’s inequality to identify the critical region of a drift; FW-DDM improves Stage 1 of DDM using a fuzzy time window instead of a conventional time window to address the gradual drift problem; DEML does not change the DDM detection algorithm but uses a novel base learner, which is a single hidden layer feedback neural network called Extreme Learning Machine (ELM) to improve the adaptation process after a drift has been confirmed. EWMA for Concept Drift Detection (ECDD)  takes advantage of the error rate to detect concept drift. ECDD employs the EWMA chart to track changes in the error rate. The implementation of Stages 1-3 of ECDD is the same as for DDM, while Stage 4 is different. ECDD modifies the conventional EWMA chart using a dynamic mean instead of the conventional static mean , where is the estimated online error rate within time , and
implies the theoretical error rate when the learner was initially built. Accordingly, the dynamic variance can be calculated bywhere controls how much weight is given to more recent data as opposed to older data, and is recommended by the authors. Also, when the test statistic of the conventional EWMA chart is , ECDD will report a concept drift warning; when , ECDD will report a concept drift. The control limits is given by the authors through experimental evaluation.
In contrast to DDM and other similar algorithms, Statistical Test of Equal Proportions Detection (STEPD)  detects error rate change by comparing the most recent time window with the overall time window, and for each timestamp, there are two time windows in the system, as shown in Fig. 7. The size of the new window must be defined by the user. According to , the test statistic
conforms to standard normal distribution, denoted as. The significance level of the warning level and the drift level were suggested as and respectively. As a result, the warning threshold and drift threshold can be easily calculated.
Another popular two-time window-based drift detection algorithm is ADaptive WINdowing (ADWIN) . Unlike STEPD, ADWIN does not require users to define the size of the compared windows in advance; it only needs to specify the total size of a “sufficiently large” window . It then examines all possible cuts of and computes optimal sub-window sizes and according to the rate of change between the two sub-windows and . The test statistic is the difference of the two sample means . An optimal cut is found when the difference exceeds a threshold with a predefined confidence interval . The author proved that both the false positive rate and false negative rate are bounded by . It is worth noting that many concept drift adaptation methods/algorithms in the literature are derived from or combined with ADWIN, such as [16, 19, 20, 53]. Since their drift detection methods are implemented with almost the same strategy, we will not discuss them in detail.
3.2.2 Data Distribution-based Drift Detection
The second largest category of drift detection algorithms is data distribution-based drift detection. Algorithms of this category use a distance function/metric to quantify the dissimilarity between the distribution of historical data and the new data. If the dissimilarity is proven to be statistically significantly different, the system will trigger a learning model upgradation process. These algorithms address concept drift from the root sources, which is the distribution drift. Not only can they accurately identify the time of drift, they can also provide location information about the drift. However, these algorithms are usually reported as incurring higher computational cost than the algorithms mentioned in Section 3.2.1 . In addition, these algorithms usually require users to predefine the historical time window and new data window. The commonly used strategy is two sliding windows with the historical time window fixed while sliding the new data window [33, 84, 106], as shown in Fig. 8.
According to the literature, the first formal treatment of change detection in data streams was proposed by . In their study, the authors point out that the most natural notion of distance between distributions is total variation, as defined by: or equivalently, when the distribution has the density functions and , . This provides practical guidance on the design of a distance function for distribution discrepancy analysis. Accordingly,  proposed a family of distances, called Relativized Discrepancy (RD). The authors also present the significance level of the distance according to the number of data instances. The bounds on the probabilities of missed detections and false alarms are theoretically proven, using Chernoff bounds and the Vapnik-Chervonenkis dimension. The authors of  do not propose novel high-dimensional friendly data models for Stage 2 (data modeling); instead, they stress that a suitable model choice is an open question.
Another typical density-based drift detection algorithm is the Information-Theoretic Approach (ITA) . The intuitive idea underlying this algorithm is to use kdqTree to partition the historical and new data (multi-dimensional) into a set of bins, denoted as
,and then use Kullback-Leibler divergence to quantify the difference of the densityin each bin. The hypothesis test applied by ITA is bootstrapping by merging , as and resampling as , to recompute the . Once the estimated probability , concept drift is confirmed, where is the significant level controlling the sensitivity of drift detection.
Similar distribution-based drift detection methods/algorithms are: Statistical Change Detection for multi-dimensional data (SCD) , Competence Model-based drift detection (CM) , a prototype-based classification model for evolving data streams called SyncStream , PCA-based change detection framework (PCA-CD) , Equal Density Estimation (EDE) , Least Squares Density Difference-based Change Detection Test (LSDD-CDT) , Incremental version of LSDD-CDT (LSDD-INC)  and Local Drift Degree-based Density Synchronized Drift Adaptation (LDD-DSDA) .
3.2.3 Multiple Hypothesis Test Drift Detection
Multiple hypothesis test drift detection algorithms apply similar techniques to those mentioned in the previous two categories. The novelty of these algorithms is that they use multiple hypothesis tests to detect concept drift in different ways. These algorithms can be divided into two groups: 1) parallel multiple hypothesis tests; and 2) hierarchical multiple hypothesis tests.
The idea of parallel multiple hypothesis drift detection algorithm is demonstrated in Fig. 9. According to the literature, Just-In-Time adaptive classifiers (JIT)  is the first algorithm that set multiple drift detection hypothesis in this way. The core idea of JIT is to extend the CUSUM chart, known as the Computational Intelligence-based CUSUM test (CI-CUSUM), to detect the mean change in the features interested by learning systems. The authors of 
, gave the following four configurations for the drift detection target. Config1: the features extracted by Principal Component Analysis (PCA), which removes eigenvalues whose sum is below a threshold, e.g. 0.001. Config2: PCA extracted features plus one generic component of the original features; Config3: detects the drift in each individually. Config4: detects drift in all possible combinations of the feature space . The authors stated that Config2 is the preferred setting for most situations, according to their experimentation, and also mentioned that Config1 may have a high missing rate, Config3 suffers from a high false alarm rate, and Config4 has exponential computational complexity. The same drift detection strategy has also been applied in [5, 6, 7, 8] for concept drift adaptation.
Similar implementations have been applied in Linear Four Rate drift detection (LFR) , which maintains and tracks the changes in True Positive rate (TP), True Negative rate (TN), False Positive rate (FP) and False Negative rate (FN) in an online manner. The drift detection process also includes warning and drift levels.
Another parallel multiple hypothesis drift detection algorithm is three-layer drift detection, based on Information Value and Jaccard similarity (IV-Jac) . IV-Jac aims to individually address the label drift Layer I, feature space drift Layer II, and decision boundary drift Layer III. It extracts the Weight of Evidence (WoE) and Information Value (IV) from the available data and then detects whether a significant change exists between the WoE and IV extracted from and by measuring the contribution to the label for a feature value. The hypothesis test thresholds are predefined parameters by default, which are chosen empirically.
Ensemble of Detectors (e-Detector)  proposed to detect concept drift via ensemble of heterogeneous drift detector. The authors consider two drift detectors are homogeneous as if they are equivalent in finding concept drifts, otherwise they are heterogeneous. e-Detector groups homogeneous drift detectors via a diversity measurement, named diversity vector. For each group, it select the one with the smallest coefficient of failure as the base detector to form the ensemble. e-Detector reports concept drift following the early-find-early-report rule, which means no matter which base detector detect a drift, the e-Detector reports a drift. Similar strategy has been applied in drift detection ensemble (DDE) .
Hierarchical drift detection is an emerging drift detection category that has a multiple verification schema. The algorithms in this category usually detect drift using an existing method, called the detection layer, and then apply an extra hypothesis test, called the validation layer, to obtain a second validation of the detected drift in a hierarchical way. The overall workflow is shown in Fig. 10.
According to the claim made by , Hierarchical Change-Detection Tests (HCDTs) is the first attempt to address concept drift using a hierarchical architecture. The detection layer can be any existing drift detection method that has a low drift delay rate and low computational burden. The validation layer will be activated and deactivated based on the results returned by the detection layer. The authors recommend two strategies for designing the validation layer: 1) estimating the distribution of the test statistics by maximizing the likelihood; 2) adapting an existing hypothesis test, such as the Kolmogorov-Smirnov test or the Cramer-Von Mises test.
Hierarchical Linear Four Rate (HLFR)  is another recently proposed hierarchical drift detection algorithm. It applies the drift detection algorithm LFR as the detection layer. Once a drift is confirmed by the detection layer, the validation layer will be triggered. The validation layer of HLFR is simply a zero-one loss, denoted as , over the ordered train-test split. If the estimated zero-one loss exceeds a predefined threshold, , the validation layer will confirm the drift and report to the learning system to trigger a model upgradation process.
Two-Stage Multivariate Shift-Detection based on EWMA (TSMSD-EWMA)  has a very similar implementation, however, the authors do not claim that their method is a hierarchy-based algorithm.
Hierarchical Hypothesis Testing with Classification Uncertainty (HHT-CU) and Hierarchical Hypothesis Testing with Attribute-wise ”Goodness-of-fit” (HHT-AG) are two drift detection algorithms based on request and reverify strategy . For HHT-CU, the detection layer is a hypotheses test based on Heoffding’s inequality that monitoring the change of the classification uncertainty measurement. The validation layer is a permutation test that evaluates the change of the zero-one loss of the learner. For HHT-AG, the detection layer is conducted based on Kolmogorov-Smirnov (KS) test for each feature distribution. Then HHT-AG validate the potential drift points by requiring true labels of data that come from , and performing independent two-dimensional (2D) KS test with each feature-label bivariate distribution. Compare to other drift detection algorithms, HHT-AG can handle concept drift with less true labels, which makes it more powerful when dealing with high verification latency.
3.3 Summary of concept drift detection methods/algorithms
TABLE I lists the most popular concept drift detection methods/algorithms against the general framework summarized in Section 3.1 (Fig. 5). A comparative study on eight popular drift detection methods can be found in .
|Category||Algorithms||Stage 1||Stage 2||Stage 3||Stage 4|
|Error rate-based||DDM||Landmark||Learner||Online error rate||Distribution estimation|
|EDDM||Landmark||Learner||Online error rate||Distribution estimation|
|FW-DDM||Landmark||Learner||Online error rate||Distribution estimation|
|DEML||Landmark||Learner||Online error rate||Distribution estimation|
|STEPD||Predefined ,||Learner||Error rate difference||Distribution estimation|
|ADWIN||Auto cut ,||Learner||Error rate difference||Hoeffding’s Bound|
|ECDD||Landmark||Learner||Online error rate||EWMA Chart|
|HDDM||Landmark||Learner||Online error rate||Hoeffding’s Bound|
|LLDD ||Landmark, or sliding ,||Decision trees||Tree node error rate||Hoeffding’s Bound|
|Data distribution-based||kdqTree||Fixed , Sliding||kdqTree||KL divergence||Bootstrapping|
|CM[83, 84]||Fixed , Sliding||Competence model||Competence distance||Permutation test|
|RD ||Fixed , Sliding||KS structure||Relativized Discrepancy||VC-Dimension|
|SCD ||Fixed , Sliding||kernel density estimator||log-likelihood||Distribution estimation|
|EDE||Fixed , Sliding||Nearest neighbor||Density scale||Permutation test|
|SyncStream ||Fixed , Sliding||PCA||P-Tree||Wilcoxon test|
|PCA-CD ||Fixed , Sliding||PCA||Change-Score||Page-Hinkley test|
|LSDD-CDT||Fixed , Sliding||Learner||Relative difference||Distribution estimation|
|LSDD-INC||Fixed , Sliding||Learner||Relative difference||Distribution estimation|
|LDD-DSDA||Fixed , Sliding||k-nearest neighbor||Local drift degree||Distribution estimation|
|Multiple Hypothesis Tests||JIT||Landmark||Selected features||4 configurations||Distribution estimation|
|LFR||Landmark||Learner||TP, TN, FP, FN||Distribution estimation|
|Three-layer||Sliding both ,||Learner||, ,||Distribution estimation|
|e-Detector||depends on base detector||depends||depends||depends|
|DDE||depends on base detector||depends||depends||depends|
|TSMSD-EWMA||Landmark||Learner||Online error rate||EWMA Chart|
|HCDTs||Landmark||Depending on layers||Depending on layers||Depending on layer|
|HLFR||Landmark||Learner||TP, TN, FP, FN||Distribution estimation|
|HHT-CU||Landmark||Learner||Classification uncertainty||Layer-I Hoeffding’s Bound, Layer-II Permutation Test|
|HHT-AG||Fixed , Sliding||N/A||KS statistic on each attribute||Layer-I KS test, Layer -II 2D KS test|
4 Concept Drift understanding
Drift understanding refers to retrieving concept drift information about “When” (the time at which the concept drift occurs and how long the drift lasts), “How” (the severity /degree of concept drift), and “Where” (the drift regions of concept drift). This status information is the output of the drift detection algorithms, and is used as input for drift adaptation.
4.1 The time of concept drift occurs (When)
The most basic function of drift detection is to identify the timestamp when a drift occurs. Recalling the definition of concept drift , the variable represents the time at which a concept drift occurs. In drift detection methods/algorithms, an alarm signal is used to indicate whether the concept drift has or has not occurred or not at the current timestamp. It is also a signal for a learning system to adapt to a new concept. Accurately identifying the time a drift occurs is critical to the adaptation process of a learning system; a delay or a false alarm will lead to failure of the learning system to track new concepts.
A drift alarm usually has a statistical guarantee with a predefined false alarm rate. Error rate-based drift detection algorithms monitor the performance of the learning system, based on statistical process control. For example, DDM  sends a drift signal when the learning accuracy of the learner drops below a predefined threshold, which is chosen by the three-sigma rule . ECCD  reports a drift when the online error rate exceeds the control limit of EWMA. Most data distribution-based drift detection algorithms report a drift alarm when two data samples have a statistically significant difference. PCA-based drift detection  outputs a drift signal when the -value of the generalized Wilcoxon test statistic is significantly large. The method in  confirms that a drift has occurred by verifying whether the empirical competence-based distance is significantly large through permuataion test.
Taking into account the various drift types, concept drift understanding needs to explore the start time point, the change period, and the end time point of concept drift. And these time information could be useful input for the adaptation process of the learning system. However the drift timestamp alert in existing drift detection algorithms is delayed compared to the actual drifting timestamp, since most drift detectors require a minimum number of new data to evaluate the status of the drift, as shown in Fig. 11. The emergence time of the new concept is therefore still vague. Some concept drift detection algorithms such as DDM , EDDM , STEPD, and HDDM , trigger a warning level to indicate a drift may have occurred. The threshold used to trigger warning level is a relaxed condition of the threshold used for the drift level; for example, the warning level is set -value to 95% or , and the drift level is set -value to 99% or . The data accumulated between the warning level and the drift level are used as the training set for updating a learning model.
4.2 The severity of concept drift (How)
The severity of concept drift refers to using a quantified value to measure the similarity between the new concept and the previous concept, as shown in Fig. 11. Formally, the severity of concept drift can be represented as , where is a function to measure the discrepancy of two data distributions, and is the timestamp when the concept drift occurred. usually is a non-negative value indicating the severity of concept drift. The greater the value of , the larger the severity of the concept drift is.
In general, error rate-based drift detection cannot directly measure the severity of concept drift, because it mainly focuses on monitoring the performance of the learning system, not the changes in the concept itself. However, the degree of decrease in learning accuracy can be used as an indirect measurement to indicate the severity of concept drift. If learning accuracy has dropped significantly when drift is observed, this indicates that the new concept is different from the previous one. For example, the severity of concept drift could be reflected by the difference between and in [48, 122], denoted as ; the difference between overall accuracy and recent accuracy in , expressed as ; and the difference between test statistics in the former window and test statistics in the later window , denoted as . However, the meaning of these differences is not discussed in existing publications. The ability of error rate-based drift detection to output the severity of concept drift is still vague.
Data distribution-based drift detection methods can directly quantify the severity of concept drift since the measurement used to compare two data samples already reflects the difference. For example,  employed a relaxation of the total variation distance to measure the difference between two data distributions.  proposed a competence-based empirical distance to show the difference between two data samples. Other drift detection methods have used information-theoretic distance; for example, Kullback-Leibler divergence , also called relative entropy, was used in the study reported in . The range of these distances is . The greater the distance, the larger the severity of the concept drift is. The distance “1” means that a new concept is different from the pervious one, while the distance “0” means that two data concepts are identical. The test statistic used in  gives an extremely small negative value if the new concept is quite different from the previous concept. The degree of concept drift in  is measured by the resulting -value of the test statistic and the defined of two datasets and .
The severity of concept drift can be used as a guideline for choosing drift adaptation strategies. For example, if the severity of drift in a classification task is low, the decision boundary may not move much in the new concept. Thus, adjusting the current learner by incremental learning will be adequate. In contrast, if the severity of the concept drift is high, the decision boundary could change significantly, therefore discarding the old learner and retraining a new one could be better than incrementally updating the old learner. We would like to mention that, even though some researches have promoted the ability to describe and quantify the severity of the detected drift, this information is not yet widely utilized in drift adaptation.
4.3 The drift regions of concept drift (Where)
The drift regions of concept drift are the regions of conflict between a new concept and the previous concept. Drift regions are located by finding regions in data feature space where and have significant difference. To illustrate this, we give an example of a classification task in Fig. 12. The data used in this task are uniformly sampled in the range of . The left sub-figure is the data sample accumulated at time , where the decision boundary is . The middle sub-figure is the data accumulated at time , where the decision boundary is . Intuitively, data located in regions have different classes in time and time , since the decision boundary has changed. The right sub-figure shows the data located in the drift regions.
The techniques to identify drift regions are highly dependent on the data model used in the drift detection methods/algorithms. Paper  detected drift data in local regions of the instance space by using online error-rate inside the inner-nodes of a decision tree. The whole data feature space is partitioned by decision tree. Each leaf of this decision tree corresponds to a hyper-rectangle in the data feature space. All leaf nodes contain a drift detector. When the leaf nodes are alerted to a drift, the corresponding hyper-rectangles indicate the regions of drift in the data feature space. Similar to ,  utilized the nodes of the kdq-tree with Kulldorff’s spatial scan statistic to identify the regions in which data had changed the most. Once a drift has been reported, Kulldorff’s statistic measures how the two datasets differ only with respect to the region associated with the leaf node of the kdq-tree. The leaf nodes with the greater statistical value of show the drift regions.  highlighted the most severe regions of concept drift through top--competence areas. Utilizing the RelatedSets of the competence model, the data feature space is partitioned by a set of overlapping hyperspheres. The RelatedSet-based empirical distance defines the distance between two datasets on a particular hypersphere. The drift regions are identified by the corresponding hyperspheres with large empirical distance at top % level.  identified drift regions by monitoring the discrepancy in the regional density of data, which is called the local drift degree. A local region with a corresponding increase or decrease in density will be highlighted as a critical drift region.
Locating concept drift regions benefits drift adaptation. Paper  pointed out that even if the concept of the entire dataset drifts, there are regions of the feature space that will remain stable significantly longer than other regions. In an ensemble scenario, the old learning models of stable regions could still be useful for predicting data instances located within stable regions, or data instances located in drift regions could be used to build a more updated current model. The authors of  also pointed out that identifying drift regions can help in recognizing obsolete data that conflict with current concepts and distinguish noise data from novel data. In their later research , they utilized top--competence areas to edit cases in a case-based reasoning system for fast new concept switching. One step in their drift adaptation is to remove conflict instances. To preserve as many instances of a new concept as possible, they only remove obsolete conflict instances which are outside the drift regions. LDD-DSDA  utilized drift regions as an instance selection strategy to construct a training set that continually tracked a new concept.
4.4 Summary of drift understanding
We summarize concept drift detection algorithms according to their ability to identify when, how, and where concept drift occurs, as shown in TABLE II. All drift detection algorithms can identify the occurrence time of concept drift (when), and most data distribution-based drift detection algorithms can also measure the severity of concept drift (how) through the test statistics, but only a few algorithms have the ability to locate drift regions (where).
|Error rate-based||DDM |
|Data distribution-based||kdqTree |
|CM [83, 84]|
|Multiple hypothesis tests||JIT |
|Three-layer drift detection |
5 Drift adaptation
This section focuses on strategies for updating existing learning models according to the drift, which is known as drift adaptation or reaction. There are three main groups of drift adaptation methods, namely simple retraining, ensemble retraining and model adjusting, that aim to handle different types of drift.
5.1 Training new models for global drift
Perhaps the most straightforward way of reacting to concept drift is to retrain a new model with the latest data to replace the obsolete model, as shown in Fig. 13. An explicit concept drift detector is required to decide when to retrain the model (see Section 3 on drift detection). A window strategy is often adopted in this method to preserve the most recent data for retraining and/or old data for distribution change test. Paired Learners  follows this strategy and uses two learners: the stable learner and the reactive learner. If the stable learner frequently misclassifies instances that the reactive learner correctly classifies, a new concept is detected and the stable learner will be replaced with the reactive learner. This method is simple to understand and easy to implement, and can be applied at any point in the data stream.
When adopting a window-based strategy, a trade-off must be made in order to decide an appropriate window size. A small window can better reflect the latest data distribution, but a large window provides more data for training a new model. A popular window scheme algorithm that aims to mitigate this problem is ADWIN . Unlike most earlier works, it does not require the user to guess a fixed size of the windows being compared in advance; instead, it examines all possible cuts of the window and computes optimal sub-window sizes according to the rate of change between the two sub-windows. After the optimal window cut has been found, the window containing old data is dropped and a new model can be trained with the latest window data.
Instead of directly retraining the model, researchers have also attempted to integrate the drift detection process with the retraining process for specific machine learning algorithms. DELM  extends the traditional ELM algorithm with the ability to handle concept drift by adaptively adjusting the number of hidden layer nodes. When the classification error rate increases — which could indicate the emergence of a concept drift — more nodes are added to the network layers to improve its approximation capability. Similarly, FP-ELM  is an ELM-extended method that adapts to drift by introducing a forgetting parameter to the ELM model. A parallel version of ELM-based method  has also been developed for high-speed classification tasks under concept drift. OS-ELM  is another online learning ensemble of repressor models that integrates ELM using an ordered aggregation (OA) technique, which overcomes the problem of defining the optimal ensemble size.
Instance-based lazy learners for handling concept drift have also been studied intensively. The Just-in-Time adaptive classifier [4, 5] is such a method which follows the ”detect and update model” strategy. For drift detection, it extends the traditional CUSUM test 
to a pdf-free form. This detection method is then integrated with a kNN classifier. When a concept drift is detected, old instances (more than the last samples) are removed from the case base. In later work, the authors of [8, 107] extended this algorithm to handle recurrent concepts by computing and comparing current concept to previously stored concepts. NEFCS  is another kNN-based adaptive model. A competence model-based drift detection algorithm  was used to locate drift instances in the case base and distinguish them from noise instances and a redundancy removal algorithm, Stepwise Redundancy Removal (SRR), was developed to remove redundant instances in a uniform way, guaranteeing that the reduced case base would still preserve enough information for future drift detection.
5.2 Model ensemble for recurring drift
In the case of recurring concept drift, preserving and reusing old models can save significant effort to retrain a new model for recurring concepts. This is the core idea of using ensemble methods to handle concept drift. Ensemble methods have received much attention in stream data mining research community in recent years. Ensemble methods comprise a set of base classifiers that may have different types or different parameters. The output of each base classifier is combined using certain voting rules to predict the newly arrived data. Many adaptive ensemble methods have been developed that aim to handle concept drift by extending classical ensemble methods or by creating specific adaptive voting rules. An example is shown in Fig. 14, where new base classifier is added to the ensemble when concept drift occurs.
Bagging, Boosting and Random Forests are classical ensemble methods used to improve the performance of single classifiers. They have all been extended for handling streaming data with concept drift. An online version of the bagging method was first proposed in which uses each instance only once to simulate the batch mode bagging. In a later study , this method was combined with the ADWIN drift detection algorithm  to handle concept drift. When a concept drift is reported, the newly proposed method, called Leveraging Bagging, trains a new classifier on the latest data to replace the existing classifier with the worst performance. Similarly, an adaptive boosting method was developed in 
which handles concept drift by monitoring prediction accuracy using a hypothesis test, assuming that classification errors on non-drifting data should follow Gaussian distribution. In a recent work, the Adaptive Random Forest (ARF) algorithm was proposed, which extends the random forest tree algorithm with a concept drift detection method, such as ADWIN , to decide when to replace an obsolete tree with a new one. A similar work can be found in , which uses Hoeffding bound to distinguish concept drift from noise within decision trees.
Besides extending classical methods, many new ensemble methods have been developed to handle concept drift using novel voting techniques. Dynamic Weighted Majority (DWM)  is such an ensemble method that is capable of adapting to drifts with a simple set of weighted voting rules. It manages base classifiers according to the performance of both the individual classifiers and the global ensemble. If the ensemble misclassifies an instance, DWM will train a new base classifier and add it to ensemble. If a base classifier misclassifies an instance, DWM reduces its weight by a factor. When the weight of a base classifier drops below a user defined threshold, DWM removes it from the ensemble. The drawback of this method is that the adding classifier process may be triggered too frequently, introducing performance issues on some occasions, such as when gradual drift occurs. A well-known ensemble method, Learn++NSE , mitigates this issue by weighting base classifiers according to their prediction error rate on the latest batch of data. If the error rate of the youngest classifier exceeds 50%, a new classifier will be trained based on the latest data. This method has several other benefits: it can easily adopt almost any base classifier algorithm; it does not store history data, only the latest batch of data, which it only uses once to train a new classifier; and it can handle sudden drift, gradual drift, and recurrent drift because underperforming classifiers can be reactivated or deactivated as needed by adjusting their weights. Other voting strategies than standard weighted voting have also been applied to handle concept drift. Examples include hierarchical ensemble structure [128, 133], short term and long term memory [82, 123] and dynamic ensemble sizes [94, 129].
A number of research efforts have been made that focus on developing ensemble methods for handling concept drift of certain types. Accuracy Update Ensemble (AUE2)  was proposed with an emphasis on handling both sudden drift and gradual drift equally well. It is a batch mode weighted voting ensemble method based on incremental base classifiers. By doing re-weighting, the ensemble is able react quickly to sudden drift. All classifiers are also incrementally trained with the latest data, which ensures that the ensemble evolves with gradual drift. The Optimal Weights Adjustment (OWA) method  achieves the same goal by building ensembles using both weighted instances and weighted classifiers for different concept drift types. The authors of  considered a special case of concept drift — class evolution — the phenomenon of class emergence and disappearance. Recurring concepts are handled in [47, 54], which monitor concept information to decide when to reactivate previously stored obsolete models.  is another method that handles recurring concepts by refining the concept pool to avoid redundancy.
5.3 Adjusting existing models for regional drift
An alternative to retraining an entire model is to develop a model that adaptively learns from the changing data. Such models have the ability to partially update themselves when the underlying data distribution changes, as shown in Fig. 15. This approach is arguably more efficient than retraining when the drift only occurs in local regions. Many methods in this category are based on the decision tree algorithm because trees have the ability to examine and adapt to each sub-region separately.
In a foundational work , an online decision tree algorithm, called Very Fast Decision Tree classifier (VFDT) was proposed, which is especially tailored for high speed data streams. It uses Hoeffding bound to limit the number of instances required for node splitting. This method has become very popular because of its several distinct advantages: 1) it only needs to process each instance once and does not store instances in memory or disk; 2) the tree itself only consumes a small amount of space and does not grow with the number of instances it processes unless there is new information in the data; 3) the cost of tree maintenance is very low. An extended version, called CVFDT , was later proposed to handle concept drift. In CVFDT, a sliding window is maintained to hold the latest data. An alternative sub-tree is trained based on the window and its performance is monitored. If the alternative sub-tree outperforms its original counterpart, it will be used for future prediction and the original obsolete sub-tree will be pruned. VFDTc 
is another attempt to make improvements to VFDT with several enhancements: the ability to handle numerical attributes; the application of naive Bayes classifiers in tree leaves and the ability to detect and adapt to concept drift. Two node-level drift detection methods were proposed based on monitoring differences between a node and its sub-nodes. The first method uses classification error rate and the second directly checks distribution difference. When a drift is detected on a node, the node becomes a leaf and its descending sub-tree is removed. Later work[125, 126] further extended VFDTc using an adaptive leaf strategy that chooses the best classifier from three options: majority voting, Naive Bayes and Weighted Naive Bayes.
Despite the success of VFDT, recent studies [103, 104] have shown that its foundation, the Hoeffding bound, may not be appropriate for the node splitting calculation because the variables it computes, the information gain, are not independent. A new online decision tree model  was developed based on an alternative impurity measure. The paper shows that this measure also reflects concept drift and can be used as a replacement measure in CVFDT. In the same spirit, another decision tree algorithm (IADEM-3) 
aims to rectify the use of Hoeffding bound by computing the sum of independent random variables, calledrelative frequencies. The error rate of sub-trees are monitored to detect drift and are used for tree pruning.
6 Evaluation, Datasets and Benchmarks
Section 6.1 discusses the evaluation systems used for learning algorithms handling concept drift. Section 6.2 introduces synthetic datasets, which used to simulate specific and controllable types of concept drift. Section 6.3 describes real-world datasets, which used to test the overall performance in a real-life scenario.
6.1 Evaluation Systems
The evaluation systems is an important part for learning algorithms. Some evaluation methodologies used in learning under concept drift have been mentioned in 
. We enrich this previous research by reviewing the evaluation systems from three aspects: 1) validation methodology, 2) evaluation metrics, and 3) statistical significance, and each evaluation is followed by its computation equation and usage introduction.
Validation methodology refers to the procedure for a learning algorithm to determine which data instances are used as the training set and which are used as the testing set. There are three procedures peculiar to the evaluation for learning algorithms capable of handling concept drift: 1) holdout, 2) prequential, and 3) controlled permutation. In the scenario of a dataset involving concept drift, holdout should follow the rule: when testing a learning algorithm at time , the holdout set represents exactly the same concept at that time . Unfortunately, it is only applied on synthetic datasets with predefined concept drift times. Prequential
is a popular evaluation scheme used in streaming data. Each data instance is first used to test the learning algorithm, and then to train the learning algorithm. This scheme has the advantage that there is no need to know the drift time of concepts, and it makes maximum use of the available data. The prequential error is computed based on an accumulated sum of a loss function between the prediction and observed label:. There are three prequential error rate estimates: a landmark window (interleaved-test-then-train), a sliding window, and a forgetting mechanism . Controlled permutation  runs multiple test datasets in which the data order has been permutated in a controlled way to preserve the local distribution, which means that data instances that were originally close to one another in time need to remain close after a permutation. Controlled permutation reduces the risk that their prequential evaluation may produce biased results for the fixed order of data in a sequence.
Evaluation metrics for datasets involving concept drift could be selected from traditional accuracy measures, such as precision/recall in retrieval tasks, mean absolute scaled error in regression, or root mean square error in recommender systems. In addition to that, the following measures should be examined: 1) RAM-hours  for the computation cost of the mining process; 2) Kappa statistic  for classification taking into account class imbalance, where is the accuracy of the classifier under consideration (reference classifier) and is the accuracy of the random classifier; 3) Kappa-Temporal statistic  for the classification of streaming data with temporal dependence, where is the accuracy of the persistent classifier (a classifier that predicts the same label as previously observed); 4) Combined Kappa statistic , which combines the and by taking the geometric average; 5) Prequential AUC ; and 6) the Averaged Normalized Area Under the Curve (NAUC) values for Precision-Range curve and Recall-Range curve , for the classification of streaming data involving concept drift. Apart from evaluating the performance of learning algorithms, the accuracy of the concept drift detection method/algorithm can be accessed according to the following criteria: 1) true detection rate, 2) false detection rate, 3) miss detection rate, and 4) delay of detection .
Statistical significance is used to compare learning algorithms on achieved error rates. The three most frequently used statistical tests for comparing two learning algorithms [22, 66] are: 1) McNemar test : denote the number of data instances misclassified by the first classifier and correctly classified by the second classifier by , and denote in the opposite way. The McNemar statistic is computed as to test whether two classifiers perform equally well. The test follows the distribution; 2) Sign test: for data instances, denote the number of data instances misclassified by the first classifier and correctly classified by the second classifier by and the number of ties by . Conduct one-sided sign test by computing . If less than a significant level, then the second classifier is better than the first classifier. and 3) Wilcoxon’s sign-rank test: For testing two classifiers on datasets, let and denote the measurements. The number of ties is and . The test statistic where is the rank ordered by increasingly. Two classifiers perform equally is rejected if (two-sided), where can be acquired from the statistical table. All three tests are non-parametric. The Nemenyi test  is used to compare more than two learning algorithms. It is an appropriate test for comparing all learning algorithms with multiple datasets, based on the average rank of algorithms over all datasets. The Nemenyi test consists of the following: two classifiers are performing differently if the corresponding average ranks differ by at least the critical difference , where is the number of learners, is the number of datasets, and critical values are based on the Studentized range statistic divided by . Other tests can be used to compare learning algorithms with a control algorithm .
6.2 Synthetic datasets
We list several widely used synthetic datasets for evaluating the performance of learning algorithms dealing with concept drift. Since data instances are generated by predefined rules and specific parameters, a synthetic dataset is a good option for evaluating the performance of learning algorithms in different concept drift scenarios. The dataset provider, the number of instances (#Insts.), the number of attributes (#Attrs.), the number of classes (#Cls.), types of drift (Types), sources of drift (Sources), and used by references, are listed in TABLE III.
|Dataset||#Insts.||#Attrs.||#Cls.||Types||Sources||Used by references|
|1||STAGGER||Custom||3||2||Sudden||II||[12, 26, 44, 45, 48, 73, 77, 92, 121, 122, 129]|
|2||SEA||Custom||3||2||Sudden||II||[12, 16, 21, 23, 42, 43, 47, 48, 53, 74, 77, 80, 81, 82, 83, 113, 121, 122, 130]|
Rotating hyperplane[65, 117]
|Custom||10||2||Gradual; Incremental||II||[2, 16, 21, 23, 25, 26, 44, 53, 57, 65, 74, 81, 82, 83, 92, 94, 106, 122, 126, 129, 130]|
|4||Random RBF||Custom||Custom||Custom||Sudden; Gradual; Incremental||III||[3, 13, 21, 23, 25, 26, 42, 44, 48, 53, 74, 82, 92, 101, 112, 122, 129, 134]|
|5||Random Tree[17, 39]||Custom||Custom||Custom||Sudden; Reoccurring||II||[23, 44, 53, 102, 103, 104, 122, 125]|
|6||LED||Custom||24||10||Sudden||II||[21, 23, 44, 45, 49, 53, 74, 121, 122, 125]|
|7||Waveform||Custom||40||3||Sudden||II||[2, 40, 44, 49, 74, 122, 125, 126]|
|8||Sine||Custom||2||2||Sudden||II||[13, 25, 48, 60, 101, 129]|
|9||Circle||Custom||2||2||Gradual||III||[13, 25, 26, 43, 48, 60, 92, 129]|
|10||Rotating chessboard||Custom||2||2||Gradual||II||[8, 42, 60, 82, 130]|
6.3 Real-world datasets
In this section, we collect several publicly available real-world datasets, including real-world datasets with synthetic drifts. The dataset provider, the number of instances (#Insts.), the number of attributes (#Attrs.), the number of classes (#Cls.), and used by references, are shown in TABLE IV.
Most of these datasets contain temporal concept drift spanning over different period range - e.g. daily (Sensor ), seasonally (Electricity ) or yearly (Airlines , NOAA weather). Others include geographical (Covertype) or categorical (Poker-Hand) concept drift. Certain datesets, mainly text based, are targeting at specific drift types, such as sudden drift (Email_data), gradural drift (Spam assassin corpus), recurrent drift (Usenet ) or novel class (KDDCup’99, ECUE drift dataset 2) These datasets provide realistic benchmark for evaluating differnent concept drift handling methods. There are, however, two limitations of real world data sets: 1) the groud truth of precise start and end time of drifts is unknown; 2) some real datasets may include mixed drift types. These limitations make it difficult to evaluate methods for understanding the drift, and could introduce bias when comparing different machine learning models.
|Dataset||#Insts.||#Attrs.||#Cls.||Used by references|
|1||Airlines||539384||7||2||[23, 53, 74, 79, 80, 118, 137]|
|2||Covertype||581012||54||7||[21, 23, 44, 45, 49, 53, 57, 74, 82, 102, 106, 125, 126, 137]|
|3||Electricity ||45312||8||2||[2, 12, 13, 15, 21, 23, 43, 44, 45, 48, 53, 74, 79, 80, 82, 101, 102, 106, 129, 137]|
|4||Poker-Hand||1025010||10||10||[16, 21, 23, 74, 82]|
|5||NOAA weather||18159||8||2||[2, 42, 79, 82, 83, 112, 128]|
|7||KDDCup’99||494021||41||23||[53, 74, 77, 102, 103, 121, 133, 134, 135]|
|8||Usenet1||1500||99||2||[44, 45, 130]|
|10||Email_data||1500||913||2||[8, 47, 54]|
|11||Spam_data||9324||499||2||[45, 74, 79, 80, 106, 109]|
|12||Spam assassin corpus||9324||39916||2||[44, 47, 53, 79]|
|13||ECUE drift dataset 1||10983||287034||2||[83, 84]|
|14||ECUE drift dataset 2||11905||166047||2||[83, 84]|
7 The Concept Drift Problem in Other Research Areas
We have observed that handling the concept drift problem is not a standalone research subject but has a large number of indirect usage scenarios. In this section, we adopt this new perspective to review recent developments in other research areas that benefit from handling the concept drift problem.
7.1 Class imbalance
Class imbalance is a common problem in stream data mining in addition to concept drift. Research effort has been made to develop effective learning algorithms to tackle both problems at same time.  presented two ensemble methods for learning under concept drift with imbalanced class. The first method, Learn++.CDS, is extended from Learn++.NSE and combined with the Synthetic Minority class Oversampling Technique (SMOTE). The second algorithm, Learn++.NIE, improves on the previous method by employing a different penalty constraint to prevent prediction accuracy bias and replacing SMOTE with bagging to avoid oversampling. ESOS-ELM  is another ensemble method which uses Online Sequential Extreme Learning Machine (OS-ELM) as a basic classifier to improve performance with class imbalanced data. A concept drift detector is integrated to retrain the classifier when drift occurs. The author then developed another algorithm , which is able to tackle multi-class imbalanced data with concept drift.  proposed two learning algorithms OOB and UOB, which build an ensemble model to overcome the class imbalance in real time through resampling and time-decayed metrics.  developed an ensemble method which handles concept drift and class imbalance with additional true label data limitation.
7.2 Big data mining
Data mining in big data environments faces similar challenges to stream data mining : data is generated at a fast rate (Velocity) and distribution uncertainty always exists in the data, which means that handling concept drift is also crucial in big data applications. Additionally, scalability is an important consideration because in big data environments, a data stream may come in very large and potentially unpredictable quantities (Volume) and cannot be processed in a single computer server. An attempt to handle concept drift in a distributed computing environment was made by  in which an Online Map-Reduce Drift Detection Method (OMR-DDM) was proposed, using the combined online error rate of the parallel classification algorithms to identify the changes in a big data stream. A recent study  proposed another scalable stream data mining algorithm, called Micro-Cluster Nearest Neighbor (MC-NN), based on nearest neighbor classifier. This method extends the original Micro-Cluster algorithm  to adapt to concept drift by monitoring classification error. This micro-cluster algorithm was further extended to a parallel version using the map-reduce technique in  and applied to solve the label-drift classification problem where class labels are not known in advance .
7.3 Active learning and semi-supervised learning
Active learning is based on the assumption that there is a large amount of unlabeled data but only a fraction of them can be labeled by human effort. This is a common situation in stream data applications, which are often also subject to the concept drift problem.  presented a general framework that combines active learning and concept drift adaptation. It first compares different instance-sampling strategies for labeling to guarantee that the labeling cost will be under budget, and that distribution bias will be prevented. A drift adaptation mechanism is then adopted, based on the DDM detection method . In , the authors proposed a new active learning algorithm that primarily aims to avoid bias in the sampling process of choosing instances for labeling. They also introduced a memory loss factor to the model, enabling it to adapt to concept drift.
Semi-supervised learning concerns how to use limited true label data more efficiently by leveraging unsupervised techniques. In this scenario, additional design effort is required to handle concept drift. For example, in 
, the authors applied a Gaussian Mixture model to both labeled and unlabeled data, and assigned labels, which has the ability to adapt to gradual drift. Similarly,[63, 121, 132] are all cluster-based semi-supervised ensemble methods that aim to adapt to drift with limited true label data. The latter are also able to recognize recurring concepts. In , the author adopted a new perspective on the true label scarcity problem by considering the true labeled data and unlabeled data as two independent non-stationary data generating processes. Concept drift is handled asynchronously on these two streams. The SAND algorithm [58, 59] is another semi-supervised adaptive method which detects concept drift on cluster boundaries. There are also studies [90, 91] that focus on adapting to concept drift in circumstances where true label data is completely unavailable.
7.4 Decision Rules
Data-driven decision support systems need to be able to adapt to concept drift in order to make accurate decisions and decision rules is the main technique for this purpose.  proposed a decision rule induction algorithm, Very Fast Decision Rules (VFDR), to effectively process stream data. An extended version, Adaptive VFDR, was developed to handle concept drift by dynamically adding and removing decision rules according to their error rate which is monitored by drift detector. Instead of inducing rules from decision trees,  proposed another decision rule algorithm based on PRISM  to directly induce rules from data. This algorithm is also able to adapt to drift by monitoring the performance of each rule on a sliding window of latest data.  also developed an adaptive decision making algorithm based on fuzzy rules. The algorithm includes a rule pruning procedure, which removes obsolete rules to adapt to changes, and a rule recal procedure to adapt to recurring concepts.
This section by no means attempts to cover every research field in which concept drift handling is used. There are many other studies that also consider concept drift as a dual problem. For example,  is a dimension reduction algorithm to separate classes based on least squares linear discovery analysis (LSLDA), which is then extended to adapt to drift;  considered the concept drift problem in time series and developed an online explicit drift detection method by monitoring time series features; and  developed an incremental scaffolding classification algorithm for complex tasks that also involve concept drift.
8 Conclusions: findings and future directions
We summarize the recent developments of concept drift research, and the following important findings can be extracted:
Error rate-based and data distribution-based drift detection methods are still playing a dominant role in concept drift detection research, while multiple hypothesis test methods emerge in recent years;
Regarding to concept drift understanding, all drift detection methods can answer “When”, but very few methods have the ability to answer “How” and “Where”;
Adaptive models and ensemble techniques have played an increasingly important role in recent concept drift adaptation developments. In contrast, research of retraining models with explicit drift detection has slowed;
Most existing drift detection and adaptation algorithms assume the ground true label is available after classification/prediction, or extreme verification latency. Very few research has been conducted to address unsupervised or semi-supervised drift detection and adaptation.
Some computational intelligence techniques, such as fuzzy logic, competence model, have been applied in concept drift;
There is no comprehensive analysis on real-world data streams from the concept drift aspect, such as the drift occurrence time, the severity of drift, and the drift regions.
An increasing number of other research areas have recognized the importance of handling concept drift, especially in big data community.
Based on these findings, we suggest four new directions in future concept drift research:
Drift detection research should not only focus on identifying drift occurrence time accurately, but also need to provide the information of drift severity and regions. These information could be utilized for better concept drift adaptation.
In the real-world scenario, the cost to acquire true label could be expensive, that is, unsupervised or semi-supervised drift detection and adaptation could still be promising in the future.
A framework for selecting real-world data streams should be established for evaluating learning algorithms handling concept drift.
Research on effectively integrating concept drift handling techniques with machine learning methodologies for data-driven applications is highly desired.
We hope this paper can provide researchers with state-of-the-art knowledge on concept drift research developments and provide guidelines about how to apply concept drift techniques in different domains to support users in various prediction and decision activities.
The work presented in this paper was supported by the Australian Research Council (ARC) under discovery grant DP150101645. We significantly thank Yiliao Song for her help in preparation of datasets and applications shown in Sections 6.
-  (2003) A framework for clustering evolving data streams. Conference Proceedings In Proc. 29th Int. Conf. Very Large Databases, Vol. 29, pp. 81–92. External Links: Cited by: §7.2.
-  (2017) Modeling recurring concepts in data streams: a graph-based framework. Knowledge and Information Systems. External Links: Cited by: §5.2, TABLE III, TABLE IV.
-  (2017) Hierarchical change-detection tests. IEEE Trans. Neural Networks Learn. Syst. 28 (2), pp. 246–258. External Links: Cited by: §3.2.3, TABLE I, TABLE II, TABLE III.
-  (2008) Just-in-time adaptive classifiers part i: detecting nonstationary changes. IEEE Trans. Neural Networks 19 (7), pp. 1145–1153. External Links: Cited by: §3.1, §3.2.3, TABLE I, TABLE II, §5.1.
-  (2008) Just-in-time adaptive classifiers part ii: designing the classifier. IEEE Trans. Neural Networks 19 (12), pp. 2053–2064. External Links: Cited by: §3.2.3, §5.1.
-  (2011) A just-in-time adaptive classification system based on the intersection of confidence intervals rule. Neural Networks 24 (8), pp. 791–800. External Links: Cited by: §3.2.3.
-  (2012) Just-in-time ensemble of classifiers. Conference Proceedings In Proc. 2012 Int. Joint Conf. Neural Networks, pp. 1–8. External Links: Cited by: §3.2.3.
-  (2013) Just-in-time classifiers for recurrent concepts. IEEE Trans. Neural Networks Learn. Syst. 24 (4), pp. 620–634. External Links: Cited by: §3.2.3, §5.1, TABLE III, TABLE IV.
-  (2009) When training and test sets are different: characterizing learning transfer. Dataset Shift in Machine Learning, pp. 3–28. External Links: Cited by: §2.1.
-  (2012) Parallel concept drift detection with online map-reduce. Conference Proceedings In Proc. 12th Int. Conf. Data Mining Workshops, pp. 402–407. External Links: Cited by: §7.2.
-  (2017) SOM-based partial labeling of imbalanced data stream. Neurocomputing 262, pp. 120–133. External Links: Cited by: §7.1.
-  (2008) Paired learners for concept drift. Conference Proceedings In Proc. 8th Int. Conf. Data Mining, pp. 23–32. External Links: Cited by: §5.1, TABLE III, TABLE IV.
-  (2006) Early drift detection method. Conference Paper In Proc. 4th Int. Workshop Knowledge Discovery from Data Streams, Cited by: §3.2.1, TABLE I, §4.1, TABLE II, TABLE III, TABLE IV.
-  (1993) Detection of abrupt changes: theory and application. Book, Vol. 104, Prentice Hall Englewood Cliffs. Cited by: §3.1.
-  (2007) Learning from time-changing data with adaptive windowing. Conference Proceedings In Proc. 2007 SIAM Int. Conf. Data Mining, Vol. 7, pp. 2007. External Links: Cited by: §3.2.1, TABLE I, TABLE II, §5.1, §5.2, TABLE IV.
-  (2009) Adaptive learning from evolving data streams. Conference Proceedings In Proc. 8th Int. Symp. Intelligent Data Analysis, pp. 249–260. External Links: Cited by: §3.2.1, TABLE III, TABLE IV.
-  (2010) MOA: massive online analysis. Journal of Machine Learning Research 99, pp. 1601–1604. Cited by: §6.3, TABLE III, TABLE IV.
Fast perceptron decision tree learning from evolving data streams. Book Section In Proc. 14th Pacific-Asia Conf. Knowledge Discovery and Data Mining, M. J. Zaki, J. X. Yu, B. Ravindran, and V. Pudi (Eds.), Berlin, Heidelberg, pp. 299–310. External Links: Cited by: §6.1.
-  (2009) Improving adaptive bagging methods for evolving data streams. Book Section In Proc. 1st Asian Conf. Machine Learning, Z. Zhou and T. Washio (Eds.), Lecture Notes in Computer Science, Berlin, Heidelberg, pp. 23–37. External Links: Cited by: §3.2.1.
-  (2009) New ensemble methods for evolving data streams. Conference Proceedings In Proc. 15th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, pp. 139–148. External Links: Cited by: §3.2.1.
-  (2010) Leveraging bagging for evolving data streams. Conference Proceedings In Proc. 2010 Joint European Conf. Machine Learning and Knowledge Discovery in Databases, pp. 135–150. External Links: Cited by: §5.2, TABLE III, TABLE IV.
-  (2015) Efficient online evaluation of big data stream classifiers. Conference Paper In Proc. 21th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, Sydney, NSW, Australia, pp. 59–68. External Links: Cited by: §6.1.
-  (2014) Reacting to different types of concept drift: the accuracy updated ensemble algorithm. IEEE Trans. Neural Networks Learn. Syst. 25 (1), pp. 81–94. External Links: Cited by: §5.2, TABLE III, TABLE IV.
-  (2014) Prequential auc for classifier evaluation and drift detection in evolving data streams. Book Section In Proc. 3rd Int. Workshop New Frontiers in Mining Complex Patterns, A. Appice, M. Ceci, C. Loglisci, G. Manco, E. Masciari, and Z. W. Ras (Eds.), Cham, pp. 87–101. External Links: Cited by: §6.1.
-  (2016) A pdf-free change detection test based on density difference estimation. IEEE Trans. Neural Networks Learn. Syst. PP (99), pp. 1–11. External Links: Cited by: §3.1, §3.2.2, TABLE I, TABLE II, TABLE III.
-  (2017) An incremental change detection test based on density difference estimation. IEEE Transactions on Systems, Man, and Cybernetics: Systems PP (99), pp. 1–13. External Links: Cited by: §3.2.2, TABLE I, TABLE II, TABLE III.
-  (2016) FEDD: feature extraction for explicit concept drift detection in time series. Conference Proceedings In Proc. 2016 Int. Joint Conf. Neural Networks, pp. 740–747. External Links: Cited by: §7.4.
-  (1987) PRISM: an algorithm for inducing modular rules. Int. J. Man Mach. Stud. 27 (4), pp. 349–370. External Links: Cited by: §7.4.
-  (2016) An adaptive framework for multistream classification. Conference Paper In Proc. 25th ACM Int. on Conf. Information and Knowledge Management, Indianapolis, Indiana, USA, pp. 1181–1190. External Links: Cited by: §7.3.
-  (2004) Fast and light boosting for adaptive mining of data streams. Book Section In Proc. 8th Pacific-Asia Conf. Knowledge Discovery and Data Mining, H. Dai, R. Srikant, and C. Zhang (Eds.), Berlin, Heidelberg, pp. 282–292. External Links: Cited by: §5.2.
-  (2011) Unbiased online active learning in data streams. Conference Paper In Proc. 17th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, San Diego, California, USA, pp. 195–203. External Links: Cited by: §7.3.
-  (1960) A coefficient of agreement for nominal scales. Educational and Psychological Measurement 20 (1), pp. 37–46. External Links: Cited by: §6.1.
-  (2006) An information-theoretic approach to detecting changes in multi-dimensional data streams. Conference Proceedings In Proc. Symp. the Interface of Statistics, Computing Science, and Applications, pp. 1–24. Cited by: §3.1, §3.2.2, §3.2.2, TABLE I, §4.2, §4.3, TABLE II, §6.1.
-  (2005) A case-based technique for tracking concept drift in spam filtering. Knowledge-Based Systems 18 (4–5), pp. 187–195. External Links: Cited by: §6.3, TABLE IV.
-  (2006) Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research 7 (Jan), pp. 1–30. Cited by: §6.1.
-  (2011) Semi-supervised learning in nonstationary environments. Conference Proceedings In Proc. 2011 Int. Joint Conf. Neural Networks, pp. 2741–2748. External Links: Cited by: §7.3.
-  (2013) Incremental learning of concept drift from streaming imbalanced data. IEEE Trans. Knowl. Data Eng. 25 (10), pp. 2283–2301. External Links: Cited by: §7.1.
-  (2015) Learning in nonstationary environments: a survey. IEEE Comput. Intell. Mag. 10 (4), pp. 12–25. Cited by: §1.
-  (2000) Mining high-speed data streams. Conference Proceedings In Proc. 6th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, pp. 71–80. External Links: Cited by: §5.3, TABLE III.
-  (2009) Adaptive concept drift detection. Statistical Analysis and Data Mining: The ASA Data Science Journal 2 (5–6), pp. 311–327. External Links: Cited by: §3.1, §3.1, TABLE III.
-  (2014) A selective detector ensemble for concept drift detection. The Computer Journal 58 (3), pp. 457–471. Cited by: §3.2.3, TABLE I, TABLE II.
-  (2011) Incremental learning of concept drift in nonstationary environments. IEEE Trans. Neural Networks 22 (10), pp. 1517–31. External Links: Cited by: §5.2, §6.3, TABLE III, TABLE IV.
-  (2017) Mining evolving data streams with particle filters. Comput. Intell. 33 (2), pp. 147–180. External Links: Cited by: TABLE III, TABLE IV.
-  (2016) Online adaptive decision trees based on concentration inequalities. Knowledge-Based Systems 104, pp. 179–194. External Links: Cited by: §5.3, TABLE III, TABLE IV.
-  (2015) Online and non-parametric drift detection methods based on hoeffding’s bounds. IEEE Trans. Knowl. Data Eng. 27 (3), pp. 810–823. External Links: Cited by: §3.1, §3.2.1, TABLE I, §4.1, §4.2, TABLE II, TABLE III, TABLE IV.
-  (2006) Learning with local drift detection. Conference Proceedings In Proc. 2nd Int. Conf. Advanced Data Mining and Applications, pp. 42–55. External Links: Cited by: §3.2.1, TABLE I, §4.3, TABLE II.
-  (2013) Recurrent concepts in data streams classification. Knowledge and Information Systems 40 (3), pp. 489–507. External Links: Cited by: §5.2, TABLE III, TABLE IV.
-  (2004) Learning with drift detection. Book Section In Proc. 17th Brazilian Symp. Artificial Intelligence, Lecture Notes in Computer Science, pp. 286–295. External Links: Cited by: §3.1, §3.2.1, TABLE I, §4.1, §4.1, §4.2, TABLE II, TABLE III, TABLE IV, §7.3.
-  (2003) Accurate decision trees for mining high-speed data streams. Conference Proceedings In Proc. 9th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, pp. 523–528. External Links: Cited by: §5.3, TABLE III, TABLE IV.
-  (2012) On evaluating stream learning algorithms. Machine Learning 90 (3), pp. 317–346. External Links: Cited by: §6.1.
-  (2014) A survey on concept drift adaptation. ACM Comput. Surv. 46 (4), pp. 1–37. External Links: Cited by: §1, §2.1, §2.1, §2.2, §2.2, §6.1.
-  (2012) A survey on learning from data streams: current and future trends. Progress in Artificial Intelligence 1 (1), pp. 45–55. External Links: Cited by: §1.
-  (2017) Adaptive random forests for evolving data stream classification. Machine Learning. External Links: Cited by: §3.2.1, §5.2, TABLE III, TABLE IV.
-  (2014) Mining recurring concepts in a dynamic feature space. IEEE Trans. Neural Networks Learn. Syst. 25 (1), pp. 95–110. External Links: Cited by: §5.2, TABLE IV.
-  (2014) A comparative study on concept drift detectors. Expert Systems with Applications 41 (18), pp. 8144–8156. Cited by: §3.3.
-  (2016) Concept drift detection based on equal density estimation. Conference Proceedings In Proc. 2016 Int. Joint Conf. Neural Networks, pp. 24–30. External Links: Cited by: §3.2.2, TABLE I, TABLE II.
-  (2015) Efficient mining of high-speed uncertain data streams. Applied Intelligence 43 (4), pp. 773–785. External Links: Cited by: §5.1, TABLE III, TABLE IV.
-  (2003) Efficient handling of concept drift and concept evolution over stream data. Conference Proceedings In Proc. 32nd Int. Conf. Data Engineering, pp. 481–492. External Links: Cited by: §7.3.
-  (2016) SAND: semi-supervised adaptive novel class detection and classification over data stream. Conference Proceedings In 30th AAAI Conf. Artificial Intelligence, pp. 1652–1658. Cited by: §7.3.
-  (2014) Concept drift detection through resampling. Conference Proceedings In Proc. 31st Int. Conf. Machine Learning, pp. 1009–1017. Cited by: TABLE III.
-  (1999) Splice-2 comparative evaluation: electricity pricing. Journal Article, Citeseer. Cited by: §6.3, TABLE IV.
-  (2015) Concept drift detection for streaming data. Conference Proceedings In Proc. 2015 Int. Joint Conf. Neural Networks, pp. 1–9. External Links: Cited by: §3.2.3, TABLE I, TABLE II.
-  (2015) An ensemble of cluster-based classifiers for semi-supervised classification of non-stationary data streams. Knowledge and Information Systems 46 (3), pp. 567–597. External Links: Cited by: §7.3.
-  (2006) Extreme learning machine: theory and applications. Neurocomputing 70 (1–3), pp. 489–501. External Links: Cited by: §3.2.1.
-  (2001) Mining time-changing data streams. Conference Paper In Proc. 7th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, San Francisco, California, pp. 97–106. External Links: Cited by: §5.3, TABLE III.
-  (2011) Evaluating learning algorithms: a classification perspective. Book, Cambridge University Press. Cited by: §6.1.
-  (2008) An adaptive personalized news dissemination system. Journal of Intelligent Information Systems 32 (2), pp. 191–212. External Links: Cited by: §6.3, TABLE IV.
-  (2008) An ensemble of classifiers for coping with recurring contexts in data streams. Conference Proceedings In 18th European Conf. Artificial Intelligence, pp. 763–764. Cited by: §6.3, TABLE IV.
-  (2009) Tracking recurring contexts using ensemble classifiers: an application to email filtering. Knowledge and Information Systems 22 (3), pp. 371–391. External Links: Cited by: §6.3, TABLE IV.
-  (2013) Big data: issues, challenges, tools and good practices. Conference Proceedings In Proc. 6th Int. Conf. Contemporary Computing (IC3), pp. 404–409. External Links: Cited by: §7.2.
-  (2004) Detecting change in data streams. Conference Proceedings In Proc. 30th Int. Conf. Very Large Databases, Vol. 30, pp. 180–191. Cited by: §3.2.2, TABLE I, §4.2, TABLE II.
-  (2007) Dynamic weighted majority: an ensemble method for drifting concepts. Journal of Machine Learning Research. Cited by: §5.2.
-  (2005) Using additive expert ensembles to cope with concept drift. Conference Paper In Proc. 22nd Int. Conf. Machine Learning, Bonn, Germany, pp. 449–456. External Links: Cited by: TABLE III.
-  (2015) Very fast decision rules for classification in data streams. Data Mining and Knowledge Discovery 29 (1), pp. 168–202. External Links: Cited by: TABLE III, TABLE IV, §7.4.
-  (2017) Ensemble learning for data stream analysis: a survey. Information Fusion 37, pp. 132–156. External Links: Cited by: §1, §1.
-  (2017) On expressiveness and uncertainty awareness in rule-based classification for data streams. Neurocomputing 265, pp. 127–141. External Links: Cited by: §7.4.
-  (2015) Learning concept-drifting data streams with random ensemble decision trees. Neurocomputing 166, pp. 68–83. External Links: Cited by: §5.2, TABLE III, TABLE IV.
-  (2013) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. External Links: Cited by: §6.3, TABLE III, TABLE IV.
-  (2017) Regional concept drift detection and density synchronized drift adaptation. Conference Proceedings In Proc. 26th Int. Joint Conf. Artificial Intelligence, External Links: Cited by: §1, §2.1, §2.2, §3.1, §3.2.2, TABLE I, §4.3, §4.3, TABLE II, TABLE IV.
-  (2017) Fuzzy time windowing for gradual concept drift adaptation. Conference Proceedings In Proc. 26th IEEE Int. Conf. Fuzzy Systems, Cited by: §1, §3.2.1, TABLE I, TABLE II, TABLE III, TABLE IV.
-  (2016) FP-elm: an online sequential learning algorithm for dealing with concept drift. Neurocomputing 207, pp. 322–334. External Links: Cited by: §5.1, TABLE III.
-  (2016) KNN classifier with self adjusting memory for heterogeneous concept drift. Conference Proceedings In Proc. 16th Int. Conf. Data Mining, pp. 291–300. External Links: Cited by: §2.1, §2.1, §5.2, TABLE III, TABLE IV.
-  (2016) A concept drift-tolerant case-base editing technique. Artif. Intell. 230, pp. 108–133. External Links: Cited by: §1, §2.1, §2.1, §3.1, §3.2.2, §3.2.2, TABLE I, §4.3, §4.3, TABLE II, §5.1, TABLE III, TABLE IV.
-  (2014) Concept drift detection via competence models. Artif. Intell. 209, pp. 11–28. External Links: Cited by: §1, §2.1, §3.1, §3.2.2, TABLE I, §4.1, §4.2, §4.3, TABLE II, §5.1, TABLE IV.
-  (2015) A lightweight concept drift detection ensemble. In Proc. 27th IEEE Int. Conf. on Tools with Artificial Intelligence, pp. 1061–1068. Cited by: §3.2.3, TABLE I, TABLE II.
-  (2000) A cumulative sum type of method for environmental monitoring. Environmetrics 11 (2), pp. 151–166. External Links: Cited by: §5.1.
-  (1947) Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika 12 (2), pp. 153–157. External Links: Cited by: §6.1.
-  (2015) Ensemble of subset online sequential extreme learning machine for class imbalance and concept drift. Neurocomputing 149, pp. 316–329. External Links: Cited by: §7.1.
-  (2016) Meta-cognitive online sequential extreme learning machine for imbalanced and concept-drifting data classification. Neural Networks 80, pp. 79–94. External Links: Cited by: §7.1.
-  (2012) A unifying view on dataset shift in classification. Pattern Recognit. 45 (1), pp. 521–530. External Links: Cited by: §2.1.
One-pass logistic regression for label-drift and large-scale classification on distributed systems. Conference Proceedings In Proc. 16th Int. Conf. Data Mining, pp. 1113–1118. External Links: Cited by: §7.2.
-  (2007) Detecting concept drift using statistical testing. Conference Proceedings In Proc. 10th Int. Conf. Discovery Science, V. Corruble, M. Takeda, and E. Suzuki (Eds.), Berlin, Heidelberg, pp. 264–269. External Links: Cited by: §3.2.1, TABLE I, §4.1, §4.2, TABLE II, TABLE III.
-  (2001) Experimental comparisons of online and batch versions of bagging and boosting. Conference Proceedings In Proc. 7th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, 502565, pp. 359–364. External Links: Cited by: §5.2.
-  (2016) A method for automatic adjustment of ensemble size in stream data mining. Conference Proceedings In Proc. 2016 Int. Joint Conf. Neural Networks, pp. 9–15. External Links: Cited by: §5.2, TABLE III.
-  (2015) PClass: an effective classifier for streaming examples. IEEE Trans. Fuzzy Syst. 23 (2), pp. 369–386. External Links: Cited by: §7.4.
-  (2016) Scaffolding type-2 classifier for incremental learning under concept drifts. Neurocomputing 191, pp. 304–329. External Links: Cited by: §7.4.
-  (1994) The three sigma rule. The American Statistician 48 (2), pp. 88–91. External Links: Cited by: §4.1.
-  (2015) A pca-based change detection framework for multidimensional data streams. Conference Proceedings In Proc. 21th Int. Conf. on Knowledge Discovery and Data Mining, pp. 935–944. Cited by: §3.2.2, TABLE I, TABLE II.
-  (2017) A survey on data preprocessing for data stream mining: current status and future directions. Neurocomputing 239, pp. 39–57. External Links: Cited by: §1, §1, 1st item, §3.1.
-  (2015) EWMA model based shift-detection methods for detecting covariate shifts in non-stationary environments. Pattern Recognit. 48 (3), pp. 659–669. External Links: Cited by: §3.2.3, TABLE I, TABLE II.
-  (2012) Exponentially weighted moving average charts for detecting concept drift. Pattern Recognit. Lett. 33 (2), pp. 191–198. External Links: Cited by: §3.2.1, TABLE I, §4.1, TABLE II, TABLE III, TABLE IV.
-  (2015) A new method for data stream mining based on the misclassification error. IEEE Trans. Neural Networks Learn. Syst. 26 (5), pp. 1048–1059. External Links: Cited by: §5.3, TABLE III, TABLE IV.
-  (2014) Decision trees for mining data streams based on the gaussian approximation. IEEE Trans. Knowl. Data Eng. 26 (1), pp. 108–119. External Links: Cited by: §5.3, TABLE III, TABLE IV.
-  (2013) Decision trees for mining data streams based on the mcdiarmid’s bound. IEEE Trans. Knowl. Data Eng. 25 (6), pp. 1272–1279. External Links: Cited by: §5.3, TABLE III.
-  (1986) Incremental learning from noisy data. Machine learning 1 (3), pp. 317–354. Cited by: §2.1.
-  (2014) Prototype-based learning on concept-drifting data streams. Conference Proceedings In Proc. 20th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, 2623609, pp. 412–421. External Links: Cited by: §3.2.2, §3.2.2, TABLE I, §4.1, §4.2, TABLE II, TABLE III, TABLE IV.
-  (2013) Data stream clustering: a survey. ACM Comput. Surv. 46 (1), pp. 1–31. External Links: Cited by: §1, §3.1, §5.1.
-  (2016) An adaptive ensemble of on-line extreme learning machines with variable forgetting factor for dynamic system prediction. Neurocomputing 171, pp. 693–707. External Links: Cited by: §5.1.
-  (2016) Dynamic clustering forest: an ensemble framework to efficiently classify textual data stream with concept drift. Information Sciences 357, pp. 125–143. External Links: Cited by: TABLE IV.
-  (2016) A data streams analysis strategy based on hoeffding tree with concept drift on hadoop system. Conference Proceedings In Proc. 4th Int. Conf. Advanced Cloud and Big Data, pp. 45–48. External Links: Cited by: §7.2.
-  (2007) Statistical change detection for multi-dimensional data. Conference Paper In Proc. 13th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, San Jose, California, USA, pp. 667–676. External Links: Cited by: §3.2.2, TABLE I, §4.2, TABLE II.
-  (2015) Data stream classification guided by clustering on nonstationary environments and extreme verification latency. In Proceedings of the 2015 SIAM International Conference on Data Mining, pp. 873–881. Cited by: TABLE III, TABLE IV.
-  (2001) A streaming ensemble algorithm (sea) for large-scale classification. Conference Proceedings In Proc. Seventh ACM Int. Conf. Knowledge Discovery and Data Mining, 502568, pp. 377–382. External Links: Cited by: TABLE III.
-  (2016) Online ensemble learning of data streams with gradually evolved classes. IEEE Trans. Knowl. Data Eng. 28 (6), pp. 1532–1545. External Links: Cited by: §5.2.
-  (2017) Scalable real-time classification of data streams with concept drift. Future Generation Computer Systems 75, pp. 187–199. External Links: Cited by: §7.2.
-  (2008) Dynamic integration of classifiers for handling concept drift. Information Fusion 9 (1), pp. 56–68. External Links: Cited by: §4.3.
-  (2003) Mining concept-drifting data streams using ensemble classifiers. Conference Paper In Proc. 9th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, Washington, D.C., pp. 226–235. External Links: Cited by: TABLE III.
-  (2017) Tracking concept drift using a constrained penalized regression combiner. Comput. Stat. Data Anal. 108, pp. 52–69. External Links: Cited by: TABLE IV.
-  (2015) Resampling-based ensemble methods for online class imbalance learning. IEEE Trans. Knowl. Data Eng. 27 (5), pp. 1356–1368. External Links: Cited by: §7.1.
-  (1996) Learning in the presence of concept drift and hidden contexts. Machine Learning 23 (1), pp. 69–101. External Links: Cited by: §1, §2.1, TABLE III.
-  (2012) Learning from concept drifting data streams with unlabeled data. Neurocomputing 92, pp. 145–155. External Links: Cited by: TABLE III, TABLE IV, §7.3.
-  (2017) Dynamic extreme learning machine for data stream classification. Neurocomputing 238, pp. 433–449. External Links: Cited by: §3.2.1, TABLE I, §4.2, TABLE II, §5.1, TABLE III.
-  (2017) Concept drift learning with alternating learners. Conference Proceedings In Proc. 2017 Int. Joint Conf. Neural Networks, pp. 2104–2111. External Links: Cited by: §5.2.
Change-point detection with feature selection in high-dimensional time-series data. Conference Proceedings In Proc. 23rd Int. Joint Conf. Artificial Intelligence, pp. 1827–1833. Cited by: §3.1.
-  (2012) Incrementally optimized decision tree for noisy big data. Conference Paper In Proc. 1st Int. Workshop Big Data, Streams and Heterogeneous Source Mining Algorithms, Systems, Programming Models and Applications, Beijing, China, pp. 36–44. External Links: Cited by: §5.3, TABLE III, TABLE IV.
-  (2015) Countering the concept-drift problems in big data by an incrementally optimized stream mining model. Journal of Systems and Software 102, pp. 158–166. External Links: Cited by: §5.3, TABLE III, TABLE IV.
-  (2013) A rank-one update method for least squares linear discriminant analysis with concept drift. Pattern Recognit. 46 (5), pp. 1267–1276. External Links: Cited by: §7.4.
-  (2015) DE2: dynamic ensemble of ensembles for learning nonstationary data. Neurocomputing 165, pp. 14–22. External Links: Cited by: §5.2, TABLE IV.
-  (2016) A simple unlearning framework for online learning under concept drifts. Conference Proceedings In Proc. 20th Pacific-Asia Conf. Knowledge Discovery and Data Mining, pp. 115–126. External Links: Cited by: §5.2, TABLE III, TABLE IV.
-  (2017) Concept drift detection with hierarchical hypothesis testing. Conference Proceedings In Proc. 2017 SIAM Int. Conf. Data Mining, pp. 768–776. External Links: Cited by: §3.2.3, TABLE I, TABLE II, TABLE III, TABLE IV.
-  (2018) Request-and-reverify: hierarchical hypothesis testing for concept drift detection with expensive labels. arXiv preprint arXiv:1806.10131. Cited by: §3.2.3, TABLE I, TABLE II, §6.1.
-  (2010) Classifier and cluster ensembles for mining concept drifting data streams. Conference Proceedings In Proc. 10th Int. Conf. Data Mining, pp. 1175–1180. External Links: Cited by: §7.3.
-  (2011) Enabling fast prediction for ensemble models on data streams. Conference Paper In Proc. 17th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, San Diego, California, USA, pp. 177–185. External Links: Cited by: §5.2, TABLE IV.
-  (2008) Categorizing and mining concept drifting data streams. Conference Paper In Proc. 14th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, Las Vegas, Nevada, USA, pp. 812–820. External Links: Cited by: §5.2, TABLE III, TABLE IV.
-  (2017) Three-layer concept drifting detection in text data streams. Neurocomputing 260, pp. 393–403. External Links: Cited by: §3.2.3, TABLE I, TABLE II, TABLE IV.
-  (2010) Stream data mining repository. External Links: Cited by: §6.3, TABLE IV.
-  (2014) Active learning with drifting streaming data. IEEE Trans. Neural Networks Learn. Syst. 25 (1), pp. 27–39. External Links: Cited by: TABLE IV, §7.3.
-  (2015) Evaluation methods and decision theory for classification of streaming data with temporal dependence. Machine Learning 98 (3), pp. 455–482. External Links: Cited by: §6.1.
-  (2014) Optimizing regression models for data streams with missing values. Machine Learning 99 (1), pp. 47–73. External Links: Cited by: §2.1, §2.1.
-  (2014) Controlled permutations for testing adaptive learning models. Knowledge and Information Systems 39 (3), pp. 565–578. External Links: Cited by: §6.1.