I Introduction
With the exponential growth of data, it becomes increasingly challenging to design and implement effective techniques for analyzing and detecting changes in a streaming environment [1, 2]. As a result, early approaches for detecting statistical changes in a time series (such as change point detectors), have had to be extended for online detection of changes in multivariate data streams [3]. Some of these techniques for detecting intrinsic changes in the relationship of the incoming data have been successfully applied to various realworld applications, such as email filtering, network traffic analysis and user preference prediction [4, 5].
Online classification is another common task performed on multivariate streaming data that takes advantage of these statistical relationships to predict a class label at each time index [6]
. If the underlying source (or joint data distribution) that generates the data is not stationary, the optimal decision rule for the classifier would change over time  a phenomena known as concept drift
[7]. Given the impact of concept drift on the predictive performance of an online classifier, there is often a need to detect these concept drifts as early as possible. The inability of change point detectors to detect these concept drifts, has motivated the need for concept drift detectors that not only monitor the join distribution of a multivariate data stream but also changes in its relationship to the class labels of the streaming data.There are two different approaches to address concept drifts in streaming data [6]. The first, automatically adapts the parameters of a statistical model in an incremental fashion [8, 9, 10] or employs an ensemble of classifiers, trained on different windows over the stream, to give the optimal decision [11, 12, 13, 14]
. There is no explicit detection of drifts in these methods, but retraining of new classifiers. The second approach integrates a statistical model and a concept drift detector, whose purpose is to signal the need for updating the statistical model once a concept drift is detected. Existing methods in this category monitor the error rate or an errordriven statistics and make a decision based on the statistical learning theory
[15, 16, 17, 18]. Unlike the first approach that only mitigates deteriorating classification performance over time, the second approach enables identification of the time instant related to concept drift occurrences. The promptness of the alert, i.e. the time that mediates the start of drift until its detection, is crucially important in applications like malware detection or network monitoring [4, 5].In this paper, we use the second approach and present a novel hierarchical hypothesis testing (HHT) framework for concept drift detection and adaptation. This framework is inspired by the hierarchical architecture that was recently proposed for change point detection [19, 20] (see section IIA for more discussion on the difference between concept drift detection and change point detection). The presented work intends to bring new perspectives to the field of concept drift detection and adaptation with the recent advances in hierarchical mechanism (e.g., [21, 20]) and provides the following contributions. First, we present Hierarchical Linear Four Rates (HLFR) detector [22], a novel HHTbased concept drift detection method, which is applicable to different types of concept drifts (e.g., recurrent or irregular, gradual or abrupt). A detailed analysis on the TypeI and TypeII errors of the proposed HLFR is also performed. Second, we present an adaptive training approach instead of the commonly used retraining strategy, once a drift is confirmed. The motivation is to leverage knowledge from the historical concept (rather than discard this information as in the retraining strategy), to enhance the classification performance in the new concept. We term this improvement adaptive HLFR (AHLFR). Admittedly, leveraging previous knowledge to boost classification performance is not novel in the streaming classification scenario. However, to the best of our knowledge, previous work either uses the first approach that do not explicitly identify timestamps or the types of drifts (e.g., [23, 24, 25]) or relies heavily on previous restored samples (e.g., [21]) which contradicts the single pass criterion^{1}^{1}1Single pass criterion: a sample from the data stream should be discarded rather than stored in the memory, once it has been processed [6, 26]. [6, 26]. From this perspective, we are among the first to investigate feasible solutions to perform “knowledge transfer” without losing intrinsic drift detection capability and the utilization of previous samples. Third, we carry out comprehensive experiments to investigate the benefits of HLFR (in detection) and AHLFR (in detection and adaptation), and validate the advantage of adaptive training strategy.
The rest of the paper is organized as follows. In Section II, we give the problem formulation of concept drift and also briefly review related work. In section III, we present HLFR and elaborate on the layerI and layerII tests employed. This section also includes the derivation of the detailed values of TypeI and TypeII errors associated with HLFR. Additionally, we present AHLFR, that not only detects drifts but also adapts the classifier to handle concept drifts. In section IV, experiments are presented and discussed. Finally, we present the conclusion in section V.
Ii Previous approaches
Iia Problem Formulation
Given a continuous stream of labeled samples , , a classifier can be learned so that . Here, is a
dimensional feature vector in a predefined vector space
and ^{2}^{2}2This paper only considers binary classification.. At every time instant , we split the samples into sets (containing recent samples) and set (containing examples that appeared prior to those in ). A concept drift refers to the joint distribution that generates samples in differs from that in [7, 4, 27]. From a Bayesian perspective, concept drifts can manifest two fundamental forms of changes [28]: 1) a change in the posterior probability
; and 2) a change in the marginal probability or . Existing studies tend to prioritize detecting posterior distribution change [5], also known as real concept drift [29], because it clearly indicates the optimal decision rule.A closely related problem to concept drift detection is the classical change point detection that has been well studied theoretically and practically before. Unlike concept drift detectors, change point detectors are targeted at detecting changes in the generating distribution of the streaming data (i.e., ) [30]. The standard change point detection methods are typically based on statistical decision theory, some reference books include [3, 31, 32, 33]. Although a change point detector may benefit the performance of concept drift detector, purely modeling is insufficient to solve the problem of concept drift detection [34]. An intuitive example is shown in Fig. 1, in which
remains unchanged, while the class labels change. On the other hand, it still remains a big challenge to detect any type of distributional changes, especially for multivariate or highdimensional data
[30, 17]. For these reasons, instead of selecting the intermediate solution of change point detection, we solve the problem by monitoring the “significant” drift in the prediction risk of the underlying predictor based on the risk minimization principle [35].IiB Benchmarking concept drift detection approaches
An extensive review on learning under concept drifts is beyond the scope of this paper, and we refer interested readers to some recently published surveys [4, 5, 27] for some classical methods and recent progresses. In this section, we only review previous work of most relevance to the presented method, i.e., concept drift detection approaches.
The method that renewed attention to this problem was the Drift Detection Method (DDM) [15]. DDM monitors the sum of overall classification error (
) and its empirical standard deviation (
). Despite its simplicity, DDM always fails to detect real drift points unless the sum of the TypeI and TypeII errors changes. Early Drift Detection Method (EDDM) [37], on the other hand, suggests monitoring the distance between two consecutive classification errors. EDDM performs better than DDM, especially in the scenario of slow gradual changes. However, it requires waiting for a minimum of classification errors before calculating the monitoring statistic at each time instant, an impractical condition for imbalanced data. A third error based method, i.e., STEPD [38], applies a test of equal proportion to compare the classification accuracy in a recent window with the historical classification accuracy excluding this recent window.Following the early work, a few new methods have been proposed to improve DDM from different perspectives. Drift Detection Method for Online Class Imbalance (DDMOCI) [16]
deals with imbalanced data. Unfortunately, DDMOCI is prone to trigger lots of false positives due to an inherent weakness in the model: the test statistic used by DDMOCI
is not approximately distributed asunder the null hypothesis
^{3}^{3}3is a modified estimator of
, which satisfies where denotes a time decaying factor [16]. [17]. PerfSim [18]also deals with imbalanced data. Different from DDMOCI, PerfSim tracks the cosine similarity of four entries associated with confusion matrix to determine an occurrence of concept drift. However, the threshold used to distinguish concept drift was userspecified. Moreover, PerfSim assumes the data comes in batchincremental manner
[39] which makes it impractical in real applications, especially when the decisions are required to be made instantly. Other related work includes the Exponentially Weighted Moving Average (EWMA) for concept drift detection (ECDD) [6] and the Drift Detection Method based on the Hoeffding’s inequality (HDDM) [40]. An experimental comparative study is available in [41].IiC Hierarchical architecture on changepoint or concept drift detection
Hierarchical architectures have been extensively studied in the machine learning community in the last decades. One of the most recent examples is the Deep Predictive Coding Networks (DPCN)
[42], a neuralinspired hierarchical generative model which is effective on modeling sensory data.However, the hierarchical architectures for change point (or concept drift) detection were seldom investigated. The first hierarchical change point test (HCDT) was proposed in [19]
based on the Intersection of Confidence Intervals (ICI) rule
[43]. It has later been extended in a higher perspective by incorporating a general methodology to design HCDT [20]. However, as a change point detector, HCDT has its intrinsic limitations as emphasized in section IIA. Although it can be modified for concept drift detection by tracking the classification error with a Bernoulli distribution assumption, a univariate indicator (or statistic) is insufficient to provide accurate concept drift detection
[17], especially when the classifier becomes unstable. Moreover, we already proved that the derived statistics (in LayerI) are geometrically weighted sum of Bernoulli random variables
[17], rather than simply following the Bernoulli distribution in the common sense.This work is motivated by [20]. However, in order to make the designed algorithm well suited for broader classes of concept drift detection (rather than change point detection) without losing accuracy and proper classifier adaptation, we proposed HLFR, a novel hierarchical architecture (together with two novel testing methods in each layer) for concept drift detection that is applicable to different concept drift types and data stream distributions (e.g., balanced or imbalanced labels). Moreover, we present an adaptive training approach instead of the retraining scheme commonly employed, once a drift is confirmed. The proposed adaptation approach is not limited to a single concept drift type and strictly follows the single pass criterion that does not need any historical data. Results show that the proposed approach captures more information from the data than previous work.
Iii Hierarchical Linear Four Rates (HLFR)
This section presents a novel hierarchical hypothesis testing (HHT) framework for concept drift detection and adaptation. As shown in Fig. 2, HHT features two layers of hypothesis tests. The LayerI test is executed online. Once it detects a potential drift, the LayerII test is activated to confirm (or deny) the validity of the suspected drift. Depending on the decision results of LayerII test, the HHT reconfigures or restarts the LayerI test correspondingly. A new concept drift detector, namely Hierarchical Linear Four Rates (HLFR), is developed under the HHT framework. HLFR implements a sequential hypothesis testing [44, 45], and the two layers cooperate closely to improve online classification capability jointly. HLFR, is summarized in Algorithm 1.
HLFR does not make use of any intrinsic property or impose any assumption on the underlying classifier. This modular property enables HLFR to be easily deployed with any classifier (support vector machine (SVM),
nearest neighbors (KNN), etc.). It is worth noting that ensemble of detectors
[46, 47] may appear to share similarities with the proposed HHT framework in this paper. However, the two architectures are significantly different in the way to organize different hypothesis tests. For HHT, the LayerII test is only activated when the LayerI test detects a suspected drift points (i.e., the LayerII is an auxiliary and validation module to LayerI in the hierarchical architecture), whereas the ensemble of detectors conducts different tests in a parallel manner (i.e., each test is performed independently and synchronously with no priority, and the final decision is made by a voting scheme). To further illustrate the differences, a rigorous investigation of the TypeI and TypeII errors analysis concerning our HHT framework and the ensemble of detectors are illustrated in Section IIIC.Iiia LayerI Hypothesis Test
HLFR selects our recently developed Linear Four Rates (LFR) [17] in its LayerI test. According to the results shown in [17], LFR always exhibits promising performances in terms of shorter detection delay and higher detection precision, compared with other prevalent concept drift detectors. This is not surprising, as LFR monitors four rates (or statistics) associated with a confusion matrix (i.e., the true positive rate (), the true negative rate (), the positive predictive value () and the negative predictive value ()) simultaneously, thus it can sufficiently and precisely make use of the error information.
The key idea for LFR is straightforward: , , , should remain the same in a stable or stationary concept. Therefore, a significant change of any () may imply a change in the underlying joint distribution or concept. Specifically, at each time instant , LFR conducts four independent tests with the following null and alternative hypotheses:
The concept is stable if hypothesis holds and is considered to have a potential drift if hypothesis is rejected. Intuitively, LFR should be more sensitive to any type of drift, as it keeps track of four rates simultaneously. By contrast, almost all previous methods use a single specific statistic that can only capture partial of the distributional information: DDM, ECDD and HDDM use the overall error rate, EDDM relies on the average distance between adjacent classification errors, DDMOCI deals with the minority class recall, STEPD monitors a ratio of recent accuracy and overall accuracy, whereas PerfSim considers the cosine similarity coefficient of four entries in confusion matrix.
The LFR is summarized in Algorithm 2. During implementation, LFR modifies with as employed in [16, 48] (see also footnote ). is essentially a weighted linear combination of the classifier’s current and previous performances. In [17], we have proved that follows a weighted independent and identically distributed () Bernoulli distribution. Given this property, we are able to obtain the “BoundTable” by conducting MonteCarlo simulations. Based upon these bound values, LFR considers that a concept drift is likely to occur when any succeeds the warning bound (warn.bd), and sets the warning signal (). If any reaches the corresponding detection bound (detect.bd), the concept drift is affirmed at (). Interested readers can refer to [17] for more details.
IiiB LayerII Hypothesis Test
The four rates are more sensitive metrics that enable LFR to be able to promptly detect any types of concept drifts. However, the sensitivity of four rates also makes LFR is more likely to trigger “false positive” detections. The LayerII test serves to validate detections raised by LayerI test, thus significantly remove these “false positive” detections. In HLFR, we use a permutation test (see Algorithm 3) in its LayerII test. Permutation test has been well studied theoretically and practically before, it does not require apriori information regarding the monitored process or the nature of the data [49].
Specifically, we partition the streaming observations into two consecutive segments based on the suspected drift instant
provided by the LayerI test, and employ a new statistical hypothesis test to compare the inherent properties of these two segments to assess a possible variations in the joint distribution
. Then, the general idea behind our designed permutation test is to test whether the prediction average risk (evaluated over the second segment using a classifier trained on the first segment) is significantly different from its sampling distribution under the null hypothesis (i.e., no drift occurs). Here, we measure the prediction average risk with zeroone loss. Zeroone loss contains partial information of the four rates. Intuitively, if no concept drift has occurred, the zeroone loss on the ordered traintest split (i.e., in line ) should not deviate too much from that of the shuffled splits (i.e., , , in line ), a realization of its sampling distribution under the null hypothesis [30].IiiC Error analysis on Hierarchical Hypothesis Testing
To further give credence to the success of HHT framework in practical applications, we present a theoretical analysis to its associated TypeI and TypeII errors.
In the problem of concept drift detection, the TypeI error (also known as a “false positive” rate) refers to the incorrect rejection of a true null hypothesis
(i.e., no drift occurs). By contrast, the TypeII error (also known as a “false negative” rate) is incorrectly retaining a false null hypothesis when the alternative hypothesis is true. On the other hand, for any (singlelayer) hypothesis test, the TypeI error is exactly the selected significance level, whereas the TypeII error (denoted with ) is determined by the power of the test and the power is exactly .Let us denote by and the TypeI and TypeII errors of LayerI test, and and the TypeI and TypeII errors of LayerII test. Also, we denote by and the overall TypeI and TypeII errors of HHT framework.
By definition, the TypeI error of HHT is given by:
(1) 
where “” denotes AND logic operator.
Eq. (1) assumes that the performance of LayerI test and LayerII test is independent, i.e., the detection results of LayerI and LayerII tests will not be mutually influenced when they are being tested independently. Given that the test statistics and manners are totally different in LayerI and LayerII tests of HLFR, this assumption makes sense. In fact, even though the performance of LayerI and LayerII tests are related to each other, still satisfies , which suggests that the HHT framework will not increase the TypeI error even in the worst case.
Similarly, the overall TypeII error is given by:
(2) 
Again, we assume the performance independence of LayerI and LayerII tests. However, even though this condition is not met, we still have . This is an unfortunate fact, as it suggests a fundamental limitation of the HHT framework: it may increase the TypeII error. Given the fact that majority of the current concept drift detectors have high detection power (i.e., is small) yet suffer from a relatively high “false positive” rate, the cost is acceptable.
As emphasized earlier, a similar architecture to the proposed HHT framework is the ensemble of detectors [46, 47]. The most widely used decision rule for ensemble of detectors is that, given a pool of candidate detectors, the system determines a drift if any one of the detectors finds a drift. This way, suppose there are candidate detectors, the TypeI and TypeII errors of the ensemble of detectors are given by (assuming pairwise performance independence [50]):
(3) 
(4) 
By referring to Eqs. (1)(4), it is easy to find that, although the architecture of HHT and the ensemble of detectors look similar, their functionalities and mechanisms are totally different. HHT attempts to remove “false positive” detections as much as possible, thus significantly decreases the TypeI error. However, HHT may increase the TypeII error at the same time. The ensemble of detectors, on the other hand, aim to further improve detection power (thus decrease the TypeII error) at the cost of increased TypeI error^{4}^{4}4, ().. Given that the prevalent concept drift detectors always have high detection power (e.g., LFR and HDDM) yet suffer from lots of “false positive” detections, it may not be necessary to naively combine different detectors in an ensemble manner. This is also the reason why the ensemble of detectors do not demonstrate any performance gain over singlelayerbased drift detectors in a recent experimental survey paper [50].
Having illustrated the analytical expressions for the overall TypeI and TypeII errors of the HHT framework (i.e., ), we now specify the detailed values of TypeI and TypeII errors in LayerI test (i.e., ) as well as the TypeI and TypeII errors in LayerII test (i.e., ) of our proposed HLFR algorithm for completeness. We have and .
IiiC1 The and of LayerI test
The TypeI error of LayerI test is upper bounded by its detection significance level (i.e., in Algorithm 2 of manuscript). On the other hand, although the test statistics () are geometrically weighted sum of Bernoulli random variables under a stable concept (i.e., hypothesis) up to time , i.e., , where and is the underlying rate, two reasons make it impossible to get a closeform expression or upper bound for TypeII error of LayerI test.
1) It is hard to obtain the closedform distribution function of under . Although [51] investigated the closedform distribution function of under for the special case , it still remains a question for other values of .
2) The closedform distribution function of under is unattainable. This is because could have arbitrary (or unconstrained) distributions when the concept changes.
Therefore, this section only empirically investigates the power of LayerI test using synthetic data to illustrate and reveal its properties. We denote by the power estimate of . Suppose the null distribution is at and alternative distribution is at , , where is the maximal detection time delay. Then suppose the underlying rate is drifted from (the first concept) to (the second concept). Fig. 3 is a heatmap of limiting power estimates on all pairs using . We can see that is already close to , when and are significantly different. In this case, the TypeII error reduces to , because .
IiiC2 The and of LayerII test
Same as the LayerI test, the TypeI error of the LayerII test is upper bounded by its selected significance level (i.e., in Algorithm 3 of manuscript). Thus, we focus our analysis on its power. Before that, we give the following two definitions.
Definition III.1.
[52] An algorithm has error stability
with respect to loss function
if:(5) 
where refers to a predictor obtained using trained on set with cardinality , is the set with the sample removed, and decreases with . is the risk of with respect to distribution , and denotes expectation.
Definition III.2.
[30]
A stream segment is said to have permitted variations, if for some , with respect to , if:
(6) 
Given two subsequences with equal length , the LayerII test in our HLFR method aims to determine whether the average prediction risk on ordered traintest split deviates too much from that of the shuffled splits by testing the following hypothesis:
where denotes the risk on ordered traintest split (i.e., and in lines and of Algorithm 3), whereas denotes the risk on shuffled splits (i.e., and in line of Algorithm 3) and
refers to the uniform distribution over all possible
training sets of size from the two segments of samples, is a parameter that controls the maximum allowable change rate and is a related function that will be elaborated in the following theorem.Having illustrated the essence of LayerII test, given Definition 1 and Definition 2, the following corollary upper bounded its TypeII error .
Corollary III.0.1.
For an algorithm with stability and any , we have that under , the probability of obtaining a “false negative” detection is bounded as follows:
(7) 
Here , in which and are the permutation window size and the significance rate in Algorithm 3 of manuscript, refers to the small variation in Definition 2 and denotes the maximum allowable change rate. and denote the estimated zeroone loss of ordered traintest split and shuffled splits (see Algorithm 3 for more details). For simplicity, we set in our Algorithm 3
to avoid introducing extra hyperparameters. Note that, the above corollary is a special example of Theorem
in [30]. Interested readers can refer to the supplementary material of [30] for complete proof.IiiD Adaptive Hierarchical Linear Four Rates (AHLFR)
Although HLFR can be used for streaming data classification with concept drifts (just like its DDM [15], EDDM [37] and STEPD [38] counterparts), naively retraining a new classifier after each concept drift detection severely deteriorates its classification performance. This stems from the fact that once a drift is confirmed, it discards all the (relevant) information from previous experience and uses only limited samples from current concept to retrain a classifier. A promising solution to avoid such circumstance is to first extract such kind of relevant knowledge from past experience and then “transfer” this knowledge to the new classifier [4, 25, 53]. To this end, Adaptive Hierarchical Linear Four Rates (AHLFR) is an integral part of the proposed solution. AHLFR makes a simple yet strategic modification to HLFR: replacing retraining scheme in HLFR framework with an adaptive learning strategy. Specifically, we substitute SVM (this paper selects soft margin SVM as the base classifier due to its accuracy and robustness [52]) with adaptive SVM (ASVM) [54] once a concept drift is confirmed. The pseudocode of AHLFR is the same as Algorithm 1. The only exception comes from the layerI test, where the retraining scheme with standard SVM (line in Algorithm 2) is substituted with ASVM.
IiiD1 Adaptive SVM  Motivations and Formulations
A fundamental difficulty for learning supervised models once a concept drift is confirmed, is that the training samples from new and previous concepts are drawn from different distributions. A short detection delay (especially for stateoftheart concept drift detection methods) results in extremely limited training samples from the new concept. These limited training samples from the new concept, coupled with the fact that it may be likely that consecutive concepts are closely related or relevant, inspires the idea of adapting the previous models with samples from the new concept to boost the concept drift adaptation capability.
Recall the earlier mentioned problem formulation, we are required to classify samples in the new concept, where only a limited number of labeled samples (i.e., a newly observed primary dataset ) are available for updating a classifier. To circumvent the drawbacks of limited training samples, the auxiliary classifier training on previously observed fullylabeled auxiliary dataset should also be considered. This is because the dataset is sampling from a joint distribution that is related to, yet different from, the joint distribution of dataset in an unknown manner. If we apply the auxiliary classifier on the primary dataset , the performance is poor since is biased to . On the other hand, although we can retrain a classifier using samples in such that the new classifier is unbiased to
, the classification accuracy may suffers from high variance due to limited training samples.
In order to achieve an improved biasvariance tradeoff, we employ adaptive SVM (ASVM), initiated in [54], to adapt to . Intuitively, the key idea of ASVM is to learn an adaptive classifier from by regularizing the distance between and , which can be formulated as:
(8) 
where represents a feature mapping to project sample into a highdimensional space or reproducing kernel Hilbert space (for linear SVM, ), denotes the classifier parameters estimated from . Eq. (8) jointly optimizes the distance between and as well as the classification error. The optimization to ASVM is presented in [54, 55].
Iv Experiments
This section presents three sets of experiments to demonstrate the superiority of HLFR and AHLFR over the prevalent baseline methods, in terms of concept drifts detection and adaptation. Section IVA validates the benefits and advantages of HLFR on concept drift detection, using both quantitative metrics and visual evaluation. Section IVB uses two realworld examples (one for email filtering, another for weather prediction) to illustrate the effectiveness and potency of using an adaptive training method to improve the capability of concept drift adaptation. In Section IVC, we empirically demonstrate that 1) the benefits of adaptive training are not limited to HLFR, i.e., it provides a general solution to classifier adaptation for all concept drift detectors, like DDM, EDDM, etc.; and 2) the concept drift detection capability will not be impacted by the adaptive training strategy, i.e., HLFR and AHLFR can achieve almost the same concept drift detection precision. Finally, we give a brief analysis to the computational complexity of all competing methods in section IVD. All the experiments mentioned in this work were conducted in MATLAB a on an Intel i53337 GHz PC with GB RAM.
Iva Concept Drift Detection with HLFR
We first compare the performance of HLFR against five stateoftheart concept drift detection methods: DDM [15], EDDM [37], DDMOCI [16], STEPD [38], as well as the recently proposed LFR [17]. The parameters used in these methods were taken as recommended by their authors: the warning and detection thresholds of DDM (EDDM) are () and () respectively; the warning and detection significance levels of LFR (STEPD) are () and () respectively; whereas the parameters of DDMOCI vary across different data under testing. For our proposed HLFR, the significant rate in LayerII test is set to , and permutations were used throughout this paper.
Four benchmark data streams are selected for evaluation, namely “SEA” [15], “Checkerboard” [14]
, “Rotating hyperplane”, and USENET1
[13]. These datasets include both synthetic and realworld data. A comprehensive description to these datasets is introduced in [22]. Drifts are synthesized in the data, thus controlling ground truth concept drift locations and enabling precise quantitative analysis. Table I summarized drift types and the data properties for each stream. Obviously, the selected datasets span the gamut of concept drift types.Data property  SEA  Checkerboard  Hyperplane  USENST1 

gradual  
abrupt  
recurrent  
imbalance  
high dimensional 
Each stream was generated and tested independently for times. The base classifier used for all competing methods in all streams is a (soft margin) linear SVM with regularization parameter
. The only exception comes from USENET1, in which a radial basis function (RBF) kernel SVM with kernel width
is selected. Fig. 4 demonstrates the detection results of different methods averaged over these trails. As can be seen, HLFR and LFR significantly outperform their competitors in terms of promptly detecting concept drifts with fewer missed or false detections, regardless of drift types or data properties. By integrating the LayerII test, HLFR further improves on LFR by effectively reducing even the few false positives triggered by LFR.Quantitative comparison are performed as well. We define a True Positive () as a detection within a fixed delay range after a concept drift occurred, a False Negative () as missing a detection within the delay range, and a False Positive () as a detection outside this range or an extra detection in the range. For each detector, its detection quality is then evaluated by the and the values. The and values with respect to a predefined (largest allowable) detection delay are demonstrated in Fig. 5. At the first glance, HLFR, LFR and STEPD can always achieve higher or values across different ranges. If we look deeper, the values is significantly improved with HLFR while the values of HLFR and LFR are similar (except for Rotating hyperplane dataset). This result corroborates our TypeI and TypeII error analysis in section IIIC: LayerII test aims to confirm or deny the validity of layerI detection results, thus it cannot compensate for the errors of missing a detection made by LayerI test. In other words, the TypeI error of HLFR should be smaller than that of LFR theoretically, whereas the TypeII error of HLFR is lower bounded by LFR. In fact, the relatively lower of HLFR (compared to LFR) suggests that the used LayerII test is a little conservative, i.e., it has a small probability to reject true positive detection triggered by LayerI test (i.e., LFR). On the other hand, it seems that STEPD has much higher values on SEA and Rotating hyperplane datasets. However, the result is meaningless. This is because STEPD triggers significantly more false alarms (as seen in the fifth row of Fig. 4(a) and Fig. 4(c)), such that its values on these two datasets are consistently smaller than . Table II summarized the detection delays (ensemble average) for all competing algorithms. Out of the four datasets, our HLFR has the shortest (average) detection delay in three of them.
Algorithms  SEA  Checkerboard  Hyperplane  USENST1 

STEPD  463  57  140  19 
DDMOCI  844  58  198  26 
EDDM  939  93  166  36 
DDM  1209  69  125  26 
LFR  458  56  127  17 
HLFR  482  55  120  17 
IvB Concept Drift Adaptation with AHLFR
In this section, we perform two case studies using representative realworld concept drift datasets from email filtering and weather prediction domain respectively, aiming to validate the rationale of HLFR on concept drift detection as well as the potency of AHLFR on concept drift adaptation. Performance is compared to DDM, EDDM, STEPD as well as LFR. Note that, the results of DDMOCI are omitted as it is hard to detect “reasonable” concept drift points in the selected data.
The spam filtering dataset [12], consisting of instances and attributes, is used herein. This data represents email messages from the Spam Assassin Collection^{5}^{5}5http://spamassassin.apache.org/ and contains contains natural concept drifts [12, 56]. The spam ratio is approximately . Besides, the weather dataset [53, 14], a subset of the National Oceanic and Atmospheric Administration (NOAA) data^{6}^{6}6ftp://ftp.ncdc.noaa.gov/pub/data/gsod, consisting of daily observations recorded in Offutt Air Force Base in Bellevue, Nebraska, is also used in the study. This data is collected and recorded over years, containing not only shortterm seasonal changes, but also (possibly) longterm climate trend. Daily measurements include temperature, pressure, wind speed, visibility, and a variety of features. The task is to predict whether it is going to rain from these features. Minority class cardinality varied between and throughout these years.
On Parameter Tuning and Experimental Setting. A common phenomenon for classification of realworld streaming data with concept drifts and temporal dependency is that “the more random alarms fire the classifier, the better the accuracy [57]”. Thus, to provide a fair comparison, the parameters of all competing methods are tuned to detect similar number of concept drifts. Table III and Table IV summarized the key parameters regarding significance levels (or thresholds) of different methods in two selected realworld datasets respectively. For spam data, an extensive search for appropriate partition of training and testing sets was performed based on two criteria. First, there is no strong autocorrelations in the classification error sequence on the training set. This is because once the errors are highly autocorrelated, it is very probably that the training data is no longer or the training data spans different concepts. Second, the classifier trained on the training set can achieve promising classification accuracies on both minority and majority classes, i.e., sufficient number of training data is required. With these two considerations, the length of training set is set to . As for the weather data, the training size is set to instances (days), approximately one season as suggested in [53].
Algorithms  Parameter settings on significance levels (or thresholds) 

STEPD  , 
EDDM  , 
DDM  , 
LFR  , 
HLFR  , , 
AHLFR  , , 
Algorithms  Parameter settings on significance levels (or thresholds) 

STEPD  , 
EDDM  , 
DDM  , 
LFR  , 
HLFR  , , 
AHLFR  , , 
Case study on spam dataset. We first evaluate the performances of different methods on the spam dataset. According to the authors of [12], there are three dominating concepts distributed in different time periods and these concept drifts occurred approximately in the neighbors of time instants and in Region I, time instants and in Region II, and time instant
in Region III. Besides, there are many abrupt drifts in Region II. A possible reason for these abrupt and frequent drifts may be batches of outliers or noisy messages. According to the concept drift detection results shown in Fig.
6, AHLFR and HLFR best match these descriptions, except that they both miss a potential drift around time instant . By contrast, although other methods are able to detect this point, they have many other limitations: 1) LFR triggers some false positive detections as well; and 2) DDM or EDDM, not only misses obvious drift points, but also feeds back unconvincing drift locations in Region I or Region III.We then applied a recently proposed measurement  Kappa Plus Statistic (KPS) [58]  to access experimental results. KPS, defined as , aims to evaluate a data stream classifiers performance, taking into account the temporal dependence and effectiveness of classifier adaptation. is the classifier’s prequential accuracy [59] and is the accuracy of NoChange classifier^{7}^{7}7The NoChange classifier is defined as a classifier that predicts the same label as previously observed, , for any observation [58].. We partition the training set into approximately consecutive time periods. The KPS prequential representation over these periods is shown in Fig. 7(a). As can be seen, the HLFR and AHLFR adaptations are most effective in periods , but suffer from a sudden drop in periods . These observations corroborate the detection results shown in Fig. 6: HLFR and AHLFR can accurately detect the first drift point without any false positives in Region I, but they both missed a target in Region II. On the other hand, there is almost no performance difference between the classifier update in AHLFR and HLFR.
We further employ several different quantitative measurements to have a thorough evaluation on streaming classification performance. The first measurement is the most commonly used overall accuracy (OAC). Although OAC is an important metric, it is inadequate for imbalanced data. Therefore, we include the Fmeasure^{8}^{8}8. [60] and the Gmean^{9}^{9}9, where and denote true positive rate and true negative rate respectively. [61] metrics. All metrics are calculated in each time instant, creating a time series representation that ensembles learning curves. Fig. 8(a)(c) plot the time series representations of OAC, Fmeasure and Gmean for all competing methods. As can be seen, AHLFR and HLFR typically provide a significant improvement in Fmeasure and Gmean while maintaining good OAC when compared to their DDM, EDDM and LFR counterparts, with AHLFR performs slightly better than HLFR. STEPD seems to demonstrate the best overall classification performance on the spam dataset. However, AHLFR and HLFR provide more accurate (or rational) concept drift detections which best match with cluster assignments results in [12].
Case study on weather dataset. We then evaluate the performances of different methods on the weather dataset. Because the groundtruth drift point location is not available, we only demonstrate the concept drift adaptation comparison results. Fig. 7(b) plots the KPS prequential representations. As can be seen, AHLFR performs (or updates) best in majority of time segments. Fig. 8(d)(f) plot the corresponding OAC, Fmeasure and Gmean time series representations for all competing algorithms. Although the no adaptation (i.e., using the initial trained classifier for prequential classification without any classifier update) enjoys an overwhelming advantage in OAC compared to DDM, EDDM, LFR, STEPD, it is however invalid as the corresponding Fmeasure and Gmean tend to be zero as time evolves. This suggests that if no adaptation is adopted, the initial classifier gradually identifies the remaining data as belonging to the majority class, i.e., no rain days, which is not realistic. AHLFR and HLFR achieves close OAC values to the nonadaptive classifier, however, shows significant improvements on Fmeasure and Gmean. Again, AHLFR performs slightly better than HLFR.
From these two real applications, we can summarize some key observations:
1) The given data has severe concept drifts, as the classification performance of no adaptation deteriorates dramatically.
2) The adaptive training will not affect the performance of concept drift detection, as the concept drift detection results given by HLFR and AHLFR are almost the same (see Fig. 6). This argument is further empirically validated and elucidated in the next subsection.
3) AHLFR and HLFR consistently produce the best overall performance in terms of OAC, Fmeasure, Gmean and the rationality of drift detected. For real data, AHLFR only performs slightly better than HLFR. This is because the temporal relatedness between consecutive concepts in realworld data is weak or the concept changes gradually and slowly such that simply transferring previous knowledge to current domain (or concept) cannot prompt the generalization capacity of new classifier significantly. Therefore, adaptive training has great potency, but it deserves more investigations and future improvements.
4) There is still plenty of room for performance improvement on incremental learning under concept drifts in nonstationary environment, as the OAC, Fmeasure and Gmean values are far from optimal. In fact, even with the stateoftheart methods which only focus on automatically adapting classifier behavior (or parameters) to stay uptodate with the streaming data dynamics, the OAC can only reach to approximately in [12, 56] for spam data and in [53, 14] for weather data, let alone the relatively lower Fmeasure and Gmean values.
5) The ensemble of classifiers seems to be a promising direction for future work. However, most of the existing ensemble learning based methods (e.g., [53, 14]) are developed for batchincremental data [39], which is not suitable for a fully online setting, where the sample is provided one by one in a sequential manner [4].
IvC Benefits of adaptive learning
In this section, we demonstrate, via the application of concept drift adaptation on USENET1 and Checkerboard datasets, that the superiority of adaptive SVM for concept drift adaptation is not limited to the HLFR framework. To this end, we consider the algorithm performance of integrating adaptive SVM into DDM, EDDM, DDMOCI, STEPD as well as LFR framework. We term this combinations ADDM, AEDDM, ADDMOCI, ASTEPD and ALFR, respectively.
In Fig. 9, we plotted the Precision and Recall curves of HLFR, LFR, DDM, EDDM, DDMOCI, STEPD, AHLFR, ALFR, ADDM, AEDDM, ADDMOCI and ASTEPD on USENET1 and Checkerboard, respectively. For better visualization, we separate all the competing algorithms into two groups, group I includes HLFR, AHLFR, LFR, ALFR, STEPD and ASTEPD as they always perform better than their counterparts, while group II contains DDM, ADDM, EDDM, AEDDM, DDMOCI and ADDMOCI. In each subfigure, the dashed line represents the baseline algorithm without adaptive training (e.g., HLFR), while the solid line denotes its adaptive version (e.g., AHLFR). Meanwhile, for each baseline algorithm, its adaptive version is marked with the same color for comparison purpose. Obviously, the adaptive training will not affect the performance of concept drift detection^{10}^{10}10Admittedly, there is performance gap for DDM or STEPD, the difference is, however, datadependent. For example, DDM seems to be better than ADDM in Checkerboard dataset, but this advantage does not hold in USENET1.. This is because the drift is determined by keeping track of “significant” changes of classification performance, rather than the specific performance measurement itself.
In Fig. 10 and Fig. 11, we plotted the time series representations of OAC, Fmeasure and Gmean on these two datasets over Montecarlo simulations. The shading enveloping each curve in the figures represents percent confidence interval. In each subfigure, the red dashed (or blue solid) line represents mean values for drift detection algorithm with (or without) adaptive training scheme, while the red (or blue) shading envelop represents the corresponding confidence intervals. For almost all the competing algorithms their corresponding adaptive versions achieve much better classification results than the nonadaptive counterparts. This performance boost begins from the first concept drift adaptation and grows gradually with increasing number of adaptations. As seen, AHLFR and ALFR achieves more compelling learning performance compared with ADDM, AEDDM, ADDMOCI and ASTEPD^{11}^{11}11The comparable performance of ADDM on Checkerboard dataset results from more times of adaptations, which is however unreasonable as the adaptation alarms are false alarms. This also coincides with the quantitative analysis results of concept drift detection shown in Fig. 9
. These results empirically validate the potential and superiority of using adaptive classifier techniques for concept drift adaptation, instead of the retraining strategy adopted in previous work. It is also worth noting that the adaptive classifier is not limited to softmargin SVM. In fact, adaptive logistic regression
[62], adaptive singlelayer perceptron
[63]and adaptive decision tree
[64] frameworks all have been developed in recent years with the advance of statistical machine learning. We leave investigations of concept drift adaptation using other adaptive classifiers as future work.IvD On the computational complexity analysis of concept drift detection
Having demonstrated the benefits and effectiveness of the HHT framework, this section discusses the computational complexity of the aforementioned concept drift detectors, particularly the additional computation cost incurred by incorporating the LayerII test. In fact, DDM, EDDM, DDMOCI, STEPD and LFR have a constant time complexity () at each time point, as all of them follow a singlelayerbased hypothesis testing framework that monitors one or four errorrelated statistics [17]. The computational complexity for generating bound tables used by LFR or HLFR to determine the corresponding warning and detection bounds with respect to different rate values is , where is the number of MonteCarlo simulations used. However, since the bound tables can be computed offline, the time complexity for looking up the bound table values once is given (see line and of Algorithm 2) remains . Due to the introduction of LayerII test, HLFR is more computational expensive than other singlelayerbased methods. This is because HLFR requires training classifiers ( in this work) for validating the occurrence of a potential concept drift time point^{12}^{12}12HLFR has the same computational complexity with LFR if the LayerI test does not reject the null hypothesis at the tested time point.. Suppose the computational complexity of training a new classifier is , the total computational complexity of HLFR at a suspected time point is .
Despite this limitation, the HHT framework introduces a new perspective to the field of concept drift detection, especially considering its overwhelming advantages on detection precision and delay of detection. Finally, it should be noted that the permutations in LayerII test can be run in parallel, as the classifier trained are independent across different permutations.
V Conclusions
This paper proposed a novel concept drift detector, namely Hierarchical Linear Four Rates (HLFR), under the hierarchical hypothesis testing (HHT) framework. Unlike previous work, HLFR is able to detect all possible variants of concept drifts regardless of data characteristics, it is also independent of the underlying classifier. Using Adaptive SVM as its base classifier, HLFR can be easily extended to a concept driftagnostic framework, i.e., AHLFR. The performance of HLFR and AHLFR in detecting and adapting to concept drifts are compared to stateoftheart methods using both simulated and realworld datasets that span the gamut of concept drift types (recurrent or irregular, gradual or abrupt, etc.) and data distributions (balanced or imbalanced labels). Experimental results corroborate our theoretically analysis on TypeI and TypeII errors of HLFR and also demonstrate that our methods can significantly outperform our competitors in terms of earliest detection of concept drift, highest detection precision as well as powerful adaptability across different concepts. Two real examples on email filtering and weather prediction are finally presented to illustrate effectiveness and great potential of our methods.
In the future, we will extend HLFR and AHLFR to multiclass classification scenario. One possible solution is to use the onevsall strategy to convert the class classification problem into binaryclass classification problems. Since the four rates associated with each binaryclass classification are still geometrically weighted sum of Bernoulli random variables, HLFR and AHLFR might be able to be applied straightforwardly. Additionally, we are also interested in investigating the performance of more sensitive metrics, from an information theoretic learning (ITL) perspective [65], to monitor the streaming environment. Finally, we will continue on designing more power tests under HHT framework for industriallevel noisy data.
References
 [1] K. Slavakis, S.J. Kim, G. Mateos, and G. B. Giannakis, “Stochastic approximation visavis online learning for big data analytics [lecture notes],” IEEE Signal Processing Magazine, vol. 31, no. 6, pp. 124–129, 2014.
 [2] H. Hu, Y. Wen, T.S. Chua, and X. Li, “Toward scalable systems for big data analytics: A technology tutorial,” IEEE Access, vol. 2, pp. 652–687, 2014.
 [3] M. Basseville, I. V. Nikiforov et al., Detection of abrupt changes: theory and application. Prentice Hall Englewood Cliffs, 1993, vol. 104.
 [4] J. Gama, I. Žliobaitė, A. Bifet, M. Pechenizkiy, and A. Bouchachia, “A survey on concept drift adaptation,” ACM Computing Surveys (CSUR), vol. 46, no. 4, p. 44, 2014.
 [5] S. Wang, L. L. Minku, and X. Yao, “A systematic study of online class imbalance learning with concept drift,” arXiv preprint arXiv:1703.06683, 2017.
 [6] G. J. Ross, N. M. Adams, D. K. Tasoulis, and D. J. Hand, “Exponentially weighted moving average charts for detecting concept drift,” Pattern Recognition Letters, vol. 33, no. 2, pp. 191–198, 2012.
 [7] G. Widmer and M. Kubat, “Learning in the presence of concept drift and hidden contexts,” Machine learning, vol. 23, no. 1, pp. 69–101, 1996.
 [8] R. Klinkenberg, “Learning drifting concepts: Example selection vs. example weighting,” Intelligent Data Analysis, vol. 8, no. 3, pp. 281–300, 2004.
 [9] A. Bifet and R. Gavalda, “Learning from timechanging data with adaptive windowing,” in Proceedings of the 2007 SIAM International Conference on Data Mining. SIAM, 2007, pp. 443–448.
 [10] L. Du, Q. Song, and X. Jia, “Detecting concept drift: An information entropy based method using an adaptive sliding window,” Intelligent Data Analysis, vol. 18, no. 3, pp. 337–364, 2014.
 [11] W. N. Street and Y. Kim, “A streaming ensemble algorithm (sea) for largescale classification,” in Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2001, pp. 377–382.
 [12] I. Katakis, G. Tsoumakas, and I. Vlahavas, “Tracking recurring contexts using ensemble classifiers: an application to email filtering,” Knowledge and Information Systems, vol. 22, no. 3, pp. 371–391, 2010.
 [13] I. Katakis, G. Tsoumakas, and I. P. Vlahavas, “An ensemble of classifiers for coping with recurring contexts in data streams.” in ECAI, 2008, pp. 763–764.

[14]
R. Elwell and R. Polikar, “Incremental learning of concept drift in
nonstationary environments,”
IEEE Transactions on Neural Networks
, vol. 22, no. 10, pp. 1517–1531, 2011. 
[15]
J. Gama, P. Medas, G. Castillo, and P. Rodrigues, “Learning with drift
detection,” in
Brazilian Symposium on Artificial Intelligence
. Springer, 2004, pp. 286–295.  [16] S. Wang, L. L. Minku, D. Ghezzi, D. Caltabiano, P. Tino, and X. Yao, “Concept drift detection for online class imbalance learning,” in Neural Networks (IJCNN), The 2013 International Joint Conference on. IEEE, 2013, pp. 1–10.
 [17] H. Wang and Z. Abraham, “Concept drift detection for streaming data,” in 2015 International Joint Conference on Neural Networks (IJCNN). IEEE, 2015, pp. 1–9.
 [18] D. K. Antwi, H. L. Viktor, and N. Japkowicz, “The perfsim algorithm for concept drift detection in imbalanced data,” in Data Mining Workshops (ICDMW), 2012 IEEE 12th International Conference on. IEEE, 2012, pp. 619–628.
 [19] C. Alippi, G. Boracchi, and M. Roveri, “A hierarchical, nonparametric, sequential changedetection test,” in Neural Networks (IJCNN), The 2011 International Joint Conference on. IEEE, 2011, pp. 2889–2896.
 [20] ——, “Hierarchical changedetection tests,” IEEE transactions on neural networks and learning systems, vol. 28, no. 2, pp. 246–258, 2017.
 [21] ——, “Justintime classifiers for recurrent concepts,” IEEE transactions on neural networks and learning systems, vol. 24, no. 4, pp. 620–634, 2013.
 [22] S. Yu and Z. Abraham, “Concept drift detection with hierarchical hypothesis testing,” in Proceedings of the 2017 SIAM International Conference on Data Mining. SIAM, 2017, pp. 768–776.
 [23] L. L. Minku, A. P. White, and X. Yao, “The impact of diversity on online ensemble learning in the presence of concept drift,” IEEE Transactions on Knowledge and Data Engineering, vol. 22, no. 5, pp. 730–742, 2010.
 [24] L. L. Minku and X. Yao, “Ddd: A new ensemble approach for dealing with concept drift,” IEEE transactions on knowledge and data engineering, vol. 24, no. 4, pp. 619–633, 2012.
 [25] Y. Sun, K. Tang, Z. Zhu, and X. Yao, “Concept drift adaptation by exploiting historical knowledge,” IEEE Transactions on Neural Networks and Learning Systems, 2018.
 [26] P. Domingos and G. Hulten, “A general framework for mining massive data streams,” Journal of Computational and Graphical Statistics, vol. 12, no. 4, pp. 945–949, 2003.
 [27] B. Krawczyk, L. L. Minku, J. Gama, J. Stefanowski, and M. Woźniak, “Ensemble learning for data stream analysis: a survey,” Information Fusion, vol. 37, pp. 132–156, 2017.
 [28] M. G. Kelly, D. J. Hand, and N. M. Adams, “The impact of changing populations on classifier performance,” in Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 1999, pp. 367–371.
 [29] G. Widmer and M. Kubat, “Effective learning in dynamic environments by explicit context tracking,” in Machine learning: ECML93. Springer, 1993, pp. 227–243.
 [30] M. Harel, S. Mannor, R. ElYaniv, and K. Crammer, “Concept drift detection through resampling.” in ICML, 2014, pp. 1009–1017.
 [31] I. W. Sandberg, J. T. Lo, C. L. Fancourt, J. C. Principe, S. Katagiri, and S. Haykin, Nonlinear dynamical systems: feedforward neural network perspectives. John Wiley & Sons, 2001, vol. 21.
 [32] E. Brodsky and B. S. Darkhovsky, Nonparametric methods in change point problems. Springer Science & Business Media, 2013, vol. 243.
 [33] J. Chen and A. K. Gupta, Parametric statistical change point analysis: with applications to genetics, medicine, and finance. Springer Science & Business Media, 2011.
 [34] T. S. Sethi and M. Kantardzic, “On the reliable detection of concept drift from streaming unlabeled data,” Expert Systems with Applications, vol. 82, pp. 77–99, 2017.
 [35] V. Vapnik, “Principles of risk minimization for learning theory,” in NIPS, 1991, pp. 831–838.
 [36] V. M. Souza, D. F. Silva, J. Gama, and G. E. Batista, “Data stream classification guided by clustering on nonstationary environments and extreme verification latency,” in Proceedings of the 2015 SIAM International Conference on Data Mining. SIAM, 2015, pp. 873–881.
 [37] M. BaenaGarcıa, J. del CampoÁvila, R. Fidalgo, A. Bifet, R. Gavalda, and R. MoralesBueno, “Early drift detection method,” in Fourth international workshop on knowledge discovery from data streams, vol. 6, 2006, pp. 77–86.
 [38] K. Nishida and K. Yamauchi, “Detecting concept drift using statistical testing,” in International conference on discovery science. Springer, 2007, pp. 264–269.
 [39] J. Read, A. Bifet, B. Pfahringer, and G. Holmes, “Batchincremental versus instanceincremental learning in dynamic and evolving data,” Advances in Intelligent Data Analysis XI, pp. 313–323, 2012.
 [40] I. FríasBlanco, J. del CampoÁvila, G. RamosJiménez, R. MoralesBueno, A. OrtizDíaz, and Y. CaballeroMota, “Online and nonparametric drift detection methods based on hoeffding s bounds,” IEEE Transactions on Knowledge and Data Engineering, vol. 27, no. 3, pp. 810–823, 2015.
 [41] P. M. Gonçalves, S. G. de Carvalho Santos, R. S. Barros, and D. C. Vieira, “A comparative study on concept drift detectors,” Expert Systems with Applications, vol. 41, no. 18, pp. 8144–8156, 2014.
 [42] J. C. Principe and R. Chalasani, “Cognitive architectures for sensory processing,” Proceedings of the IEEE, vol. 102, no. 4, pp. 514–525, 2014.
 [43] C. Alippi, G. Boracchi, and M. Roveri, “Change detection tests using the ici rule,” in Neural Networks (IJCNN), The 2010 International Joint Conference on. IEEE, 2010, pp. 1–7.
 [44] C. W. Helstrom, Statistical Theory of Signal Detection. New York, NY, USA: Pergamon Press, 1968.
 [45] D. Siegmund, Sequential Analysis: Tests and Confidence Intervals. Springer, New York, 1985.
 [46] L. Du, Q. Song, L. Zhu, and X. Zhu, “A selective detector ensemble for concept drift detection,” Computer Journal, no. 3, 2015.
 [47] B. I. F. Maciel, S. G. T. C. Santos, and R. S. M. Barros, “A lightweight concept drift detection ensemble,” in Tools with Artificial Intelligence (ICTAI), 2015 IEEE 27th International Conference on. IEEE, 2015, pp. 1061–1068.
 [48] S. Wang, L. L. Minku, and X. Yao, “A learning framework for online class imbalance learning,” in Computational Intelligence and Ensemble Learning (CIEL), 2013 IEEE Symposium on. IEEE, 2013, pp. 36–45.
 [49] P. Good, Permutation tests: a practical guide to resampling methods for testing hypotheses. Springer Science & Business Media, 2013.
 [50] M. Woźniak, P. Ksieniewicz, B. Cyganek, and K. Walkowiak, “Ensembles of heterogeneous concept drift detectorsexperimental study,” in IFIP International Conference on Computer Information Systems and Industrial Management. Springer, 2016, pp. 538–549.
 [51] D. Bhati, P. Kgosi, and R. N. Rattihalli, “Distribution of geometrically weighted sum of bernoulli random variables,” Applied Mathematics, vol. 2, no. 11, p. 1382, 2011.
 [52] O. Bousquet and A. Elisseeff, “Stability and generalization,” Journal of Machine Learning Research, vol. 2, no. Mar, pp. 499–526, 2002.
 [53] G. Ditzler and R. Polikar, “Incremental learning of concept drift from streaming imbalanced data,” ieee transactions on knowledge and data engineering, vol. 25, no. 10, pp. 2283–2301, 2013.
 [54] J. Yang, R. Yan, and A. G. Hauptmann, “Crossdomain video concept detection using adaptive svms,” in Proceedings of the 15th ACM international conference on Multimedia. ACM, 2007, pp. 188–197.
 [55] ——, “Adapting svm classifiers to data with shifted distributions,” in Seventh IEEE International Conference on Data Mining Workshops (ICDMW 2007). IEEE, 2007, pp. 69–76.

[56]
I. Katakis, G. Tsoumakas, and I. Vlahavas, “Dynamic feature space and incremental feature selection for the classification of textual data streams,”
Knowledge Discovery from Data Streams, pp. 107–116, 2006.  [57] I. Zliobaite, “How good is the electricity benchmark for evaluating concept drift adaptation,” arXiv preprint arXiv:1301.3524, 2013.
 [58] I. Žliobaitė, A. Bifet, J. Read, B. Pfahringer, and G. Holmes, “Evaluation methods and decision theory for classification of streaming data with temporal dependence,” Machine Learning, vol. 98, no. 3, pp. 455–482, 2015.
 [59] J. Gama, R. Sebastião, and P. P. Rodrigues, “On evaluating stream learning algorithms,” Machine learning, vol. 90, no. 3, pp. 317–346, 2013.
 [60] C. J. V. Rijsbergen, Information Retrieval. ButterworthHeinemann, 1979.
 [61] M. Kubat, R. Holte, and S. Matwin, “Learning when negative examples abound,” in European Conference on Machine Learning. Springer, 1997, pp. 146–153.
 [62] C. Anagnostopoulos, D. K. Tasoulis, N. M. Adams, and D. J. Hand, “Temporally adaptive estimation of logistic classifiers on data streams,” Advances in data analysis and classification, vol. 3, no. 3, pp. 243–261, 2009.
 [63] N. G. Pavlidis, D. K. Tasoulis, N. M. Adams, and D. J. Hand, “perceptron: An adaptive classifier for data streams,” Pattern Recognition, vol. 44, no. 1, pp. 78–96, 2011.
 [64] C. Alippi and M. Roveri, “Justintime adaptive classifiers in nonstationary conditions,” in 2007 International Joint Conference on Neural Networks. IEEE, 2007, pp. 1014–1019.
 [65] J. C. Principe, Information theoretic learning: Renyi’s entropy and kernel perspectives. Springer Science & Business Media, 2010.