Concept Drift Detection and Adaptation with Hierarchical Hypothesis Testing

07/25/2017 ∙ by Shujian Yu, et al. ∙ Bosch University of Florida Machine Zone 0

In a streaming environment, there is often a need for statistical prediction models to detect and adapt to concept drifts (i.e., changes in the joint distribution between predictor and response variables) so as to mitigate deteriorating predictive performance over time. Various concept drift detection approaches have been proposed in the past decades. However, they do not perform well across different concept drift types (e.g., gradual or abrupt, recurrent or irregular) and different data stream distributions (e.g., balanced and imbalanced labels). This paper presents a novel framework that can detect and also adapt to the various concept drift types, even in the presence of imbalanced data labels. The framework leverages a hierarchical set of hypothesis tests in an online fashion to detect concept drifts and employs an adaptive training strategy to significantly boost its adaptation capability. The performance of the proposed framework is compared to benchmark approaches using both simulated and real-world datasets spanning the breadth of concept drift types. The proposed approach significantly outperforms benchmark solutions in terms of precision, delay of detection as well as the adaptability across different concepts.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

With the exponential growth of data, it becomes increasingly challenging to design and implement effective techniques for analyzing and detecting changes in a streaming environment [1, 2]. As a result, early approaches for detecting statistical changes in a time series (such as change point detectors), have had to be extended for online detection of changes in multivariate data streams [3]. Some of these techniques for detecting intrinsic changes in the relationship of the incoming data have been successfully applied to various real-world applications, such as email filtering, network traffic analysis and user preference prediction [4, 5].

Online classification is another common task performed on multivariate streaming data that takes advantage of these statistical relationships to predict a class label at each time index [6]

. If the underlying source (or joint data distribution) that generates the data is not stationary, the optimal decision rule for the classifier would change over time - a phenomena known as concept drift

[7]. Given the impact of concept drift on the predictive performance of an online classifier, there is often a need to detect these concept drifts as early as possible. The inability of change point detectors to detect these concept drifts, has motivated the need for concept drift detectors that not only monitor the join distribution of a multivariate data stream but also changes in its relationship to the class labels of the streaming data.

There are two different approaches to address concept drifts in streaming data [6]. The first, automatically adapts the parameters of a statistical model in an incremental fashion [8, 9, 10] or employs an ensemble of classifiers, trained on different windows over the stream, to give the optimal decision [11, 12, 13, 14]

. There is no explicit detection of drifts in these methods, but retraining of new classifiers. The second approach integrates a statistical model and a concept drift detector, whose purpose is to signal the need for updating the statistical model once a concept drift is detected. Existing methods in this category monitor the error rate or an error-driven statistics and make a decision based on the statistical learning theory

[15, 16, 17, 18]. Unlike the first approach that only mitigates deteriorating classification performance over time, the second approach enables identification of the time instant related to concept drift occurrences. The promptness of the alert, i.e. the time that mediates the start of drift until its detection, is crucially important in applications like malware detection or network monitoring [4, 5].

In this paper, we use the second approach and present a novel hierarchical hypothesis testing (HHT) framework for concept drift detection and adaptation. This framework is inspired by the hierarchical architecture that was recently proposed for change point detection [19, 20] (see section II-A for more discussion on the difference between concept drift detection and change point detection). The presented work intends to bring new perspectives to the field of concept drift detection and adaptation with the recent advances in hierarchical mechanism (e.g., [21, 20]) and provides the following contributions. First, we present Hierarchical Linear Four Rates (HLFR) detector [22], a novel HHT-based concept drift detection method, which is applicable to different types of concept drifts (e.g., recurrent or irregular, gradual or abrupt). A detailed analysis on the Type-I and Type-II errors of the proposed HLFR is also performed. Second, we present an adaptive training approach instead of the commonly used retraining strategy, once a drift is confirmed. The motivation is to leverage knowledge from the historical concept (rather than discard this information as in the retraining strategy), to enhance the classification performance in the new concept. We term this improvement adaptive HLFR (A-HLFR). Admittedly, leveraging previous knowledge to boost classification performance is not novel in the streaming classification scenario. However, to the best of our knowledge, previous work either uses the first approach that do not explicitly identify timestamps or the types of drifts (e.g., [23, 24, 25]) or relies heavily on previous restored samples (e.g., [21]) which contradicts the single pass criterion111Single pass criterion: a sample from the data stream should be discarded rather than stored in the memory, once it has been processed [6, 26]. [6, 26]. From this perspective, we are among the first to investigate feasible solutions to perform “knowledge transfer” without losing intrinsic drift detection capability and the utilization of previous samples. Third, we carry out comprehensive experiments to investigate the benefits of HLFR (in detection) and A-HLFR (in detection and adaptation), and validate the advantage of adaptive training strategy.

The rest of the paper is organized as follows. In Section II, we give the problem formulation of concept drift and also briefly review related work. In section III, we present HLFR and elaborate on the layer-I and layer-II tests employed. This section also includes the derivation of the detailed values of Type-I and Type-II errors associated with HLFR. Additionally, we present A-HLFR, that not only detects drifts but also adapts the classifier to handle concept drifts. In section IV, experiments are presented and discussed. Finally, we present the conclusion in section V.

Ii Previous approaches

Ii-a Problem Formulation

Given a continuous stream of labeled samples , , a classifier can be learned so that . Here, is a

-dimensional feature vector in a predefined vector space

and 222This paper only considers binary classification.. At every time instant , we split the samples into sets (containing recent samples) and set (containing examples that appeared prior to those in ). A concept drift refers to the joint distribution that generates samples in differs from that in [7, 4, 27]. From a Bayesian perspective, concept drifts can manifest two fundamental forms of changes [28]

: 1) a change in the posterior probability

; and 2) a change in the marginal probability or . Existing studies tend to prioritize detecting posterior distribution change [5], also known as real concept drift [29], because it clearly indicates the optimal decision rule.

A closely related problem to concept drift detection is the classical change point detection that has been well studied theoretically and practically before. Unlike concept drift detectors, change point detectors are targeted at detecting changes in the generating distribution of the streaming data (i.e., ) [30]. The standard change point detection methods are typically based on statistical decision theory, some reference books include [3, 31, 32, 33]. Although a change point detector may benefit the performance of concept drift detector, purely modeling is insufficient to solve the problem of concept drift detection [34]. An intuitive example is shown in Fig. 1, in which

remains unchanged, while the class labels change. On the other hand, it still remains a big challenge to detect any type of distributional changes, especially for multivariate or high-dimensional data

[30, 17]. For these reasons, instead of selecting the intermediate solution of change point detection, we solve the problem by monitoring the “significant” drift in the prediction risk of the underlying predictor based on the risk minimization principle [35].

(a) Data distribution in concept I
(b) Data distribution in concept II
Fig. 1: The limitations of change point detector on concept drift detection. (a) and (b) demonstrate the feature (i.e., ) distribution in 2-D plane in two consecutive concepts (selected from the Two Classes Rotating (2CR) dataset [36]), where the “red rectangle” denotes class and the “blue triangle” represents class . There is no distribution change on and (because the labels are balanced). The only factor that evolves over time is , the optimal decision rule (see the black dashed line).

Ii-B Benchmarking concept drift detection approaches

An extensive review on learning under concept drifts is beyond the scope of this paper, and we refer interested readers to some recently published surveys [4, 5, 27] for some classical methods and recent progresses. In this section, we only review previous work of most relevance to the presented method, i.e., concept drift detection approaches.

The method that renewed attention to this problem was the Drift Detection Method (DDM) [15]. DDM monitors the sum of overall classification error (

) and its empirical standard deviation (

). Despite its simplicity, DDM always fails to detect real drift points unless the sum of the Type-I and Type-II errors changes. Early Drift Detection Method (EDDM) [37], on the other hand, suggests monitoring the distance between two consecutive classification errors. EDDM performs better than DDM, especially in the scenario of slow gradual changes. However, it requires waiting for a minimum of classification errors before calculating the monitoring statistic at each time instant, an impractical condition for imbalanced data. A third error based method, i.e., STEPD [38], applies a test of equal proportion to compare the classification accuracy in a recent window with the historical classification accuracy excluding this recent window.

Following the early work, a few new methods have been proposed to improve DDM from different perspectives. Drift Detection Method for Online Class Imbalance (DDM-OCI) [16]

deals with imbalanced data. Unfortunately, DDM-OCI is prone to trigger lots of false positives due to an inherent weakness in the model: the test statistic used by DDM-OCI

is not approximately distributed as

under the null hypothesis

333

is a modified estimator of

, which satisfies where denotes a time decaying factor [16].
 [17]. PerfSim [18]

also deals with imbalanced data. Different from DDM-OCI, PerfSim tracks the cosine similarity of four entries associated with confusion matrix to determine an occurrence of concept drift. However, the threshold used to distinguish concept drift was user-specified. Moreover, PerfSim assumes the data comes in batch-incremental manner

[39] which makes it impractical in real applications, especially when the decisions are required to be made instantly. Other related work includes the Exponentially Weighted Moving Average (EWMA) for concept drift detection (ECDD) [6] and the Drift Detection Method based on the Hoeffding’s inequality (HDDM) [40]. An experimental comparative study is available in [41].

Ii-C Hierarchical architecture on change-point or concept drift detection

Hierarchical architectures have been extensively studied in the machine learning community in the last decades. One of the most recent examples is the Deep Predictive Coding Networks (DPCN)

[42], a neural-inspired hierarchical generative model which is effective on modeling sensory data.

However, the hierarchical architectures for change point (or concept drift) detection were seldom investigated. The first hierarchical change point test (HCDT) was proposed in [19]

based on the Intersection of Confidence Intervals (ICI) rule

[43]. It has later been extended in a higher perspective by incorporating a general methodology to design HCDT [20]. However, as a change point detector, HCDT has its intrinsic limitations as emphasized in section II-A

. Although it can be modified for concept drift detection by tracking the classification error with a Bernoulli distribution assumption, a univariate indicator (or statistic) is insufficient to provide accurate concept drift detection

[17]

, especially when the classifier becomes unstable. Moreover, we already proved that the derived statistics (in Layer-I) are geometrically weighted sum of Bernoulli random variables 

[17], rather than simply following the Bernoulli distribution in the common sense.

This work is motivated by [20]. However, in order to make the designed algorithm well suited for broader classes of concept drift detection (rather than change point detection) without losing accuracy and proper classifier adaptation, we proposed HLFR, a novel hierarchical architecture (together with two novel testing methods in each layer) for concept drift detection that is applicable to different concept drift types and data stream distributions (e.g., balanced or imbalanced labels). Moreover, we present an adaptive training approach instead of the retraining scheme commonly employed, once a drift is confirmed. The proposed adaptation approach is not limited to a single concept drift type and strictly follows the single pass criterion that does not need any historical data. Results show that the proposed approach captures more information from the data than previous work.

Iii Hierarchical Linear Four Rates (HLFR)

Fig. 2: The proposed hierarchical hypothesis testing (HHT) framework for concept drift detection and adaptation.

This section presents a novel hierarchical hypothesis testing (HHT) framework for concept drift detection and adaptation. As shown in Fig. 2, HHT features two layers of hypothesis tests. The Layer-I test is executed online. Once it detects a potential drift, the Layer-II test is activated to confirm (or deny) the validity of the suspected drift. Depending on the decision results of Layer-II test, the HHT reconfigures or restarts the Layer-I test correspondingly. A new concept drift detector, namely Hierarchical Linear Four Rates (HLFR), is developed under the HHT framework. HLFR implements a sequential hypothesis testing [44, 45], and the two layers cooperate closely to improve online classification capability jointly. HLFR, is summarized in Algorithm 1.

HLFR does not make use of any intrinsic property or impose any assumption on the underlying classifier. This modular property enables HLFR to be easily deployed with any classifier (support vector machine (SVM),

-nearest neighbors (KNN), etc.). It is worth noting that ensemble of detectors 

[46, 47] may appear to share similarities with the proposed HHT framework in this paper. However, the two architectures are significantly different in the way to organize different hypothesis tests. For HHT, the Layer-II test is only activated when the Layer-I test detects a suspected drift points (i.e., the Layer-II is an auxiliary and validation module to Layer-I in the hierarchical architecture), whereas the ensemble of detectors conducts different tests in a parallel manner (i.e., each test is performed independently and synchronously with no priority, and the final decision is made by a voting scheme). To further illustrate the differences, a rigorous investigation of the Type-I and Type-II errors analysis concerning our HHT framework and the ensemble of detectors are illustrated in Section III-C.

1:Data where , ; Initially trained classifier .
2:Concept drift time points .
3:for  to  do
4:     Perform Layer-I hypothesis test.
5:     if (Layer-I detects potential drift point then
6:         Perform Layer-II hypothesis test on
7:         if (Layer-II confirms the potentiality of then
8:              ; Update .
9:         else
10:              Discard ; Restart Layer-I test.
11:         end if
12:     end if
13:end for
Algorithm 1 Hierarchical Linear Four Rates (HLFR)

Iii-a Layer-I Hypothesis Test

HLFR selects our recently developed Linear Four Rates (LFR) [17] in its Layer-I test. According to the results shown in [17], LFR always exhibits promising performances in terms of shorter detection delay and higher detection precision, compared with other prevalent concept drift detectors. This is not surprising, as LFR monitors four rates (or statistics) associated with a confusion matrix (i.e., the true positive rate (), the true negative rate (), the positive predictive value () and the negative predictive value ()) simultaneously, thus it can sufficiently and precisely make use of the error information.

The key idea for LFR is straightforward: , , , should remain the same in a stable or stationary concept. Therefore, a significant change of any () may imply a change in the underlying joint distribution or concept. Specifically, at each time instant , LFR conducts four independent tests with the following null and alternative hypotheses:

The concept is stable if hypothesis holds and is considered to have a potential drift if hypothesis is rejected. Intuitively, LFR should be more sensitive to any type of drift, as it keeps track of four rates simultaneously. By contrast, almost all previous methods use a single specific statistic that can only capture partial of the distributional information: DDM, ECDD and HDDM use the overall error rate, EDDM relies on the average distance between adjacent classification errors, DDM-OCI deals with the minority class recall, STEPD monitors a ratio of recent accuracy and overall accuracy, whereas PerfSim considers the cosine similarity coefficient of four entries in confusion matrix.

The LFR is summarized in Algorithm 2. During implementation, LFR modifies with as employed in [16, 48] (see also footnote ). is essentially a weighted linear combination of the classifier’s current and previous performances. In [17], we have proved that follows a weighted independent and identically distributed () Bernoulli distribution. Given this property, we are able to obtain the “BoundTable” by conducting Monte-Carlo simulations. Based upon these bound values, LFR considers that a concept drift is likely to occur when any succeeds the warning bound (warn.bd), and sets the warning signal (). If any reaches the corresponding detection bound (detect.bd), the concept drift is affirmed at (). Interested readers can refer to [17] for more details.

1:Data where and ; Binary classifier ; Time decaying factors ; warn significance level ; detect significance level .
2:Potential concept drift time points .
3: where and confusion matrix ;
4:for  to  do
5:     
6:     
7:     while (do
8:         if ( is influenced by then
9:              
10:         else
11:              ;
12:         end if
13:         if ( then
14:              
15:              
16:         else
17:              
18:              
19:         end if
20:         
21:         
22:     end while
23:     if  ( any exceeds & warn.time is NULL)  then
24:         
25:     else if (no exceeds & warn.time is not NULL) then
26:          NULL
27:     end if
28:     if  ( any exceeds )  then
29:         detect.time ;
30:         relearn by or wait for sufficient instances;
31:         reset as step 1;
32:         .
33:     end if
34:end for
Algorithm 2 Linear Four Rates (Layer-I)

Iii-B Layer-II Hypothesis Test

The four rates are more sensitive metrics that enable LFR to be able to promptly detect any types of concept drifts. However, the sensitivity of four rates also makes LFR is more likely to trigger “false positive” detections. The Layer-II test serves to validate detections raised by Layer-I test, thus significantly remove these “false positive” detections. In HLFR, we use a permutation test (see Algorithm 3) in its Layer-II test. Permutation test has been well studied theoretically and practically before, it does not require apriori information regarding the monitored process or the nature of the data [49].

Specifically, we partition the streaming observations into two consecutive segments based on the suspected drift instant

provided by the Layer-I test, and employ a new statistical hypothesis test to compare the inherent properties of these two segments to assess a possible variations in the joint distribution

. Then, the general idea behind our designed permutation test is to test whether the prediction average risk (evaluated over the second segment using a classifier trained on the first segment) is significantly different from its sampling distribution under the null hypothesis (i.e., no drift occurs). Here, we measure the prediction average risk with zero-one loss. Zero-one loss contains partial information of the four rates. Intuitively, if no concept drift has occurred, the zero-one loss on the ordered train-test split (i.e., in line ) should not deviate too much from that of the shuffled splits (i.e., , , in line ), a realization of its sampling distribution under the null hypothesis [30].

1:Potential drift time ; Permutation window size ; Permutation number ; Classification algorithm ; Significant rate .
2:decision ( or ?).
3: streaming segment before of length .
4: streaming segment after of length .
5:Train classifier on using .
6:Test classifier on to get the zero-one loss .
7:for  to  do
8:      random split of .
9:     Train classifier on using .
10:     Test classifier on to get the zero-one loss .
11:end for
12:if  then
13:     decision is .
14:else
15:     decision is .
16:end if
17:return decision
Algorithm 3 Permutation Test (Layer-II)

Iii-C Error analysis on Hierarchical Hypothesis Testing

To further give credence to the success of HHT framework in practical applications, we present a theoretical analysis to its associated Type-I and Type-II errors.

In the problem of concept drift detection, the Type-I error (also known as a “false positive” rate) refers to the incorrect rejection of a true null hypothesis

(i.e., no drift occurs). By contrast, the Type-II error (also known as a “false negative” rate) is incorrectly retaining a false null hypothesis when the alternative hypothesis is true. On the other hand, for any (single-layer) hypothesis test, the Type-I error is exactly the selected significance level, whereas the Type-II error (denoted with ) is determined by the power of the test and the power is exactly .

Let us denote by and the Type-I and Type-II errors of Layer-I test, and and the Type-I and Type-II errors of Layer-II test. Also, we denote by and the overall Type-I and Type-II errors of HHT framework.

By definition, the Type-I error of HHT is given by:

(1)

where “” denotes AND logic operator.

Eq. (1) assumes that the performance of Layer-I test and Layer-II test is independent, i.e., the detection results of Layer-I and Layer-II tests will not be mutually influenced when they are being tested independently. Given that the test statistics and manners are totally different in Layer-I and Layer-II tests of HLFR, this assumption makes sense. In fact, even though the performance of Layer-I and Layer-II tests are related to each other, still satisfies , which suggests that the HHT framework will not increase the Type-I error even in the worst case.

Similarly, the overall Type-II error is given by:

(2)

Again, we assume the performance independence of Layer-I and Layer-II tests. However, even though this condition is not met, we still have . This is an unfortunate fact, as it suggests a fundamental limitation of the HHT framework: it may increase the Type-II error. Given the fact that majority of the current concept drift detectors have high detection power (i.e., is small) yet suffer from a relatively high “false positive” rate, the cost is acceptable.

As emphasized earlier, a similar architecture to the proposed HHT framework is the ensemble of detectors [46, 47]. The most widely used decision rule for ensemble of detectors is that, given a pool of candidate detectors, the system determines a drift if any one of the detectors finds a drift. This way, suppose there are candidate detectors, the Type-I and Type-II errors of the ensemble of detectors are given by (assuming pairwise performance independence [50]):

(3)
(4)

By referring to Eqs. (1)-(4), it is easy to find that, although the architecture of HHT and the ensemble of detectors look similar, their functionalities and mechanisms are totally different. HHT attempts to remove “false positive” detections as much as possible, thus significantly decreases the Type-I error. However, HHT may increase the Type-II error at the same time. The ensemble of detectors, on the other hand, aim to further improve detection power (thus decrease the Type-II error) at the cost of increased Type-I error444, ().. Given that the prevalent concept drift detectors always have high detection power (e.g., LFR and HDDM) yet suffer from lots of “false positive” detections, it may not be necessary to naively combine different detectors in an ensemble manner. This is also the reason why the ensemble of detectors do not demonstrate any performance gain over single-layer-based drift detectors in a recent experimental survey paper [50].

Having illustrated the analytical expressions for the overall Type-I and Type-II errors of the HHT framework (i.e., ), we now specify the detailed values of Type-I and Type-II errors in Layer-I test (i.e., ) as well as the Type-I and Type-II errors in Layer-II test (i.e., ) of our proposed HLFR algorithm for completeness. We have and .

Iii-C1 The and of Layer-I test

The Type-I error of Layer-I test is upper bounded by its detection significance level (i.e., in Algorithm 2 of manuscript). On the other hand, although the test statistics () are geometrically weighted sum of Bernoulli random variables under a stable concept (i.e., hypothesis) up to time , i.e., , where and is the underlying rate, two reasons make it impossible to get a close-form expression or upper bound for Type-II error of Layer-I test.

1) It is hard to obtain the closed-form distribution function of under . Although [51] investigated the closed-form distribution function of under for the special case , it still remains a question for other values of .

2) The closed-form distribution function of under is unattainable. This is because could have arbitrary (or unconstrained) distributions when the concept changes.

Therefore, this section only empirically investigates the power of Layer-I test using synthetic data to illustrate and reveal its properties. We denote by the power estimate of . Suppose the null distribution is at and alternative distribution is at , , where is the maximal detection time delay. Then suppose the underlying rate is drifted from (the first concept) to (the second concept). Fig. 3 is a heatmap of limiting power estimates on all pairs using . We can see that is already close to , when and are significantly different. In this case, the Type-II error reduces to , because .

Fig. 3: Heatmap of power estimate .

Iii-C2 The and of Layer-II test

Same as the Layer-I test, the Type-I error of the Layer-II test is upper bounded by its selected significance level (i.e., in Algorithm 3 of manuscript). Thus, we focus our analysis on its power. Before that, we give the following two definitions.

Definition III.1.

[52] An algorithm has error stability

with respect to loss function

if:

(5)

where refers to a predictor obtained using trained on set with cardinality , is the set with the sample removed, and decreases with . is the risk of with respect to distribution , and denotes expectation.

Definition III.2.

[30] A stream segment is said to have -permitted variations, if for some , with respect to , if:

(6)

Given two subsequences with equal length , the Layer-II test in our HLFR method aims to determine whether the average prediction risk on ordered train-test split deviates too much from that of the shuffled splits by testing the following hypothesis:

where denotes the risk on ordered train-test split (i.e., and in lines and of Algorithm 3), whereas denotes the risk on shuffled splits (i.e., and in line of Algorithm 3) and

refers to the uniform distribution over all possible 

training sets of size from the two segments of samples, is a parameter that controls the maximum allowable change rate and is a -related function that will be elaborated in the following theorem.

Having illustrated the essence of Layer-II test, given Definition 1 and Definition 2, the following corollary upper bounded its Type-II error .

Corollary III.0.1.

For an algorithm with stability and any , we have that under , the probability of obtaining a “false negative” detection is bounded as follows:

(7)

Here , in which and are the permutation window size and the significance rate in Algorithm 3 of manuscript, refers to the small variation in Definition 2 and denotes the maximum allowable change rate. and denote the estimated zero-one loss of ordered train-test split and shuffled splits (see Algorithm 3 for more details). For simplicity, we set in our Algorithm 3

to avoid introducing extra hyperparameters. Note that, the above corollary is a special example of Theorem

in [30]. Interested readers can refer to the supplementary material of [30] for complete proof.

Iii-D Adaptive Hierarchical Linear Four Rates (A-HLFR)

Although HLFR can be used for streaming data classification with concept drifts (just like its DDM [15], EDDM [37] and STEPD [38] counterparts), naively retraining a new classifier after each concept drift detection severely deteriorates its classification performance. This stems from the fact that once a drift is confirmed, it discards all the (relevant) information from previous experience and uses only limited samples from current concept to retrain a classifier. A promising solution to avoid such circumstance is to first extract such kind of relevant knowledge from past experience and then “transfer” this knowledge to the new classifier [4, 25, 53]. To this end, Adaptive Hierarchical Linear Four Rates (A-HLFR) is an integral part of the proposed solution. A-HLFR makes a simple yet strategic modification to HLFR: replacing re-training scheme in HLFR framework with an adaptive learning strategy. Specifically, we substitute SVM (this paper selects soft margin SVM as the base classifier due to its accuracy and robustness [52]) with adaptive SVM (A-SVM) [54] once a concept drift is confirmed. The pseudocode of A-HLFR is the same as Algorithm 1. The only exception comes from the layer-I test, where the re-training scheme with standard SVM (line in Algorithm 2) is substituted with A-SVM.

Iii-D1 Adaptive SVM - Motivations and Formulations

A fundamental difficulty for learning supervised models once a concept drift is confirmed, is that the training samples from new and previous concepts are drawn from different distributions. A short detection delay (especially for state-of-the-art concept drift detection methods) results in extremely limited training samples from the new concept. These limited training samples from the new concept, coupled with the fact that it may be likely that consecutive concepts are closely related or relevant, inspires the idea of adapting the previous models with samples from the new concept to boost the concept drift adaptation capability.

Recall the earlier mentioned problem formulation, we are required to classify samples in the new concept, where only a limited number of labeled samples (i.e., a newly observed primary dataset ) are available for updating a classifier. To circumvent the drawbacks of limited training samples, the auxiliary classifier training on previously observed fully-labeled auxiliary dataset should also be considered. This is because the dataset is sampling from a joint distribution that is related to, yet different from, the joint distribution of dataset in an unknown manner. If we apply the auxiliary classifier on the primary dataset , the performance is poor since is biased to . On the other hand, although we can retrain a classifier using samples in such that the new classifier is unbiased to

, the classification accuracy may suffers from high variance due to limited training samples.

In order to achieve an improved bias-variance tradeoff, we employ adaptive SVM (A-SVM), initiated in [54], to adapt to . Intuitively, the key idea of A-SVM is to learn an adaptive classifier from by regularizing the distance between and , which can be formulated as:

(8)

where represents a feature mapping to project sample into a high-dimensional space or reproducing kernel Hilbert space (for linear SVM, ), denotes the classifier parameters estimated from . Eq. (8) jointly optimizes the distance between and as well as the classification error. The optimization to A-SVM is presented in [54, 55].

Iv Experiments

This section presents three sets of experiments to demonstrate the superiority of HLFR and A-HLFR over the prevalent baseline methods, in terms of concept drifts detection and adaptation. Section IV-A validates the benefits and advantages of HLFR on concept drift detection, using both quantitative metrics and visual evaluation. Section IV-B uses two real-world examples (one for email filtering, another for weather prediction) to illustrate the effectiveness and potency of using an adaptive training method to improve the capability of concept drift adaptation. In Section IV-C, we empirically demonstrate that 1) the benefits of adaptive training are not limited to HLFR, i.e., it provides a general solution to classifier adaptation for all concept drift detectors, like DDM, EDDM, etc.; and 2) the concept drift detection capability will not be impacted by the adaptive training strategy, i.e., HLFR and A-HLFR can achieve almost the same concept drift detection precision. Finally, we give a brief analysis to the computational complexity of all competing methods in section IV-D. All the experiments mentioned in this work were conducted in MATLAB a on an Intel i5-3337 GHz PC with GB RAM.

Iv-a Concept Drift Detection with HLFR

We first compare the performance of HLFR against five state-of-the-art concept drift detection methods: DDM [15], EDDM [37], DDM-OCI [16], STEPD [38], as well as the recently proposed LFR [17]. The parameters used in these methods were taken as recommended by their authors: the warning and detection thresholds of DDM (EDDM) are () and () respectively; the warning and detection significance levels of LFR (STEPD) are () and () respectively; whereas the parameters of DDM-OCI vary across different data under testing. For our proposed HLFR, the significant rate in Layer-II test is set to , and permutations were used throughout this paper.

Four benchmark data streams are selected for evaluation, namely “SEA” [15], “Checkerboard” [14]

, “Rotating hyperplane”, and USENET1

[13]. These datasets include both synthetic and real-world data. A comprehensive description to these datasets is introduced in [22]. Drifts are synthesized in the data, thus controlling ground truth concept drift locations and enabling precise quantitative analysis. Table I summarized drift types and the data properties for each stream. Obviously, the selected datasets span the gamut of concept drift types.

Data property SEA Checkerboard Hyperplane USENST1
gradual
abrupt
recurrent
imbalance
high dimensional
TABLE I: Summary of properties of selected datasets

Each stream was generated and tested independently for times. The base classifier used for all competing methods in all streams is a (soft margin) linear SVM with regularization parameter

. The only exception comes from USENET1, in which a radial basis function (RBF) kernel SVM with kernel width

is selected. Fig. 4 demonstrates the detection results of different methods averaged over these trails. As can be seen, HLFR and LFR significantly outperform their competitors in terms of promptly detecting concept drifts with fewer missed or false detections, regardless of drift types or data properties. By integrating the Layer-II test, HLFR further improves on LFR by effectively reducing even the few false positives triggered by LFR.

(a) SEA dataset
(b) Checkerboard dataset
(c) Rotating hyperplane dataset
(d) USENET1 dataset
Fig. 4: The histograms of detected concept drift points, generated using different methods, over (a) SEA; (b) Checkerboard; (c) Rotating hyperplane and (d) USENET1 datasets. In each row, the red bars denote the ground truth locations of concept drift points, whereas the blue bars are the histogram of detected points summarized over independent trails.

Quantitative comparison are performed as well. We define a True Positive () as a detection within a fixed delay range after a concept drift occurred, a False Negative () as missing a detection within the delay range, and a False Positive () as a detection outside this range or an extra detection in the range. For each detector, its detection quality is then evaluated by the and the values. The and values with respect to a predefined (largest allowable) detection delay are demonstrated in Fig. 5. At the first glance, HLFR, LFR and STEPD can always achieve higher or values across different ranges. If we look deeper, the values is significantly improved with HLFR while the values of HLFR and LFR are similar (except for Rotating hyperplane dataset). This result corroborates our Type-I and Type-II error analysis in section III-C: Layer-II test aims to confirm or deny the validity of layer-I detection results, thus it cannot compensate for the errors of missing a detection made by Layer-I test. In other words, the Type-I error of HLFR should be smaller than that of LFR theoretically, whereas the Type-II error of HLFR is lower bounded by LFR. In fact, the relatively lower of HLFR (compared to LFR) suggests that the used Layer-II test is a little conservative, i.e., it has a small probability to reject true positive detection triggered by Layer-I test (i.e., LFR). On the other hand, it seems that STEPD has much higher values on SEA and Rotating hyperplane datasets. However, the result is meaningless. This is because STEPD triggers significantly more false alarms (as seen in the fifth row of Fig. 4(a) and Fig. 4(c)), such that its values on these two datasets are consistently smaller than . Table II summarized the detection delays (ensemble average) for all competing algorithms. Out of the four datasets, our HLFR has the shortest (average) detection delay in three of them.

Algorithms SEA Checkerboard Hyperplane USENST1
STEPD 463 57 140 19
DDM-OCI 844 58 198 26
EDDM 939 93 166 36
DDM 1209 69 125 26
LFR 458 56 127 17
HLFR 482 55 120 17
TABLE II: Average detection delay for all competing algorithms. The best performance in each dataset is highlighted in bold.
(a) Precision over SEA dataset
(b) Precision over Checkerboard dataset
(c) Precision over Hyperplane dataset
(d) Precision over USENET1 dataset
(e) Recall over SEA dataset
(f) Recall over Checkerboard dataset
(g) Recall over Hyperplane dataset
(h) Recall over USENET1 dataset
Fig. 5: The and values of different methods over SEA, Checkerboard, Rotating hyperplane and USENET1 datasets. In each figure, the X-axis represents the predefined (largest allowable) detection delay, and the Y-axis denotes the corresponding metric values. For a specific delay range, a higher or value suggests better performance.

Iv-B Concept Drift Adaptation with A-HLFR

In this section, we perform two case studies using representative real-world concept drift datasets from email filtering and weather prediction domain respectively, aiming to validate the rationale of HLFR on concept drift detection as well as the potency of A-HLFR on concept drift adaptation. Performance is compared to DDM, EDDM, STEPD as well as LFR. Note that, the results of DDM-OCI are omitted as it is hard to detect “reasonable” concept drift points in the selected data. LABEL:section_adaptive.

The spam filtering dataset [12], consisting of instances and attributes, is used herein. This data represents email messages from the Spam Assassin Collection555http://spamassassin.apache.org/ and contains contains natural concept drifts [12, 56]. The spam ratio is approximately . Besides, the weather dataset [53, 14], a subset of the National Oceanic and Atmospheric Administration (NOAA) data666ftp://ftp.ncdc.noaa.gov/pub/data/gsod, consisting of daily observations recorded in Offutt Air Force Base in Bellevue, Nebraska, is also used in the study. This data is collected and recorded over years, containing not only short-term seasonal changes, but also (possibly) long-term climate trend. Daily measurements include temperature, pressure, wind speed, visibility, and a variety of features. The task is to predict whether it is going to rain from these features. Minority class cardinality varied between and throughout these years.


On Parameter Tuning and Experimental Setting. A common phenomenon for classification of real-world streaming data with concept drifts and temporal dependency is that “the more random alarms fire the classifier, the better the accuracy [57]”. Thus, to provide a fair comparison, the parameters of all competing methods are tuned to detect similar number of concept drifts. Table III and Table IV summarized the key parameters regarding significance levels (or thresholds) of different methods in two selected real-world datasets respectively. For spam data, an extensive search for appropriate partition of training and testing sets was performed based on two criteria. First, there is no strong autocorrelations in the classification error sequence on the training set. This is because once the errors are highly autocorrelated, it is very probably that the training data is no longer or the training data spans different concepts. Second, the classifier trained on the training set can achieve promising classification accuracies on both minority and majority classes, i.e., sufficient number of training data is required. With these two considerations, the length of training set is set to . As for the weather data, the training size is set to instances (days), approximately one season as suggested in [53].

Algorithms Parameter settings on significance levels (or thresholds)
STEPD ,
EDDM ,
DDM ,
LFR ,
HLFR , ,
A-HLFR , ,
TABLE III: Parameter settings in spam email filtering.
Algorithms Parameter settings on significance levels (or thresholds)
STEPD ,
EDDM ,
DDM ,
LFR ,
HLFR , ,
A-HLFR , ,
TABLE IV: Parameter settings in weather prediction.

Case study on spam dataset. We first evaluate the performances of different methods on the spam dataset. According to the authors of [12], there are three dominating concepts distributed in different time periods and these concept drifts occurred approximately in the neighbors of time instants and in Region I, time instants and in Region II, and time instant

in Region III. Besides, there are many abrupt drifts in Region II. A possible reason for these abrupt and frequent drifts may be batches of outliers or noisy messages. According to the concept drift detection results shown in Fig.

6, A-HLFR and HLFR best match these descriptions, except that they both miss a potential drift around time instant . By contrast, although other methods are able to detect this point, they have many other limitations: 1) LFR triggers some false positive detections as well; and 2) DDM or EDDM, not only misses obvious drift points, but also feeds back unconvincing drift locations in Region I or Region III.

We then applied a recently proposed measurement - Kappa Plus Statistic (KPS) [58] - to access experimental results. KPS, defined as , aims to evaluate a data stream classifiers performance, taking into account the temporal dependence and effectiveness of classifier adaptation. is the classifier’s prequential accuracy [59] and is the accuracy of No-Change classifier777The No-Change classifier is defined as a classifier that predicts the same label as previously observed, , for any observation [58].. We partition the training set into approximately consecutive time periods. The KPS prequential representation over these periods is shown in Fig. 7(a). As can be seen, the HLFR and A-HLFR adaptations are most effective in periods -, but suffer from a sudden drop in periods -. These observations corroborate the detection results shown in Fig. 6: HLFR and A-HLFR can accurately detect the first drift point without any false positives in Region I, but they both missed a target in Region II. On the other hand, there is almost no performance difference between the classifier update in A-HLFR and HLFR.

Fig. 6: Concept drift detection results on the spam dataset.
(a) KPS prequential representation on the spam dataset
(b) KPS prequential representation on the weather dataset
Fig. 7: Kappa Plus Statistic (KPS) prequential representations.

We further employ several different quantitative measurements to have a thorough evaluation on streaming classification performance. The first measurement is the most commonly used overall accuracy (OAC). Although OAC is an important metric, it is inadequate for imbalanced data. Therefore, we include the F-measure888. [60] and the G-mean999, where and denote true positive rate and true negative rate respectively. [61] metrics. All metrics are calculated in each time instant, creating a time series representation that ensembles learning curves. Fig. 8(a)-(c) plot the time series representations of OAC, F-measure and G-mean for all competing methods. As can be seen, A-HLFR and HLFR typically provide a significant improvement in F-measure and G-mean while maintaining good OAC when compared to their DDM, EDDM and LFR counterparts, with A-HLFR performs slightly better than HLFR. STEPD seems to demonstrate the best overall classification performance on the spam dataset. However, A-HLFR and HLFR provide more accurate (or rational) concept drift detections which best match with cluster assignments results in [12].


Case study on weather dataset. We then evaluate the performances of different methods on the weather dataset. Because the ground-truth drift point location is not available, we only demonstrate the concept drift adaptation comparison results. Fig. 7(b) plots the KPS prequential representations. As can be seen, A-HLFR performs (or updates) best in majority of time segments. Fig. 8(d)-(f) plot the corresponding OAC, F-measure and G-mean time series representations for all competing algorithms. Although the no adaptation (i.e., using the initial trained classifier for prequential classification without any classifier update) enjoys an overwhelming advantage in OAC compared to DDM, EDDM, LFR, STEPD, it is however invalid as the corresponding F-measure and G-mean tend to be zero as time evolves. This suggests that if no adaptation is adopted, the initial classifier gradually identifies the remaining data as belonging to the majority class, i.e., no rain days, which is not realistic. A-HLFR and HLFR achieves close OAC values to the non-adaptive classifier, however, shows significant improvements on F-measure and G-mean. Again, A-HLFR performs slightly better than HLFR.

From these two real applications, we can summarize some key observations:
1) The given data has severe concept drifts, as the classification performance of no adaptation deteriorates dramatically.
2) The adaptive training will not affect the performance of concept drift detection, as the concept drift detection results given by HLFR and A-HLFR are almost the same (see Fig. 6). This argument is further empirically validated and elucidated in the next subsection.
3) A-HLFR and HLFR consistently produce the best overall performance in terms of OAC, F-measure, G-mean and the rationality of drift detected. For real data, A-HLFR only performs slightly better than HLFR. This is because the temporal relatedness between consecutive concepts in real-world data is weak or the concept changes gradually and slowly such that simply transferring previous knowledge to current domain (or concept) cannot prompt the generalization capacity of new classifier significantly. Therefore, adaptive training has great potency, but it deserves more investigations and future improvements.
4) There is still plenty of room for performance improvement on incremental learning under concept drifts in nonstationary environment, as the OAC, F-measure and G-mean values are far from optimal. In fact, even with the state-of-the-art methods which only focus on automatically adapting classifier behavior (or parameters) to stay up-to-date with the streaming data dynamics, the OAC can only reach to approximately in [12, 56] for spam data and in [53, 14] for weather data, let alone the relatively lower F-measure and G-mean values.
5) The ensemble of classifiers seems to be a promising direction for future work. However, most of the existing ensemble learning based methods (e.g., [53, 14]) are developed for batch-incremental data [39], which is not suitable for a fully online setting, where the sample is provided one by one in a sequential manner [4].

Fig. 8: The time series representations of different metrics for all competing algorithms. (a) and (b): the OAC representations for spam data and weather data, respectively. (c) and (d): the F-measure representations for spam data and weather data, respectively. (e) and (f): the G-mean representations for spam data and weather data, respectively.

Iv-C Benefits of adaptive learning

In this section, we demonstrate, via the application of concept drift adaptation on USENET1 and Checkerboard datasets, that the superiority of adaptive SVM for concept drift adaptation is not limited to the HLFR framework. To this end, we consider the algorithm performance of integrating adaptive SVM into DDM, EDDM, DDM-OCI, STEPD as well as LFR framework. We term this combinations A-DDM, A-EDDM, A-DDM-OCI, A-STEPD and A-LFR, respectively.

In Fig. 9, we plotted the Precision and Recall curves of HLFR, LFR, DDM, EDDM, DDM-OCI, STEPD, A-HLFR, A-LFR, A-DDM, A-EDDM, A-DDM-OCI and A-STEPD on USENET1 and Checkerboard, respectively. For better visualization, we separate all the competing algorithms into two groups, group I includes HLFR, A-HLFR, LFR, A-LFR, STEPD and A-STEPD as they always perform better than their counterparts, while group II contains DDM, A-DDM, EDDM, A-EDDM, DDM-OCI and A-DDM-OCI. In each subfigure, the dashed line represents the baseline algorithm without adaptive training (e.g., HLFR), while the solid line denotes its adaptive version (e.g., A-HLFR). Meanwhile, for each baseline algorithm, its adaptive version is marked with the same color for comparison purpose. Obviously, the adaptive training will not affect the performance of concept drift detection101010Admittedly, there is performance gap for DDM or STEPD, the difference is, however, data-dependent. For example, DDM seems to be better than A-DDM in Checkerboard dataset, but this advantage does not hold in USENET1.. This is because the drift is determined by keeping track of “significant” changes of classification performance, rather than the specific performance measurement itself.

In Fig. 10 and Fig. 11, we plotted the time series representations of OAC, F-measure and G-mean on these two datasets over Monte-carlo simulations. The shading enveloping each curve in the figures represents percent confidence interval. In each sub-figure, the red dashed (or blue solid) line represents mean values for drift detection algorithm with (or without) adaptive training scheme, while the red (or blue) shading envelop represents the corresponding confidence intervals. For almost all the competing algorithms their corresponding adaptive versions achieve much better classification results than the non-adaptive counterparts. This performance boost begins from the first concept drift adaptation and grows gradually with increasing number of adaptations. As seen, A-HLFR and A-LFR achieves more compelling learning performance compared with A-DDM, A-EDDM, A-DDM-OCI and A-STEPD111111The comparable performance of A-DDM on Checkerboard dataset results from more times of adaptations, which is however unreasonable as the adaptation alarms are false alarms. This also coincides with the quantitative analysis results of concept drift detection shown in Fig. 9

. These results empirically validate the potential and superiority of using adaptive classifier techniques for concept drift adaptation, instead of the re-training strategy adopted in previous work. It is also worth noting that the adaptive classifier is not limited to soft-margin SVM. In fact, adaptive logistic regression

[62]

, adaptive single-layer perceptron

[63]

and adaptive decision tree

[64] frameworks all have been developed in recent years with the advance of statistical machine learning. We leave investigations of concept drift adaptation using other adaptive classifiers as future work.

Iv-D On the computational complexity analysis of concept drift detection

Having demonstrated the benefits and effectiveness of the HHT framework, this section discusses the computational complexity of the aforementioned concept drift detectors, particularly the additional computation cost incurred by incorporating the Layer-II test. In fact, DDM, EDDM, DDM-OCI, STEPD and LFR have a constant time complexity () at each time point, as all of them follow a single-layer-based hypothesis testing framework that monitors one or four error-related statistics [17]. The computational complexity for generating bound tables used by LFR or HLFR to determine the corresponding warning and detection bounds with respect to different rate values is , where is the number of Monte-Carlo simulations used. However, since the bound tables can be computed offline, the time complexity for looking up the bound table values once is given (see line and of Algorithm 2) remains . Due to the introduction of Layer-II test, HLFR is more computational expensive than other single-layer-based methods. This is because HLFR requires training classifiers ( in this work) for validating the occurrence of a potential concept drift time point121212HLFR has the same computational complexity with LFR if the Layer-I test does not reject the null hypothesis at the tested time point.. Suppose the computational complexity of training a new classifier is , the total computational complexity of HLFR at a suspected time point is .

Despite this limitation, the HHT framework introduces a new perspective to the field of concept drift detection, especially considering its overwhelming advantages on detection precision and delay of detection. Finally, it should be noted that the permutations in Layer-II test can be run in parallel, as the classifier trained are independent across different permutations.

V Conclusions

This paper proposed a novel concept drift detector, namely Hierarchical Linear Four Rates (HLFR), under the hierarchical hypothesis testing (HHT) framework. Unlike previous work, HLFR is able to detect all possible variants of concept drifts regardless of data characteristics, it is also independent of the underlying classifier. Using Adaptive SVM as its base classifier, HLFR can be easily extended to a concept drift-agnostic framework, i.e., A-HLFR. The performance of HLFR and A-HLFR in detecting and adapting to concept drifts are compared to state-of-the-art methods using both simulated and real-world datasets that span the gamut of concept drift types (recurrent or irregular, gradual or abrupt, etc.) and data distributions (balanced or imbalanced labels). Experimental results corroborate our theoretically analysis on Type-I and Type-II errors of HLFR and also demonstrate that our methods can significantly outperform our competitors in terms of earliest detection of concept drift, highest detection precision as well as powerful adaptability across different concepts. Two real examples on email filtering and weather prediction are finally presented to illustrate effectiveness and great potential of our methods.

In the future, we will extend HLFR and A-HLFR to multi-class classification scenario. One possible solution is to use the one-vs-all strategy to convert the -class classification problem into binary-class classification problems. Since the four rates associated with each binary-class classification are still geometrically weighted sum of Bernoulli random variables, HLFR and A-HLFR might be able to be applied straightforwardly. Additionally, we are also interested in investigating the performance of more sensitive metrics, from an information theoretic learning (ITL) perspective [65], to monitor the streaming environment. Finally, we will continue on designing more power tests under HHT framework for industrial-level noisy data.

References

  • [1] K. Slavakis, S.-J. Kim, G. Mateos, and G. B. Giannakis, “Stochastic approximation vis-a-vis online learning for big data analytics [lecture notes],” IEEE Signal Processing Magazine, vol. 31, no. 6, pp. 124–129, 2014.
  • [2] H. Hu, Y. Wen, T.-S. Chua, and X. Li, “Toward scalable systems for big data analytics: A technology tutorial,” IEEE Access, vol. 2, pp. 652–687, 2014.
  • [3] M. Basseville, I. V. Nikiforov et al., Detection of abrupt changes: theory and application.   Prentice Hall Englewood Cliffs, 1993, vol. 104.
  • [4] J. Gama, I. Žliobaitė, A. Bifet, M. Pechenizkiy, and A. Bouchachia, “A survey on concept drift adaptation,” ACM Computing Surveys (CSUR), vol. 46, no. 4, p. 44, 2014.
  • [5] S. Wang, L. L. Minku, and X. Yao, “A systematic study of online class imbalance learning with concept drift,” arXiv preprint arXiv:1703.06683, 2017.
  • [6] G. J. Ross, N. M. Adams, D. K. Tasoulis, and D. J. Hand, “Exponentially weighted moving average charts for detecting concept drift,” Pattern Recognition Letters, vol. 33, no. 2, pp. 191–198, 2012.
  • [7] G. Widmer and M. Kubat, “Learning in the presence of concept drift and hidden contexts,” Machine learning, vol. 23, no. 1, pp. 69–101, 1996.
  • [8] R. Klinkenberg, “Learning drifting concepts: Example selection vs. example weighting,” Intelligent Data Analysis, vol. 8, no. 3, pp. 281–300, 2004.
  • [9] A. Bifet and R. Gavalda, “Learning from time-changing data with adaptive windowing,” in Proceedings of the 2007 SIAM International Conference on Data Mining.   SIAM, 2007, pp. 443–448.
  • [10] L. Du, Q. Song, and X. Jia, “Detecting concept drift: An information entropy based method using an adaptive sliding window,” Intelligent Data Analysis, vol. 18, no. 3, pp. 337–364, 2014.
  • [11] W. N. Street and Y. Kim, “A streaming ensemble algorithm (sea) for large-scale classification,” in Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining.   ACM, 2001, pp. 377–382.
  • [12] I. Katakis, G. Tsoumakas, and I. Vlahavas, “Tracking recurring contexts using ensemble classifiers: an application to email filtering,” Knowledge and Information Systems, vol. 22, no. 3, pp. 371–391, 2010.
  • [13] I. Katakis, G. Tsoumakas, and I. P. Vlahavas, “An ensemble of classifiers for coping with recurring contexts in data streams.” in ECAI, 2008, pp. 763–764.
  • [14] R. Elwell and R. Polikar, “Incremental learning of concept drift in nonstationary environments,”

    IEEE Transactions on Neural Networks

    , vol. 22, no. 10, pp. 1517–1531, 2011.
  • [15] J. Gama, P. Medas, G. Castillo, and P. Rodrigues, “Learning with drift detection,” in

    Brazilian Symposium on Artificial Intelligence

    .   Springer, 2004, pp. 286–295.
  • [16] S. Wang, L. L. Minku, D. Ghezzi, D. Caltabiano, P. Tino, and X. Yao, “Concept drift detection for online class imbalance learning,” in Neural Networks (IJCNN), The 2013 International Joint Conference on.   IEEE, 2013, pp. 1–10.
  • [17] H. Wang and Z. Abraham, “Concept drift detection for streaming data,” in 2015 International Joint Conference on Neural Networks (IJCNN).   IEEE, 2015, pp. 1–9.
  • [18] D. K. Antwi, H. L. Viktor, and N. Japkowicz, “The perfsim algorithm for concept drift detection in imbalanced data,” in Data Mining Workshops (ICDMW), 2012 IEEE 12th International Conference on.   IEEE, 2012, pp. 619–628.
  • [19] C. Alippi, G. Boracchi, and M. Roveri, “A hierarchical, nonparametric, sequential change-detection test,” in Neural Networks (IJCNN), The 2011 International Joint Conference on.   IEEE, 2011, pp. 2889–2896.
  • [20] ——, “Hierarchical change-detection tests,” IEEE transactions on neural networks and learning systems, vol. 28, no. 2, pp. 246–258, 2017.
  • [21] ——, “Just-in-time classifiers for recurrent concepts,” IEEE transactions on neural networks and learning systems, vol. 24, no. 4, pp. 620–634, 2013.
  • [22] S. Yu and Z. Abraham, “Concept drift detection with hierarchical hypothesis testing,” in Proceedings of the 2017 SIAM International Conference on Data Mining.   SIAM, 2017, pp. 768–776.
  • [23] L. L. Minku, A. P. White, and X. Yao, “The impact of diversity on online ensemble learning in the presence of concept drift,” IEEE Transactions on Knowledge and Data Engineering, vol. 22, no. 5, pp. 730–742, 2010.
  • [24] L. L. Minku and X. Yao, “Ddd: A new ensemble approach for dealing with concept drift,” IEEE transactions on knowledge and data engineering, vol. 24, no. 4, pp. 619–633, 2012.
  • [25] Y. Sun, K. Tang, Z. Zhu, and X. Yao, “Concept drift adaptation by exploiting historical knowledge,” IEEE Transactions on Neural Networks and Learning Systems, 2018.
  • [26] P. Domingos and G. Hulten, “A general framework for mining massive data streams,” Journal of Computational and Graphical Statistics, vol. 12, no. 4, pp. 945–949, 2003.
  • [27] B. Krawczyk, L. L. Minku, J. Gama, J. Stefanowski, and M. Woźniak, “Ensemble learning for data stream analysis: a survey,” Information Fusion, vol. 37, pp. 132–156, 2017.
  • [28] M. G. Kelly, D. J. Hand, and N. M. Adams, “The impact of changing populations on classifier performance,” in Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining.   ACM, 1999, pp. 367–371.
  • [29] G. Widmer and M. Kubat, “Effective learning in dynamic environments by explicit context tracking,” in Machine learning: ECML-93.   Springer, 1993, pp. 227–243.
  • [30] M. Harel, S. Mannor, R. El-Yaniv, and K. Crammer, “Concept drift detection through resampling.” in ICML, 2014, pp. 1009–1017.
  • [31] I. W. Sandberg, J. T. Lo, C. L. Fancourt, J. C. Principe, S. Katagiri, and S. Haykin, Nonlinear dynamical systems: feedforward neural network perspectives.   John Wiley & Sons, 2001, vol. 21.
  • [32] E. Brodsky and B. S. Darkhovsky, Nonparametric methods in change point problems.   Springer Science & Business Media, 2013, vol. 243.
  • [33] J. Chen and A. K. Gupta, Parametric statistical change point analysis: with applications to genetics, medicine, and finance.   Springer Science & Business Media, 2011.
  • [34] T. S. Sethi and M. Kantardzic, “On the reliable detection of concept drift from streaming unlabeled data,” Expert Systems with Applications, vol. 82, pp. 77–99, 2017.
  • [35] V. Vapnik, “Principles of risk minimization for learning theory,” in NIPS, 1991, pp. 831–838.
  • [36] V. M. Souza, D. F. Silva, J. Gama, and G. E. Batista, “Data stream classification guided by clustering on nonstationary environments and extreme verification latency,” in Proceedings of the 2015 SIAM International Conference on Data Mining.   SIAM, 2015, pp. 873–881.
  • [37] M. Baena-Garcıa, J. del Campo-Ávila, R. Fidalgo, A. Bifet, R. Gavalda, and R. Morales-Bueno, “Early drift detection method,” in Fourth international workshop on knowledge discovery from data streams, vol. 6, 2006, pp. 77–86.
  • [38] K. Nishida and K. Yamauchi, “Detecting concept drift using statistical testing,” in International conference on discovery science.   Springer, 2007, pp. 264–269.
  • [39] J. Read, A. Bifet, B. Pfahringer, and G. Holmes, “Batch-incremental versus instance-incremental learning in dynamic and evolving data,” Advances in Intelligent Data Analysis XI, pp. 313–323, 2012.
  • [40] I. Frías-Blanco, J. del Campo-Ávila, G. Ramos-Jiménez, R. Morales-Bueno, A. Ortiz-Díaz, and Y. Caballero-Mota, “Online and non-parametric drift detection methods based on hoeffding s bounds,” IEEE Transactions on Knowledge and Data Engineering, vol. 27, no. 3, pp. 810–823, 2015.
  • [41] P. M. Gonçalves, S. G. de Carvalho Santos, R. S. Barros, and D. C. Vieira, “A comparative study on concept drift detectors,” Expert Systems with Applications, vol. 41, no. 18, pp. 8144–8156, 2014.
  • [42] J. C. Principe and R. Chalasani, “Cognitive architectures for sensory processing,” Proceedings of the IEEE, vol. 102, no. 4, pp. 514–525, 2014.
  • [43] C. Alippi, G. Boracchi, and M. Roveri, “Change detection tests using the ici rule,” in Neural Networks (IJCNN), The 2010 International Joint Conference on.   IEEE, 2010, pp. 1–7.
  • [44] C. W. Helstrom, Statistical Theory of Signal Detection.   New York, NY, USA: Pergamon Press, 1968.
  • [45] D. Siegmund, Sequential Analysis: Tests and Confidence Intervals.   Springer, New York, 1985.
  • [46] L. Du, Q. Song, L. Zhu, and X. Zhu, “A selective detector ensemble for concept drift detection,” Computer Journal, no. 3, 2015.
  • [47] B. I. F. Maciel, S. G. T. C. Santos, and R. S. M. Barros, “A lightweight concept drift detection ensemble,” in Tools with Artificial Intelligence (ICTAI), 2015 IEEE 27th International Conference on.   IEEE, 2015, pp. 1061–1068.
  • [48] S. Wang, L. L. Minku, and X. Yao, “A learning framework for online class imbalance learning,” in Computational Intelligence and Ensemble Learning (CIEL), 2013 IEEE Symposium on.   IEEE, 2013, pp. 36–45.
  • [49] P. Good, Permutation tests: a practical guide to resampling methods for testing hypotheses.   Springer Science & Business Media, 2013.
  • [50] M. Woźniak, P. Ksieniewicz, B. Cyganek, and K. Walkowiak, “Ensembles of heterogeneous concept drift detectors-experimental study,” in IFIP International Conference on Computer Information Systems and Industrial Management.   Springer, 2016, pp. 538–549.
  • [51] D. Bhati, P. Kgosi, and R. N. Rattihalli, “Distribution of geometrically weighted sum of bernoulli random variables,” Applied Mathematics, vol. 2, no. 11, p. 1382, 2011.
  • [52] O. Bousquet and A. Elisseeff, “Stability and generalization,” Journal of Machine Learning Research, vol. 2, no. Mar, pp. 499–526, 2002.
  • [53] G. Ditzler and R. Polikar, “Incremental learning of concept drift from streaming imbalanced data,” ieee transactions on knowledge and data engineering, vol. 25, no. 10, pp. 2283–2301, 2013.
  • [54] J. Yang, R. Yan, and A. G. Hauptmann, “Cross-domain video concept detection using adaptive svms,” in Proceedings of the 15th ACM international conference on Multimedia.   ACM, 2007, pp. 188–197.
  • [55] ——, “Adapting svm classifiers to data with shifted distributions,” in Seventh IEEE International Conference on Data Mining Workshops (ICDMW 2007).   IEEE, 2007, pp. 69–76.
  • [56]

    I. Katakis, G. Tsoumakas, and I. Vlahavas, “Dynamic feature space and incremental feature selection for the classification of textual data streams,”

    Knowledge Discovery from Data Streams, pp. 107–116, 2006.
  • [57] I. Zliobaite, “How good is the electricity benchmark for evaluating concept drift adaptation,” arXiv preprint arXiv:1301.3524, 2013.
  • [58] I. Žliobaitė, A. Bifet, J. Read, B. Pfahringer, and G. Holmes, “Evaluation methods and decision theory for classification of streaming data with temporal dependence,” Machine Learning, vol. 98, no. 3, pp. 455–482, 2015.
  • [59] J. Gama, R. Sebastião, and P. P. Rodrigues, “On evaluating stream learning algorithms,” Machine learning, vol. 90, no. 3, pp. 317–346, 2013.
  • [60] C. J. V. Rijsbergen, Information Retrieval.   Butterworth-Heinemann, 1979.
  • [61] M. Kubat, R. Holte, and S. Matwin, “Learning when negative examples abound,” in European Conference on Machine Learning.   Springer, 1997, pp. 146–153.
  • [62] C. Anagnostopoulos, D. K. Tasoulis, N. M. Adams, and D. J. Hand, “Temporally adaptive estimation of logistic classifiers on data streams,” Advances in data analysis and classification, vol. 3, no. 3, pp. 243–261, 2009.
  • [63] N. G. Pavlidis, D. K. Tasoulis, N. M. Adams, and D. J. Hand, “-perceptron: An adaptive classifier for data streams,” Pattern Recognition, vol. 44, no. 1, pp. 78–96, 2011.
  • [64] C. Alippi and M. Roveri, “Just-in-time adaptive classifiers in non-stationary conditions,” in 2007 International Joint Conference on Neural Networks.   IEEE, 2007, pp. 1014–1019.
  • [65] J. C. Principe, Information theoretic learning: Renyi’s entropy and kernel perspectives.   Springer Science & Business Media, 2010.