1 Introduction
In the last decades, numerous efforts have been made in algorithms that can learn from data streams. Most traditional methods for this purpose assume the stationarity of the data. However, when the underlying source generating the data stream, i.e., the joint distribution , is not stationary, the optimal decision rule should change over time. This is a phenomena known as concept drift [Ditzler et al.2015, Krawczyk et al.2017]. Detecting such concept drifts is essential for the algorithm to adapt itself to the evolving data.
Concept drift can manifest two fundamental forms of changes from the Bayesian perspective [Kelly et al.1999]
: 1) a change in the marginal probability
; 2) a change in the posterior probability
. Existing studies in this field primarily concentrate on detecting posterior distribution change , also known as the real drift [Widmer and Kubat1993], as it clearly indicates the optimal decision rule. On the other hand, only a little work aims at detecting the virtual drift [Hoens et al.2012], which only affects . In practice, one type of concept drift typically appears in combination with the other [Tsymbal2004]. Most methods for real drift detection assume that the true labels are available immediately after the classifier makes a prediction. However, this assumption is overoptimistic, since it could involve the annotation of data by expensive means in terms of cost and labor time. The virtual drift detection, though making no use of true label , has the issue of wrong interpretation (i.e., interpreting a virtual drift as the real drift). Such wrong interpretation could provide wrong decision about classifier update which still require labeled data [Krawczyk et al.2017].To address these issues simultaneously, we propose a novel Hierarchical Hypothesis Testing (HHT) framework with a RequestandReverify strategy for concept drift detection. HHT incorporates two layers of hypothesis tests. Different from the existing HHT methods [Alippi et al.2017, Yu and Abraham2017], our HHT framework is the first attempt to use labels for concept drift detection only when necessary
. It ensures that the test statistic (derived in a fully unsupervised manner) in LayerI captures the most important properties of the underlying distributions, and adjusts itself well in a more powerful yet conservative manner that only requires labeled data when necessary in LayerII. Two methods, namely Hierarchical Hypothesis Testing with Classification Uncertainty (HHTCU) and Hierarchical Hypothesis Testing with Attributewise “Goodnessoffit” (HHTAG), are proposed under this framework in this paper. The first method incrementally tracks the distribution change with the defined
classification uncertainty measurement in LayerI, and uses permutation test in LayerII, whereas the second method uses the standard KolmogorovSmirnov (KS) test in LayerI and twodimensional (D) KS test [Peacock1983] in LayerII. We test both proposed methods in benchmark datasets. Our methods demonstrate overwhelming advantages over stateoftheart unsupervised methods. Moreover, though using significantly fewer labels, our methods outperform supervised methods like DDM [Gama et al.2004].2 Background Knowledge
2.1 Problem Formulation
Given a continuous stream of labeled samples , , a classification model can be learned so that . Here, represents a
dimensional feature vector, and
is a discrete class label. Let be a sequence of new samples that comes chronologically with unknown labels. At time , we split the samples in a set of recent ones and a set containing the samples that appear prior to those in . The problem of concept drift detection is identifying whether or not the source (i.e., the joint distribution ^{1}^{1}1The distributions are deliberated subscripted with time index to explicitly emphasize their timevarying characteristics.) that generates samples in is the same as that in (even without access to the true labels ) [Ditzler et al.2015, Krawczyk et al.2017]. Once such a drift is found, the machine can request a window of labeled data to update and employ the new classifier to predict labels of incoming data.2.2 Related Work
The techniques for concept drift detection can be divided into two categories depending on reliance of labels [Sethi and Kantardzic2017]: supervised (or explicit) drift detectors and unsupervised (or implicit) drift detectors. Supervised Drift Detectors rely heavily on true labels, as they typically monitor one error metrics associated with classification loss. Although much progress has been made on concept drift detection in the supervised manner, its assumption that the ground truth labels are available immediately for all already classified instances is typically overoptimistic. Unsupervised Drift Detectors, on the other hand, explore to detect concept drifts without using true labels. Most unsupervised concept drift detection methods concentrate on performing multivariate statistical tests to detect the changes of feature values
, such as the Conjunctive Normal Form (CNF) density estimation test
[Dries and Rückert2009] and the Hellinger distance based density estimation test [Ditzler and Polikar2011]. Considering their high computational complexity, an alternative approach is to conduct univariate test on each attribute of features independently. For example, [Reis et al.2016] develops an incremental (sequential) KS test which can achieve exactly the same performance as the conventional batchbased KS test.Besides modeling virtual drifts of , recent research in unsupervised drift detection attempts to model the real drifts by monitoring the classifier output or posterior probability as an alternative to . The Confidence Distribution Batch Detection (CDBD) approach [Lindstrom et al.2011] uses KullbackLeibler (KL) divergence to compare the classifier output values from two batches. A drift is signaled if the divergence exceeds a threshold. This work is extended in [Kim and Park2017] by substituting the classifier output value with the classifier confidence measurement. Another representative method is the Margin Density Drift Detection (MD3) algorithm [Sethi and Kantardzic2017]
, which tracks the proportion of samples that are within a classifier (i.e., SVM) margin and uses an active learning strategy in
[Žliobaite et al.2014] to interactively query the information source to obtain true labels. Though not requiring true labels for concept drift detection, the major drawback of these unsupervised drift detectors is that they are prone to false positives as it is difficult to distinguish noise from distribution changes. Moreover, the wrong interpretation of virtual drifts could cause wrong decision for classifier update which require not only more labeled data but also unnecessary classifier retraining [Krawczyk et al.2017].3 RequestandReverify HHT Approach
The observations on the existing supervised and unsupervised concept drift detection methods motivate us to propose the RequestandReverify Hierarchial Hypothesis Testing framework (see Fig. 1). Specifically, our layerI test is operated in a fully unsupervised manner that does not require any labels. Once a potential drift is signaled by LayerI, the LayerII test is activated to confirm (or deny) the validity of the suspected drift. The result of the LayerII is fed back to the LayerI to reconfigure or restart LayerI once needed.
In this way, the upper bound of HHT’s TypeI error is determined by the significance level of its LayerI test, whereas the lower bound of HHT’s TypeII error is determined by the power of its LayerI test. Our LayerI test (and most existing single layer concept drift detectors) has low TypeII error (i.e., is able to accurately detect concept drifts), but has relatively higher TypeI error (i.e., is prone to generate false alarms). The incorporation of the LayerII test is supposed to reduce false alarms, thus decreasing the TypeI error. The cost is that the TypeII error could be increased at the same time. In our work, we request true labels to conduct a more precise LayerII test, so that we can significantly decrease the TypeI error with minimum increase in the TypeII error.
3.1 HHT with Classification Uncertainty (HhtCu)
Our first method, HHTCU, detects concept drift by tracking the classification uncertainty measurement , where denotes the distance, is the posterior probability estimated by the classifier at time index , and is the target label encoded from using the of coding scheme [Bishop2006]. Intuitively, the distance between and measures the classification uncertainty for the current classifier, and the statistic derived from this measurement should be stationary (i.e., no “significant” distribution change) in a stable concept. Therefore, the dramatic change of the uncertainty mean value may suggest a potential concept drift.
Different from the existing work that typically monitors the derived statistic with the threesigma rule in statistical process control [Montgomery2009], we use the Hoeffding’s inequality [Hoeffding1963] to monitor the moving average of in our LayerI test.
Theorem 1 (Hoeffding’s inequality)
where denotes the expectation. Using this theorem, given a specific significance level , the error can be computed as:
(2) 
The Hoeffding’s inequality does not require an assumption on the probabilistic distribution of . This makes it well suited in learning from real data streams [FríasBlanco et al.2015]. Moreover, the Corollary 1.1 proposed by Hoeffding [Hoeffding1963] can be directly applied to detect significant changes in the moving average of streaming values.
Corollary 1.1 (LayerI test of HHTCU)
If , , …, , , …, be independent random variables with values in the interval , and if and , then for :
(3) 
By definition, , where is the number of classes. denotes the classification uncertainty moving average before a cutoff point, and
denotes the moving average over the whole sequence. The rule to reject the null hypothesis
against the alternative one at the significance level will be , where(4) 
Regarding the cutoff point, a reliable location can be estimated from the minimum value of () [Gama et al.2004, FríasBlanco et al.2015]. This is because keeps approximately constant in a stable concept, thus must reduce its value correspondingly.
The LayerII test aims to reduce false positives signaled by LayerI. Here, we use the permutation test which is described in [Yu and Abraham2017]. Different from [Yu and Abraham2017], which trains only one classifier using and evaluates it on to get a zeroone loss , we train another classifier using and evaluate it on to get another zeroone loss . We reject the null hypothesis if either or deviates too much from the prediction loss of the shuffled splits. The proposed HHTCU is summarized in Algorithm 1, where the window size is set as the number of labeled samples to train the initial classifier .
3.2 HHT with Attributewise “Goodness of fit” (HHTAG))
The general idea behind HHTAG is to explicitly model with limited access to . To this end, a feasible solution is to detect potential drift points in LayerI by just modeling , and then require limited labeled data to confirm (or deny) the suspected time index in LayerII.
The LayerI test of HHTAG conducts “Goodnessoffit” test on each attribute individually to determine whether from two windows differ: a baseline (or reference) window containing the first items of the stream that occur after the last detected change; and a sliding window containing items that follow . We slide the one step forward whenever a new item appears on the stream. A potential concept drift is signaled if at least for one attribute there is a distribution change. Factoring into for multivariate change detection is initially proposed in [Kifer et al.2004]. Since then, this factorization strategy becomes widely used [Žliobaite2010, Reis et al.2016]. Sadly, no existing work provides a theoretical foundation of this factorization strategy. In our perspective, one possible explanation is the Sklar’s Theorem [Sklar1959], which states that if is a dimensional joint distribution function and if , , …, are its corresponding marginal distribution functions, then there exists a copula : such that:
(5) 
The density function (if exists) can thus be represented as:
where is the density of the copula .
Though Sklar does not show practical ways on how to calculate , this Theorem demonstrates that if changes, we can infer that one of should also changes; otherwise, if none of the changes, the would not be likely to change.
This paper selects KolmogorovSmirnov (KS) test to measure the discrepancy of in two windows. Specifically, the KS test rejects the null hypothesis, i.e., the observations in sets and originate from the same distribution, at significance level if the following inequality holds:
(6) 
where
denotes the empirical distribution function (an estimation to the cumulative distribution function
), is a specific value that can be retrieved from a known table, and are the cardinality of set and set respectively.We then validate the potential drift points by requiring true labels of data that come from and in LayerII. The LayerII test of HHTAG makes the conditionally independent factor assumption [Bishop2006]
(a.k.a. the “naive Bayes” assumption), i.e.,
(). Thus, the joint distribution can be represented as:(7) 
According to Eq. (7), we perform independent twodimensional (2D) KS tests [Peacock1983] on each bivariate distribution
individually. The 2D KS test is a generalization of KS test on 2D plane. Although the cumulative probability distribution is not welldefined in more than one dimension, Peacock’s insight is that a good surrogate is the integrated probability in each of the four quadrants for a given point
, i.e., , , and . Similarly, a potential drift is confirmed if the 2D KS test rejects the null hypothesis for at least one of the bivariate distributions. HHTAG is summarized in Algorithm 2, where the window size is set as the number of labeled samples to train the initial classifier .4 Experiments
Two sets of experiments are performed to evaluate the performance of HHTCU and HHTAG. First, quantitative metrics and plots are presented to demonstrate HHTCU and HHTAG’s effectiveness and superiority over stateoftheart approaches on benchmark synthetic data. Then, we validate, via three realworld applications, the effectiveness of the proposed HHTCU and HHTAG on streaming data classification and the accuracy of its detected concept drift points. This paper selects soft margin SVM as the baseline classifier because of its accuracy and robustness.
4.1 Experimental Setup
We compare the results with three baseline methods, three topline supervised methods, and two stateoftheart unsupervised methods for concept drift detection. The first two baselines, DDM [Gama et al.2004] and EDDM [BaenaGarcía et al.2006], are the most popular supervised drift detector. The third one, we refer to as Attributewise KS test (AKS) [Žliobaite2010, Reis et al.2016], is a benchmark unsupervised drift detector that has been proved effective in real applications. Note that, AKS is equivalent to the LayerI test of HHTAG. The toplines selected for comparison are LFR [Wang and Abraham2015], HLFR [Yu and Abraham2017] and HDDM [FríasBlanco et al.2015]. HLFR is the first method on concept drift detection with HHT framework, whereas HDDM introduces Hoeffding’s inequality on concept drift detection. All of these methods are operated in supervised manner and significantly outperform DDM. However, LFR and HLFR can only support binary classification. In addition, we also compare with MD3 [Sethi and Kantardzic2017] and CDBD [Lindstrom et al.2011], the stateoftheart concept drift detectors that attempt to model without access to . We use the parameters recommended in the papers for each competing method. The detailed values on significance levels or thresholds (if there exist) are shown in Table 1.
Algorithms  Significance levels (or thresholds) 

HHTCU  , 
HHTAG  , 
AKS  
MD3  
HLFR  , , 
LFR  , 
DDM  , 
EDDM  , 
HDDM  , 
4.2 Results on Benchmark Synthetic Data
We first compare the performance of the HHTCU and HHTAG against aforementioned concept drift approaches on benchmark synthetic data. Eight datasets are selected from [Souza et al.2015, Dyer et al.2014], namely 2CDT, 2CHT, UG2C2D, MG2C2D, 4CR, 4CREV1, 4CE1CF, 5CVT. Among them, 2CDT, 2CHT, UG2C2D and MG2C2D are binaryclass datasets, while 4CR, 4CREV1, 4CE1CF and 5CVT have multiple classes. To facilitate detection evaluation, we cluster each dataset into segments to introduce abrupt drift points, thus controlling ground truth drift points and allowing precise quantitative analysis. Quantitative comparison is performed by evaluating detection quality. To this end, the True Positive (TP) detection is defined as a detection within a fixed delay range after the precise concept change time. The False Negative (FN) is defined as missing a detection within the delay range, and the False Positive (FP) is defined as a detection outside the delay range range or an extra detection in the range. The detection quality is measured jointly with Precision, Recall and delay detection using  curve and  curve respectively (see Fig. 2 for an example), where , and .
For a straightforward comparison, Table 2 reports the number of required labeled samples (in percentage) for each algorithm, whereas Table 3 summarizes the Normalized Area Under the Curve (NAUC) values for two kinds of curves. As can be seen, HLFR and LFR can provide the most accurate detection as expected. However, they are only applicable for binaryclass datasets and require true labels for the entire data stream. Our proposed HHTCU and HHTAG, although slightly inferior to HLFR or LFR, can strike the best tradeoff between detection accuracy and the portion of required labels, especially considering the overwhelming advantage over MD3 and CDBD that are the most relevant counterparts. Although the detection module of MD3 and CDBD are operated in fully unsupervised manner, they either fail to provide reliable detection results or generate too much false positives which may, by contrast, require even more true labels (for classifier update). Meanwhile, it is very encouraging to find that HHTCU can achieve comparable or even better results than DDM (i.e., the most popular supervised drift detector) with significantly fewer labels. This suggests that our classification uncertainty is as sensitive as the total classification accuracy in DDM to monitor the nonstationary environment. And, we can see HHTAG can significantly improve the Precision value compared to AKS. This suggests the effectiveness of LayerII test on reverifying the validity of suspected drifts and denying false alarms. In addition, in the extreme cases when remains unchanged but does change, our methods (and the stateoftheart unsupervised methods) are not able to detect the concept drift which is the change of the joint distribution . This limitation is demonstrated in our experiments on the synthetic 4CR dataset where remains the same.
HHTCU  HHTAG  AKS  MD3  CDBD  

2CDT  28.97  13.69  
2CHT  28.12  11.71  
UG2C2D  28.43  18.68  31.21  
MG2C2D  21.36  11.83  
4CR  
4CREV1  29.15  20.22  
4CR1CF  12.69  8.33  
5CVT  34.65  35.98 
Our methods  Unsupervised methods  Supervised methods  

HHTCU  HHTAG  AKS  MD3  CDBD  HLFR  LFR  DDM  EDDM  HDDM  
2CDT  
2CHT  
UG2C2D  
MG2C2D  
4CR  
4CREV1  
4CR1CF  
5CVT 
4.3 Results on Realworld Data
In this section, we evaluate algorithm performance on realworld streaming data classification in a nonstationary environment. Three widely used realworld datasets are selected, namely USENET1 [Katakis et al.2008], Keystroke [Souza et al.2015] and Posture [Kaluža et al.2010]. The descriptions on these three datasets are available in [Yu and Abraham2017, Reis et al.2016]. For each dataset, we also select the same number of labeled instances to train the initial classifier as suggested in [Yu and Abraham2017, Reis et al.2016].
The concept drift detection results and streaming classification results are summarized in Table 4. We measure the cumulative classification accuracy and the portion of required labels to evaluate prediction quality. Since the classes are balanced, the classification accuracy is also a good indicator. In these experiments, our proposed HHTCU and HHTAG always feature significantly less amount of false positives, while maintaining good true positive rate for concept drift detection. This suggests the effectiveness of the proposed hierarchical architecture on concept drift reverification. The HHTCU can achieve overall the best performance in terms of accurate drift detection, streaming classification, as well as the rational utilization of labeled data.
5 Conclusion
This paper presents a novel Hierarchical Hypothesis Testing (HHT) framework with a RequestandReverify strategy to detect concept drifts. Two methods, namely HHT with Classification Uncertainty (HHTCU) and HHT with Attributewise “Goodnessoffit” (HHTAG), are proposed respectively under this framework. Our methods significantly outperform the stateoftheart unsupervised counterparts, and are even comparable or superior to the popular supervised methods with significantly fewer labels. The results indicate our progress on using far fewer labels to perform accurate concept drift detection. The HHT framework is highly effective in deciding label requests and validating detection candidates.



References
 [Alippi et al.2017] Cesare Alippi, Giacomo Boracchi, and Manuel Roveri. Hierarchical changedetection tests. IEEE Trans. Neural Netw. Learn. Syst., 28(2):246–258, 2017.
 [BaenaGarcía et al.2006] Manuel BaenaGarcía, José del CampoÁvila, et al. Early drift detection method. In Int. Workshop Knowledge Discovery from Data Streams, 2006.
 [Bishop2006] Christopher M Bishop. Pattern Recognition and Machine Learning. Springer, 2006.
 [Brzezinski and Stefanowski2014] Dariusz Brzezinski and Jerzy Stefanowski. Prequential auc for classifier evaluation and drift detection in evolving data streams. In Workshop New Frontiers in Mining Complex Patterns, 2014.
 [Ditzler and Polikar2011] Greg Ditzler and Robi Polikar. Hellinger distance based drift detection for nonstationary environments. In IEEE Symp. CIDUE, pages 41–48, 2011.
 [Ditzler et al.2015] Greg Ditzler, Manuel Roveri, et al. Learning in nonstationary environments: A survey. IEEE Comput. Intell. Mag., 10(4):12–25, 2015.
 [Dries and Rückert2009] Anton Dries and Ulrich Rückert. Adaptive concept drift detection. Statistical Anal. Data Mining: The ASA Data Sci. J., 2(56):311–327, 2009.
 [Dyer et al.2014] Karl B Dyer, Robert Capo, and Robi Polikar. Compose: A semisupervised learning framework for initially labeled nonstationary streaming data. IEEE Trans. Neural Netw. Learn. Syst., 25(1):12–26, 2014.
 [FríasBlanco et al.2015] Isvani FríasBlanco, José del CampoÁvila, et al. Online and nonparametric drift detection methods based on hoeffding’s bounds. IEEE Trans. Knowl. Data Eng., 27(3):810–823, 2015.

[Gama et al.2004]
Joao Gama, Pedro Medas, et al.
Learning with drift detection.
In
Brazilian Symp. on Artificial Intelligence
, pages 286–295. Springer, 2004.  [Gonçalves et al.2014] Paulo Gonçalves, Silas de Carvalho Santos, et al. A comparative study on concept drift detectors. Expert Syst. Appl., 41(18):8144–8156, 2014.
 [Hoeffding1963] Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American statistical association, 58(301):13–30, 1963.
 [Hoens et al.2012] T. Hoens, R. Polikar, and N. Chawla. Learning from streaming data with concept drift and imbalance. Progress in Artificial Intell., 1(1):89–101, 2012.
 [Kaluža et al.2010] Boštjan Kaluža, Violeta Mirchevska, et al. An agentbased approach to care in independent living. Ambient intelligence, pages 177–186, 2010.
 [Katakis et al.2008] I. Katakis, G. Tsoumakas, and I. Vlahavas. An ensemble of classifiers for coping with recurring contexts in data streams. In ECAI, pages 763–764, 2008.
 [Kelly et al.1999] Mark G Kelly, David J Hand, and Niall M Adams. The impact of changing populations on classifier performance. In KDD, pages 367–371, 1999.
 [Kifer et al.2004] Daniel Kifer, Shai BenDavid, and Johannes Gehrke. Detecting change in data streams. In Int. Conf. on Very Large Data Bases, pages 180–191, 2004.
 [Kim and Park2017] Youngin Kim and Cheong Hee Park. An efficient concept drift detection method for streaming data under limited labeling. IEICE Trans. Inf. Syst., 100(10):2537–2546, 2017.
 [Krawczyk et al.2017] Bartosz Krawczyk, Leandro L Minku, et al. Ensemble learning for data stream analysis: a survey. Information Fusion, 37:132–156, 2017.
 [Lindstrom et al.2011] Patrick Lindstrom, Brian Namee, and Sarah Delany. Drift detection using uncertainty distribution divergence. In ICDM Workshops. IEEE, 2011.
 [Montgomery2009] Douglas Montgomery. Introduction to statistical quality control. John Wiley & Sons, 2009.
 [Peacock1983] JA Peacock. Twodimensional goodnessoffit testing in astronomy. Monthly Notices of the Royal Astronomical Society, 202(3):615–627, 1983.
 [Reis et al.2016] Denis dos Reis, Peter Flach, et al. Fast unsupervised online drift detection using incremental kolmogorovsmirnov test. In KDD, 2016.
 [Sethi and Kantardzic2017] T. Sethi and M. Kantardzic. On the reliable detection of concept drift from streaming unlabeled data. Expert Syst. Appl., 82:77–99, 2017.
 [Sklar1959] M Sklar. Fonctions de répartition à n dimensions et leurs marges. Publ. Inst. Statist. Univ. Paris, 8:229–231, 1959.
 [Sobolewski and Wozniak2013] Piotr Sobolewski and Michal Wozniak. Concept drift detection and model selection with simulated recurrence and ensembles of statistical detectors. J. UCS, 19(4):462–483, 2013.
 [Souza et al.2015] Vinícius Souza, Diego Silva, et al. Data stream classification guided by clustering on nonstationary environments and extreme verification latency. In SIAM Int. Conf. Data Mining, pages 873–881, 2015.
 [Tsymbal2004] Alexey Tsymbal. The problem of concept drift: definitions and related work. Computer Science Department, Trinity College Dublin, 106(2), 2004.
 [Wang and Abraham2015] Heng Wang and Zubin Abraham. Concept drift detection for streaming data. In IJCNN, pages 1–9. IEEE, 2015.
 [Wang et al.2013] Shuo Wang, Leandro L. Minku, et al. Concept drift detection for online class imbalance learning. In IJCNN, pages 1–10. IEEE, 2013.
 [Widmer and Kubat1993] G. Widmer and M. Kubat. Effective learning in dynamic environments by explicit context tracking. In ECML, pages 227–243. Springer, 1993.
 [Yu and Abraham2017] Shujian Yu and Z. Abraham. Concept drift detection with hierarchical hypothesis testing. In SIAM Int. Conf. Data Mining, pages 768–776, 2017.
 [Žliobaite et al.2014] Indre Žliobaite, Albert Bifet, et al. Active learning with drifting streaming data. IEEE Trans. Neural Netw. Learn. Syst., 25(1):27–39, 2014.
 [Žliobaite2010] Indre Žliobaite. Change with delayed labeling: When is it detectable? In ICDM Workshops, pages 843–850. IEEE, 2010.
Comments
There are no comments yet.