In the last decades, numerous efforts have been made in algorithms that can learn from data streams. Most traditional methods for this purpose assume the stationarity of the data. However, when the underlying source generating the data stream, i.e., the joint distribution , is not stationary, the optimal decision rule should change over time. This is a phenomena known as concept drift [Ditzler et al.2015, Krawczyk et al.2017]. Detecting such concept drifts is essential for the algorithm to adapt itself to the evolving data.
Concept drift can manifest two fundamental forms of changes from the Bayesian perspective [Kelly et al.1999]
: 1) a change in the marginal probability
; 2) a change in the posterior probability. Existing studies in this field primarily concentrate on detecting posterior distribution change , also known as the real drift [Widmer and Kubat1993], as it clearly indicates the optimal decision rule. On the other hand, only a little work aims at detecting the virtual drift [Hoens et al.2012], which only affects . In practice, one type of concept drift typically appears in combination with the other [Tsymbal2004]. Most methods for real drift detection assume that the true labels are available immediately after the classifier makes a prediction. However, this assumption is over-optimistic, since it could involve the annotation of data by expensive means in terms of cost and labor time. The virtual drift detection, though making no use of true label , has the issue of wrong interpretation (i.e., interpreting a virtual drift as the real drift). Such wrong interpretation could provide wrong decision about classifier update which still require labeled data [Krawczyk et al.2017].
To address these issues simultaneously, we propose a novel Hierarchical Hypothesis Testing (HHT) framework with a Request-and-Reverify strategy for concept drift detection. HHT incorporates two layers of hypothesis tests. Different from the existing HHT methods [Alippi et al.2017, Yu and Abraham2017], our HHT framework is the first attempt to use labels for concept drift detection only when necessary
. It ensures that the test statistic (derived in a fully unsupervised manner) in Layer-I captures the most important properties of the underlying distributions, and adjusts itself well in a more powerful yet conservative manner that only requires labeled data when necessary in Layer-II. Two methods, namely Hierarchical Hypothesis Testing with Classification Uncertainty (HHT-CU) and Hierarchical Hypothesis Testing with Attribute-wise “Goodness-of-fit” (HHT-AG), are proposed under this framework in this paper. The first method incrementally tracks the distribution change with the definedclassification uncertainty measurement in Layer-I, and uses permutation test in Layer-II, whereas the second method uses the standard Kolmogorov-Smirnov (KS) test in Layer-I and two-dimensional (D) KS test [Peacock1983] in Layer-II. We test both proposed methods in benchmark datasets. Our methods demonstrate overwhelming advantages over state-of-the-art unsupervised methods. Moreover, though using significantly fewer labels, our methods outperform supervised methods like DDM [Gama et al.2004].
2 Background Knowledge
2.1 Problem Formulation
Given a continuous stream of labeled samples , , a classification model can be learned so that . Here, represents a
-dimensional feature vector, andis a discrete class label. Let be a sequence of new samples that comes chronologically with unknown labels. At time , we split the samples in a set of recent ones and a set containing the samples that appear prior to those in . The problem of concept drift detection is identifying whether or not the source (i.e., the joint distribution 111The distributions are deliberated subscripted with time index to explicitly emphasize their time-varying characteristics.) that generates samples in is the same as that in (even without access to the true labels ) [Ditzler et al.2015, Krawczyk et al.2017]. Once such a drift is found, the machine can request a window of labeled data to update and employ the new classifier to predict labels of incoming data.
2.2 Related Work
The techniques for concept drift detection can be divided into two categories depending on reliance of labels [Sethi and Kantardzic2017]: supervised (or explicit) drift detectors and unsupervised (or implicit) drift detectors. Supervised Drift Detectors rely heavily on true labels, as they typically monitor one error metrics associated with classification loss. Although much progress has been made on concept drift detection in the supervised manner, its assumption that the ground truth labels are available immediately for all already classified instances is typically over-optimistic. Unsupervised Drift Detectors, on the other hand, explore to detect concept drifts without using true labels. Most unsupervised concept drift detection methods concentrate on performing multivariate statistical tests to detect the changes of feature values
, such as the Conjunctive Normal Form (CNF) density estimation test[Dries and Rückert2009] and the Hellinger distance based density estimation test [Ditzler and Polikar2011]. Considering their high computational complexity, an alternative approach is to conduct univariate test on each attribute of features independently. For example, [Reis et al.2016] develops an incremental (sequential) KS test which can achieve exactly the same performance as the conventional batch-based KS test.
Besides modeling virtual drifts of , recent research in unsupervised drift detection attempts to model the real drifts by monitoring the classifier output or posterior probability as an alternative to . The Confidence Distribution Batch Detection (CDBD) approach [Lindstrom et al.2011] uses Kullback-Leibler (KL) divergence to compare the classifier output values from two batches. A drift is signaled if the divergence exceeds a threshold. This work is extended in [Kim and Park2017] by substituting the classifier output value with the classifier confidence measurement. Another representative method is the Margin Density Drift Detection (MD3) algorithm [Sethi and Kantardzic2017]
, which tracks the proportion of samples that are within a classifier (i.e., SVM) margin and uses an active learning strategy in[Žliobaite et al.2014] to interactively query the information source to obtain true labels. Though not requiring true labels for concept drift detection, the major drawback of these unsupervised drift detectors is that they are prone to false positives as it is difficult to distinguish noise from distribution changes. Moreover, the wrong interpretation of virtual drifts could cause wrong decision for classifier update which require not only more labeled data but also unnecessary classifier re-training [Krawczyk et al.2017].
3 Request-and-Reverify HHT Approach
The observations on the existing supervised and unsupervised concept drift detection methods motivate us to propose the Request-and-Reverify Hierarchial Hypothesis Testing framework (see Fig. 1). Specifically, our layer-I test is operated in a fully unsupervised manner that does not require any labels. Once a potential drift is signaled by Layer-I, the Layer-II test is activated to confirm (or deny) the validity of the suspected drift. The result of the Layer-II is fed back to the Layer-I to reconfigure or restart Layer-I once needed.
In this way, the upper bound of HHT’s Type-I error is determined by the significance level of its Layer-I test, whereas the lower bound of HHT’s Type-II error is determined by the power of its Layer-I test. Our Layer-I test (and most existing single layer concept drift detectors) has low Type-II error (i.e., is able to accurately detect concept drifts), but has relatively higher Type-I error (i.e., is prone to generate false alarms). The incorporation of the Layer-II test is supposed to reduce false alarms, thus decreasing the Type-I error. The cost is that the Type-II error could be increased at the same time. In our work, we request true labels to conduct a more precise Layer-II test, so that we can significantly decrease the Type-I error with minimum increase in the Type-II error.
3.1 HHT with Classification Uncertainty (Hht-Cu)
Our first method, HHT-CU, detects concept drift by tracking the classification uncertainty measurement , where denotes the distance, is the posterior probability estimated by the classifier at time index , and is the target label encoded from using the -of- coding scheme [Bishop2006]. Intuitively, the distance between and measures the classification uncertainty for the current classifier, and the statistic derived from this measurement should be stationary (i.e., no “significant” distribution change) in a stable concept. Therefore, the dramatic change of the uncertainty mean value may suggest a potential concept drift.
Different from the existing work that typically monitors the derived statistic with the three-sigma rule in statistical process control [Montgomery2009], we use the Hoeffding’s inequality [Hoeffding1963] to monitor the moving average of in our Layer-I test.
Theorem 1 (Hoeffding’s inequality)
Let , ,…, be independent random variables such that
be independent random variables such that, and let , then for :
where denotes the expectation. Using this theorem, given a specific significance level , the error can be computed as:
The Hoeffding’s inequality does not require an assumption on the probabilistic distribution of . This makes it well suited in learning from real data streams [Frías-Blanco et al.2015]. Moreover, the Corollary 1.1 proposed by Hoeffding [Hoeffding1963] can be directly applied to detect significant changes in the moving average of streaming values.
Corollary 1.1 (Layer-I test of HHT-CU)
If , , …, , , …, be independent random variables with values in the interval , and if and , then for :
By definition, , where is the number of classes. denotes the classification uncertainty moving average before a cutoff point, and
denotes the moving average over the whole sequence. The rule to reject the null hypothesisagainst the alternative one at the significance level will be , where
Regarding the cutoff point, a reliable location can be estimated from the minimum value of () [Gama et al.2004, Frías-Blanco et al.2015]. This is because keeps approximately constant in a stable concept, thus must reduce its value correspondingly.
The Layer-II test aims to reduce false positives signaled by Layer-I. Here, we use the permutation test which is described in [Yu and Abraham2017]. Different from [Yu and Abraham2017], which trains only one classifier using and evaluates it on to get a zero-one loss , we train another classifier using and evaluate it on to get another zero-one loss . We reject the null hypothesis if either or deviates too much from the prediction loss of the shuffled splits. The proposed HHT-CU is summarized in Algorithm 1, where the window size is set as the number of labeled samples to train the initial classifier .
3.2 HHT with Attribute-wise “Goodness of fit” (HHT-AG))
The general idea behind HHT-AG is to explicitly model with limited access to . To this end, a feasible solution is to detect potential drift points in Layer-I by just modeling , and then require limited labeled data to confirm (or deny) the suspected time index in Layer-II.
The Layer-I test of HHT-AG conducts “Goodness-of-fit” test on each attribute individually to determine whether from two windows differ: a baseline (or reference) window containing the first items of the stream that occur after the last detected change; and a sliding window containing items that follow . We slide the one step forward whenever a new item appears on the stream. A potential concept drift is signaled if at least for one attribute there is a distribution change. Factoring into for multivariate change detection is initially proposed in [Kifer et al.2004]. Since then, this factorization strategy becomes widely used [Žliobaite2010, Reis et al.2016]. Sadly, no existing work provides a theoretical foundation of this factorization strategy. In our perspective, one possible explanation is the Sklar’s Theorem [Sklar1959], which states that if is a -dimensional joint distribution function and if , , …, are its corresponding marginal distribution functions, then there exists a -copula : such that:
The density function (if exists) can thus be represented as:
where is the density of the copula .
Though Sklar does not show practical ways on how to calculate , this Theorem demonstrates that if changes, we can infer that one of should also changes; otherwise, if none of the changes, the would not be likely to change.
This paper selects Kolmogorov-Smirnov (KS) test to measure the discrepancy of in two windows. Specifically, the KS test rejects the null hypothesis, i.e., the observations in sets and originate from the same distribution, at significance level if the following inequality holds:
denotes the empirical distribution function (an estimation to the cumulative distribution function), is a -specific value that can be retrieved from a known table, and are the cardinality of set and set respectively.
We then validate the potential drift points by requiring true labels of data that come from and in Layer-II. The Layer-II test of HHT-AG makes the conditionally independent factor assumption [Bishop2006]
(a.k.a. the “naive Bayes” assumption), i.e.,(). Thus, the joint distribution can be represented as:
individually. The 2D KS test is a generalization of KS test on 2D plane. Although the cumulative probability distribution is not well-defined in more than one dimension, Peacock’s insight is that a good surrogate is the integrated probability in each of the four quadrants for a given point, i.e., , , and . Similarly, a potential drift is confirmed if the 2D KS test rejects the null hypothesis for at least one of the bivariate distributions. HHT-AG is summarized in Algorithm 2, where the window size is set as the number of labeled samples to train the initial classifier .
Two sets of experiments are performed to evaluate the performance of HHT-CU and HHT-AG. First, quantitative metrics and plots are presented to demonstrate HHT-CU and HHT-AG’s effectiveness and superiority over state-of-the-art approaches on benchmark synthetic data. Then, we validate, via three real-world applications, the effectiveness of the proposed HHT-CU and HHT-AG on streaming data classification and the accuracy of its detected concept drift points. This paper selects soft margin SVM as the baseline classifier because of its accuracy and robustness.
4.1 Experimental Setup
We compare the results with three baseline methods, three topline supervised methods, and two state-of-the-art unsupervised methods for concept drift detection. The first two baselines, DDM [Gama et al.2004] and EDDM [Baena-García et al.2006], are the most popular supervised drift detector. The third one, we refer to as Attribute-wise KS test (A-KS) [Žliobaite2010, Reis et al.2016], is a benchmark unsupervised drift detector that has been proved effective in real applications. Note that, A-KS is equivalent to the Layer-I test of HHT-AG. The toplines selected for comparison are LFR [Wang and Abraham2015], HLFR [Yu and Abraham2017] and HDDM [Frías-Blanco et al.2015]. HLFR is the first method on concept drift detection with HHT framework, whereas HDDM introduces Hoeffding’s inequality on concept drift detection. All of these methods are operated in supervised manner and significantly outperform DDM. However, LFR and HLFR can only support binary classification. In addition, we also compare with MD3 [Sethi and Kantardzic2017] and CDBD [Lindstrom et al.2011], the state-of-the-art concept drift detectors that attempt to model without access to . We use the parameters recommended in the papers for each competing method. The detailed values on significance levels or thresholds (if there exist) are shown in Table 1.
|Algorithms||Significance levels (or thresholds)|
4.2 Results on Benchmark Synthetic Data
We first compare the performance of the HHT-CU and HHT-AG against aforementioned concept drift approaches on benchmark synthetic data. Eight datasets are selected from [Souza et al.2015, Dyer et al.2014], namely 2CDT, 2CHT, UG-2C-2D, MG-2C-2D, 4CR, 4CRE-V1, 4CE1CF, 5CVT. Among them, 2CDT, 2CHT, UG-2C-2D and MG-2C-2D are binary-class datasets, while 4CR, 4CRE-V1, 4CE1CF and 5CVT have multiple classes. To facilitate detection evaluation, we cluster each dataset into segments to introduce abrupt drift points, thus controlling ground truth drift points and allowing precise quantitative analysis. Quantitative comparison is performed by evaluating detection quality. To this end, the True Positive (TP) detection is defined as a detection within a fixed delay range after the precise concept change time. The False Negative (FN) is defined as missing a detection within the delay range, and the False Positive (FP) is defined as a detection outside the delay range range or an extra detection in the range. The detection quality is measured jointly with Precision, Recall and delay detection using - curve and - curve respectively (see Fig. 2 for an example), where , and .
For a straightforward comparison, Table 2 reports the number of required labeled samples (in percentage) for each algorithm, whereas Table 3 summarizes the Normalized Area Under the Curve (NAUC) values for two kinds of curves. As can be seen, HLFR and LFR can provide the most accurate detection as expected. However, they are only applicable for binary-class datasets and require true labels for the entire data stream. Our proposed HHT-CU and HHT-AG, although slightly inferior to HLFR or LFR, can strike the best tradeoff between detection accuracy and the portion of required labels, especially considering the overwhelming advantage over MD3 and CDBD that are the most relevant counterparts. Although the detection module of MD3 and CDBD are operated in fully unsupervised manner, they either fail to provide reliable detection results or generate too much false positives which may, by contrast, require even more true labels (for classifier update). Meanwhile, it is very encouraging to find that HHT-CU can achieve comparable or even better results than DDM (i.e., the most popular supervised drift detector) with significantly fewer labels. This suggests that our classification uncertainty is as sensitive as the total classification accuracy in DDM to monitor the nonstationary environment. And, we can see HHT-AG can significantly improve the Precision value compared to A-KS. This suggests the effectiveness of Layer-II test on reverifying the validity of suspected drifts and denying false alarms. In addition, in the extreme cases when remains unchanged but does change, our methods (and the state-of-the-art unsupervised methods) are not able to detect the concept drift which is the change of the joint distribution . This limitation is demonstrated in our experiments on the synthetic 4CR dataset where remains the same.
|Our methods||Unsupervised methods||Supervised methods|
4.3 Results on Real-world Data
In this section, we evaluate algorithm performance on real-world streaming data classification in a non-stationary environment. Three widely used real-world datasets are selected, namely USENET1 [Katakis et al.2008], Keystroke [Souza et al.2015] and Posture [Kaluža et al.2010]. The descriptions on these three datasets are available in [Yu and Abraham2017, Reis et al.2016]. For each dataset, we also select the same number of labeled instances to train the initial classifier as suggested in [Yu and Abraham2017, Reis et al.2016].
The concept drift detection results and streaming classification results are summarized in Table 4. We measure the cumulative classification accuracy and the portion of required labels to evaluate prediction quality. Since the classes are balanced, the classification accuracy is also a good indicator. In these experiments, our proposed HHT-CU and HHT-AG always feature significantly less amount of false positives, while maintaining good true positive rate for concept drift detection. This suggests the effectiveness of the proposed hierarchical architecture on concept drift reverification. The HHT-CU can achieve overall the best performance in terms of accurate drift detection, streaming classification, as well as the rational utilization of labeled data.
This paper presents a novel Hierarchical Hypothesis Testing (HHT) framework with a Request-and-Reverify strategy to detect concept drifts. Two methods, namely HHT with Classification Uncertainty (HHT-CU) and HHT with Attribute-wise “Goodness-of-fit” (HHT-AG), are proposed respectively under this framework. Our methods significantly outperform the state-of-the-art unsupervised counterparts, and are even comparable or superior to the popular supervised methods with significantly fewer labels. The results indicate our progress on using far fewer labels to perform accurate concept drift detection. The HHT framework is highly effective in deciding label requests and validating detection candidates.
- [Alippi et al.2017] Cesare Alippi, Giacomo Boracchi, and Manuel Roveri. Hierarchical change-detection tests. IEEE Trans. Neural Netw. Learn. Syst., 28(2):246–258, 2017.
- [Baena-García et al.2006] Manuel Baena-García, José del Campo-Ávila, et al. Early drift detection method. In Int. Workshop Knowledge Discovery from Data Streams, 2006.
- [Bishop2006] Christopher M Bishop. Pattern Recognition and Machine Learning. Springer, 2006.
- [Brzezinski and Stefanowski2014] Dariusz Brzezinski and Jerzy Stefanowski. Prequential auc for classifier evaluation and drift detection in evolving data streams. In Workshop New Frontiers in Mining Complex Patterns, 2014.
- [Ditzler and Polikar2011] Greg Ditzler and Robi Polikar. Hellinger distance based drift detection for nonstationary environments. In IEEE Symp. CIDUE, pages 41–48, 2011.
- [Ditzler et al.2015] Greg Ditzler, Manuel Roveri, et al. Learning in nonstationary environments: A survey. IEEE Comput. Intell. Mag., 10(4):12–25, 2015.
- [Dries and Rückert2009] Anton Dries and Ulrich Rückert. Adaptive concept drift detection. Statistical Anal. Data Mining: The ASA Data Sci. J., 2(5-6):311–327, 2009.
- [Dyer et al.2014] Karl B Dyer, Robert Capo, and Robi Polikar. Compose: A semisupervised learning framework for initially labeled nonstationary streaming data. IEEE Trans. Neural Netw. Learn. Syst., 25(1):12–26, 2014.
- [Frías-Blanco et al.2015] Isvani Frías-Blanco, José del Campo-Ávila, et al. Online and non-parametric drift detection methods based on hoeffding’s bounds. IEEE Trans. Knowl. Data Eng., 27(3):810–823, 2015.
[Gama et al.2004]
Joao Gama, Pedro Medas, et al.
Learning with drift detection.
Brazilian Symp. on Artificial Intelligence, pages 286–295. Springer, 2004.
- [Gonçalves et al.2014] Paulo Gonçalves, Silas de Carvalho Santos, et al. A comparative study on concept drift detectors. Expert Syst. Appl., 41(18):8144–8156, 2014.
- [Hoeffding1963] Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American statistical association, 58(301):13–30, 1963.
- [Hoens et al.2012] T. Hoens, R. Polikar, and N. Chawla. Learning from streaming data with concept drift and imbalance. Progress in Artificial Intell., 1(1):89–101, 2012.
- [Kaluža et al.2010] Boštjan Kaluža, Violeta Mirchevska, et al. An agent-based approach to care in independent living. Ambient intelligence, pages 177–186, 2010.
- [Katakis et al.2008] I. Katakis, G. Tsoumakas, and I. Vlahavas. An ensemble of classifiers for coping with recurring contexts in data streams. In ECAI, pages 763–764, 2008.
- [Kelly et al.1999] Mark G Kelly, David J Hand, and Niall M Adams. The impact of changing populations on classifier performance. In KDD, pages 367–371, 1999.
- [Kifer et al.2004] Daniel Kifer, Shai Ben-David, and Johannes Gehrke. Detecting change in data streams. In Int. Conf. on Very Large Data Bases, pages 180–191, 2004.
- [Kim and Park2017] Youngin Kim and Cheong Hee Park. An efficient concept drift detection method for streaming data under limited labeling. IEICE Trans. Inf. Syst., 100(10):2537–2546, 2017.
- [Krawczyk et al.2017] Bartosz Krawczyk, Leandro L Minku, et al. Ensemble learning for data stream analysis: a survey. Information Fusion, 37:132–156, 2017.
- [Lindstrom et al.2011] Patrick Lindstrom, Brian Namee, and Sarah Delany. Drift detection using uncertainty distribution divergence. In ICDM Workshops. IEEE, 2011.
- [Montgomery2009] Douglas Montgomery. Introduction to statistical quality control. John Wiley & Sons, 2009.
- [Peacock1983] JA Peacock. Two-dimensional goodness-of-fit testing in astronomy. Monthly Notices of the Royal Astronomical Society, 202(3):615–627, 1983.
- [Reis et al.2016] Denis dos Reis, Peter Flach, et al. Fast unsupervised online drift detection using incremental kolmogorov-smirnov test. In KDD, 2016.
- [Sethi and Kantardzic2017] T. Sethi and M. Kantardzic. On the reliable detection of concept drift from streaming unlabeled data. Expert Syst. Appl., 82:77–99, 2017.
- [Sklar1959] M Sklar. Fonctions de répartition à n dimensions et leurs marges. Publ. Inst. Statist. Univ. Paris, 8:229–231, 1959.
- [Sobolewski and Wozniak2013] Piotr Sobolewski and Michal Wozniak. Concept drift detection and model selection with simulated recurrence and ensembles of statistical detectors. J. UCS, 19(4):462–483, 2013.
- [Souza et al.2015] Vinícius Souza, Diego Silva, et al. Data stream classification guided by clustering on nonstationary environments and extreme verification latency. In SIAM Int. Conf. Data Mining, pages 873–881, 2015.
- [Tsymbal2004] Alexey Tsymbal. The problem of concept drift: definitions and related work. Computer Science Department, Trinity College Dublin, 106(2), 2004.
- [Wang and Abraham2015] Heng Wang and Zubin Abraham. Concept drift detection for streaming data. In IJCNN, pages 1–9. IEEE, 2015.
- [Wang et al.2013] Shuo Wang, Leandro L. Minku, et al. Concept drift detection for online class imbalance learning. In IJCNN, pages 1–10. IEEE, 2013.
- [Widmer and Kubat1993] G. Widmer and M. Kubat. Effective learning in dynamic environments by explicit context tracking. In ECML, pages 227–243. Springer, 1993.
- [Yu and Abraham2017] Shujian Yu and Z. Abraham. Concept drift detection with hierarchical hypothesis testing. In SIAM Int. Conf. Data Mining, pages 768–776, 2017.
- [Žliobaite et al.2014] Indre Žliobaite, Albert Bifet, et al. Active learning with drifting streaming data. IEEE Trans. Neural Netw. Learn. Syst., 25(1):27–39, 2014.
- [Žliobaite2010] Indre Žliobaite. Change with delayed labeling: When is it detectable? In ICDM Workshops, pages 843–850. IEEE, 2010.