Streaming data classification consists of a routine where a model is trained on historical data and then used to classify upcoming samples. When the labels of the new arrived samples are available, they become a part of the training data. Concept drift refers to inconsistencies in data generation at different time, which means the training data and the testing data have different distributions[6, 36, 29]. Drift detection aims to identify these differences with a statistical guarantee through what is, typically, a four-step process 
: 1) cut data stream into chunks as training/testing sets; 2) abstract the data sets into a comparable model; 3) develop a test statistical or similarity measurement to quantify the distance between the models; and 4) design a hypothesis test to investigate the null hypothesis (most often, the null hypothesis is that there is no concept drift).
Concept drift detection is also referred to change detection test or covariate shift, which is very relevant in machine learning[42, 44, 9, 6, 25]
. Some application domains are mobile tracking systems that monitor user behaviour, intrusion detection systems that identify unusual operations and remote sensing systems that reveal false sensors. In these scenarios, the systems can inference the change of situation by comparing data distributions at different time points, where the discrepancy of the distributions is estimated, based on the observed sample sets. Learning under concept drift consists of three major components: concept drift detection, concept drift understanding, and concept drift adaptation . In this paper, we are focusing on improving concept drift detection accuracy on multi-cluster data sets. Regarding to the drift adaptation process, we recommend retraining the learner if a drift is confirmed as significant.
Online and batch are two modes for drift detection [5, 10, 32]. Batch mode drift detection is also referred to as change-detection, or the two-sample test, where the idea is to infer whether two sample sets have been selected from the same population. This is a fundamental component of statistical data processing. For most change-detection algorithms, the batch size affects the drift threshold of the test statistics. Hence, extra computation are required when the batch size is not fixed [6, 28, 22]. The online approach is more flexible because the drift threshold is self-adaptive . Alternatively, it can be calculated directly from new samples without a complicated estimation process , especially when the change is simply an insertion and/or removal of observation .
Histograms are the most widely used density estimators . A histogram is a set of intervals, i.e., bins, and density is then estimated by counting the number of samples in each bin. The design of the bins to reach the best density estimation result is a nontrivial problem. Most methods are based on regular grids, and the number of bins grows exponentially with the dimensionality of the data 
. A few methods instead use a tree-based partitioning scheme, which tends to scale well with high-dimensional data[6, 15]. Recent research shows that bins of equal density result in better detection performance than regular grids . For example, Boracchi et al. 
developed a space partitioning algorithm, named QuantTree, that creates bins of uniform density and proved that the probabilities of these bins are independent of the data distribution. As a result, the thresholds of the test statistics calculated on these histograms can be computed numerically from uni-variate and synthetically generated data with a guaranteed false positive rate.
Tree-based methods have achieved outstanding results with batch mode drift detection. However, their results are less optimal with online modes due to the extra effort to recalculate the drift threshold, since their drift threshold is depend on the sample size . This is a critical issue in real-world distribution change monitoring problems, particularly for streams with no explicit data batch indicators . In addition, tree-based space partitioning does not consider the clustering properties of the data. Therefore, the partitioning results for data with complex distributions may be arbitrary, unexplainable, and may cause drift blind-spots in the leaf nodes. For example, Fig 1. demonstrates the difference in the space partitioning between QuantTree, kdqTree, and kMeans algorithms. It can be seen that tree-based space partitioning will produce hyper-rectangles that crossing multiple clusters. The detected distribution change area may not be easily understood.
To address the problems caused by irregular partitions, we propose a novel space partitioning algorithm, called equal-intensity kMeans (EI-kMeans). The first priority of EI-kMeans is to build a histogram that dynamically partitions the data into an appropriate number of small clusters, then applying Pearson’s chi-square test ( test) to conduct the null hypothesis test. The Pearson’s chi-square test ensures the test statistics remain independent of the sample distribution and the sample size. The proposed EI-kMeans drift detection consists of three major components, which are the main contributions of this paper:
A greedy equal-intensity cluster-initialization algorithm to initialize the kMeans cluster centroids. This helps the clustering algorithm to select a appropriate initialization status, and reduces the randomness of the algorithm.
An intensity-based cluster amplify-shrink algorithm to unify the cluster intensity ratio and ensure that each cluster has enough samples for the Pearson’s chi-square test. In addition, an automatic partition number searching method that satisfies the requirements of a Pearson’s chi-square test is integrated.
A Pearson’s chi-square test-based concept drift detection algorithm that achieves higher drift sensitiveness while preserving a low false alarm rate.
The rest of this paper is organized as follows. In Section II, the problem of concept drift is formulated and the preliminaries of Pearson’s chi-square test are introduced. Section III presents the proposed EI-kMeans space partitioning algorithm and the drift detection algorithm. Section IV evaluates the space partitioning performance and the drift detection accuracy. Section V concludes this study with a discussion of future work.
Ii Preliminaries and Related Works
In this section, we define concept drift, discuss the state-of-the-art concept drift detection algorithms, and outline the preliminaries of the Pearson’s chi-square test for the proposed drift detection algorithm.
Ii-a Concept drift definitions and related works of drift detection
Concept drift is characterized by variations in the distribution of data. In a non-stationary learning environment, the distribution of available training samples may vary with time [30, 46, 41, 11]. Consider a topological space feature space denoted as , where is the dimensionality of the feature space. A tuple denotes a data instance, where
is the feature vector,is the class label and is the number of classes. A data stream can then be represented as a sequence of data instances denoted as . A sample set chunked from a stream via a time window strategy is a set of data instances arriving within a time interval, denoted as , where is the given time interval that defines the time window. A concept drift has occurred between two time windows and if the joint probability of and is different, that is, [18, 27, 23, 1].
Covariate shift focuses on the drift in while remains unchanged. This is considered to be virtual drift
Concept shift focuses on the drift in while remains unchanged. This is most commonly referred to as real drift
It is worth mentioning that and are not the only implications of
drift. The prior probabilities of classesand the class conditional probabilities may also change, which could lead to a change in and would affect the predictions [18, 34]. This issue is another research topic in concept drift learning that closely relates to class imbalance in data streams .
|active learning||active learning||depends|
|active learning||active learning||depends|
Concept drift detection algorithms can be summarized in three major categories, i) error rate-based; ii) distribution-based and iii) multiple hypothesis tests (multi-hypo). The algorithms can also be distinguished in different learning settings, such as supervised, unsupervised [37, 24], semi-supervised , and active learning settings. For a supervised setting, the target variable is available for drift detection. Most error rate-based drift detection algorithms are developed with this setting [38, 3]. In later work, the problem of label availability in data streams with concept drift has been acknowledged[45, 20]
pointing out concept drift could occur within unsupervised and semi-supervised learning environments. Accordingly, active learning strategy is adopted by to address concept for improving the learning performance.
Real and virtual are two major drift . Error-based, distribution-based and multiple hypothesis are three major of drift detection algorithms. Supervised, unsupervised, semi-supervised and active learning are four major of learning under concept drift. In Table I, the indicates the algorithms in this category can detect and distinguish different drift types with the given setting. The indicates the they can detect drifts but cannot distinguish the types. The indicates they are unable to detect the drift. The indicates the algorithms in this category cannot be applied in the given setting. With regard to multiple hypothesis tests, the capability of these algorithms varies significantly, since they could be a combination of multiple error-based algorithms or a hybrid of both error and distribution-based algorithms. Therefore, it is hard to give a conclusion for this category. In addition, it is worth to mention that Mahardhika has proposed a method to handle concept drift in a weakly supervised setting.
EI-kMeans is one distribution-based drift detection algorithm. Most Hoeffding bound-based algorithms, like[16, 31], belong to error rate-based drift detection that can only detect real drift with supervised, semi-supervised or active learning settings. The main contribution of EI-kMeans is different from conventional distribution-based drift detection. Conventional distribution-based drift detection algorithms aim to find a novel test statistics to measure the discrepancy between two distributions and to design a tailored hypothesis test to determine the drift significant level, such as [22, 10, 21]
. In contrast, EI-kMeans focuses on how to efficiently convert multivariate samples into a multinomial distribution and then use an existing hypothesis test to detect the drift. Since EI-kMeans is using Pearson’s chi-square test as the hypothesis test, the drift threshold can be calculated directly according to Chi-square distribution and it can be implemented in an online manner. Other distribution-based algorithms may need to re-compute the drift threshold as new samples become available.
Ii-B Histogram-based distribution change detection
Histograms are the oldest and most widely used density estimator . The bins of the histogram are the intervals, i.e., partitions, of the feature space. Hence, a -bins histogram is a set of partitions, denoted as , where is a partition of the feature space , , and , for . Histograms are often built upon regular grids, which means the number of bins will grow exponentially along with the dimensionality of the data . Dasu et al. extended QuadTree  based on the idea of a k-dimensional tree  and developed a kdqTree space partitioning scheme 
. In the kdqTree scheme, the feature space is partitioned into adaptable cells of a minimum size and a minimum number of training samples. Then, the Kullback–Leibler divergence is used to quantify the distribution discrepancy, and bootstrap sampling is used to estimate the confidence interval. Another recent tree-based space partitioning algorithm, named QuantTree, was proposed by Boracchi et al., which splits the feature space into partitions of uniform density. The advantages of QuantTree is that the test statistics computed based on it are distribution free .
Distribution change detection with histograms can be considered from the perspective of granularity and can be categorized into two groups: higher resolution histograms and lower resolution histograms, as demonstrated in Fig. 2.
Lower resolution partitioning requires a large number of training samples so that each partition could have enough samples to estimate the density. Without adequate training samples, the estimate the density may suffer from randomness. To mitigate this problem, Lu et al. [27, 28] proposed a competence-based space partitioning method that uses related sets to enrich sample sets, then applying space partitioning and calculating the distribution discrepancy. Liu et al. applied a similar strategy  by partitioning the feature space based on k-nearest neighbor particles. These higher-resolution partitions resulted in higher drift detection accuracy on small sample sets, but also suffered from higher computational costs.
Ii-C Pearson’s chi-square test
Pearson’s chi-square test, or test for short, is used to determine whether there is a significant difference between the expected frequencies and the observed frequencies in one or more sets of data . The test statistic follows a chi-square distribution when there is no significant difference. The purpose of the test is to assume the null hypothesis is true and then evaluate how likely a specific observation would be.
The standard process of the
test is to use sample data to find: the degrees of freedom, the expected frequencies, the test statistic, and the-value associated with the test statistic 
. Given a contingency table withrows and categorical variables (column), the degrees of freedom are equal to
The expected frequency counts are computed separately for each level of one categorical variable at each level of the other categorical variable. The th and th expected frequencies of the contingency table are calculated with the equation
where is the sum of the frequencies for all columns in row , is the sum of the frequencies for all rows of columns , and
is the sum of all rows and columns. The test statistic is a chi-square random variabledefined by
where is the observed frequency count at row and column , and is the expected frequency count at row and column . The -value is the probability of observing a sample statistic as extreme as the test statistic. Since the p-value is a
test statistic, it can be computed with the chi-square probability distribution function.
Pearson’s chi-square test should be used with the conditions described in , which assumes there is a sufficiently large sample set. If the test is applied to a small sample set, the
test will yield an inaccurate inference and will result in a high Type II error. The true positive detection accuracy will be impaired, but the false alarm rate will not increase. According to the central limit theorem, andistribution is the sum of . For many practical purposes, Box et al.  claim that for and the distribution of the estimated test statistics is sufficiently close to a normal distribution for the difference to be ignored. In other words, to avoid the bias raised by asymptotic issues, the observations and expectation frequencies should be greater than a particular threshold.
Iii EI-kMeans Space Partitioning and Drift Detection with Pearson’s Chi-square Test
This section presents our EI-kMeans space partitioning histogram and our drift detection algorithm based on Pearson’s chi-square test. The algorithm implementation detail is given, and the complexity is discussed at the end of this section.
Iii-a The risk of offset partitions in histogram-based drift detection
Let us begin by restating the concept drift detection problem and our proposition.
Problem. 1. Let and be random variables defined on a topological space , with respect to , where consists of all Borel probability measures on . Given the observations and from and , respectively, how much confidence do we have that ? At present, most distribution change detection methods assume that the observations are i.i.d. which makes the assumption and objective equivalent to a two-sample test problem.
The problem of analyzing a data stream to detect changes in data generating distribution is very relevant in machine-learning and is typically addressed in an unsupervised manner . However, it can easily be extended to handle a supervised setting. For this, there are two options for implementing our proposed solution without changing the algorithms. Option 1: Considering the label or target variable as one feature of the observations in the sample set and then applying the proposed concept drift detection algorithms. Option 2: Separate the observations based on their labels and detect concept drift individually.
The design of the space partitioning algorithm is critical to how the histogram is constructed, but nowhere in the literature is there a definitive conclusion on how to build a perfect histogram. Tree-based histogram construction is one of the most popular methods for change detection. QuantTree  is a representative algorithm that creates partitions of uniform density in a tree structure. Given all the distributions are the same, the drift threshold is independent of the data samples and can be numerically computed from univariate and synthetically generated data. Although some studies claim that uniform-density partition schemes are superior based on experiment evaluations [7, 22], no study includes a detailed justification of its claims.
The fundamental idea of drift detection via histograms is to convert the problem of a multivariate two-sample test into a goodness-of-fit test for multinomial distributions. If the data is categorical and belongs to a collection of discrete non-overlapping classes, it has a multinomial population . In this case, each partition (each bin in the histogram) constitutes a categorical, non-overlapping class. And the null hypothesis for a goodness-of-fit test to evaluate how the observed frequency match the expected frequency , that is, the number of testing data in a partition is expected to fall into an estimated range based on the training data . The hypothesis is rejected if the p-value of the observed test statistic is less than a given significance level [1, 2, 40].
Pearson’s chi-square test is a commonly used hypothesis test for this task if the expected frequency for each category is larger than 5, and the observed frequency for each category is larger than 50 . In histogram-based drift detection, this requirement can be satisfied by controlling the number of samples in partitions, such as reducing the number of partitions K to ensure all partitions contain enough samples. Recall the test statistic in Eq. (1)., we know that, given the same number of partitions, the higher the value of the test statistic, the more likely it is that a distribution drift has occurred. Therefore, the objective is to design a partition algorithm to have the highest test statistic. If the highest test statistic, which represents the highest distribution discrepancy, does not refute the null hypothesis, then there is no drift. Theoretically, the expected frequency counts for all partitions becomes known once the partition scheme is determined.
To maximize the test statistic, the space partitioning strategy needs to avoid partitions that have distribution discrepancies that cannot be measured by subtracting the observations and expectations, as illustrated in Fig. 3. Related defintions are given below.
(Partition Absolute Variation) The absolute variation of a partition is defined as the integration of the probability density difference of and in partition , denoted as
denotes the probability density function of the training and testing data, andis the partition interval.
(Partition Probability Variation) The probability variation of a partition is defined as the difference of the integration of the probability density in partition of and , denoted as
Then we have the offset partition defined as follow.
(Offset Partition) Given two probability density distributions and , a space partition is an offset partition if the absolute variation is larger than the probability variation, denoted as .
Concept drift detection requires the histogram built on training samples only. Admittedly, the impact of offset partitions on distribution estimation can be reduced by learning methods that optimize the density difference between the training and testing samples. However, this method can be time-consuming and is not feasible when the testing data is small or even not available. Additionally, this method may not be suitable for streaming data since data may arrive much faster than it can be tested. Therefore, for concept drift detection, histograms need to be designed only based on training data, and minimizing the occurrence of offset partitions. In other words, to achieve the best drift detection results, the histogram should have the least number of offset partitions. To detect concept drift, we propose the following strategies to reduce the appearance of offset partitions.
Partitions should avoid cluster gaps. With multi-cluster training sample sets, there are gaps between clusters. If a partition steps across multiple clusters, its sensitivity to drift will be affected.
Partitions should be as compact as possible. The distances between samples within a partition should be minimized. If the drift direction is unknown, one rule of thumb to avoid offset partitions is to keep the shape of the partition as compact as possible, as shown in Fig. 4. However, this strategy must be constrained by a predefined minimum partition size. Otherwise, the partitions will be too small to yield statistical information. In our case, the test requires the number of observations to be as large possible. The minimum requirement is 50 observations for each partition, and the expected frequency count has to be greater or equal to 5 .
This strategy also conforms to Boracchi et al.’s  conclusion that histogram bins of equal density provide better detection performance than regular grids. For example, given a sample set with 1000 samples and 50 as the minimum number of points in a partition with no identical samples, the smallest average interval size of 1000/50=20 partitions is always smaller than the smallest average interval size of 19 partitions. However, histogram bins of equal density may not always have the smallest average interval size. Therefore, bins of non-uniform density may provide superior performance to uniform density bins in some cases.
The distribution discrepancy within partitions is also important, which is another issue resulting from offset partitions that may influence the drift detection results. An test cannot identify a distribution discrepancy inside a partition, so the histogram design should ensure the distributions in the partitions are as simple as possible. For example, kernel density estimation-based methods assume the data follows Gaussian mixture distributions. However, this can cause bandwidth selection problems. Therefore, we need an indicator that represents whether or not the density of samples in the same partition are similar. Definition 4 defines this indicator as the offset margin:
(Offset Margin) The off-set margin of partition is defined as the difference between the absolute variation and the probability variation of , denoted as .
The offset partition is only one of many issues that might influence the detection results. Intuitively, the more partitions we have, the less likely offset partitions occur. Also, different partitioning schemes will result in different drift detection results with different sample sets. Minimizing the risk of offset partitions may result in better performance generally, but it may not be the best choice for a particular sample set.
Iii-B EI-kMeans Space Partitioning
Since the main objective is to keep the risk of offset regions as small as possible without knowing the testing data, the simplest method is to create as many partitions as possible. To this end, we propose using the average partition interval size as an indicator for constructing a histogram. The test requires there be more than 50 observations with an expected frequency greater than 5. This requirement can be satisfied by adding constraints onto the indicator. The general form of the objective function is to find the centroids with the smallest average interval size:
As the interval in high dimensional cases is a volume, the interval size is denoted as . The denotes the count of observations in , and denotes the expected frequency count in .
The nature of kMeans makes it a good option for this task. Adding constraints can be addressed by introducing an algorithm to monitor the number of samples in each cluster. Here, the volume indicator represents the average distance to the centroids. The overall workflow of EI-kMeans space partition is shown in Fig. 5.
As shown in Fig. 5, the procedure begins by initializing the cluster centroids with a greedy equal-intensity k-means initialization algorithm. The objective of this algorithm is to segment the feature space into a set of partitions with the same number of samples. Let denote the training data set for the histogram, and be the samples located in partition . There are partitions. The greedy equal intensity kMeans initialization will evenly divide the samples into groups. The centroids of these groups will be input into kMeans as the initial centroids. Once thekMeans converges or reaches the maximum iteration criteria, the returned sample labels and the centroids are used for equal-intensity cluster amplification.
Greedy equal-intensity k-means initialization finds the farthest sample, i.e., the sample with the longest distance to its nearest neighbor. The -nearest neighbors of this sample is labelled as the first partition, where is the cardinality of the training sample set. The labelled samples are then removed from the training set, and the above process is repeated until all the samples are labelled.
Remark: if the remainder of is not equal to 0, the remainder will be evenly distributed into the first few partitions, that is, samples with nearest neighbors will be labeled instead of those with the nearest neighbours.
Algorithm 1 shows the pseudocode for the greedy equal-intensity kMeans centroid initialization. The inputs are: a set of training samples ; and the number of clusters to initialize. In this algorithm, one trick we used to control the computation cost is sub-sampling. The input training set could be the entire training set or just a subset of the training set. Some data pre-processing techniques, such as dimensionality reduction or data normalization, will be applied before running our algorithm. Since different data sets may require different data pre-processing techniques, this is not the main scope of our algorithm.
Denote the number of samples in dataset as , and . The runtime complexity in line 3 is with an appropriate nearest neighbor search algorithm, such as -d tree. The sorting complexity for line 4 is with a merge-sort algorithm. The complexity for lines 6 and 7 are and , respectively. Therefore, the total complexity for each iteration is according to the rule of sums. The total complexity for the greedy equal-intensity k-means centroids initialization is
Based on the labels, the cluster sample intensity ratio is computed by dividing the count of samples in a cluster by the total number of training samples, i.e.,
where is the number of samples located in partition .
The intensity ratios for all clusters can be represented as a vector , where the shape of is . The amplify coefficient function for the cluster distance is calculated based on this vector:
where is a parameter to control the shape of the coefficient function. To convert the amplify coefficients to matrix, the amplify coefficient vector is multiplied by a vector to create a amplify coefficient matrix, denoted as . Calculating the paired Euclidean distance matrix between the centroids and the data samples as , the amplified distance matrix is derived by
and the amplified cluster labels are derived by finding the centriod index with the minimum amplified distance,
In the cluster amplify-shrink algorithm, is chosen through a grid search from a predefined set . When , the amplify coefficients are all equal to 1, which will not amplify or shrink any of the clusters. As increases, the clusters are amplified or shrunk sharply. If the minimum number of samples in a partition is larger than the desired value, the amplify-shrink algorithm will terminate, denoted as
where is the desired value of the minimum number of samples in the partitions. According to the requirements of Pearson’s chi-square test, the desired value is . If no can satisfy the desired value, the number of partitions is reduced by 1, namely , and the above process is repeated.
Algorithm 2 shows the pseudocode for the equal-intensity kMeans space partitioning. The inputs are: the training set ; the minimum number of samples in a partition ; and the grid search range of the amplify coefficient function parameter . The aim of lines 4-8 are to build kMeans clusters of similar intensity. Then, from lines 10 to 19, the clusters are amplified or shrunk based on their intensity ratio. The amplify-shrink process will end up satisfying the minimum number of samples or until it reaches the end of the range of . If a desired partition sets cannot be found after the amplify-shrink process, the number of clusters will be reduced by 1, namely is updated to , and the process is repeated.
From lines 10 to 19, the main cost is the multiplication of the matrix, which has a runtime complexity equal to . Because is constant, the complexity is actually . The greedy initialization in line 4 is , the k-means in line 6 is . Considering the while loop starts from K to 2, the worst-case complexity of EI-kMeans space partition is .
Iii-C EI-kMeans Drift Detection
EI-kMeans considers the clustering property between samples as important when drift occurs. We assume that the distribution change is more likely to occur in a closely-located group of samples than in an arbitrary shape. EI-kMeans space partitioning are cluster-prioritized and are more sensitive to drift within multi-cluster type datasets. The drift detection workflow, in Fig. 6, is simple and fast, once the space partitioning is finished. Based on the output of how the partitions are constructed, the testing samples are clustered into partitions. The observations in the training and the testing sample sets are vertically stacked to form a contingency table, and the test is applied to evaluate the distribution discrepancy between them.
Algorithm 3 is the drift detection algorithm. It counts the observation frequencies of both the training and testing data, and conducts the test. The counting process is implemented using the same steps in Algorithm 2, lines 12-14. If no drift occurs, the observation frequencies of the training data set are stored in the system buffer for the next test. A contingency table is formed for each test by vertically stacking the stored vector and the observation frequencies of the testing data set. The test returns a result whether it rejects or accepts the null hypothesis test, denoted as .
The optimized complexity of the 1NN classifier in the EI-kMeans drift detection algorithm is . The test complexity is . The overall EI-kMeans drift detection runtime complexity is . In this algorithm,
The overall EI-kMeans drift detection algorithm can be summarized into 3 steps.
Step 1. Initialize the greedy equal-intensity cluster centroids.
Step 2. Segment the feature space as small clusters. This step is based on k-means clustering, which divides the datasets into a set of individual clusters. This ensures no partition will step across clusters. The number of partitions is continuously reduced if the number of samples in each partition does not satisfy the desired values.
Step 3. Detect drift with Pearson’s chi-square test.
Iv Experiments and Evaluation
In this section, we compare the proposed EI-kMeans with other state-of-the-art drift detection algorithms to demonstrate how EI-kMeans performs on the drift detection tasks. The selected histogram-based drift detection algorithms are QuantTree with both and total variation statistics which are reported as the best method in their paper , kdqTree with test  and one multivariate two-sample test baseline, known as the multivariate Wald-Wolfowitz test (MWW test) . We choose the MWW test as the baseline because it is designed to solve the problem by statistical analysis and its runtime complexity is low enough to perform in a stream learning scenario. To support the reproducible research initiative, the source code of EI-kMeans is available online111https://github.com/Anjin-Liu/TCYB2019-EIkMeansDriftDetection
Iv-a A comparison of space partitioning
Experiment 1. For this experiment, we generated three data sets with different configurations to demonstrate the difference in the space partitioning. The partitioning results are shown in Fig. 7. The first data set, denoted as 1G, has a Gaussian distribution with a mean of , a variance matrix of , and 1350 data samples. The second data set, denoted as 3G[1:1:1], has three Gaussian distributions with different means: , , . The variance matrixes are the same, which form three clusters with the same number of data samples in each cluster. The third data set, denoted as 3G[1:3:5], has the same settings as 3G[1:1:1] but with a diverse sample ratio for each cluster, i.e., the cluster with the mean of contains 150 data samples; has 450 samples; and has 750. The number of desired partitions is set as .
Findings and discussion
: The intStv stands for the standard deviation of the partitions’ intensity, which is calculated via Eq. (3). Low intStv implies that the samples are evenly distributed in each partition. The results shows that no matter what shape of the data set is, EI-kMeans always has a smaller intensity variation than kMeans, which is what we want to achieve.
|Data type||Description||Configurations||Drift margin|
|EI-kMeans test||kMeans test||kdqTree test||QuantTree stat||QuantTree stat||MWW test|
. Each detection algorithm was run 50 times on 250 data sets generated with different random seeds, the average and standard deviation of Type-I error are reported. The underlined values are the Type-I error which exceed the predefined false positive rate,.
|EI-kMeans test||kMeans test||kdqTree test||QuantTree stat||QuantTree stat||MWW test|
|Average||58.50 (1)||65.96 (3)||59.43 (2)||70.84 (4)||79.66 (6)||72.00 (5)|
|Average||4.27||58.50 (4)||4.31||57.10 (3)||4.35||56.55 (2)||4.23||56.31 (1)|
Iv-B Drift detection accuracy with synthetic data sets
Experiment 2: In this experiment, we generated six 2-dimensional data sets to evaluate the power of EI-kMeans to detect drift. We compared EI-kMeans with the state-of-the-arts QuantTree, kdqTree and k-means space partition plus
test. The training set contained 2000 training samples, and the testing set contained 200 samples. For each data type, we generated 250 stationary testing sets and 250 drift testing sets and evaluated both Type I and Type II errors. Type I errors are rejections of a true null hypothesis (also known as a "false positive"). A Type II error is the false null hypothesis rates (a "false negative"). The Type-I and Type-II errors are the most common evaluation metric for distribution change detection. To evaluate the stability, we run the test 50 times and recorded the mean and standard deviation. TableII presents the data set configurations, and Table III shows the mean of the drift detection results. Table IV shows the standard deviation. To evaluate the influence of training batch size, we changed the training set size to 3000, 4000 and 5000. The detection results are shown in Table V. Fig. 8. shows the space partitioning results of each algorithm.
Findings and discussion: The results shows that all drift detection algorithms outperformed the base-line multivariate two-sample test, MWW test. The results demonstrate that EI-kMeans with test has the average Type-I error below as well as the lowest average Type-II error. Comparing to the kMeans-based space partition the improvement of EI-kMeans space partitioning is significant, which raised the rank from (3) to (1). The kdqTree space partition with test performed well in this experiment, and had shown no significant disadvantages compared to others. The QuantTree space partitioning with and statistics are not performing well in general, because the partitioning strategy is not designed for multi-cluster data sets. As we can see, the Type-II error of QuantTree stat is very close to the kMeans test on the 2d-U-mean, 2d-1G-mean, 2d-1G-var and 2d-1G-cov data sets, which are all single cluster type data sets. Average performance dropped significantly on the multi-cluster data sets 2d-2G-mean and 2d-4G-mean. Based on these results, we conclude that the design of a histogram scheme makes a significant contribution to the drift detection accuracy in different data distribution which is a nontrivial problem. Regarding to the batch size, as we use Pearson’s chi-square test as the drift detection hypothesis test, the drift threshold of the test statistics is determined by the Chi-square distribution with a given significant level. A sample set with a sufficiently large size is assumed. If a chi-squared test is conducted on a sample with a small size, the chi-squared test will yield an inaccurate inference, which might end up committing a Type II error. As we can see, the Type-II error increases as the training size decrease.
Experiment 3: To evaluate the proposed algorithm on high dimensional data, we expand the 2d-1G-mean and 2d-4G-mean data sets to 4, 6, 8, 10 and 20 dimensions by adding normal distributed data. For example, in the 4d-1G-mean data set, the first two features are the same as the 2d-1G-mean but, for the third and fourth features, they are generated by normal distribution with covariance equal to 0. Since increasing unrelated dimensions will reduce the drift sensitiveness, we increased the drift margin for HD-1G-mean to 0.5, for HD-4G-mean to 1.0, and doubled the training data. The results are given in Fig. 9.
Findings and discussion: In Fig. 9, the Type-II errors increased as the non-drift dimension increased. Most algorithms preserved a low Type-I error, except the MWW test. Although MWW test has the lowest Type-II error on the HD-4G-mean drift data sets, its Type-I error is above the desired -level threshold. The kdqTree with test has the best performance on the HD-1G-mean data sets, but it turns to powerless on the HD-4G-mean data sets. We consider this is because kdqTree does have a effective method to control the number of samples in each partition. Directly applying test on the kdqTree partitions is risky. In this experiment, EI-kMeans outperforms others in most cases and has its false positive rate below the predefined threshold , which indicates that it is stable on high dimensional data.
Iv-C Drift detection accuracy with real-world data sets
Experiment 4. Drift detection on real-world data sets. For this experiment, we created 8 train-test drift detection tests from 5 real-world data applications. For each test, we generated one training data set and 500 testing data sets. Among these testing data sets, half were drawn from the same distribution, the other half were drawn from a different distribution. Again, the results were evaluated in terms of Type-I and Type-II errors. The characteristics of these data sets are summarized in Table IX.
HIGGS Bosons and Background Data Set. The objective of this data set is to distinguish the signatures of the processes that produce Higgs boson particles from those background processes that do not. Four low-level indicators of the azimuthal angular momenta for four particle jets were selected as features, which means the distributions were . The jet momenta distributions of the background processes are denoted as Back, and the processes that produce Higgs bosons are denoted as Higgs. The total sample size of mixed Backs and Higgs is . We randomly selected 2000 samples without replacement from each distribution as the training data.1000 samples were used as the testing set. There were three types of data integration: Back-Back where both sample sets were drawn from Back; Higgs-Higgs, where both sample sets were drawn from Higga; and Higgs-Back, where one sample set was drawn from Higgs and the other from Back.
MiniBooNe Particle Identification Data Set. This data set contains 36,499 signal events and 93,565 background events. Each event has 50 particle ID variables. The drift detection task is to distinguish between signal events and background events. The sample size of the training set size was 2000, and 500 for the testing set.
Arabic Digit Mixture Data Set. This data set contains audio features of 88 people (44 females and 44 males) pronouncing Arabic digits between 0 and 9. We applied the same configuration as Denis et al. . The data set was originally i.i.d. and contained a time series for 13 different attributes. The revised configuration has 26 attributes instead of 13 time series with a replacement mean and standard deviation for each time series. Mixture distributions were generated by grouping female and male labels. Mixture distribution contained randomly selected samples of both males and females, with male labels from 0 to 4 and female labels from 5 to 9. Mixture distribution reversed the labels at 9, i.e., drawing the samples of with label male. We configured the data set this way to create multiple clusters, where the pronunciation of each digit formed a cluster. The configuration is summarized in Table VI. The training set size was 2000, and the testing set size was 500.
Localization Mixture Data Set. The localization data set contains data from a sensor carried by 5 different people (A, B, C, D, E). The original data has 11 different movements with imbalanced samples. To use this data set for drift detection, we selected the top three movements with the most samples, ’lying’, ’walking’ and ’sitting’. To simulate multiple clusters with drift, we grouped samples from different people together at different percentages to result in varied data distributions. The training set size was 3000, and 600 for the testing set. The sample proportion of each people is summarised in Table VII.
Insects Mixture Data Set. This data set contains features from a laser sensor. The task is to distinguish between 5 possible specimens of flying insects that pass through a laser in a controlled environment (Flies, Aedes, Tarsalis, Quinx, and Fruit). A preliminary analysis showed no drift in the feature space. However, the class distributions gradually change over time. To simulate drift in multiple clusters, we selected the samples from different insects and grouped them together at different percentages. Thus, the data distribution may vary. The training set size was 2000, and 500 for the testing set size. The sample proportion of each specimens is summarized in Table VIII.
|Data set ID||Data set name||# Features||# Training||# Testing|
|EI-kMeans test||kMeans test||kdqTree test||QuantTree stat||QuantTree stat||MWW test|
|EI-kMeans test||kMeans test||kdqTree test||QuantTree stat||QuantTree stat||MWW test|
|Average||32.57 (1)||33.74 (2)||38.54 (3)||57.21 (5)||61.35 (6)||53.36 (4)|
Findings and discussion: The average drift detection accuracy is shown in Table X XI, and their standard deviation. The results show that all tested methods returned an average false positive rate below the except MWW test. EI-kMeans had the lowest average Type-II error of 32.57%, which is 1.17% lower than the next best performance by k-means with a test. However, EI-kMeans improved drift detection power comes at the cost of an increased false positive rate, and sometimes at over the predefined thresholds. This result conforms to our expectation since EI-kMeans places more restrictive constraints on the number of samples in each partition to meet the requirements of test. With a small sample set, the test will yield an inaccurate inference and is prone to Type II errors. Notably, however, while the true positive detection accuracy may be impaired, the false alarm rate does not surpass the predefined threshold .
Across the Real-I to Real-VII data sets, the cluster-based algorithms performed just as well as the tree-based algorithms. However, the QuantTree algorithms completely lost its power to detect drift with the Real-VIII Insect mixture cluster-based data set, while EI-kMeans showed the best performance.
V Conclusions and Future Work
In this paper, we proposed a novel space partitioning algorithm, called EI-kMeans, for drift detection on multi-cluster data sets. EI-kMeans is a modified k-means algorithm to search for the best centroids to create partitions. The distances between samples and centroids are amply-shrink based on the cluster intensity ratios. The proposed algorithm detects concept drift from a data distribution perspective. Similar to most distribution-based drift detection algorithms, with a supervised learning setting, it will trigger drift alarm if there is a real or virtual drift but it may not be able to distinguish the drift types. The results of our experiments show the power of EI-kMeans to detect drift with multi-cluster type data sets and proved that histogram design is critical to drift detection accuracy. The results also show that uniform space partitioning may not always outperform other schemes – the performance of the space partition algorithm is data-dependent.
The version of EI-kMeans considered in this paper is designed for a Pearson’s chi-square test, but different hypothesis tests may require different methods of histogram construction. This is something we intend to explore in future work. In addition, concept drift detection is only one aspect of learning in a dynamic stream. How to design a tailored drift adaptation algorithm that leverages the drift detection result to achieve better performance in stream learning is our next target.
The work presented in this paper was supported by the Australian Research Council (ARC) under Discovery Project DP190101733. We acknowledge the support of NVIDIA Corporation with the donation of GPU used for this research.
Hierarchical change-detection tests.
IEEE Transactions on Neural Networks and Learning Systemss28 (2), pp. 246–258. Cited by: §II-A, §III-A.
-  (2013) Just-in-time classifiers for recurrent concepts. IEEE Transactions on Neural Networks and Learning Systems 24 (4), pp. 620–634. Cited by: §III-A.
-  (2017) RDDM: reactive drift detection method. Expert Systems with Applications 90, pp. 344–355. Cited by: §II-A.
-  (1975) Multidimensional binary search trees used for associative searching. Communications of the ACM 18 (9), pp. 509–517. Cited by: §II-B.
-  (2007) Learning from time-changing data with adaptive windowing. In Proceedings of the 2007 SIAM International Conference on Data Mining, pp. 443–448. Cited by: §I.
-  (2018) QuantTree: histograms for change detection in multivariate data streams. In Proceedings of the 2018 International Conference on Machine Learning, pp. 638–647. Cited by: §I, §I, §I, §I, §II-B, §III-A, §III-A, §IV.
-  (2017) Uniform histograms for change detection in multivariate data. In Proceedings of the 2017 International Joint Conference on Neural Networks, pp. 1732–1739. Cited by: §I, §III-A, §III-A.
-  (1978) Statistics for experimenters: an introduction to design, data analysis, and model building. Vol. 1, JSTOR. Cited by: §II-C, §II-C, §II-C, 2nd item, §III-A, §III-A.
-  (2018) A pdf-free change detection test based on density difference estimation. IEEE Transactions on Neural Networks and Learning Systems 29 (2), pp. 324–334. Cited by: §I.
-  (2017) An incremental change detection test based on density difference estimation. IEEE Transactions on Systems, Man, and Cybernetics: Systems PP (99), pp. 1–13. Cited by: §I, §II-A.
MUSE-rnn: a multilayer self-evolving recurrent neural network for data stream classification. In Proceedings of the 19th IEEE International Conference on Data Mining, Cited by: §II-A.
-  (2006) An information-theoretic approach to detecting changes in multi-dimensional data streams. In Proceedings of the 28th Symposium on the Interface of Statistics, Computing Science, and Applications, pp. 1–24. Cited by: §II-B, §IV.
-  (2015) Learning in nonstationary environments: a survey. IEEE Computational Intelligence Magazine 10, pp. 12–25. Cited by: §II-A.
-  (2016) Fast unsupervised online drift detection using incremental kolmogorov-smirnov test. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1545–1554. Cited by: §I, §I, §IV-C.
-  (1974) Quad trees a data structure for retrieval on composite keys. Acta Informatica 4 (1), pp. 1–9. Cited by: §I, §II-B.
-  (2015) Online and non-parametric drift detection methods based on hoeffding’s bounds. IEEE Transactions on Knowledge and Data Engineering 27 (3), pp. 810–823. Cited by: §II-A.
-  (1979) Multivariate generalizations of the wald-wolfowitz and smirnov two-sample tests. The Annals of Statistics, pp. 697–717. Cited by: §IV.
-  (2014) A survey on concept drift adaptation. ACM Computing Surveys 46 (4), pp. 1–37. Cited by: §II-A, §II-A.
Sand: semi-supervised adaptive novel class detection and classification over data stream.
Proceedings of the 2016 AAAI conference on artificial intelligence, Cited by: §II-A.
-  (2007) An active learning system for mining time-changing data streams. Intelligent Data Analysis 11 (4), pp. 401–419. Cited by: §II-A.
PCA feature extraction for change detection in multidimensional unlabeled data. IEEE Transactions on Neural Networks and Learning Systems 25 (1), pp. 69–80. Cited by: §II-A.
-  (2018) Accumulating regional density dissimilarity for concept drift detection in data streams. Pattern Recognition 76, pp. 256–272. Cited by: §I, Fig. 2, §II-A, §II-B, §III-A.
-  (2017) Regional concept drift detection and density synchronized drift adaptation. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, pp. 2280–2286. Cited by: §II-A.
-  (2020) Diverse instances-weighting ensemble based on region drift disagreement for concept drift adaptation. IEEE Transactions on Neural Networks and Learning Systems Early Access, pp. 1–16. External Links: Cited by: §II-A.
-  (2020) Heterogeneous domain adaptation: an unsupervised approach. IEEE Transactions on Neural Networks and Learning Systems Early Access, pp. 1–15. External Links: Cited by: §I.
-  (2018) Learning under concept drift: a review. IEEE Transactions on Knowledge and Data Engineering. Cited by: §I, §I, §II-A, §III-A.
-  (2016) A concept drift-tolerant case-base editing technique. Artificial Intelligence 230, pp. 108–133. Cited by: §II-A, §II-B.
-  (2014) Concept drift detection via competence models. Artificial Intelligence 209, pp. 11–28. Cited by: §I, §II-B.
-  (2012) DDD a new ensemble approach for dealing with concept drift. IEEE Transactions on Knowledge and Data Engineering 24 (4), pp. 619–633. Cited by: §I.
-  (2012) A unifying view on dataset shift in classification. Pattern Recognition 45 (1), pp. 521–530. Cited by: §II-A, §II-A.
-  (2016) Fast hoeffding drift detection method for evolving data streams. In Joint European conference on machine learning and knowledge discovery in databases, pp. 96–111. Cited by: §II-A.
-  (2001) Learn++: an incremental learning algorithm for supervised neural networks. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews 31 (4), pp. 497–508. Cited by: §I.
Weakly supervised deep learning approach in streaming environments. In Proceedings of the 2019 IEEE International Conference on Big Data, Cited by: §II-A.
Automatic construction of multi-layer perceptron network from streaming examples. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp. 1171–1180. Cited by: §II-A.
-  (2017) A survey on data preprocessing for data stream mining: current status and future directions. Neurocomputing 239, pp. 39–57. Cited by: §II-A.
-  (2014) Decision trees for mining data streams based on the gaussian approximation. IEEE Transactions on Knowledge and Data Engineering 26 (1), pp. 108–119. Cited by: §I.
-  (2017) On the reliable detection of concept drift from streaming unlabeled data. Expert Systems with Applications 82, pp. 77–99. Cited by: §II-A.
-  (2017) Robust prototype-based learning on data streams. IEEE Transactions on Knowledge and Data Engineering 30 (5), pp. 978–991. Cited by: §II-A.
-  (2018) Density estimation for statistics and data analysis. Routledge. Cited by: §I, §II-B.
-  (2019) Fuzzy clustering-based adaptive regression for drifting data streams. IEEE Transactions on Fuzzy Systems. Cited by: §III-A.
Concept drift-oriented adaptive and dynamic support vector machine ensemble with time window in corporate financial risk prediction. IEEE Transactions on Systems, Man, and Cybernetics: Systems 43 (4), pp. 801–813. Cited by: §II-A.
-  (2016) Online ensemble learning of data streams with gradually evolved classes. IEEE Transactions on Knowledge and Data Engineering 28 (6), pp. 1532–1545. Cited by: §I.
-  (2018) A systematic study of online class imbalance learning with concept drift. IEEE Transactions on Neural Networks and Learning Systems. Cited by: §II-A.
-  (2015) Resampling-based ensemble methods for online class imbalance learning. IEEE Transactions on Knowledge and Data Engineering 27 (5), pp. 1356–1368. Cited by: §I.
-  (2005) Relevant data expansion for learning concept drift from sparsely labeled data. IEEE Transactions on Knowledge and Data Engineering 17 (3), pp. 401–412. Cited by: §II-A.
-  (2016) Exploiting attribute correlations: a novel trace lasso-based weakly supervised dictionary learning method. IEEE Transactions on Cybernetics 47 (12), pp. 4497–4508. Cited by: §I, §II-A.
-  (2013) Active learning with drifting streaming data. IEEE Transactions on Neural Networks and Learning Systems 25 (1), pp. 27–39. Cited by: §II-A.
-  (2016) An overview of concept drift applications. In Big data analysis: new algorithms for a new society, pp. 91–114. Cited by: §I.