I Introduction
Streaming data classification consists of a routine where a model is trained on historical data and then used to classify upcoming samples. When the labels of the new arrived samples are available, they become a part of the training data. Concept drift refers to inconsistencies in data generation at different time, which means the training data and the testing data have different distributions
[6, 36, 29]. Drift detection aims to identify these differences with a statistical guarantee through what is, typically, a fourstep process [26]: 1) cut data stream into chunks as training/testing sets; 2) abstract the data sets into a comparable model; 3) develop a test statistical or similarity measurement to quantify the distance between the models; and 4) design a hypothesis test to investigate the null hypothesis (most often, the null hypothesis is that there is no concept drift).
Concept drift detection is also referred to change detection test or covariate shift, which is very relevant in machine learning
[42, 44, 9, 6, 25]. Some application domains are mobile tracking systems that monitor user behaviour, intrusion detection systems that identify unusual operations and remote sensing systems that reveal false sensors. In these scenarios, the systems can inference the change of situation by comparing data distributions at different time points, where the discrepancy of the distributions is estimated, based on the observed sample sets
[48]. Learning under concept drift consists of three major components: concept drift detection, concept drift understanding, and concept drift adaptation [26]. In this paper, we are focusing on improving concept drift detection accuracy on multicluster data sets. Regarding to the drift adaptation process, we recommend retraining the learner if a drift is confirmed as significant.Online and batch are two modes for drift detection [5, 10, 32]. Batch mode drift detection is also referred to as changedetection, or the twosample test, where the idea is to infer whether two sample sets have been selected from the same population. This is a fundamental component of statistical data processing. For most changedetection algorithms, the batch size affects the drift threshold of the test statistics. Hence, extra computation are required when the batch size is not fixed [6, 28, 22]. The online approach is more flexible because the drift threshold is selfadaptive [10]. Alternatively, it can be calculated directly from new samples without a complicated estimation process [32], especially when the change is simply an insertion and/or removal of observation [14].
Histograms are the most widely used density estimators [39]. A histogram is a set of intervals, i.e., bins, and density is then estimated by counting the number of samples in each bin. The design of the bins to reach the best density estimation result is a nontrivial problem. Most methods are based on regular grids, and the number of bins grows exponentially with the dimensionality of the data [6]
. A few methods instead use a treebased partitioning scheme, which tends to scale well with highdimensional data
[6, 15]. Recent research shows that bins of equal density result in better detection performance than regular grids [7]. For example, Boracchi et al. [6]developed a space partitioning algorithm, named QuantTree, that creates bins of uniform density and proved that the probabilities of these bins are independent of the data distribution. As a result, the thresholds of the test statistics calculated on these histograms can be computed numerically from univariate and synthetically generated data with a guaranteed false positive rate
[6].Treebased methods have achieved outstanding results with batch mode drift detection. However, their results are less optimal with online modes due to the extra effort to recalculate the drift threshold, since their drift threshold is depend on the sample size [14]. This is a critical issue in realworld distribution change monitoring problems, particularly for streams with no explicit data batch indicators [46]. In addition, treebased space partitioning does not consider the clustering properties of the data. Therefore, the partitioning results for data with complex distributions may be arbitrary, unexplainable, and may cause drift blindspots in the leaf nodes. For example, Fig 1. demonstrates the difference in the space partitioning between QuantTree, kdqTree, and kMeans algorithms. It can be seen that treebased space partitioning will produce hyperrectangles that crossing multiple clusters. The detected distribution change area may not be easily understood.
To address the problems caused by irregular partitions, we propose a novel space partitioning algorithm, called equalintensity kMeans (EIkMeans). The first priority of EIkMeans is to build a histogram that dynamically partitions the data into an appropriate number of small clusters, then applying Pearson’s chisquare test ( test) to conduct the null hypothesis test. The Pearson’s chisquare test ensures the test statistics remain independent of the sample distribution and the sample size. The proposed EIkMeans drift detection consists of three major components, which are the main contributions of this paper:

A greedy equalintensity clusterinitialization algorithm to initialize the kMeans cluster centroids. This helps the clustering algorithm to select a appropriate initialization status, and reduces the randomness of the algorithm.

An intensitybased cluster amplifyshrink algorithm to unify the cluster intensity ratio and ensure that each cluster has enough samples for the Pearson’s chisquare test. In addition, an automatic partition number searching method that satisfies the requirements of a Pearson’s chisquare test is integrated.

A Pearson’s chisquare testbased concept drift detection algorithm that achieves higher drift sensitiveness while preserving a low false alarm rate.
The rest of this paper is organized as follows. In Section II, the problem of concept drift is formulated and the preliminaries of Pearson’s chisquare test are introduced. Section III presents the proposed EIkMeans space partitioning algorithm and the drift detection algorithm. Section IV evaluates the space partitioning performance and the drift detection accuracy. Section V concludes this study with a discussion of future work.
Ii Preliminaries and Related Works
In this section, we define concept drift, discuss the stateoftheart concept drift detection algorithms, and outline the preliminaries of the Pearson’s chisquare test for the proposed drift detection algorithm.
Iia Concept drift definitions and related works of drift detection
Concept drift is characterized by variations in the distribution of data. In a nonstationary learning environment, the distribution of available training samples may vary with time [30, 46, 41, 11]. Consider a topological space feature space denoted as , where is the dimensionality of the feature space. A tuple denotes a data instance, where
is the feature vector,
is the class label and is the number of classes. A data stream can then be represented as a sequence of data instances denoted as . A sample set chunked from a stream via a time window strategy is a set of data instances arriving within a time interval, denoted as , where is the given time interval that defines the time window. A concept drift has occurred between two time windows and if the joint probability of and is different, that is, [18, 27, 23, 1].According to the definition of joint probability , if we only consider problems that use to infer , concept drift can be divided into two subresearch topics [30, 18, 13, 35]:

Covariate shift focuses on the drift in while remains unchanged. This is considered to be virtual drift

Concept shift focuses on the drift in while remains unchanged. This is most commonly referred to as real drift
It is worth mentioning that and are not the only implications of
drift. The prior probabilities of classes
and the class conditional probabilities may also change, which could lead to a change in and would affect the predictions [18, 34]. This issue is another research topic in concept drift learning that closely relates to class imbalance in data streams [43].Error rate  Distribution  Multihypo  
Real drift  supervised  supervised  depends 
unsupervised  unsupervised  depends  
semisupervised  semisupervised  depends  
active learning  active learning  depends  
Virtual drift  supervised  supervised  depends 
unsupervised  unsupervised  depends  
semisupervised  semisupervised  depends  
active learning  active learning  depends 
Concept drift detection algorithms can be summarized in three major categories, i) error ratebased; ii) distributionbased and iii) multiple hypothesis tests (multihypo)[26]. The algorithms can also be distinguished in different learning settings, such as supervised, unsupervised [37, 24], semisupervised [19], and active learning settings[47]. For a supervised setting, the target variable is available for drift detection. Most error ratebased drift detection algorithms are developed with this setting [38, 3]. In later work, the problem of label availability in data streams with concept drift has been acknowledged[45, 20]
pointing out concept drift could occur within unsupervised and semisupervised learning environments. Accordingly, active learning strategy is adopted by
[47] to address concept for improving the learning performance.Real and virtual are two major drift . Errorbased, distributionbased and multiple hypothesis are three major of drift detection algorithms. Supervised, unsupervised, semisupervised and active learning are four major of learning under concept drift. In Table I, the indicates the algorithms in this category can detect and distinguish different drift types with the given setting. The indicates the they can detect drifts but cannot distinguish the types. The indicates they are unable to detect the drift. The indicates the algorithms in this category cannot be applied in the given setting. With regard to multiple hypothesis tests, the capability of these algorithms varies significantly, since they could be a combination of multiple errorbased algorithms or a hybrid of both error and distributionbased algorithms. Therefore, it is hard to give a conclusion for this category. In addition, it is worth to mention that Mahardhika[33] has proposed a method to handle concept drift in a weakly supervised setting.
EIkMeans is one distributionbased drift detection algorithm. Most Hoeffding boundbased algorithms, like[16, 31], belong to error ratebased drift detection that can only detect real drift with supervised, semisupervised or active learning settings. The main contribution of EIkMeans is different from conventional distributionbased drift detection. Conventional distributionbased drift detection algorithms aim to find a novel test statistics to measure the discrepancy between two distributions and to design a tailored hypothesis test to determine the drift significant level, such as [22, 10, 21]
. In contrast, EIkMeans focuses on how to efficiently convert multivariate samples into a multinomial distribution and then use an existing hypothesis test to detect the drift. Since EIkMeans is using Pearson’s chisquare test as the hypothesis test, the drift threshold can be calculated directly according to Chisquare distribution and it can be implemented in an online manner. Other distributionbased algorithms may need to recompute the drift threshold as new samples become available.
IiB Histogrambased distribution change detection
Histograms are the oldest and most widely used density estimator [39]. The bins of the histogram are the intervals, i.e., partitions, of the feature space. Hence, a bins histogram is a set of partitions, denoted as , where is a partition of the feature space , , and , for [6]. Histograms are often built upon regular grids, which means the number of bins will grow exponentially along with the dimensionality of the data [6]. Dasu et al. extended QuadTree [15] based on the idea of a kdimensional tree [4] and developed a kdqTree space partitioning scheme [12]
. In the kdqTree scheme, the feature space is partitioned into adaptable cells of a minimum size and a minimum number of training samples. Then, the Kullback–Leibler divergence is used to quantify the distribution discrepancy, and bootstrap sampling is used to estimate the confidence interval. Another recent treebased space partitioning algorithm, named QuantTree, was proposed by Boracchi et al.
[6], which splits the feature space into partitions of uniform density. The advantages of QuantTree is that the test statistics computed based on it are distribution free [6].Distribution change detection with histograms can be considered from the perspective of granularity and can be categorized into two groups: higher resolution histograms and lower resolution histograms, as demonstrated in Fig. 2.
Lower resolution partitioning requires a large number of training samples so that each partition could have enough samples to estimate the density. Without adequate training samples, the estimate the density may suffer from randomness. To mitigate this problem, Lu et al. [27, 28] proposed a competencebased space partitioning method that uses related sets to enrich sample sets, then applying space partitioning and calculating the distribution discrepancy. Liu et al. applied a similar strategy [22] by partitioning the feature space based on knearest neighbor particles. These higherresolution partitions resulted in higher drift detection accuracy on small sample sets, but also suffered from higher computational costs.
IiC Pearson’s chisquare test
Pearson’s chisquare test, or test for short, is used to determine whether there is a significant difference between the expected frequencies and the observed frequencies in one or more sets of data [8]. The test statistic follows a chisquare distribution when there is no significant difference. The purpose of the test is to assume the null hypothesis is true and then evaluate how likely a specific observation would be.
The standard process of the
test is to use sample data to find: the degrees of freedom, the expected frequencies, the test statistic, and the
value associated with the test statistic [8]. Given a contingency table with
rows and categorical variables (column), the degrees of freedom are equal toThe expected frequency counts are computed separately for each level of one categorical variable at each level of the other categorical variable. The th and th expected frequencies of the contingency table are calculated with the equation
where is the sum of the frequencies for all columns in row , is the sum of the frequencies for all rows of columns , and
is the sum of all rows and columns. The test statistic is a chisquare random variable
defined by(1) 
where is the observed frequency count at row and column , and is the expected frequency count at row and column . The value is the probability of observing a sample statistic as extreme as the test statistic. Since the pvalue is a
test statistic, it can be computed with the chisquare probability distribution function.
Pearson’s chisquare test should be used with the conditions described in [8], which assumes there is a sufficiently large sample set. If the test is applied to a small sample set, the
test will yield an inaccurate inference and will result in a high Type II error. The true positive detection accuracy will be impaired, but the false alarm rate will not increase. According to the central limit theorem, an
distribution is the sum ofindependent random variables with a finite mean and variance that converges to a normal distribution for large
. For many practical purposes, Box et al. [8] claim that for and the distribution of the estimated test statistics is sufficiently close to a normal distribution for the difference to be ignored. In other words, to avoid the bias raised by asymptotic issues, the observations and expectation frequencies should be greater than a particular threshold.Iii EIkMeans Space Partitioning and Drift Detection with Pearson’s Chisquare Test
This section presents our EIkMeans space partitioning histogram and our drift detection algorithm based on Pearson’s chisquare test. The algorithm implementation detail is given, and the complexity is discussed at the end of this section.
Iiia The risk of offset partitions in histogrambased drift detection
Let us begin by restating the concept drift detection problem and our proposition.
Problem. 1. Let and be random variables defined on a topological space , with respect to , where consists of all Borel probability measures on . Given the observations and from and , respectively, how much confidence do we have that ? At present, most distribution change detection methods assume that the observations are i.i.d. which makes the assumption and objective equivalent to a twosample test problem.
The problem of analyzing a data stream to detect changes in data generating distribution is very relevant in machinelearning and is typically addressed in an unsupervised manner [6]. However, it can easily be extended to handle a supervised setting. For this, there are two options for implementing our proposed solution without changing the algorithms. Option 1: Considering the label or target variable as one feature of the observations in the sample set and then applying the proposed concept drift detection algorithms. Option 2: Separate the observations based on their labels and detect concept drift individually.
The design of the space partitioning algorithm is critical to how the histogram is constructed, but nowhere in the literature is there a definitive conclusion on how to build a perfect histogram. Treebased histogram construction is one of the most popular methods for change detection. QuantTree [6] is a representative algorithm that creates partitions of uniform density in a tree structure. Given all the distributions are the same, the drift threshold is independent of the data samples and can be numerically computed from univariate and synthetically generated data. Although some studies claim that uniformdensity partition schemes are superior based on experiment evaluations [7, 22], no study includes a detailed justification of its claims.
The fundamental idea of drift detection via histograms is to convert the problem of a multivariate twosample test into a goodnessoffit test for multinomial distributions. If the data is categorical and belongs to a collection of discrete nonoverlapping classes, it has a multinomial population [26]. In this case, each partition (each bin in the histogram) constitutes a categorical, nonoverlapping class. And the null hypothesis for a goodnessoffit test to evaluate how the observed frequency match the expected frequency , that is, the number of testing data in a partition is expected to fall into an estimated range based on the training data [8]. The hypothesis is rejected if the pvalue of the observed test statistic is less than a given significance level [1, 2, 40].
Pearson’s chisquare test is a commonly used hypothesis test for this task if the expected frequency for each category is larger than 5, and the observed frequency for each category is larger than 50 [8]. In histogrambased drift detection, this requirement can be satisfied by controlling the number of samples in partitions, such as reducing the number of partitions K to ensure all partitions contain enough samples. Recall the test statistic in Eq. (1)., we know that, given the same number of partitions, the higher the value of the test statistic, the more likely it is that a distribution drift has occurred. Therefore, the objective is to design a partition algorithm to have the highest test statistic. If the highest test statistic, which represents the highest distribution discrepancy, does not refute the null hypothesis, then there is no drift. Theoretically, the expected frequency counts for all partitions becomes known once the partition scheme is determined.
To maximize the test statistic, the space partitioning strategy needs to avoid partitions that have distribution discrepancies that cannot be measured by subtracting the observations and expectations, as illustrated in Fig. 3. Related defintions are given below.
Definition 1.
(Partition Absolute Variation) The absolute variation of a partition is defined as the integration of the probability density difference of and in partition , denoted as
where ,
denotes the probability density function of the training and testing data, and
is the partition interval.Definition 2.
(Partition Probability Variation) The probability variation of a partition is defined as the difference of the integration of the probability density in partition of and , denoted as
Then we have the offset partition defined as follow.
Definition 3.
(Offset Partition) Given two probability density distributions and , a space partition is an offset partition if the absolute variation is larger than the probability variation, denoted as .
Concept drift detection requires the histogram built on training samples only. Admittedly, the impact of offset partitions on distribution estimation can be reduced by learning methods that optimize the density difference between the training and testing samples. However, this method can be timeconsuming and is not feasible when the testing data is small or even not available. Additionally, this method may not be suitable for streaming data since data may arrive much faster than it can be tested. Therefore, for concept drift detection, histograms need to be designed only based on training data, and minimizing the occurrence of offset partitions. In other words, to achieve the best drift detection results, the histogram should have the least number of offset partitions. To detect concept drift, we propose the following strategies to reduce the appearance of offset partitions.

Partitions should avoid cluster gaps. With multicluster training sample sets, there are gaps between clusters. If a partition steps across multiple clusters, its sensitivity to drift will be affected.

Partitions should be as compact as possible. The distances between samples within a partition should be minimized. If the drift direction is unknown, one rule of thumb to avoid offset partitions is to keep the shape of the partition as compact as possible, as shown in Fig. 4. However, this strategy must be constrained by a predefined minimum partition size. Otherwise, the partitions will be too small to yield statistical information. In our case, the test requires the number of observations to be as large possible. The minimum requirement is 50 observations for each partition, and the expected frequency count has to be greater or equal to 5 [8].
This strategy also conforms to Boracchi et al.’s [7] conclusion that histogram bins of equal density provide better detection performance than regular grids. For example, given a sample set with 1000 samples and 50 as the minimum number of points in a partition with no identical samples, the smallest average interval size of 1000/50=20 partitions is always smaller than the smallest average interval size of 19 partitions. However, histogram bins of equal density may not always have the smallest average interval size. Therefore, bins of nonuniform density may provide superior performance to uniform density bins in some cases.
The distribution discrepancy within partitions is also important, which is another issue resulting from offset partitions that may influence the drift detection results. An test cannot identify a distribution discrepancy inside a partition, so the histogram design should ensure the distributions in the partitions are as simple as possible. For example, kernel density estimationbased methods assume the data follows Gaussian mixture distributions. However, this can cause bandwidth selection problems. Therefore, we need an indicator that represents whether or not the density of samples in the same partition are similar. Definition 4 defines this indicator as the offset margin:
Definition 4.
(Offset Margin) The offset margin of partition is defined as the difference between the absolute variation and the probability variation of , denoted as .
The offset partition is only one of many issues that might influence the detection results. Intuitively, the more partitions we have, the less likely offset partitions occur. Also, different partitioning schemes will result in different drift detection results with different sample sets. Minimizing the risk of offset partitions may result in better performance generally, but it may not be the best choice for a particular sample set.
IiiB EIkMeans Space Partitioning
Since the main objective is to keep the risk of offset regions as small as possible without knowing the testing data, the simplest method is to create as many partitions as possible. To this end, we propose using the average partition interval size as an indicator for constructing a histogram. The test requires there be more than 50 observations with an expected frequency greater than 5. This requirement can be satisfied by adding constraints onto the indicator. The general form of the objective function is to find the centroids with the smallest average interval size:
(2) 
As the interval in high dimensional cases is a volume, the interval size is denoted as . The denotes the count of observations in , and denotes the expected frequency count in .
The nature of kMeans makes it a good option for this task. Adding constraints can be addressed by introducing an algorithm to monitor the number of samples in each cluster. Here, the volume indicator represents the average distance to the centroids. The overall workflow of EIkMeans space partition is shown in Fig. 5.
As shown in Fig. 5, the procedure begins by initializing the cluster centroids with a greedy equalintensity kmeans initialization algorithm. The objective of this algorithm is to segment the feature space into a set of partitions with the same number of samples. Let denote the training data set for the histogram, and be the samples located in partition . There are partitions. The greedy equal intensity kMeans initialization will evenly divide the samples into groups. The centroids of these groups will be input into kMeans as the initial centroids. Once thekMeans converges or reaches the maximum iteration criteria, the returned sample labels and the centroids are used for equalintensity cluster amplification.
Greedy equalintensity kmeans initialization finds the farthest sample, i.e., the sample with the longest distance to its nearest neighbor. The nearest neighbors of this sample is labelled as the first partition, where is the cardinality of the training sample set. The labelled samples are then removed from the training set, and the above process is repeated until all the samples are labelled.
Remark: if the remainder of is not equal to 0, the remainder will be evenly distributed into the first few partitions, that is, samples with nearest neighbors will be labeled instead of those with the nearest neighbours.
Algorithm 1 shows the pseudocode for the greedy equalintensity kMeans centroid initialization. The inputs are: a set of training samples ; and the number of clusters to initialize. In this algorithm, one trick we used to control the computation cost is subsampling. The input training set could be the entire training set or just a subset of the training set. Some data preprocessing techniques, such as dimensionality reduction or data normalization, will be applied before running our algorithm. Since different data sets may require different data preprocessing techniques, this is not the main scope of our algorithm.
Denote the number of samples in dataset as , and . The runtime complexity in line 3 is with an appropriate nearest neighbor search algorithm, such as d tree. The sorting complexity for line 4 is with a mergesort algorithm. The complexity for lines 6 and 7 are and , respectively. Therefore, the total complexity for each iteration is according to the rule of sums. The total complexity for the greedy equalintensity kmeans centroids initialization is
Based on the labels, the cluster sample intensity ratio is computed by dividing the count of samples in a cluster by the total number of training samples, i.e.,
(3) 
where is the number of samples located in partition .
The intensity ratios for all clusters can be represented as a vector , where the shape of is . The amplify coefficient function for the cluster distance is calculated based on this vector:
where is a parameter to control the shape of the coefficient function. To convert the amplify coefficients to matrix, the amplify coefficient vector is multiplied by a vector to create a amplify coefficient matrix, denoted as . Calculating the paired Euclidean distance matrix between the centroids and the data samples as , the amplified distance matrix is derived by
and the amplified cluster labels are derived by finding the centriod index with the minimum amplified distance,
In the cluster amplifyshrink algorithm, is chosen through a grid search from a predefined set . When , the amplify coefficients are all equal to 1, which will not amplify or shrink any of the clusters. As increases, the clusters are amplified or shrunk sharply. If the minimum number of samples in a partition is larger than the desired value, the amplifyshrink algorithm will terminate, denoted as
where is the desired value of the minimum number of samples in the partitions. According to the requirements of Pearson’s chisquare test, the desired value is . If no can satisfy the desired value, the number of partitions is reduced by 1, namely , and the above process is repeated.
Algorithm 2 shows the pseudocode for the equalintensity kMeans space partitioning. The inputs are: the training set ; the minimum number of samples in a partition ; and the grid search range of the amplify coefficient function parameter . The aim of lines 48 are to build kMeans clusters of similar intensity. Then, from lines 10 to 19, the clusters are amplified or shrunk based on their intensity ratio. The amplifyshrink process will end up satisfying the minimum number of samples or until it reaches the end of the range of . If a desired partition sets cannot be found after the amplifyshrink process, the number of clusters will be reduced by 1, namely is updated to , and the process is repeated.
From lines 10 to 19, the main cost is the multiplication of the matrix, which has a runtime complexity equal to . Because is constant, the complexity is actually . The greedy initialization in line 4 is , the kmeans in line 6 is . Considering the while loop starts from K to 2, the worstcase complexity of EIkMeans space partition is .
IiiC EIkMeans Drift Detection
EIkMeans considers the clustering property between samples as important when drift occurs. We assume that the distribution change is more likely to occur in a closelylocated group of samples than in an arbitrary shape. EIkMeans space partitioning are clusterprioritized and are more sensitive to drift within multicluster type datasets. The drift detection workflow, in Fig. 6, is simple and fast, once the space partitioning is finished. Based on the output of how the partitions are constructed, the testing samples are clustered into partitions. The observations in the training and the testing sample sets are vertically stacked to form a contingency table, and the test is applied to evaluate the distribution discrepancy between them.
Algorithm 3 is the drift detection algorithm. It counts the observation frequencies of both the training and testing data, and conducts the test. The counting process is implemented using the same steps in Algorithm 2, lines 1214. If no drift occurs, the observation frequencies of the training data set are stored in the system buffer for the next test. A contingency table is formed for each test by vertically stacking the stored vector and the observation frequencies of the testing data set. The test returns a result whether it rejects or accepts the null hypothesis test, denoted as .
The optimized complexity of the 1NN classifier in the EIkMeans drift detection algorithm is . The test complexity is . The overall EIkMeans drift detection runtime complexity is . In this algorithm,
The overall EIkMeans drift detection algorithm can be summarized into 3 steps.

Step 1. Initialize the greedy equalintensity cluster centroids.

Step 2. Segment the feature space as small clusters. This step is based on kmeans clustering, which divides the datasets into a set of individual clusters. This ensures no partition will step across clusters. The number of partitions is continuously reduced if the number of samples in each partition does not satisfy the desired values.

Step 3. Detect drift with Pearson’s chisquare test.
Iv Experiments and Evaluation
In this section, we compare the proposed EIkMeans with other stateoftheart drift detection algorithms to demonstrate how EIkMeans performs on the drift detection tasks. The selected histogrambased drift detection algorithms are QuantTree with both and total variation statistics which are reported as the best method in their paper [6], kdqTree with test [12] and one multivariate twosample test baseline, known as the multivariate WaldWolfowitz test (MWW test) [17]. We choose the MWW test as the baseline because it is designed to solve the problem by statistical analysis and its runtime complexity is low enough to perform in a stream learning scenario. To support the reproducible research initiative, the source code of EIkMeans is available online^{1}^{1}1https://github.com/AnjinLiu/TCYB2019EIkMeansDriftDetection
Iva A comparison of space partitioning
Experiment 1. For this experiment, we generated three data sets with different configurations to demonstrate the difference in the space partitioning. The partitioning results are shown in Fig. 7. The first data set, denoted as 1G, has a Gaussian distribution with a mean of , a variance matrix of , and 1350 data samples. The second data set, denoted as 3G[1:1:1], has three Gaussian distributions with different means: , , . The variance matrixes are the same, which form three clusters with the same number of data samples in each cluster. The third data set, denoted as 3G[1:3:5], has the same settings as 3G[1:1:1] but with a diverse sample ratio for each cluster, i.e., the cluster with the mean of contains 150 data samples; has 450 samples; and has 750. The number of desired partitions is set as .
Findings and discussion
: The intStv stands for the standard deviation of the partitions’ intensity, which is calculated via Eq. (
3). Low intStv implies that the samples are evenly distributed in each partition. The results shows that no matter what shape of the data set is, EIkMeans always has a smaller intensity variation than kMeans, which is what we want to achieve.Data type  Description  Configurations  Drift margin  
2dUmean 


2d1Gmean 

,  
2d1Gvar 

,  
2d1Gcov 

,  
2d2Gmean 

, ,  
2d4Gmean 

, ,,, 
EIkMeans test  kMeans test  kdqTree test  QuantTree stat  QuantTree stat  MWW test  
2dUmean  4.791.57  4.021.35  5.001.94  4.891.48  3.771.36  10.114.62 
2d1Gmean  4.901.94  3.651.62  4.882.26  4.721.69  3.701.40  10.804.60 
2d1Gvar  4.901.94  3.651.62  4.882.26  4.721.69  3.701.40  10.804.60 
2d1Gcov  4.901.94  3.651.62  4.882.26  4.721.69  3.701.40  10.804.60 
2d2Gmean  3.821.70  3.141.54  4.012.29  4.511.97  3.041.43  10.685.28 
2d4Gmean  2.311.39  2.061.02  2.721.62  2.661.01  2.110.99  9.364.65 
Average  4.27  3.36  4.39  4.37  3.34  10.43 
. Each detection algorithm was run 50 times on 250 data sets generated with different random seeds, the average and standard deviation of TypeI error are reported. The underlined values are the TypeI error which exceed the predefined false positive rate,
.EIkMeans test  kMeans test  kdqTree test  QuantTree stat  QuantTree stat  MWW test  
2dUmean  45.0010.18  47.879.60  47.2811.77  40.2721.28  57.2523.72  13.575.64 
2d1Gmean  40.8411.34  58.2811.45  43.7112.07  61.3411.39  70.337.47  83.627.34 
2d1Gvar  80.466.02  81.766.78  76.518.95  84.934.43  89.853.22  83.727.38 
2d1Gcov  80.845.51  87.474.58  83.395.82  90.725.39  92.454.57  87.206.18 
2d2Gmean  57.4310.80  66.789.66  59.7811.41  70.5710.42  83.745.20  83.226.90 
2d4Gmean  46.4213.37  53.5812.70  45.8813.27  77.2112.17  84.329.08  80.707.77 
Average  58.50 (1)  65.96 (3)  59.43 (2)  70.84 (4)  79.66 (6)  72.00 (5) 
TypeI  TypeII  TypeI  TypeII  TypeI  TypeII  TypeI  TypeII  
2dUmean  4.79  45.00  4.83  43.32  4.84  41.57  4.78  40.86 
2d1Gmean  4.90  40.84  5.06  38.39  5.17  37.55  5.00  36.77 
2d1Gvar  4.90  80.46  5.06  79.57  5.17  79.85  5.00  79.57 
2d1Gcov  4.90  80.84  5.06  80.10  5.17  80.39  5.00  80.72 
2d2Gmean  3.82  57.43  3.73  55.69  3.68  54.74  3.64  55.52 
2d4Gmean  2.31  46.42  2.13  45.55  2.06  45.18  1.98  44.43 
Average  4.27  58.50 (4)  4.31  57.10 (3)  4.35  56.55 (2)  4.23  56.31 (1) 
IvB Drift detection accuracy with synthetic data sets
Experiment 2: In this experiment, we generated six 2dimensional data sets to evaluate the power of EIkMeans to detect drift. We compared EIkMeans with the stateofthearts QuantTree, kdqTree and kmeans space partition plus
test. The training set contained 2000 training samples, and the testing set contained 200 samples. For each data type, we generated 250 stationary testing sets and 250 drift testing sets and evaluated both Type I and Type II errors. Type I errors are rejections of a true null hypothesis (also known as a "false positive"). A Type II error is the false null hypothesis rates (a "false negative"). The TypeI and TypeII errors are the most common evaluation metric for distribution change detection. To evaluate the stability, we run the test 50 times and recorded the mean and standard deviation. Table
II presents the data set configurations, and Table III shows the mean of the drift detection results. Table IV shows the standard deviation. To evaluate the influence of training batch size, we changed the training set size to 3000, 4000 and 5000. The detection results are shown in Table V. Fig. 8. shows the space partitioning results of each algorithm.Findings and discussion: The results shows that all drift detection algorithms outperformed the baseline multivariate twosample test, MWW test. The results demonstrate that EIkMeans with test has the average TypeI error below as well as the lowest average TypeII error. Comparing to the kMeansbased space partition the improvement of EIkMeans space partitioning is significant, which raised the rank from (3) to (1). The kdqTree space partition with test performed well in this experiment, and had shown no significant disadvantages compared to others. The QuantTree space partitioning with and statistics are not performing well in general, because the partitioning strategy is not designed for multicluster data sets. As we can see, the TypeII error of QuantTree stat is very close to the kMeans test on the 2dUmean, 2d1Gmean, 2d1Gvar and 2d1Gcov data sets, which are all single cluster type data sets. Average performance dropped significantly on the multicluster data sets 2d2Gmean and 2d4Gmean. Based on these results, we conclude that the design of a histogram scheme makes a significant contribution to the drift detection accuracy in different data distribution which is a nontrivial problem. Regarding to the batch size, as we use Pearson’s chisquare test as the drift detection hypothesis test, the drift threshold of the test statistics is determined by the Chisquare distribution with a given significant level. A sample set with a sufficiently large size is assumed. If a chisquared test is conducted on a sample with a small size, the chisquared test will yield an inaccurate inference, which might end up committing a Type II error. As we can see, the TypeII error increases as the training size decrease.
Experiment 3: To evaluate the proposed algorithm on high dimensional data, we expand the 2d1Gmean and 2d4Gmean data sets to 4, 6, 8, 10 and 20 dimensions by adding normal distributed data. For example, in the 4d1Gmean data set, the first two features are the same as the 2d1Gmean but, for the third and fourth features, they are generated by normal distribution with covariance equal to 0. Since increasing unrelated dimensions will reduce the drift sensitiveness, we increased the drift margin for HD1Gmean to 0.5, for HD4Gmean to 1.0, and doubled the training data. The results are given in Fig. 9.
Findings and discussion: In Fig. 9, the TypeII errors increased as the nondrift dimension increased. Most algorithms preserved a low TypeI error, except the MWW test. Although MWW test has the lowest TypeII error on the HD4Gmean drift data sets, its TypeI error is above the desired level threshold. The kdqTree with test has the best performance on the HD1Gmean data sets, but it turns to powerless on the HD4Gmean data sets. We consider this is because kdqTree does have a effective method to control the number of samples in each partition. Directly applying test on the kdqTree partitions is risky. In this experiment, EIkMeans outperforms others in most cases and has its false positive rate below the predefined threshold , which indicates that it is stable on high dimensional data.
IvC Drift detection accuracy with realworld data sets
Experiment 4. Drift detection on realworld data sets. For this experiment, we created 8 traintest drift detection tests from 5 realworld data applications. For each test, we generated one training data set and 500 testing data sets. Among these testing data sets, half were drawn from the same distribution, the other half were drawn from a different distribution. Again, the results were evaluated in terms of TypeI and TypeII errors. The characteristics of these data sets are summarized in Table IX.
HIGGS Bosons and Background Data Set. The objective of this data set is to distinguish the signatures of the processes that produce Higgs boson particles from those background processes that do not. Four lowlevel indicators of the azimuthal angular momenta for four particle jets were selected as features, which means the distributions were . The jet momenta distributions of the background processes are denoted as Back, and the processes that produce Higgs bosons are denoted as Higgs. The total sample size of mixed Backs and Higgs is . We randomly selected 2000 samples without replacement from each distribution as the training data.1000 samples were used as the testing set. There were three types of data integration: BackBack where both sample sets were drawn from Back; HiggsHiggs, where both sample sets were drawn from Higga; and HiggsBack, where one sample set was drawn from Higgs and the other from Back.
MiniBooNe Particle Identification Data Set. This data set contains 36,499 signal events and 93,565 background events. Each event has 50 particle ID variables. The drift detection task is to distinguish between signal events and background events. The sample size of the training set size was 2000, and 500 for the testing set.
Arabic Digit Mixture Data Set. This data set contains audio features of 88 people (44 females and 44 males) pronouncing Arabic digits between 0 and 9. We applied the same configuration as Denis et al. [14]. The data set was originally i.i.d. and contained a time series for 13 different attributes. The revised configuration has 26 attributes instead of 13 time series with a replacement mean and standard deviation for each time series. Mixture distributions were generated by grouping female and male labels. Mixture distribution contained randomly selected samples of both males and females, with male labels from 0 to 4 and female labels from 5 to 9. Mixture distribution reversed the labels at 9, i.e., drawing the samples of with label male. We configured the data set this way to create multiple clusters, where the pronunciation of each digit formed a cluster. The configuration is summarized in Table VI. The training set size was 2000, and the testing set size was 500.
M  M  M  M  F  F  F  F  F  F  
M  M  M  M  F  F  F  F  F  M 
Localization Mixture Data Set. The localization data set contains data from a sensor carried by 5 different people (A, B, C, D, E). The original data has 11 different movements with imbalanced samples. To use this data set for drift detection, we selected the top three movements with the most samples, ’lying’, ’walking’ and ’sitting’. To simulate multiple clusters with drift, we grouped samples from different people together at different percentages to result in varied data distributions. The training set size was 3000, and 600 for the testing set. The sample proportion of each people is summarised in Table VII.
0.0  0.2  0.2  0.2  0.4  
0.4  0.4  0.2  0.0  0.0 
Insects Mixture Data Set. This data set contains features from a laser sensor. The task is to distinguish between 5 possible specimens of flying insects that pass through a laser in a controlled environment (Flies, Aedes, Tarsalis, Quinx, and Fruit). A preliminary analysis showed no drift in the feature space. However, the class distributions gradually change over time. To simulate drift in multiple clusters, we selected the samples from different insects and grouped them together at different percentages. Thus, the data distribution may vary. The training set size was 2000, and 500 for the testing set size. The sample proportion of each specimens is summarized in Table VIII.
Flies  Aedes  Tarsalis  Quinx  Fruit  
0.2  0.2  0.2  0.2  0.2  
0.14  0.14  0.2  0.2  0.32 
Data set ID  Data set name  # Features  # Training  # Testing 
RealI  HiggsBack  4  2000  1000 
RealII  BackHiggs  4  2000  1000 
RealIII  SignBack  50  2000  500 
RealIV  BackSign  50  2000  500 
RealV  Arabic   26  2000  500 
RealVI  Arabic   26  2000  500 
RealVII  Localization  3  3000  600 
RealVIII  Insect  49  2000  500 
EIkMeans test  kMeans test  kdqTree test  QuantTree stat  QuantTree stat  MWW test  
RealI  6.345.97  3.502.72  4.804.18  6.705.73  5.774.85  7.145.94 
RealII  9.098.13  3.443.37  4.063.24  5.263.15  4.342.71  6.545.72 
RealIII  4.672.74  1.81.35  3.123.59  3.831.9  3.691.80  3.143.60 
RealIV  4.062.03  1.941.52  2.942.50  4.912.68  4.592.39  3.263.21 
RealV  4.404.14  3.963.97  5.375.33  4.424.11  4.303.96  1.463.05 
RealVI  4.453.94  2.962.35  4.814.20  5.346.45  4.875.16  1.331.84 
RealVII  0.000.00  0.000.00  0.000.00  2.0014.14  2.0014.14  10.0030.30 
RealVIII  2.812.59  1.581.60  3.427.90  5.664.96  4.974.24  10.459.39 
Average  4.48  2.40  3.57  4.77  4.32  5.42 
EIkMeans test  kMeans test  kdqTree test  QuantTree stat  QuantTree stat  MWW test  
RealI  87.169.69  85.349.46  78.6111.69  80.0013.10  82.7610.84  89.457.85 
RealII  74.0217.21  92.735.77  77.7113.48  79.6611.62  84.468.80  69.7819.22 
RealIII  0.000.00  0.000.00  25.9443.94  0.000.00  0.000.00  0.000.00 
RealIV  0.000.00  0.000.00  3.9012.02  0.000.00  0.000.00  0.000.00 
RealV  12.9212.61  18.9413.18  16.7810.90  66.6817.64  84.719.06  98.151.99 
RealVI  12.1012.92  4.265.38  17.1510.51  92.037.82  93.986.51  98.821.23 
RealVII  62.0049.03  36.0048.49  58.0049.86  62.0049.03  64.0048.49  2.0014.14 
RealVIII  12.3710.98  32.6217.45  30.2217.35  77.3214.40  80.9011.92  68.6618.43 
Average  32.57 (1)  33.74 (2)  38.54 (3)  57.21 (5)  61.35 (6)  53.36 (4) 
Findings and discussion: The average drift detection accuracy is shown in Table X XI, and their standard deviation. The results show that all tested methods returned an average false positive rate below the except MWW test. EIkMeans had the lowest average TypeII error of 32.57%, which is 1.17% lower than the next best performance by kmeans with a test. However, EIkMeans improved drift detection power comes at the cost of an increased false positive rate, and sometimes at over the predefined thresholds. This result conforms to our expectation since EIkMeans places more restrictive constraints on the number of samples in each partition to meet the requirements of test. With a small sample set, the test will yield an inaccurate inference and is prone to Type II errors. Notably, however, while the true positive detection accuracy may be impaired, the false alarm rate does not surpass the predefined threshold .
Across the RealI to RealVII data sets, the clusterbased algorithms performed just as well as the treebased algorithms. However, the QuantTree algorithms completely lost its power to detect drift with the RealVIII Insect mixture clusterbased data set, while EIkMeans showed the best performance.
V Conclusions and Future Work
In this paper, we proposed a novel space partitioning algorithm, called EIkMeans, for drift detection on multicluster data sets. EIkMeans is a modified kmeans algorithm to search for the best centroids to create partitions. The distances between samples and centroids are amplyshrink based on the cluster intensity ratios. The proposed algorithm detects concept drift from a data distribution perspective. Similar to most distributionbased drift detection algorithms, with a supervised learning setting, it will trigger drift alarm if there is a real or virtual drift but it may not be able to distinguish the drift types. The results of our experiments show the power of EIkMeans to detect drift with multicluster type data sets and proved that histogram design is critical to drift detection accuracy. The results also show that uniform space partitioning may not always outperform other schemes – the performance of the space partition algorithm is datadependent.
The version of EIkMeans considered in this paper is designed for a Pearson’s chisquare test, but different hypothesis tests may require different methods of histogram construction. This is something we intend to explore in future work. In addition, concept drift detection is only one aspect of learning in a dynamic stream. How to design a tailored drift adaptation algorithm that leverages the drift detection result to achieve better performance in stream learning is our next target.
Acknowledgment
The work presented in this paper was supported by the Australian Research Council (ARC) under Discovery Project DP190101733. We acknowledge the support of NVIDIA Corporation with the donation of GPU used for this research.
References

[1]
(2017)
Hierarchical changedetection tests.
IEEE Transactions on Neural Networks and Learning Systemss
28 (2), pp. 246–258. Cited by: §IIA, §IIIA.  [2] (2013) Justintime classifiers for recurrent concepts. IEEE Transactions on Neural Networks and Learning Systems 24 (4), pp. 620–634. Cited by: §IIIA.
 [3] (2017) RDDM: reactive drift detection method. Expert Systems with Applications 90, pp. 344–355. Cited by: §IIA.
 [4] (1975) Multidimensional binary search trees used for associative searching. Communications of the ACM 18 (9), pp. 509–517. Cited by: §IIB.
 [5] (2007) Learning from timechanging data with adaptive windowing. In Proceedings of the 2007 SIAM International Conference on Data Mining, pp. 443–448. Cited by: §I.
 [6] (2018) QuantTree: histograms for change detection in multivariate data streams. In Proceedings of the 2018 International Conference on Machine Learning, pp. 638–647. Cited by: §I, §I, §I, §I, §IIB, §IIIA, §IIIA, §IV.
 [7] (2017) Uniform histograms for change detection in multivariate data. In Proceedings of the 2017 International Joint Conference on Neural Networks, pp. 1732–1739. Cited by: §I, §IIIA, §IIIA.
 [8] (1978) Statistics for experimenters: an introduction to design, data analysis, and model building. Vol. 1, JSTOR. Cited by: §IIC, §IIC, §IIC, 2nd item, §IIIA, §IIIA.
 [9] (2018) A pdffree change detection test based on density difference estimation. IEEE Transactions on Neural Networks and Learning Systems 29 (2), pp. 324–334. Cited by: §I.
 [10] (2017) An incremental change detection test based on density difference estimation. IEEE Transactions on Systems, Man, and Cybernetics: Systems PP (99), pp. 1–13. Cited by: §I, §IIA.

[11]
(2019)
MUSErnn: a multilayer selfevolving recurrent neural network for data stream classification
. In Proceedings of the 19th IEEE International Conference on Data Mining, Cited by: §IIA.  [12] (2006) An informationtheoretic approach to detecting changes in multidimensional data streams. In Proceedings of the 28th Symposium on the Interface of Statistics, Computing Science, and Applications, pp. 1–24. Cited by: §IIB, §IV.
 [13] (2015) Learning in nonstationary environments: a survey. IEEE Computational Intelligence Magazine 10, pp. 12–25. Cited by: §IIA.
 [14] (2016) Fast unsupervised online drift detection using incremental kolmogorovsmirnov test. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1545–1554. Cited by: §I, §I, §IVC.
 [15] (1974) Quad trees a data structure for retrieval on composite keys. Acta Informatica 4 (1), pp. 1–9. Cited by: §I, §IIB.
 [16] (2015) Online and nonparametric drift detection methods based on hoeffding’s bounds. IEEE Transactions on Knowledge and Data Engineering 27 (3), pp. 810–823. Cited by: §IIA.
 [17] (1979) Multivariate generalizations of the waldwolfowitz and smirnov twosample tests. The Annals of Statistics, pp. 697–717. Cited by: §IV.
 [18] (2014) A survey on concept drift adaptation. ACM Computing Surveys 46 (4), pp. 1–37. Cited by: §IIA, §IIA.

[19]
(2016)
Sand: semisupervised adaptive novel class detection and classification over data stream.
In
Proceedings of the 2016 AAAI conference on artificial intelligence
, Cited by: §IIA.  [20] (2007) An active learning system for mining timechanging data streams. Intelligent Data Analysis 11 (4), pp. 401–419. Cited by: §IIA.

[21]
(2014)
PCA feature extraction for change detection in multidimensional unlabeled data
. IEEE Transactions on Neural Networks and Learning Systems 25 (1), pp. 69–80. Cited by: §IIA.  [22] (2018) Accumulating regional density dissimilarity for concept drift detection in data streams. Pattern Recognition 76, pp. 256–272. Cited by: §I, Fig. 2, §IIA, §IIB, §IIIA.
 [23] (2017) Regional concept drift detection and density synchronized drift adaptation. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, pp. 2280–2286. Cited by: §IIA.
 [24] (2020) Diverse instancesweighting ensemble based on region drift disagreement for concept drift adaptation. IEEE Transactions on Neural Networks and Learning Systems Early Access, pp. 1–16. External Links: Document Cited by: §IIA.
 [25] (2020) Heterogeneous domain adaptation: an unsupervised approach. IEEE Transactions on Neural Networks and Learning Systems Early Access, pp. 1–15. External Links: Document Cited by: §I.
 [26] (2018) Learning under concept drift: a review. IEEE Transactions on Knowledge and Data Engineering. Cited by: §I, §I, §IIA, §IIIA.
 [27] (2016) A concept drifttolerant casebase editing technique. Artificial Intelligence 230, pp. 108–133. Cited by: §IIA, §IIB.
 [28] (2014) Concept drift detection via competence models. Artificial Intelligence 209, pp. 11–28. Cited by: §I, §IIB.
 [29] (2012) DDD a new ensemble approach for dealing with concept drift. IEEE Transactions on Knowledge and Data Engineering 24 (4), pp. 619–633. Cited by: §I.
 [30] (2012) A unifying view on dataset shift in classification. Pattern Recognition 45 (1), pp. 521–530. Cited by: §IIA, §IIA.
 [31] (2016) Fast hoeffding drift detection method for evolving data streams. In Joint European conference on machine learning and knowledge discovery in databases, pp. 96–111. Cited by: §IIA.
 [32] (2001) Learn++: an incremental learning algorithm for supervised neural networks. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews 31 (4), pp. 497–508. Cited by: §I.

[33]
(2019)
Weakly supervised deep learning approach in streaming environments
. In Proceedings of the 2019 IEEE International Conference on Big Data, Cited by: §IIA. 
[34]
(2019)
Automatic construction of multilayer perceptron network from streaming examples
. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp. 1171–1180. Cited by: §IIA.  [35] (2017) A survey on data preprocessing for data stream mining: current status and future directions. Neurocomputing 239, pp. 39–57. Cited by: §IIA.
 [36] (2014) Decision trees for mining data streams based on the gaussian approximation. IEEE Transactions on Knowledge and Data Engineering 26 (1), pp. 108–119. Cited by: §I.
 [37] (2017) On the reliable detection of concept drift from streaming unlabeled data. Expert Systems with Applications 82, pp. 77–99. Cited by: §IIA.
 [38] (2017) Robust prototypebased learning on data streams. IEEE Transactions on Knowledge and Data Engineering 30 (5), pp. 978–991. Cited by: §IIA.
 [39] (2018) Density estimation for statistics and data analysis. Routledge. Cited by: §I, §IIB.
 [40] (2019) Fuzzy clusteringbased adaptive regression for drifting data streams. IEEE Transactions on Fuzzy Systems. Cited by: §IIIA.

[41]
(2013)
Concept driftoriented adaptive and dynamic support vector machine ensemble with time window in corporate financial risk prediction
. IEEE Transactions on Systems, Man, and Cybernetics: Systems 43 (4), pp. 801–813. Cited by: §IIA.  [42] (2016) Online ensemble learning of data streams with gradually evolved classes. IEEE Transactions on Knowledge and Data Engineering 28 (6), pp. 1532–1545. Cited by: §I.
 [43] (2018) A systematic study of online class imbalance learning with concept drift. IEEE Transactions on Neural Networks and Learning Systems. Cited by: §IIA.
 [44] (2015) Resamplingbased ensemble methods for online class imbalance learning. IEEE Transactions on Knowledge and Data Engineering 27 (5), pp. 1356–1368. Cited by: §I.
 [45] (2005) Relevant data expansion for learning concept drift from sparsely labeled data. IEEE Transactions on Knowledge and Data Engineering 17 (3), pp. 401–412. Cited by: §IIA.
 [46] (2016) Exploiting attribute correlations: a novel trace lassobased weakly supervised dictionary learning method. IEEE Transactions on Cybernetics 47 (12), pp. 4497–4508. Cited by: §I, §IIA.
 [47] (2013) Active learning with drifting streaming data. IEEE Transactions on Neural Networks and Learning Systems 25 (1), pp. 27–39. Cited by: §IIA.
 [48] (2016) An overview of concept drift applications. In Big data analysis: new algorithms for a new society, pp. 91–114. Cited by: §I.