Conventional machine learning methods assume learning environments are stationary, that is, the testing data has the same data generation distribution as the training data. However, this assumption is substantially undermined in the context of the Internet of Things and Big Data[40, 26, 41], where data distributions can easily change over time. A data distribution is just a reflection of how frequently a real-world concept appears in some data. Hence, this phenomenon of changing distributions has been termed concept drift and, with today’s advancing data streams, learning in its presence has become an important topic of research. A typical symptom of concept drift is a shift in the decision boundary of a classification, which reduces prediction accuracy . For example, consider a user preference prediction or fraud detection task on streaming data. The performance of a static predictor trained on historical data will inevitably degrade over time because the nature of personal preferences or fraudulent attacks is that they are always evolving . As concepts change, new data may no longer conform to old patterns , which negatively impacts subsequent data analysis tasks . More importantly, these changes may be barely perceptible in real-world scenarios. For this reason, a continuous learning system will vigilantly monitor concept drift and adapt to it quickly, rather than assuming a learning environment is stationary.
Ensemble algorithms are useful for data stream learning as they can be integrated with drift detection algorithms and updated dynamically . Two comprehensive surveys on data stream ensemble learning [28, 34] have cataloged the advantages and disadvantages of current ensemble learning algorithms. Both point out that dedicated diversity measurements for data stream classifier ensembles are a worthwhile direction of research. Additionally, the literature shows that most existing ensemble diversity measurements are based on the outputs of classifiers 
- for example, the Kohavi-Wolpert variance, double fault, the interrater agreement, Yule’s Q statistic coincident failure diversity, etc.[18, 36]. Such diversity is created via input manipulation, output manipulation, base learner manipulation, or heterogeneous base learners . As reported in , existing ensemble diversity measurements are designed for static learning, but there have been no proposals for an ensemble diversity measurement specifically for evolving data streams.
To the best of our knowledge, no established ensemble diversity measurement can reflect disagreements between the base learners as to whether a concept drift has occurred or not. We propose that an ensemble diversity measurement for changing data streams should be able to address this issue since it can help the ensemble system to select the most appropriate ensemble members (ensemblers) for the final prediction. High diversity between two base learners indicates that one learner has found a drift while the other has not. Since every base learner updates itself with its own rule, different opinions on the existence of drift will result in different update statuses. This creates diversity in itself and is the inspiration behind our thinking. To implement this idea, we propose to detect drift using different change detection settings, and adjust instances’ weights to adapt to drift. In other words, drift detection works at the instance level, where a data instance is considered less important if it is located in a region of drift [38, 37]. A region drift disagreement index is proposed as a tool to measure ensemble diversity and to select the most appropriate ensemblers to vote for in the final prediction.
The predictive accuracy of current concept drift detection-adaptation strategies is highly reliant on accurate change detection and a low false positive rate. Missing a change or raising a false alarm may impair their overall performance. In contrast, in an innovative way, our stance is that whether concept drift is present or not is uncertain. Under this assumption, it is reasonable that different change detection tests could have different change detection results. Therefore, instead of updating the learning models whenever an ensemble detects a drift, we have opted for a voting strategy where the ensemblers with the most controversial change detection results act as the voting representatives. In broad terms, the innovation here is to leverage the base learner’s update mechanism as a way to create ensemble diversity. Assigning different change detection settings to each ensembler will give rise to disagreement and then the members with the most controversial results can be selected to maximize diversity.
The main contribution of this paper is a diverse instance-weighting ensemble algorithm (DiwE) based on region drift disagreement for concept drift adaptation. DiwE consists of an instance weighting method to incrementally adjust sample weights and an ensemble diversity measurement to select the ensemblers. By defining different schemes for constructing the region sets, DiwE can dynamically change the weights of instances according to newly-emerging concepts and, further, can select the combination of ensemblers with the highest diversity. Compared to other concept drift adaptation ensemble algorithms, DiwE has the following advantages:
It can incrementally adapt instance weights based on the estimated risk of region drift and pass this information to learning models before a drift becomes statistically significant.
It can dynamically select different ensemblers in different concept drift situations and automatically adapt to the diversity present, which is not possible with existing methods.
The rest of this paper is organized as follows. The definition of concept drift is provided in Section II along with a review of relevant works. Section III formally describes the proposed DiwE algorithm for concept drift adaptation and its associated adjuncts, including the region drift disagreement index and procedure for selecting the ensembler with the maximum disagreement in region drift. Section IV presents the evaluation of DiwE using benchmarks that include artificial streams with known drift characteristics and highly-referenced real-world applications. Section V concludes this study with a discussion on future work.
Ii Preliminary and Related Works
This section begins by introducing the preliminaries, definitions, and types of concept drift. The state-of-the-art ensemble-based algorithms for handling concept drift are then categorized based on their drift detection and adaptation strategies.
Ii-a The definition of concept drift
Concept drift is caused by variations in the distribution of data , which, in turn, leads to a disparity between the training samples and the data streams associated with non-stationary learning environments . Denote the feature space as , where is the dimensionality of the feature space. A data instance
is a pair of feature vectorsand a label , , where is the number of classes. A data stream can then be represented as an infinite sequence of data instances denoted as . A concept drift occurs at time if the joint probability of and changes, that is, [42, 38, 26, 54], where is the number of total available data instance. In this paper, time is considered to be discrete.
If we further decompose , we have . In considering problems that use to infer , concept drift is generally divided into two sub-research topics :
By this definition, a drift consists of both the time information and the location information in the feature space. Note that and are not the only elements affected by
drift. The prior probabilities of classesand the class conditional probabilities may also change, which could lead to changing, again affecting . This issue is the next challenge to consider in concept drift learning, i.e., understanding the reason for the drift .
Tsymbal et al. 
first defined region drift as changes in the concepts and the data distributions at the instance level rather than at the dataset level. They noted that lazy learning is a good option for addressing region drift problems. Other related studies that address region drift with decision tree models[31, 23, 6, 5] have also demonstrated good results. However, decision tree models require a minimum number of instances to perform their splitting algorithm. For example, the CVFDT algorithm  normally requires observation of 200 data instances before attempting to split the nodes. If a region drift occurs within those 200 data instances, a tree node will be updated before splitting, which means no region drift will be identified in that node . In [38, 37], the authors proposed quantifying the discrepancies in regional density with a metric they called the local drift degree . These discrepancies are accumulated to determine whether the overall change is sufficiently significant to report concept drift . These studies, like many other drift detection algorithms [22, 19, 4], require prior knowledge to organize the data samples in a stream into time-window sample sets. If a drift is confirmed, the old-time window is replaced with the latest time window, but the non-drift information in the old window is not reused. An empirical method for selecting the region size has been developed, based on a metric called the information granularity indicator. But how a region of concept drift should be defined at the instance level, along with the size of that region, are both still unsolved.
Ii-B Ensemble-based algorithms for concept drift handling
The research on handling concept drift mainly covers drift detection and drift adaptation. The aim of drift detection is to determine the time at which a drift occurs and notify a learning model to update itself. Drift adaptation focuses on how to update a learning model with the least effort to achieve the best learning results.
An ensemble is a set of individual classifiers whose predictions are combined to predict (e.g., classify) new incoming instances. This is considered to be one of the most promising research directions for intelligent data stream analysis . Ensembles for concept drift detection seek to improve the precision of identifying a change (drift) and reduce false alarm rates. These types of algorithms for drift detection are also known as multiple hypothesis testing . Ensembles for concept drift adaptation aim to improve overall prediction accuracy by decomposing a complex learning problem into easier sub-problems [34, 28]. Ensemble algorithms are efficient at drift adaptation because they can easily incorporate dynamic updates, such as selective removal or addition of classifiers . The types of algorithms have two modes – online-mode and chunk-mode – and fall into two categories: the active category and the passive category . Online mode algorithms process data instances one-by-one. Chunk mode algorithms process data instances in fixed batch sizes (i.e., chunks) .
The active category relies on change detection tests to trigger the adaptation process. The tests inspect the features extracted from the data generation process and/or from an analysis of the classification errors (evaluated over labeled samples). Any detected changes are then accommodated by either updating or retraining the classifier(s). Of course, to perform well, changes need to be detected promptly and false positive rates need to be controlled [4, 11, 12]. Unlike the active category, the passive category does not actively detect drift as new data arrives. Rather it simply accepts that the underlying data distributions may change at any time and at any rate . To accommodate this uncertainty, the model is adapted every time new data arrives.
From a drift detection perspective, ensemble algorithms fall into two categories [43, 20, 40]: active drift detection with an adaptive ensemble; or a passive ensemble with a forgetting mechanism. Algorithms like ADWIN-ARF  and leverage bagging  fall into the first category. With these, the ensemble actively searches for concept drift and builds new ensemblers if a drift is detected. In the second category, the ensemblers are built without considering the conflicts between concepts, and the base learners are built according to a predefined time frame without explicit drift detection. Example algorithms in this second category include DWM , Learner++.NSE , AUE1 AUE2 , and OnlineAUE . These algorithms attempt to learn drift incrementally with each new piece of arriving data, eliminating old ensemblers through a forgetting mechanism . The major difference between the two categories is whether the ensemble algorithm contains an explicit drift detection method. Other interesting research about how ensemble diversity may affect drift adaptation is discussed in [46, 45].
Class imbalance is another problem with concept drift. The class imbalance problem in sequential learning has garnered increasing attention from researchers in various application domains. The two most prominent solutions are ensemble of subset online sequential extreme learning machine (ESOS-ELM)  and meta-cognitive online sequential extreme learning machine (MOS-ELM) . A study of online class imbalance learning with concept drift can be found in .
Ii-C Ensemble diversity measurement
Ensemble diversity seeks to quantitatively analyze the dissimilarity between a set of individual classifiers. As illustrated in , the importance of ensemble diversity can be intuitively explained using the anthropomorphic example of a group of individuals with different knowledge backgrounds who need to make decisions together. If the group had the same knowledge backgrounds, they would not be able to think about the problem from different angles, while a diverse group of individuals is more capable of lateral thinking.
Moreover, some correlations between accuracy and specific diversity measurements have been found in special cases [35, 46, 45]. For example, Minku et al.  argue that, before a drift, ensembles with less diversity have lower test errors, while, after a drift, maintaining highly diverse ensembles could result in lower test errors. The authors also find that diversity is beneficial for reducing the errors caused by a drift, but it does not speed up recovery from a drift over the long term . Their argument is well supported by comprehensive evaluations. However, theoretical guarantees for more general cases are yet to be discovered . In addition, since there is no generally-accepted definition of diversity, a way of proving the correlations between accuracy and diversity is still not clear .
Most ensemble algorithms are accompanied by strategies to create diversity, even if the strategies are not part of the core algorithm . Only a few studies [28, 18, 46, 45] have been undertaken to devise specific diversity measurements and their properties for ensemble learning with data streams. Brzezinski and Stefanowski’s approach  is to visualize a diversity measurement over time and use those values as complementary information to the data stream. Overall, diversity creation methods can be divided into four broad categories: input manipulation, output manipulation, base learner manipulation, and heterogeneous base learners . However, there is still a gap in how to define ensemble diversity for concept drift detection and adaptation .
Iii A Diverse Instance Weighting Ensemble Algorithm via Drift Risk Estimation on Different Region Sets
Our proposal for diverse instance weighting is based on region drift estimation, where a region is defined as an -ball, i.e., a ball in an -dimensional Euclidean feature space. The intuition behind the idea is to estimate concept drift at the instance level rather than across the entire feature space. Because detecting drift across the entire feature space has the risk that sub-spaces with insignificant changes may be combined, resulting in substantially fewer discrepancies in density. In general, the proposed method choose different region sizes in a range to perform ensemble learning. And the ensemblers with the highest regional drift disagreement are selected to perform the final voting. In this paper, we assume that data instances arrive independently identically distributed (i.i.d.) in an online mode.
Iii-a A phi-level region set
Iii-A1 The definition of a phi-level region set
. Large regions are not sensitive to local drifts, but they are robust to noise. Conversely, small regions are sensitive to local drifts, but may be affected by noise. Setting a specific distance for constructing a region is one option; however, using a fixed distance has several drawbacks when dealing with a high-dimensional feature space, arbitrary shapes, or distributions with high-density variations between different regions. For example, sparse regions may have no data instances, while dense regions may have too many. Similar problems can occur when using kernel density estimation. Therefore, we have considered region size from a probability perspective rather than a geometric view. In other words, the region size should be determined by the relative proportion of the region sample.
A -level region set is defined as a set of -balls of the feature space , , where is the core data instance of the region, is the radius, and is the number of available data instances. The sample proportion in each region is equal to , namely . The parameter ranges between 0 and 1, and determines the radius .
To implement a -level region set for computation, we opted a -nearest neighbor-based region construction method. Consider a data instance in a non-drift period as the center and a distance as the radius. The empirical of the region can be estimated by the number of instances in the region divided by the total number of instances, denoted as , where is the cardinality of , , and denotes the Euclidean distance between the features of and . The interval or radius of a region is determined by the distance between the data instance and its nearest neighbor, denoted as
As such, a larger value implies a larger region size, which is less sensitive to small discrepancies in density. Figure 1 illustrates the components of a region and the region set in a 2-dimensional space.
Iii-A2 The minimum sample size to initialize a region set
With sufficient data, a -level region can be constructed by choosing and calculating the distance between and its nearest neighbor as the radius. According to Box et al. , the distribution of and the region sample proportion are approximately normal for large values ofis and
, respectively, which is identical to the mean and variance of a binomial distribution. Similarly, the mean and variance for an approximately normal distribution of the sample proportion isand , respectively. However, because a normal approximation is not accurate for small values of , a good rule of thumb is to use the normal approximation only if and . In other words, the number of available data should be greater than . For example, if we want to create a region set with , the minimum number of data instances we need is
Iii-B An incremental instance-weighting function
Iii-B1 The intuition behind diverse instance weighting
The fundamental idea of instance weighting is that, if there is no concept drift in the streaming data, the next incoming data instances, whether or not they are located in a region, can be considered as a Bernoulli process. The set of data instances that enter the region from the next continuously arriving samples is denoted as . If no drift exists between the time points
, we have a random variable, the cardinality of, follows the binomial distribution, denoted as
Therefore, the probability of observing a number at time can be calculated as
where is the probability mass function of the binomial distribution. If , the event has a small probability of occurring, and that drift level is reported for data instance , where controls the sensitivity to drift.
To implement this approach incrementally, we have placed the focus on calculating , which is the probability that no other data instance will be in region in the next period. This online region drift condition can be rewritten as
Then, we have , if , which is used as the weighting function.
The region will be updated if it exists a time that satisfies the following conditions:
This ensures that the radius of the regions become more accurate as the amount of available data increases.
Essentially this means that each region of drift is detected based on the sequence of data instances arriving from time point 1 to time point , rather than using the data instances in a region. For example, if a data instance arrives at time point , denoted as , region will be built based on the buffered 499 data instances . Setting the region set parameter to and the drift significance level at , will be reported as a drift instance only if no data instance are located in region over the next instances, which is the period .
In addition, the region and the time counter will be reset if a new coming instance locates in the neighbourhood of . This ensures the tested neighbourhoods and time periods are independent from previous tests so that DiwE will not perform repeated hypothesis tests. For example, if a data instance arrived at located in the neighbour of , the will be rebuilt based on the buffered data instances, and the will be reset to . Then the will be reported as a drift instance only if no data instance are located in region for the period
Iii-B2 The phi-level region set instance weighting function
In a concept drift adaptation scenario, the region will be removed from if . Also, the region with the lowest weight will be replaced by the latest data instance and its region if the region set size reaches a predefined threshold, called maximum buffer size. denoted as . A -level ensembler is trained based on the core instance set with the weights from the -level region set. That is, the training set is .
Iii-B3 A strategy to control the impact of false alarms
As argued by Tsymbal et al. , region drift should be defined as changes in concepts (data distributions) at the instance level, not at the dataset level. Therefore, to avoid detecting redundant drifts, and to mitigate the impact of false alarms, only the core instance of a region should be updated/removed. Since each region has a unique core instance, the weights of the overlapped instances will not change, and the weight of the core data instances can be incrementally adapted based on the risk of drift in their region.
Iii-C A maximum region drift disagreement ensemble
Iii-C1 Region drift disagreement
Given two region set parameters , , and a training set , we can build two region sets, . Without concept drift, we assume the data instances have arrived i.i.d. in an online manner. The RDD index of two region sets is defined as the Jaccard dissimilarity of the set of core instances for those regions, denoted as
The RDD ensemble diversity is then defined as the average RDD of all pairs of ensemblers in the set of regions , formulated as
Since determines the radius for all regions in a region scheme, we can approach the power set of the feature space if there are sufficient data and covers all possible region sample proportion values. However, given there is always some level of constraint on computation costs, cannot be infinite in real-world applications. Therefore, given that is a set of grid values between 0 and 1, we want to select a subset of with a limit number of to reach the maximum RDD diversity (max-RDD), namely
where parameter governs how many ensemblers should be used for the final prediction. The goal of maximum ensemble diversity selection is to select a subset of region sets so that the diversity reaches the maximum value. Many ensemble algorithms use 10 ensemblers as the default setting [27, 16], so we set as default as well. The grid search range of is set to as a default. In this case, the max-RDD will select 10 out of 20 region sets from which to build the base classifiers for making classifications
Iii-C2 The ensemble-voted classifier
The ensemble-voted classifier is a meta-classifier that combines either similar or conceptually different machine learning classifiers for classification via majority voting. Two voting strategies can be used: "hard" and "soft". In hard voting, the final class label prediction is the most-frequently predicted class label by the classification models. In soft voting, the final class label is predicted by averaging the class-probabilities. For example, suppose we have three ensemble classifiers that have calculated their prediction probabilities on a binary classification task as , , , respectively. The combined prediction probability with hard voting would be ; hence, the final prediction result would be the first class. With soft voting, the combined prediction probability would be , resulting in a final prediction of the second class. Soft voting is recommended if the classifiers are well-calibrated. Since the voting ensemblers in DiwE are selected via max-RDD, we want to leverage the advantage of soft voting to reflect their disagreement on the combined result. Thus, we chose soft majority voting as our final prediction strategy.
Recall that the fundamental idea of a max-RDD ensemble-voted classifier is to select the most controversial drift detection region sets for ensemble learning. The ensemble determines whether two region sets are inconsistent according to an RDD index. If two region sets have no argument about the drift detection result, it is enough to preserve only one. However, if one detects a drift and the other does not, both need to be preserved for classification to boost the diversity. The soft majority vote will balance the final classification result, which is
where is the class label, and is the classification probability given by the base learner trained on the core instance set . The structure of max-RDD DiwE is illustrated in Fig. 2.
It is worth to mention that different base classifier will calculate the differently. In this paragraph, we present formulation of in terms of IBk classifier. The of each IBk base learner is calculated as follows according to:
The algorithm parses the entire time window, computing the distance between and each training observation. The points in the training data that are closest to are denoted as the set .
Then the conditional probability for each class is estimated, i.e., the fraction of the number of points in with that given class label. Binary classification problems are given a class label , and
where is a feature vector paired with a class label, . The is an indicator function: if then , otherwise . Therefore, the probability vector . The inverse-distance-weighted IBk consider the distance between samples as the weight to adjust the probability vector. The prediction probabilities are calculated differently for different base classifiers. However, considering drift detection occurs at the instance level, we recommend using the IBk classifier as the default.
Iii-D The implementation of the DiwE algorithm
Iii-D1 The initialization of a region set
To initialize a region set, we need a training dataset and a region set parameter . The size of the training set is denoted as . If the training set is not large enough, that is , as discussed in Section III-A2, a very large value is assigned to the radius. Therefore, the next data instance to arrive will definitely be located in this region. The region updating process is then triggered during which the radius is recalculated. This process continues until there are enough data instances.
In the worst case of Algorithm 1, the runtime complexity of each -nearest neighbor search is , where denotes the dimensionality. Given a fixed-size training set of , the worst-case runtime complexity is .
Iii-D2 Max-RDD diversity ensembler selection
The intuition behind Max-RDD diversity is to quantitatively measure the disagreement between ensemblers about whether region drift exists, and only select the ensemblers that do not reach a consensus for the final prediction. As such, region drift is empirically estimated at the instance level.
In Algorithm 2, the inputs are the set of region sets , and the maximum number of ensemblers for ensemble learning is . The RDD index is calculated in Line 9. The intersection and union runtime complexity are both , where denotes the cardinality of the region set. The RDD index between two given region sets has a runtime complexity of Hence, the complexity of calculating the RDD index for all pairs of region sets is , . To iterate over all the possible combinations of the region set parameters with a given voting ensemble size , we have options, where stands for all the possible combinations of choosing out of region sets. The cardinality of is calculated by the combination function . In this case, the cardinality is , so the runtime complexity is . The overall complexity of Max-RDD is , where is the stored region set size, and according to the buffer size constraints. The complexity of Max-RDD is independent of the size of the dataset.
Iii-D3 Diverse instance weighting ensemble
The idea behind DiwE is to use a buffer to store the regions that are most relevant to the current concept and to update the learning models regularly according to the core instances in the stored regions, i.e., . We set so that by default, which means there are 20 different values. Then we selected 10 as representatives, to vote on the final prediction result using the Max-RDD ensembler selection algorithm. The number of representatives selected is determined by a voting ensemble size parameter, denoted as . In general, we manipulated 20 ensemblers at all times and dynamically chose 10 of them for voting. The configuration of should be selected according to the available computational resources. The larger the size, the more different region sets DiwE can investigate. The downside is that a larger will increase the algorithm complexity in a combinatorial manner. Therefore, we recommend determining the size of according to the computational resources, then fill in the values by a grid searching from to . Considering DiwE detect drift at the instance level, we recommend using the IBk classifier as the default. The input parameter indicates the maximum number of regions that are allowed to be stored in a region set. In common practice, the default setting of is [37, 40].
In Algorithm 3, the system is initialized on the training data in Lines 1-3. Line 4 starts processing the streaming data. Line 5 selects the ensemblers according to max-RDD diversity. The base learners are built in Lines 6-8, and Line 9 is the soft majority vote for the new data instance following Eq. (10). Lines 11-16 track region drift and update instance weight according to Eq. (6). Line 17 is where the new regions are constructed as new data becomes available. Similar to the region initialization algorithm, we find the -th nearest neighbor of , and build the region with . Lines 18-20 ensure the buffer size does not exceed the maximum limit by removing the most likely drift regions. Lines 22 and 23 detect the end of the stream and output the prediction/classification results.
|Line 2||Region set construction|
|Line 6-8||IBk classifier ensemble|
|Line 11-16||Weight updating|
|Line 17||Creating a new region|
|Line 18-20||Least important instance removal|
In terms of complexity, the worst case for the region set construction (Lines 1-3) is . The complexity of Max-RDD is in Line 5 is . The training complexity of the IBk classifier is because IBk classifier does not require a training process. The complexity of the for-loop between Line 6 and 8 is . The softMajorityVote complexity of IBk classifiers in Line 9 is . The complexity of calculating the distance between and , then compare the distance with to update the weights is . Updating the weights for a region set (Lines 11-16) is . In Line 17, the complexity for creating a new region after computing the distance to all data instances in the region is , given the calculations in Lines 11-16. Removing the least important instance in Lines 18-20 is based on a minimum value search iteration of the buffer. So, the overall complexity for Lines 10-21 is Extending this to Lines 1-22, we have runtime complexity of . The details are summarised in Table I
Simplified, this is
where is the region sets initialization algorithm, is the Max-RDD algorithm, is the drift adaption process for the IBk classifiers. Note that the runtime complexity of Algorithm 3 is controlled by the input parameters, , , , and . If all parameters are set as default values, then the overall complexity is , where is the buffer size, and the worst case is , therefore, the complexity is which is similar to most stream learning algorithms.
Iii-D4 A scalability analysis and the data pre-processing requirements for DiwE
Scalability is a system’s capacity for handling a growing amount of work by adding resources to the system , and is an important property in data stream learning algorithms. Resources fall into two broad categories: horizontal and vertical . From a data perspective, horizontal resources are the number of features (), and vertical resources are the number of training/testing samples (). A common way to increase the vertical scalability of an algorithm is sub-sampling - that is, using bagging or boosting algorithms to select a relatively small subset of samples to build the learning model. In time series and data stream mining tasks, variable time windowing strategies are an alternative approach [13, 39]
. In terms of horizontal scalability, dimension reduction is the most popular way to reduce the number of random variables under consideration. Many feature selection and feature projection techniques have been developed to address scalability, such as principal component analysis and auto-encoders. These techniques are usually used as a pre-processing step followed by clustering using -nearest neighbor on feature vectors in a reduced-dimension space. In machine learning, this process is sometimes called low-dimensional embedding .
In DiwE, the runtime complexity is closely related to the algorithm parameters, , , , and . DiwE can fit large datasets by adjusting these parameters to suit the available system capacity - for example, by reducing the window size to control memory costs. However, directly applying DiwE to data with high dimensionality could be dangerous. That may cause memory overflows and significantly increase runtime complexity. Hence, we recommend applying a dimension reduction process before applying DiwE to high-dimension data. Which dimension reduction technique is best to use depends on the dataset and learning task at hand.
Another important issue that may affect DiwE’s performance the chosen distance metric. Euclidean distance may not be efficient when dealing with data with many nominal attributes. Therefore, data pre-processing is essential, and feature normalization with one-hot encoding is recommended in most cases.
Iv Experiments and Evaluation
This section contains the evaluations of the proposed DiwE algorithm on both synthetic and real-world datasets. In Section IV-A, we demonstrate how a single DiwE member incrementally adapts to concept drift. In Section IV-B, we outline the ten synthetic datasets with both drifting and non-drifting streams that were used to compare accuracy. Section IV-C includes seven real-world benchmark datasets, and an evaluation of the Max-RDD ensembler selection. Performance was measured as accuracy, and all the results were evaluated by a prequential, basic classification performance evaluator.
Iv-a An evaluation of DiwE members on drift instance removal
We first assessed how well a single DiwE member maintains its region set. This experiment was designed to evaluate whether the buffer size changes with concept drift, and whether the reserved core data instances in each region convey information about the most recent concept. To illustrate how the adaptation works, we used sliding windows with the same buffer limitation as a contrast. We also applied the Kolmogorov-Smirnov two-sample test (KS test) as a baseline to illustrate the difference between DiwE and the conventional concept drift retrain procedure.
Experiment 1. (Evaluation of a single DiwE member on drift instance removal.) The datasets were generated based on the three different distributions given in Table II. One data instance was independently generated for each time point according to the current distribution. To simulate sudden and incremental drift, the data distributions were incrementally changed for and suddenly changed at . To maintain the KS test data buffer, we applied the most commonly used drift adaptation strategy [22, 24], that is, building a new buffer at a specified warning level and replacing the old buffer at a specified drift level. The warning level was set as , and the drift level was set to .
|Drift type||Time slot||distribution|
|sudden region drift|
Findings and Discussion. The experimental results are shown in Fig. 3, in a similar format to . In general, both KS test and the DiwE member were able to take corrective actions no matter what type of drift occurred. However, we can see from the buffer size that the conventional replace and retrain method discarded all historical data after confirming a sudden drift, even though some of that data might still have been useful. The DiwE, in contrast, was able to trim irrelevant information from the buffer while retaining historical data that conformed to the current distribution. In addition, KS test triggered more than one true positive alarm during the incremental drift, which is correct from a drift detection perspective. However, the available training data in the buffer was overly reduced, which may not be necessary for drift adaptation. Compared to the sliding window strategy, DiwE is more sensitive to drift and can preserve the data instances that convey information about the most recent concept, as shown in the buffer snapshot at different time points. Another interesting result shown in this experiment is that KS test did not trigger a warning level but rather triggered a drift level directly on the incremental drift. This phenomenon inspired us to reconsider incremental drift as a series of sudden drifts. Notably, the warning level threshold of may not always be the best option.
Iv-B An evaluation of DiwE on synthetic concept drift datasets
In Experiment 2, we evaluated DiwE on ten synthetic datasets and compared it with eight state-of-the-art concept drift detection-adaptation algorithms.
Experiment 2. (Evaluation of DiwE on synthetic datasets) Synthetic datasets are good for generating and testing performance with specific and/or varied drift behaviors [61, 25]. In this experiment, we applied seven data stream generators based on MOA  with common parameterization [27, 60, 19]. Table III shows the main characteristics of the datasets. The selected algorithms were ADWIN-ARF , , OnlineAUE , Learn++NSE , SAMkNN , IBLStream , and NN- , all of which are online mode classifiers. We ran the experiment using the MOA software framework to allow for easy reproducibility. Since different base classifiers may affect the results , the base classifiers for , Learn++NSE, SAMkNN, IBLStream, and NN- were set as IBk, with a window size equal to 1000 and . Neighbors were weighted by the inverse of their distance. ADWIN-ARF and OnlineAUE were only available with Hoeffding decision tree as the base classifiers. These two algorithms were selected because they are two benchmark ensemble algorithms for drift adaptation.
Similar to 
, ten synthetic datasets were generated for evaluation: SEA sudden, gradual drift, Hyperplane incremental drift, LED sudden, gradual drift, AGR sudden, gradual drift, RTG no-drift, RBF global, and region drift. The characteristics of these datasets are summarized in TableIII.
|Dataset||Drift Type||#Instances||#Attributes||# Class|
The SEA generator  produces data streams with three continuous attributes, and . An inequality determines the label of each data instance, , where is a threshold to control the label boundary. The entire data stream was divided into four subsets with different data distributions (“Concepts”) of equal size, and was 8, 9, 7, and 9.5, respectively. This evaluation method has been widely used in sudden drift detection and adaptation [39, 60, 27, 59]. There were 10,000 data instances at a noise ratio of 10%. To simulate gradual drift, Concepts 1 and 2 were changed every 50 data instances from to , that is, the data for was generated based on the new concept, while the data for was generated based on the old concept up to .
The rotating Hyperplane generator  produces data streams with ten continuous attributes, and . The label boundary for classification was determined by , where is the number of features related to drift, and are weights that randomly initialize in the range of . Incrementally changing the threshold produces a rotating hyperplane label boundary, thereby generating incremental concept drifts. In this experiment we set , that is, only the first two features had incremental drifts. Again, there were 10,000 data instances, and the noise ratio was set to .
The LED  generator creates instances with 24 Boolean features, but only seven features determine the class labels. The configurations to simulate four different concepts were as follows: the first three features were swapped for ; the first five features were swapped for ; and the first seven features were swapped for . The gradual drift configuration was the same as the SEA gradual drift.
The AGRAWAL  generator creates instances with six nominal and three continuous attributes. Ten functions are available to map instances into two classes. We used the first four functions in MOA to simulate four concepts of equal length. The same gradual drift configuration was applied to AGRg.
The Random Tree Generator (RTG
) randomly builds a decision tree and randomly assigns a class label to each leaf node, after which the data is uniformly distributed to the leaf nodes. For this dataset, we applied the MOA default setting to create a non-drifting dataset.
generator creates data instances using a radial basis function. It creates centroids at random positions and associates them with a standard deviation value, a weight, and a class label. Incremental drifts are simulated by continuously moving the centroids. Both RBF and RBFr were parameterized with 50 centroids with a speed of change equal to 0.001. For the RBF incremental drift, 50/50 centroids are drifting, and for the RBF incremental region drift, 10/50 are drifting.
The evaluation results were calculated based on 50 runs of each dataset. The average accuracy and standard deviation of accuracy are given in Table IV.
Findings and Discussion.
The results show that DiwE reached an average of rank 2.0 on the evaluated datasets, which sits at the top of all the algorithms. We conducted a Friedman test to determine whether the difference in results was significant and found a significant difference at From a further investigation of the difference between each pair with the Nemenyi post-hoc test, we found that only the difference between DiwE and Learn++. NSE was significant (). All other pairs had a significance level above .
Overall, the results show that DiwE was the most accurate on most datasets, with the exception of AGRa, and AGRg. This might be due to the distance metric used for constructing the regions, which in this case was Euclidean distance. Euclidean distance performed well on the normalized numerical datasets, but appeared to have difficulties with the datasets containing nominal attributes. We therefore recommend choosing the distance metric carefully according to the feature type in the dataset(s).
|AvgRank||2.0 (1)||4.0 (2)||4.1 (3)||4.4 (4)||4.9 (5)||5.0 (6.5)||5.0 (6.5)||6.6 (8)|
Iv-C An evaluation of DiwE on real-world applications
In this set of experiments, we evaluated DiwE on real-world applications. Experiment 3 shows the buffer size of the ensembler using a region set parameter . Experiment 4 shows the effectiveness of maximum diversity ensembler selection by comparing it with random ensembler selection. Experiment 5 evaluates the robustness of DiwE with different parameter settings.
Experiment 3. (Evaluation of DiwE on seven real-world applications) To evaluate the ability of DiwE to address real-world problems, we compared it with the same algorithms as introduced in Section IV-B but with real-world datasets. As discussed in [25, 62, 9], execution time and memory cost are important in streaming data learning, so this information has been provided alongside the results. The characteristics of the datasets used are summarized in Table V. Tables VI, VII, and VIII show the performance of the tested algorithms, and Fig. 4 shows the changes in the size of the region set .
The Electricity dataset contains 45,312 instances, collected every 30 minutes from the Australian New South Wales Electricity Market between 7 May 1996 and 5 Dec 1998. In this market, prices are not fixed; rather, they are affected by supply and demand. This dataset contains eight features and two classes (up, down) and has been widely used to evaluate concept drift adaptation.
The Nebraska Weather prediction dataset was compiled by the US National Oceanic and Atmospheric Administration. It contains eight features and 18,159 instances with 31% positive (rain) classes, and 69% negative (no rain) classes. The dataset is summarized in  and is available at .
The Spam filtering dataset is a collection of 9324 email messages derived from the Spam Assassin collection and is available at http://spamassassin.apache.org/. The original dataset contains 39,916 features and 9324 emails. It is commonly considered to be a typical gradual drift dataset . According to Katakis  500 attributes can be retrieved using the Chi-square feature selection approach.
The Usenet1 and Usenet2 datasets are derived from Usenet posts in the 20 Newsgroup collection with simulated region drifts. The task is to classify messages as either interesting or junk as they arrive. The dataset is split into five periods, and the data in each period covers different user interest topics. All data instances were concentrated to simulate sudden/reoccurring drift.
The Airline dataset consists of flight arrival and departure details for all commercial flights within the US from October 1987 to April 2008. The dataset was originally designed for regression problems as part of the Data Expo Competition, 2009. It was subsequently modified by the MOA team  for prediction analysis. Each data instance has seven features and two classes with 539,388 records in total.
The forest cover type (Covtype) dataset designed to test predictions on the type of forest cover from a given observation as determined by the US Forest Service (USFS) Region 2 resource information system. Each instance is derived from data originally obtained from the US Geological Survey (USGS) and USFS data.
|Elec||45312||8||2 (up, down)|
|Weather||18159||8||2 (rain, no rain)|
|Spam||9324||500||2 (spam, legitimate)|
|Usenet1||1500||99||2 (interested, non-interested)|
|Usenet2||1500||99||2 (interested, non-interested)|
|Airline||539383||7||2 (delay, not delay)|
|Elec||83.84 (5)||88.17 (1)||87.39 (3)||84.35 (4)||82.78 (6)||87.74 (2)||77.05 (7)||69.67 (8)|
|Weather||80.20 (1)||78.74 (2)||74.63 (7)||74.97 (6)||77.73 (3)||75.24 (5)||75.69 (4)||73.24 (8)|
|Spam||96.69 (1)||95.60 (3)||93.47 (4)||91.08 (6)||95.79 (2)||84.29 (7)||92.78 (5)||72.54 (8)|
|usenet1||68.53 (1)||68.40 (2)||66.80 (4)||66.87 (3)||65.67 (5)||63.47 (6)||56.00 (7)||46.93 (8)|
|usenet2||73.20 (1)||71.93 (4)||72.47 (2)||72.27 (3)||71.00 (5)||68.87 (6)||67.67 (7)||65.67 (8)|
|Airline||78.55 (1)||65.24 (4)||64.55 (5)||66.06 (3)||60.35 (8)||67.51 (2)||63.74 (6)||63.04 (7)|
|Covtype||89.84 (6)||92.11 (3)||94.04 (1)||84.74 (7)||91.71 (4)||90.01 (5)||92.26 (2)||68.43 (8)|
|AveRank||2.29 (1)||2.71 (2)||3.71 (3)||4.57 (4)||4.71 (5.5)||4.71 (5.5)||5.43 (7)||7.86 (8)|
|AveRank||5.14 (5)||3.71 (4)||6.86 (7.5)||2.57 (2)||3.00 (3)||2.29 (1)||5.57 (6)||6.86 (7.5)|
|AveRank||7.29 (8)||4.14 (4)||2.57 (2)||2.00 (1)||4.57 (5)||2.86 (3)||6.00 (6)||6.57 (7)|
Findings and Discussion. From the accuracy and execution efficiency results in Tables VI, VII, and VIII, we conclude that different drift adaptation algorithms are suited to different applications; there is no perfect algorithm that can achieve the best performance for all datasets. While the average ranking only demonstrates the effectiveness of DiwE on the tested datasets, the results do provide strong evidence that DiwE performs as well as the other methods in the tested situations. More concretely, what the results show is that considering diversity in region drift disagreement is a suitable alternative method for ensemble learning to address concept drift.
The memory cost of DiwE is higher than the other algorithms because a few region sets need to kept in memory. However, this issue could easily be overcome with distributed computing. From the results, we observe that the Covtype dataset has more attributes and data instances than the Airline dataset. But the execution times for DiwE, ADWIN-ARF, and NN- on Covtype were much faster. From this, we surmise that the execution time of concept drift detection-adaptation algorithms might be related to the number of drifts in the dataset. Hence, differences in drift detection accuracy might result in different drift-adapt execution times. According to the complexity analysis of DiwE discussed in Section III-D, DiwE has complexity, where denotes the buffer size. As shown in Fig. 4, the average buffer size of the Airline dataset is much higher than the Covtype dataset, which accords with our conclusion. This phenomenon has inspired us to reconsider the balance between detection-adaptation performance and execution time.
|Max-RDD DiwE||Random DiwE||Max-RDD DiwE||Random DiwE|
|Elec||83.84||83.400.05 (0.44)||84.96||82.590.09 (2.37)|
|Weather||80.20||80.110.14 (0.09)||79.93||78.820.21 (1.11)|
|Spam||96.69||96.440.10 (0.25)||96.58||78.820.21 (0.98)|
|usenet1||68.53||67.920.72 (0.61)||67.20||64.340.62 (2.86)|
|usenet2||73.20||72.640.42 (0.56)||69.93||69.320.89 (0.61)|
|Airline||78.55||77.320.12 (1.23)||79.28||76.230.18 (3.05)|
|Covtype||89.84||89.420.04 (0.42)||89.36||88.420.03 (0.94)|
Experiment 4. (Evaluation of Max-RDD diversity ensembler selection.) To evaluate whether Max-RDD ensembler selection improves the overall classification results, we compared it with a random ensembler selection with the same range. The aim of the Max-RDD ensembler selection is to select the most controversial region sets for ensemble learning so that the ensembles can reach a high drift sensitivity without losing robustness. Given this assumption, Max-RDD should be able to highlight the ensemblers with the highest diversity, no matter what type of drift, with a limited ensembler size . Random ensembler selection does not have this property, which means Max-RDD should outperform random ensembler selection. To verify our assumption, we chose , and evaluated DiwE on the seven real-world datasets. For the random ensembler selection, we ran DiwE 50 times and calculated the mean and standard deviation. The results are shown in Table IX.
Findings and Discussion. According to the Friedman test, there is a significant difference () in classification accuracy between the Max-RDD and random ensembler selection methods, but there is no significant difference for MaxRDD with different values. The value in brackets indicates the extent to which Max-RDD improved classification accuracy compared to the random method. From the results, we see that a smaller caused the random ensembler selection to become unstable, while Max-RDD maintained accuracy with no significant drops.
Experiment 5. (Evaluation of the selection of the voting ensemble size and the maximum window size .) The voting ensemble size and the maximum window size are two critical parameters that may affect DiwE’s performance. To evaluate how these parameters may influence prediction accuracy, we varied the settings of