Data streams are a potentially unbounded sequence of observations. As such, data streams are subject to a number of external factors, e.g. seasonal or catastrophic events. Hence, the distributions of a data stream are usually not stationary, but change over time, which is known as concept drift.
Concept drift can seriously affect the quality of predictions, if it goes unnoticed. Concept drift detection models help identify and handle distributional changes, allowing us to maintain a high predictive performance over time. Ideally, concept drift detection models are sensitive enough to detect drift with only a short delay. However, concept drift detection should also be robust against small perturbations of the input in order to avoid false positives and thus be reliable.
Let and be random variables that correspond to the streaming observations and the associated labels. According to 
, concept drift resembles a difference in the joint probabilityat different time steps , i.e.
We call the active concept at time step . Moreover, we distinguish between real and virtual concept drift. Virtual concept drift describes a change in , i.e. . Hence, virtual concept drift is independent from the target distribution and does not change the decision boundary . On the other hand, real concept drift, sometimes called concept shift, corresponds to a change in the conditional target distribution, i.e. . Real concept drift shifts the decision boundary, which may influence subsequent predictions . It is therefore crucial to detect changes of in time to avoid dramatic drops in predictive performance. In this paper, we investigate the effective and robust identification of real concept drift.
Unfortunately, concept drift does not follow a clear pattern in practice. Instead, we might observe large differences in the duration and magnitude of concept drift. To this end, we distinguish between different types of concept drift [26, 10, 29]: Sudden drift describes an abrupt change from one concept to another. Incremental drift is a steady transition of concepts over some time period. In a gradual drift, the concepts alternate temporarily, until a new concept ultimately replaces the old one. Sometimes we also observe mixtures of different concept drift types and recurring or cyclic concepts. For further information, we refer the fellow reader to . In general, concept drift detection models should allow timely and accurate detection of all types of concept drift.
In a data stream, we can only access a fraction of the data at every time step . To detect real concept drift, we thus need to approximate , by using a predictive model . Accordingly, we get , with parameters . We optimize the model parameters, given the new observations in every time step. Consequently, represents our most current information about the active concept at time step . A concept drift detection model should therefore adhere to changes of the model parameters through the following two properties:
Model-Aware Concept Drift Detection. Let be the parameters of a predictive model at two time steps and . Let further be a statistical divergence measure (e.g., Kullback–Leibler, Jensen-Shannon, etc.). Concept drift detection is model-aware, if for a detected drift between any two time steps and , we observe .
Accordingly, we associate concept drift with updates of the predictive model . Given that is robust, model-awareness reduces the sensitivity of a concept drift detection scheme to random input perturbations, which in turn reduces the risk of false alarms.
Explainable Concept Drift Detection. Concept drift detection at time step is explainable with respect to the predictive model , if the concept drift can be associated with individual model parameters, i.e. each dimension of .
If we associate concept drift with individual parameters, we can make more targeted model updates. Hence, we may avoid unnecessary and costly adaptations of the predictive model. Moreover, some parameter distributions even allow us to relate concept drift to specific input features. In this way, concept drift becomes much more transparent.
In this paper, we propose a novel framework for Effective and Robust Identification of Concept Shift (ERICS). ERICS complies with the Properties 1 and 2. We use the probabilistic framework introduced in  to model the distribution of the parameters at every time step. Specifically, we express real concept drift in terms of the marginal likelihood and the parameter distribution , which is itself parameterized by . Unlike many existing models, ERICS does not need to access the streaming data directly . Instead, we detect concept drift by investigating the differential entropy and Kullback-Leibler (KL) divergence of at different time steps. In this context, we show that concept drift corresponds to changes in the distributional uncertainty of model parameters. In other words, real concept drift can be measured as a change in the average number of bits required to encode the parameters of the predictive model. By specifying an adequate parameter distribution, we can identify concept drift at the input level, which offers a significant advantage over existing approaches in terms of explainability. In fact, the proposed framework can be applied to almost any parameter distribution and online predictive model. For illustration, we apply ERICS to a Probit model. In experiments on both synthetic and real-world data sets, we show that the proposed framework can detect different types of concept drift, while having a lower average delay than state-of-the-art methods. Indeed, ERICS outperforms existing approaches with respect to the recall and precision of concept drift alerts.
Ii ERICS: A Concept Drift Detection Framework
Real concept drift corresponds to a change of the conditional target distribution . However, data streams are potentially infinite and so the true distribution remains unknown. Hence, we may use a predictive model to approximate . Since we update the model parameters for every new observation, represents our most current information about the active concept at time step . Consequently, we may identify concept drift by investigating changes in over time.
To this end, we adopt the general framework of  and treat the parameters as a random variable, i.e. . Analogously, we optimize the distribution parameters at every time step with respect to the log-likelihood. This optimization problem can be expressed in terms of the marginal likelihood . Hence, the marginal likelihood relates to the optimal parameter distribution under the active concept. Accordingly, we may associate concept drift between two time steps and with a difference of the marginal likelihood for the distribution parameters and :
) in terms of the differential entropy and KL-divergence, which are common measures from information theory. The entropy of a random variable corresponds to the average degree of uncertainty of the possible outcomes. Besides, entropy is often described as the average number of bits required to encode a sample of the distribution. On the other hand, the KL-divergence measures the difference between two probability distributions. It is frequently applied in Bayesian inference models, where it describes the information gained by updating from a prior to a posterior distribution. We can derive the following proportionality:
where is the differential entropy of the parameter distribution at time step . Note that we have rephrased the cross entropy by using the KL-divergence . We may now substitute (II) into (II) to derive a general scheme for concept drift detection:
Intuitively, real concept drift thus corresponds to a change in the uncertainty of the optimal parameters and a divergence of the parameter distribution. On the other hand, stable concepts are characterized by a static parameter distribution and uncertainty.
Note that (3) has another interpretation in the context of Bayesian inference. As mentioned before, the KL-divergence can be interpreted as the information gained from inferring the posterior from a prior . According to (3), we thus find that every difference in parameter uncertainty (entropy) between time step and , which can not be attributed to the inference of posterior parameters, may be traced back to a concept drift.
By construction, any parametric distribution used in Equation (3) can be evaluated for each parameter individually, i.e. we have . ∎
Ii-a Continuous Concept Drift Detection
Based on the general scheme (3), we are able to identify concept drift between any two time steps and . In practice, we are mainly interested in concept drifts between successive time steps and . However, if we were to study (3) for two time steps only, our concept drift detection model might become too sensitive to random variations of the predictive model. To be more robust, we examine the moving average of (3) instead. Specifically, we compute the moving average at time step over time steps as
As before, the moving average contains our latest information on the model parameters and the active concept. We can adjust the sensitivity of our framework by selecting appropriately. In general, the larger we select , the more robust the framework becomes. However, a large might also hide concept drifts of small magnitude or short duration.
So far we have treated all changes of the parameter distribution as an indication of concept drift. Indeed, this is in line with the general definition of concept drift 
. Still, we argue that only certain changes in the parameter distribution have practical relevance. For example, suppose that we use stochastic gradient descent (SGD) to optimize the model parameters at every time step. If we start from an arbitrary initialization, the distribution of optimal parameters usually changes significantly in early training iterations. However, given that the conceptis stationary, SGD will almost surely converge to a local optimum. Consequently, we will ultimately minimize the entropy and KL-divergence of in successive time steps. In other words, (II-A) will tend to decrease as long as we optimize the parameters with respect to the active concept. However, if the decision boundary changes due to a real concept drift, SGD-updates will aim for a different optimum. This change of the objective will temporarily lead to more uncertainty in the model and thus increase the entropy of the parameter distribution.
detects a concept drift (black vertical lines). By adjusting the hyperparameter, we can control the iterative updates of and thus regulate the sensitivity of ERICS after a drift is detected. Generally, the larger we choose , the more sensitive ERICS becomes to changes of the parameter distribution. Here, we depict different for the KDD data set . We artificially generated four sudden concept drifts (blue vertical lines). In this example, small update steps (i.e. small ) are preferable to give the predictive model enough time to adapt to the new concept. Note that the early alerts correspond to the initial training phase of the predictive model. Hence, we would ignore them in practice.
We exploit this temporal pattern for concept drift detection. To this end, we measure the total change of (II-A) in a sliding window of size :
where is an adaptive threshold. As before, we may control the robustness of the concept drift detection with the sliding window size . Whenever we detect concept drift, i.e. (5) evaluates to true, we redefine as
In this way, we temporarily tolerate all changes to the predictive model up to a magnitude of (6). We consider these changes to be the after-effects of the concept drift. We then update in an iterative fashion. Let be a user-defined hyperparameter in the interval . Each update depends on the current -value, the -hyperparameter and the time elapsed since the last concept drift alert, which we denote by :
Note that will asymptotically approach 0 over time, if there is no concept drift. In this way, we gradually reduce the tolerance of our framework after a drift is detected.
The choice of a suitable usually depends on the application at hand. By way of illustration, we applied ERICS with different to the KDD data set . We used ’s method to induce sudden concept drift after every 20% of observations. For more information, see Section V. Figure 4 illustrates the components of ERICS for three different -values. Notably, the larger we chose , the more drifts we detected. Since we were dealing with a sudden concept drift in this particular example, we could be less sensitive and apply smaller update steps. For , we achieved good first results in all our experiments. Therefore, this value can generally be used as a starting point for further optimization.
To conclude our general framework, we provide a pseudo code implementation in Figure 5.
Ii-B Limitations and Advantages
The proposed framework does not access streaming observations directly, but uses the parameters of a predictive model instead. Accordingly, our approach is much more memory efficient than many related works. Yet, if the parameter distribution does not change in a drift period, concept drift may go unnoticed. In general, however, ERICS can detect all concept drifts that affect the predictive outcome.
One should also be aware that some predictive models are prone to adversarial attacks. Accordingly, ERICS can only be as robust as its underlying predictive model. This sensitivity to the predictive model is shared by most existing works. With ERICS, the possibility of misuse is drastically reduced, as we closely monitor the distribution of the model parameters at all times.
Iii Illustrating ERICS
ERICS is model-agnostic. This means that the framework can be applied to different predictive models and parameter distributions
. In this way, we enable maximum flexibility with regard to possible streaming applications. By way of illustration, we adopt a Probit model with independent normally distributed parameters. This setup has achieved state-of-the-art results in online feature selection. Besides, it offers dramatic computational advantages due to its low complexity. In line with , we optimize at every time step with respect to the log-likelihood for the Probit model.
The assumption of independent model parameters may appear restrictive, but in practice it often leads to good results, e.g. in the case of local feature attributions [16, 15] or feature selection [13, 5]. In fact, the independence assumption allows us to identify the parameters affected by concept drift and thus to comply with Property 2. Since the Probit model comprises one parameter per input feature, we can readily associate concept drift with individual input variables.
Accordingly, let , where
is a vector of mean values andis the diagonal covariance matrix, where the diagonal entries correspond to the vector . The differential entropy of is
The KL-divergence between and is
According to (II-A), we then write the moving average as
Note that (8) scales linearly with the number of parameters , i.e. it has time complexity.
In order to identify concept drift at individual parameters (which is equivalent to examining individual features, since we use a Probit model), we can investigate the moving average of a specific parameter :
In this case, we maintain a different threshold per parameter. Note that (9) has a constant time complexity.
Iv Related Work
In this section, we briefly introduce some of the most prominent and recent contributions to concept drift detection.
DDM monitors changes in the classification error of a predictive model . Whenever the observed error changes significantly, DDM issues a warning or an alert. We find various modifications of this general scheme, including  and . Another well-known method for concept drift adaptation is ADWIN . Here, the authors maintain a sliding window, whose size changes dynamically according to the current rate of distributional change.  also employ a sliding window approach and provide a feasible implementation of Fisher’s Exact test, which they use for concept drift detection. Similar to our framework,  use a sliding window and the entropy to detect concept drift. However, they examine entropy with regard to the predictive result and disregard the model parameters. FHDDM applies a sliding window to classification results and tracks significant differences between the current probability of correct predictions and the previously observed maximal probability . To this end, FHDDM employs a threshold that is based on the Hoeffding bound. In a later approach, the same authors instead use McDiarmid’s inequality to detect concept drift . EWMA is a method that monitors an increase in the probability that observations are misclassified . The authors use an exponentially weighted moving average, which places greater weight on the most recent instances in order to detect changes. 
also focus on the predictive outcome. Specifically, they investigate the distribution of the loss function via resampling. Likewise, the LFR method uses certain test statistics to detect concept drift by identifying changes through statistical hypothesis testing. Finally,  compare the labels of close data points in successive batches to detect concept drift.
In addition, we find various approaches that examine ensembles of online learners to deal with concept drift. For example,  compare two models; one that is trained with all streaming observations and another that is trained only with the latest observations. Likewise, 
analyze the density of the posterior distributions of an incremental and a static estimator.
Conceptually, our work differs substantially from the remaining literature. Instead of directly examining the streaming observations or the predictive outcome, ERICS monitors changes in the parameters of a predictive model.
We evaluated ERICS in multiple experiments. All experiments were conducted on an i5-8250U CPU with 8 Gb of RAM, running 64-bit Windows 10 and Python 3.7.3. We compared our framework to the popular concept drift detection methods ADWIN , DDM , EWMA , FHDDM , MDDM  and RDDM . We used the predefined implementations of these models as provided by the Tornado framework 
. Besides, we applied the default set of parameters throughout all experiments. Note that all related models require classifications of a predictive model. To this end, we trained a Very Fast Decision Tree (VFDT) in an interleaved test-then-train evaluation. The VFDT is a state-of-the-art online learner, which uses the Hoeffding bound to incrementally construct a decision tree for streaming data. We used the VFDT implementation of scikit-multiflow  in our experiments. Note that we consider a simple binary classification scenario in all our experiments, since it should be handled well by all models.
lists all hyperparameters per data set. The hyperparameters “Epochs”, “LR (learning rate)” and “LR ” control the training of the Probit model, which we adopted from .
|KDD (sample)||100,000||41||cont., cat.|
V-a Data Sets
In order to evaluate the timeliness and precision of a concept drift detection model, we require ground truth. Consequently, we generated multiple synthetic data sets using the scikit-multiflow package . Detailed information about each generator can be obtained from the corresponding documentation. We exhibit the properties of all data sets in Table I. Note that we simulated multiple types of concept drift. Specifically, we produced sudden concept drifts with the SEA generator. To this end, we specified a drift duration (width parameter) of 1. We alternated between the classification functions 0-3 to produce the different concepts. With the Agrawal generator, we simulated gradual drift of different duration. Again, we alternated between the classification functions 0-3 to shift the data distribution. With the rotating Hyperplane generator, we simulated an incremental drift over the full length of the data set. We generated 20 features with the Hyperplane generator, out of which 10 features were subject to concept drift by a magnitude of 0.5. Finally, we produced a Mixed drift using the Agrawal generator. The Mixed data contains both sudden and gradual drift, which we obtained by alternating the classification functions 0-4. All synthetic data sets contain 10% noisy data. We obtained 100,000 observations from each data stream generator.
In addition, we evaluated the proposed framework on real world data. However, since real world data usually does not provide any ground truth information, we had to artificially induce concept drift. For this reason, we applied the methodology of 
to induce sudden concept drift in five well-known data sets from the online learning literature. First, we randomly shuffled the data to remove any natural (unknown) concept drifts. Next, we ranked all features according to their information gain. We then selected the top 50% of the ranked features and randomly permuted their values. In this way, we generated sudden drifts after every 20% of the observations. Specifically, we introduced concept drift to the real-world data sets Spambase, Adult, Human Activity Recognition (HAR), KDD 1999 and Dota2, which we took from the UCI Machine Learning repository. Note that we drew a random sample of 100,000 observations from the KDD 1999 data to allow for feasible computations.
Besides, we used the MNIST data set to evaluate partial concept drift detection at the input level. We selected all observations that are either labelled 3 or 8, since these numbers are difficult to distinguish. In the first half of the observations, we treated 3 as the true class. In the second half of the observations, we switched the true class to 8. In this way, we simulated a sudden concept drift of all input features.
For all real world data sets, we normalized the continuous features to the range
V-B Delay, Recall and Precision
In our first experiment, we applied the concept drift detection models to all synthetic and real-world data sets. Figure 15 exhibits the drift alerts of every model. The blue vertical lines and shaded areas indicate periods of concept drift. Each black vertical line corresponds to one drift alert. Most models identify concept drift in early iterations. This is due to the initial training phase of the predictive model and therefore has no practical relevance. For the upcoming evaluations, we have therefore ignored all drift alerts in the first 80 batches.
By Figure 15, the proposed framework ERICS performs well in all data sets. Given the low complexity of the underlying Probit model, some concept drifts do not infer a change of the parameter distribution immediately. This can be seen in small delays, such as for the Agrawal data, for example. Still, ERICS achieves the smallest average delay of all concept drift detection models, which is shown in Table III.
Strikingly, ERICS generally seems to produce fewer false alarms than related models. We find support for this intuition by examining the average recall (Figure 16) and precision (Figure 17) over all data sets. Similar to , we evaluated the detected drifts for different detection ranges. The detection range corresponds to the number of batches after a known drift, during which we consider an alert as a true positive. Whenever there is no drift alert in the detection range, we count this as a false negative. Besides, all drift alerts outside of the detection range are false positives. We used these scores to compute the recall and precision values. Again, we find that ERICS tends to struggle in the early stages, right after a drift happens. As mentioned before, we attribute this to the slowly updating Probit model that we used for illustration. The VFDT, which is used by all related models, is much more complex and can thus adapt to changes faster. Additionally, we must treat some recall scores with care. For example, in four data sets, the DDM model detects drift in almost every time step. Hence, it achieves perfect recall, although the drift alerts are not reliable at all. Still, ERICS
ultimately outperforms all related models in terms of both recall and precision. The superiority of our framework is even more apparent, if we look at the harmonic mean of precision and recall, which is the F1 score that we show in Figure18.
V-C Detecting Drift at the Input Level
As mentioned before, by using a Probit model and treating parameters as independently Gaussian distributed, we are able to associate concept drift with specific input features. By means of illustration, we applyERICS to a sample of the MNIST data set, which we induced with concept drift. In Figure 19, we exhibit the mean of all observations corresponding to the true class before and after the concept drift (left subplots). We also show the absolute difference between those mean values. In the outer most subplot on the right, we illustrate the drift alerts per input feature in the first 15 batches after the concept drift. The color intensity corresponds to the number of drift alerts (where many alerts correspond to darker patterns). Strikingly, the frequency of drift alerts closely maps the absolute difference between the two concepts. This shows that ERICS is generally able to identify the input features that are most affected by concept drift. We expect this pattern to become even clearer, when using more complex base models.
In this work, we proposed a novel and generic framework for the detection of concept drift in streaming applications. Our framework monitors changes in the parameters of a predictive model to effectively identify distributional changes of the input. We exploit common measures from information theory, by showing that real concept drift corresponds to changes of the uncertainty regarding the optimal parameters. Given an appropriate parameter distribution, the proposed framework can also attribute drift to specific input features. In experiments, we highlighted the advantages of our approach over multiple existing methods, using both synthetic and real-world data. Strikingly, ERICS detects concept drift with less delay on average, while outperforming existing models in terms of both recall and precision.
-  (2008) Paired learners for concept drift. In 2008 Eighth IEEE International Conference on Data Mining, pp. 23–32. Cited by: §IV.
-  (2006) Early drift detection method. In Fourth international workshop on knowledge discovery from data streams, Vol. 6, pp. 77–86. Cited by: §IV.
-  (2017) RDDM: reactive drift detection method. Expert Systems with Applications 90, pp. 344–355. Cited by: §IV, §V.
-  (2007) Learning from time-changing data with adaptive windowing. In Proceedings of the 2007 SIAM international conference on data mining, pp. 443–448. Cited by: §IV, §V.
CancelOut: a layer for feature selection in deep neural networks. In International Conference on Artificial Neural Networks, pp. 72–83. Cited by: §III.
-  (2018) Concept drift detection based on fisher’s exact test. Information Sciences 442, pp. 220–234. Cited by: §IV.
-  (2014) Detecting concept drift: an information entropy based method using an adaptive sliding window. Intelligent Data Analysis 18 (3), pp. 337–364. Cited by: §IV.
-  (2017) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. External Links: Cited by: Fig. 4, §II-A, §V-A.
Learning with drift detection.
Brazilian symposium on artificial intelligence, pp. 286–295. Cited by: §IV, §V.
-  (2014) A survey on concept drift adaptation. ACM computing surveys (CSUR) 46 (4), pp. 44. Cited by: §I, §I, §IV.
-  (2014) A comparative study on concept drift detectors. Expert Systems with Applications 41 (18), pp. 8144–8156. Cited by: §IV.
-  (2014) Concept drift detection through resampling. In International Conference on Machine Learning, pp. 1009–1017. Cited by: §IV.
-  (2020) Leveraging model inherent variable importance for stable online feature selection. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1478–1502. Cited by: §I, §II, §III, §III, §V.
-  (2001) Mining time-changing data streams. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 97–106. Cited by: Fig. 16, §V.
-  (2016) Licon: a linear weighting scheme for the contribution ofinput variables in deep artificial neural networks. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pp. 45–54. Cited by: §III.
-  (2017) A unified approach to interpreting model predictions. In Advances in neural information processing systems, pp. 4765–4774. Cited by: §III.
-  (2018) Scikit-multiflow: a multi-output streaming framework. The Journal of Machine Learning Research 19 (1), pp. 2915–2914. Cited by: §V-A, §V.
-  (2018) McDiarmid drift detection methods for evolving data streams. In 2018 International Joint Conference on Neural Networks (IJCNN), pp. 1–9. Cited by: §IV, §V.
-  (2016) Fast hoeffding drift detection method for evolving data streams. In Joint European conference on machine learning and knowledge discovery in databases, pp. 96–111. Cited by: §IV, §V.
-  (2018) Reservoir of diverse adaptive learners and stacking fast hoeffding drift detection methods for evolving data streams. Machine Learning 107 (11), pp. 1711–1743. Cited by: §V.
-  (2012) Exponentially weighted moving average charts for detecting concept drift. Pattern recognition letters 33 (2), pp. 191–198. Cited by: §IV, §V.
-  (2017) On the reliable detection of concept drift from streaming unlabeled data. Expert Systems with Applications 82, pp. 77–99. Cited by: §II-A, §V-A.
-  (2011) New drift detection method for data streams. In International conference on adaptive and intelligent systems, pp. 88–97. Cited by: §IV.
-  (2019) Online semi-supervised concept drift detection with density estimation. arXiv preprint arXiv:1909.11251. Cited by: §IV.
-  (2015) Concept drift detection for streaming data. In 2015 International Joint Conference on Neural Networks (IJCNN), pp. 1–9. Cited by: §IV.
-  (2016) Characterizing concept drift. Data Mining and Knowledge Discovery 30 (4), pp. 964–994. Cited by: §I, §I, §II-A, §II, §IV.
-  (2017) Concept drift detection with hierarchical hypothesis testing. In Proceedings of the 2017 SIAM International Conference on Data Mining, pp. 768–776. Cited by: §V-B.
-  (2020) Handling concept drift via model reuse. Machine Learning 109 (3), pp. 533–568. Cited by: §I.
-  (2010) Learning under concept drift: an overview. arXiv preprint arXiv:1010.4784. Cited by: §I, §IV.