1 Introduction
It is crucial to adapt predictive models built on live nonstationary data streams to the ever changing underlying processes. Timely adaptation of these models to this phenomenon, known as concept drift, has drawn significant attention in contemporary literature. Adaptation to concept drift has a wide range of applications from InternetofThings (IoT) analytics to analysis of signals generated by autonomous robots, from spam detection to natural language processing.
Most of the predictive models today operate under the assumption of a stationary environment. However certain realtime data, for example, financial, climate, medical, energy demand, and pricing data, are generated from underlying nonstationary sources which are constantly changing with time. In Figure 1, we demonstrate the effect of concept drift on prediction models with a visual illustration. In Figure 1
, we see that the prediction accuracy of the classifier (Classes A and B) degrades since the data (features
and ) is undergoing smooth concept drift. So it is imperative to adapt the prediction models to new incoming data stream.In the manufacturing domain, there is a need to deploy predictive models to perform predictive maintenance, quality assessment and condition monitoring. But changes in either the machine configuration or their calibration or thresholds for quality assessment are the usual sources of concept drift in the data. Thus there is a need to adapt the deployed models to this gradual drift in the data.
The task of model adaptation becomes especially challenging when the incoming data streams are unlabeled or very sparsely labeled. Figure 2 illustrates a specific use case where the features and labels, collected over a period of time, are available to learn a predictive model which is the Training phase (). But once the model is deployed, we only have access to the trained model, , given a training set , but not the training data any more. At time when the model is online, it is pertinent to make prediction over a batch of data, , by adapting the previously trained model . We may not have access to the labels, , during adaptation and thus this presents a motivation to perform model adaptation with unsupervised learning on batch data.
Although model adaptation has gained a lot of attention in the recent past, little has been done in developing a quantitative definition of concept drift. In this paper, we propose a novel quantitative expression of drifts of each of the data points in terms of changes in posterior probability distributions. We then develop a novel iterative algorithm that learns from the nonstationary data, estimates the pointwise drifts and adapts the prediction model to improve its accuracy. We evaluate the proposed algorithm on synthetic and real data, and, show significant improvement over an unadapted solution.
The rest of the paper is organized as follows. In Section 2, we review some of the existing related literature and draw comparisons with our work. Section 3 formulates a quantitative definition of concept drift. The drift adaptation algorithm with its convergence properties are presented in Section 4. Section 5 includes the evaluation and results of the algorithm on synthetic and real data. We summarize and conclude in Section 6.
2 Related Work
Various approaches to address different problems in concept drift are summarized by MorenoTorres et al. (2012), Gama et al. (2014), Heywood (2015), Ditzler et al. (2015) and Žliobaitė et al. (2016). MorenoTorres et al. (2012) present a unifying framework to review and compare works in four different categories of dataset shifts. Gama et al. (2014) cover various aspects of concept drift in an integrated way to reflect on the existing state of the art techniques and specifically focus on supervised learning. Heywood (2015) survey developments in model building under both evolutionary and nonevolutionary streaming environments. Žliobaitė et al. (2016) compile potential applications of concept drift adaptation to financial, climate, medical, energy demand and pricing data based on tasks, characteristics of changes and operational settings. Kuznetsov and Mohri (2016), Mohri and Medina (2012) present a series of models for timeseries prediction under nonstationary environment.
In some applications, it is also crucial to detect when the data has undergone significant drift such that the existing model is no longer valid. Such goals leads to drift detection and it is relevant to batch adaptation. Wang and Abraham (2015) present Linear Four Rates (LFR) framework to detect concept drift and identify the data points that belong to the new concept. However this detection technique is supervised and can be used only with binary classification models. Dries and Rückert (2009)
propose three methods for adaptive concept drift detection where the test statistics are dynamically adapted with the nonstationary data. Such methods turn out to be useful when the drift affects different characteristics of the underlying distribution at different time points.
The primary objective of this paper is to continuously adapt the prediction model undergoing smooth/gradual concept drift, see Bartlett (1992), irrespective of the degree of drift at a time point. Dyer et al. (2014) introduce a computational geometry based framework to learn from nonstationary data, where labels are unavailable after initialization. They define nonstationarity in terms of timevarying probability distribution of the features, i.e., . They consider gradual drift, change in nonstationarity, as translation, rotation or compaction of . However, they do not quantify the drift. Hanneke et al. (2015)
study the bounds on the error rates of a predictive model given sequence of independent data points under concept drift. Further the paper provides an adaption method of active learning type where the bound on the number of labels, to achieve a desired bound on error rate, is studied.
Kuznetsov and Mohri (2016) presents theoretical guarantees of ensemble based methods for forecasting nonstationary timeseries. Chaudhuri et al. (2010) proposed a tracking algorithm where the observations follow a slightly drifted distribution. Bousquet and Warmuth (2002) and Herbster and Warmuth (1998) provide loss bounds of online algorithms for tracking the best set of experts, and extended them to shifting bounds in Herbster and Warmuth (2001) for shifting predictors. Kuznetsov and Mohri (2014) and Kuznetsov and Mohri (2015) presents timeseries prediction error bounds for nonstationary mixing and nonmixing stochastic processes.We are particularly interested in an unsupervised adaptation technique where the predictive model does not have access to the labels. In this context, Hofer and Krempl (2013) presents an unsupervised statistical methodology for analyzing population drift in classification. They define nonstationarity in terms of timevarying probability distribution of the priors, i.e., but the conditional feature distribution/density, , is stationary, where are the features and are the classes. They define drift as the ratio of and . Such definitions of nonstationarity and drifts are restrictive, as in most practical applications the conditional feature distribution also undergoes gradual changes. We overcome this limitation and formulate an ubiquitous quantitative definition of concept drift, in the following Section 3, that encompasses more general nonstationarity in data distributions.
References Bach and Maloof (2010) and Kolter and Maloof (2007) have defined the concept drift in terms of the KL Divergence between the overall posterior probability distributions. In this paper, we quantify the pointwise drift, i.e., the change in the posterior probability of each data point, in terms of a physical quantity similar to the KL Divergence. Further, we estimate the pointwise drift in an unsupervised manner to improve the prediction accuracy. We point out that learning under nonstationary environment (concept drift) is different from domain adaptation Jiang and Zhai (2007), Cortes and Mohri (2014)
and/or transfer learning
Long et al. (2014). Both in domain adaptation and transfer learning, the learning is done on source data (one distribution) and the prediction is done on a different/related target data (different distribution). Whereas, in this work we address a different problem setup, where the incoming data stream is being generated by a gradually drifting distribution as considered in Souza et al. (2015), Hofer (2015) and Long et al. (2014). There may or may not be any change in distribution between two subsequent batches of data.3 Drift Formulation
Concept drift is the change in the statistical properties of the target variable over time and Fig. 1–2
visually portray the qualitative definition of concept drift. In this paper, we define drift in the context of classification problems. To obtain a quantitative definition of drift, we observe the changes in the joint and conditional probability distributions/density of the features and classes.
3.1 Problem Setup
We represent the data in terms of the predictor variables
, where is the feature space, and the class labels . The data can be represented by the joint probability distribution. In most machine learning and data mining applications, predictive models are designed assuming that
is timeinvariant. However, in practice gradually changes with time, , thereby causing concept drift. Since can be expressed as(1) 
the drift can be sufficiently described by the timevarying nature of:
For notational simplicity, from now onwards we drop the subscripts and in the probability functions. In our formulation, we consider and for two reasons: for a given data point , its class label, , is computed using the posterior probability
(2) 
and, for unsupervised model adaptation at any time point we only observe the features, , which can provide us an estimate of feature distribution . In this paper we study unsupervised model adaptation under concept drift. We aim to update the prediction rule based on the changes observed in the predictor variables X only. To be able to detect, and hence estimate, drift in the underlying model based on the variables X alone, we must observe some drift or changes in the feature distribution with time. Otherwise, drift detection and adaptation would require labels of atleast some of the data points. Since in this paper we consider unsupervised learning, we assume that the feature distribution changes whenever there is a concept drift.
At time , based on the labeled initial data (training phase in Figure 2) , we learn the posterior distribution and the feature density . The underlying model and are saved, but we do not store the initial training data. Then at time (online phase in Figure 2), we receive drifted unlabeled data . Based on , and the new drifted data , we aim to compute the updated posterior distribution and feature density , keep them and discard the data . We repeat this process for every subsequent time step .
Hence, at any time , we have the old model and , and receive the new drifted data . Our objective is to design a method to obtain the updated and . This method will update the classifier at all time points. For simplicity of notation, we denote by . To compute the updates, we first need to estimate the drifts which we define in the next subsection.
3.2 Drift definition
We define pointwise drift, , of each new data points for each class at each time,
(3) 
From time to , the change/divergence in posterior probability of data point being in class is captured in the drift, . We use Kullback–Leibler (KL) divergence to measure the difference between two probability distributions. The conditional KL divergence of from is
(4) 
where, . The pointwise drifts defined in (3) are the building blocks of the conditional KL divergence . Using (3) and (4), we establish the relation
(5) 
The KL divergence , is a nonnegative quantity. A zero value denotes that there is no drift and higher the value stronger is the drift. We have developed a quantitative definition for drift which is in well accord with its qualitative notion. The overall drift between time points and
can be expressed by the KL divergence of the discretized joint distributions
and ,(6) 
Any nonzero drift , is reflected in the divergence between the feature distributions, . The cases where the drift results only in the changes of posterior probabilities, i.e., , but no changes in the feature distribution, i.e., , is beyond the scope of this paper.
3.3 Prediction using drift estimates
We first estimate the pointwise drifts, . The updated posterior probabilities, , follow from the estimated drifts using (3) as:
(7) 
The drift captures the divergence of the posterior probability of the data point belonging to class from time to time . Once we obtain the estimated posterior probabilities using the drift estimates, we predict the class of data point as:
(8) 
We see that the class labels are the maximum a posteriori probability (MAP) estimates, which, in turn, are dependent on the drift estimates . In the next section, we propose an iterative algorithm that simultaneously estimates the drifts and the posterior probabilities of each data point .
4 Model Adaptation
In this section, we propose the novel model adaptation algorithm for unsupervised learning of class labels from nonstationary data under concept drift.
4.1 Algorithm
The unsupervised algorithm for model adaptation runs two companion posterior probability estimation subroutines at each iteration, until it converges. In the first subroutine we update the pointwise drift estimates and then update the driftbased posterior probability estimates using (7). The second subroutine computes the class labels from (2), and then, using , updates the labelbased posterior probability estimates . The algorithm, proposed in this paper, iteratively decreases the divergence between the driftbased and labelbased posterior probability estimates. The decreasing KL divergence is guaranteed by a gradient descent (derivative w.r.t drift) step on the drift updates . The steps of the model adaptation iterations are presented in Algorithm 1.
In the algorithm, the driftbased posterior is initialized with the posterior distribution, , from previous time step and the drift estimates are initialized to be . At each iteration, the labels and thereafter are estimated using distribution obtained from the previous iteration. We compute the KL divergence between the labelbased and driftbased posterior distributions, and test for convergence. If the divergence is greater than the predefined threshold, the gradient of the KL divergence is used to update the drifts and the corresponding posterior distribution. The algorithm continues until it reaches convergence. In the next Subsection 4.2 we discuss the convergence properties and the design of the parameters of the Algorithm 1.
4.2 Convergence Properties and Step Size Design
In model adaptation Algorithm 1, the cost function is the KL divergence between the two posterior probabilities, and . Since, KL divergence is a convex function and we are minimizing it using a gradient descent step, the algorithm monotonically converges due to the convex property of the cost function. Now the rate of convergence and the asymptotic bounds depend on the choice of the step size. Hence, the key design parameter in model adaptation Algorithm 1 is the step size . Here, we consider step size monotonically decreasing with iterations , i.e.,
(9) 
where, is a constant. The dynamic and decreasing step size ensures that the algorithm achieves faster convergence rate and at the same time converges with a low KL divergence between the between the driftbased and labelbased posterior probability estimates. Thus, the algorithm ensures that they converge to the true posterior probabilities of each data point. We evaluate and discuss the predictive performance of the proposed algorithm in the following section.
5 Evaluation and Results
We extensively evaluate the performance of the proposed algorithm on synthetic, SEA (Street and Kim (2001)), and manufacturing datasets. We benchmark our solution with respect to supervised and unadapted approaches. In the supervised approach, we train and test a new model using the features and labels of the drifted data at time by fold cross validation. Although we have the constraint that the labels are not available, we evaluate our model against supervised model to assess the benchmark performance. The unadapted model uses the model learnt at time to predict the labels at time . The performance of a supervised learning solution is the best case scenario, whereas the unadapted learning is the worst case scenario. We empirically demonstrate that the performance of Algorithm 1 improves on the unadapted solution, but yields a performance gap compared to the supervised learning. To further reduce this gap is the scope for future work.
5.1 Evaluation on synthetic dataset
The synthetic data is a mixture of two Gaussians, i.e., , where and
. The prior on the class labels is considered to follow Bernoulli distribution, i.e.,
Bernoulli. The prior probability, mean and covariances of the Gaussian distributions for the initial
and drifted data are:We considered initial data points and drifted data points for our experimental evaluations. The drift in the mean of class data points from to is graphically displayed in Fig 3 (Top). We report the classification error for the drifted data in Table 1 and observe that the prediction accuracy of the adapted model is higher than the unadapted model. The last plot in Fig 3 (Top) shows the labels of the drifted data estimated using Algorithm 1.
Classification errors  Synthetic  SEA  Manufacturing 

Supervised learning (bestcase)  2.20%  4.45%  1.40% 
Model adaptation (Algo. 1)  3.10%  4.97%  1.67% 
Without adaptation (worstcase)  4.10%  5.61%  1.85% 
5.2 Evaluation on SEA dataset
We adapted the SEA concepts dataset from Street and Kim (2001) and removed data points corresponding to one of the three classes. We changed the classification rule so as to incorporate the change in the feature distribution from the initial data to the drifted data. The initial and drifted data for the modified SEA dataset can be seen in Fig. 3 (Bottom). Similar to the results of the synthetic data, we see an improvement in the accuracy of the adapted model over the unadapted one, see Table 1. We have also plotted the estimated labels for the drifted data in Fig. 3 (Bottom). We observe from this plot that the adapted model was able to estimate a class boundary that is much closer to the class boundary in the drifted data rather than the initial data.
5.3 Evaluation on Manufacturing dataset
In this study we have evaluated our algorithm on a real dataset from a manufacturing plant. This data contains 27 different measurements taken from a product that is manufactured in the plant and the quality of the end product (Good/Bad) is used as a class label. This is a binary classification problem and there is a gradual drift in the data feature space. We evaluated a total of 11,000 products that have been produced and divided them into two phases, training (initial data) and testing (drifted data). We applied our algorithm to this data and the results are chronicled in Table 1. Again in this case, we observe that the adapted model outperforms the unadapted model which is surpassed by the supervised model. In all of our above experiments, we have chosen step size empirically to be , the convergence threshold to be and number of iterations as .
5.4 Comparative Evaluation
Here, we compare the performance of our model adaptation algorithm 1 with the state of the art method Stream Classification Algorithm Guided by Clustering (SGARC) presented in Souza et al. (2015). In Fig. 4, we present the classification error on the drifted data with time on the five datasets MG_2C_2D, UG_2C_3D, UG_2C_2D, FG_2C_2D and UG_2C_5D from Souza et al. (2015), out of which the first three datasets were originally proposed in Dyer et al. (2014). Souza et al. (2015) compared the performance of their algorithm SGARC with Compacted Object Sample Extraction (COMPOSE) from Dyer et al. (2014) and Arbitrary Subpopulation Tracker (APT) from Krempl (2011). Souza et al. (2015) demonstrated that SGARC outperforms both COMPOSE and APT on MG_2C_2D dataset, whereas on the UG_2C_2D dataset, SGARC outperforms APT while providing same performance as COMPOSE.
Classification accuracy  MG_2C_2D  UG_2C_3D  UG_2C_2D  FG_2C_2D  UG_2C_5D 

SGARC (1NN)  92.71%  94.77%  95.56%  95.16%  90.98% 
SGARC (SVM)  92.75%  94.79%  95.53%  95.23%  88.24% 
Our Model (Algo. 1)  92.20%  95.11%  96.08%  92.03%  93.65% 
In Table 2, we see that our method outperforms SGARC on UG_2C_2D, UG_2C_3D and UG_2C_5D datasets, whereas lags behind (but still comparative) on the MG_2C_2D and FG_2C_2D datasets. Note that, our method is a less computationally complex and converges within 10 iterations. Table 3 exhibits that our method achieves comparative performance (see Table 2) within a significantly less time compared to SGARC. Souza et al. (2015) showed that SGARC runs much faster than the COMPOSE and APT on the same five datasets. Hence, our model adaptation algorithm 1 provides comparative (or better, in most cases) classification accuracy with significantly less computation run times.
6 Conclusions
Making classifiers robust to drift is an important requirement for practical applications. In this work, we have presented an example of model adaptation for scenarios where drift is gradual and labels are unavailable during the adaptation period. The three primary contributions of this paper are: (i) quantification of concept drift in classification applications; (ii) determination of sample importance in the estimation of drift; and (iii) development of a novel algorithm that estimates the drift of each data point and adapts the classifier to improve prediction accuracy. While the adaptation algorithm has shown promising results for a NaiveBayes classifier, extensions to other, popular classifiers will have to be examined. Similarly, an extension from a current batchimplementation to a streamingimplementation is desired. Future work will also investigate the convergence of the algorithm in situations where the drift is not gradual.
References
 A bayesian approach to concept drift. In Advances in Neural Information Processing Systems, pp. 127–135. Cited by: §2.

Learning with a slowly changing distribution.
In
Proceedings of the fifth annual workshop on Computational learning theory
, pp. 243–252. Cited by: §2.  Tracking a small set of experts by mixing past posteriors. Journal of Machine Learning Research 3 (Nov), pp. 363–396. Cited by: §2.

An online learningbased framework for tracking.
In
Uncertainty in Artificial Intelligence (UAI)
, Cited by: §2.  Domain adaptation and sample bias correction theory and algorithm for regression. Theoretical Computer Science 519, pp. 103–126. Cited by: §2.
 Learning in nonstationary environments: a survey. Computational Intelligence Magazine, IEEE 10 (4), pp. 12–25. Cited by: §2.
 Adaptive concept drift detection. Statistical Analysis and Data Mining 2, pp. 311–327. Cited by: §2.

Compose: a semisupervised learning framework for initially labeled nonstationary streaming data.
IEEE Transactions on Neural Networks and Learning Systems
. Cited by: §2, §5.4.  A survey on concept drift adaptation. ACM Computing Surveys (CSUR) 46 (4), pp. 44. Cited by: §2.
 Learning with a drifting target concept. In Algorithmic Learning Theory, pp. 149–164. Cited by: §2.
 Tracking the best expert. Machine Learning 32 (2), pp. 151–178. Cited by: §2.
 Tracking the best linear predictor. Journal of Machine Learning Research 1 (Sep), pp. 281–309. Cited by: §2.
 Evolutionary model building under streaming data for classification tasks: opportunities and challenges. Genetic Programming and Evolvable Machines 16 (3), pp. 283–326. Cited by: §2.
 Drift mining in data: a framework for addressing drift in classification. Computational Statistics & Data Analysis 57 (1), pp. 377–391. Cited by: §2.
 Adapting a classification rule to local and global shift when only unlabelled data are available. European Journal of Operational Research 243 (1), pp. 177–189. Cited by: §2.
 Instance weighting for domain adaptation in NLP. ACL 7, pp. 264–271. Cited by: §2.
 Dynamic weighted majority: an ensemble method for drifting concepts. Journal of Machine Learning Research 8 (Dec), pp. 2755–2790. Cited by: §2.
 The algorithm apt to classify in concurrence of latency and drift. In International Symposium on Intelligent Data Analysis, pp. 222–233. Cited by: §5.4.
 Generalization bounds for time series prediction with nonstationary processes. In International Conference on Algorithmic Learning Theory, pp. 260–274. Cited by: §2.
 Learning theory and algorithms for forecasting nonstationary time series. In Advances in Neural Information Processing Systems, pp. 541–549. Cited by: §2.
 Time series prediction and online learning. In 29th Annual Conference on Learning Theory, pp. 1190–1213. Cited by: §2, §2.
 Adaptation regularization: a general framework for transfer learning. IEEE Transactions on Knowledge and Data Engineering. Cited by: §2.
 New analysis and algorithm for learning with drifting distributions. In International Conference on Algorithmic Learning Theory, pp. 124–138. Cited by: §2.
 A unifying view on dataset shift in classification. Pattern Recognition 45 (1), pp. 521–530. Cited by: §2.
 Data stream classification guided by clustering on nonstationary environments and extreme verification latency. In Proceedings of SIAM International Conference on Data Mining (SDM), pp. 873–881. Cited by: §2, Figure 4, §5.4, §5.4, Table 2, Table 3.
 A streaming ensemble algorithm (sea) for largescale classification. In International conference on knowledge discovery and data mining (SIGKDD), pp. 377–382. Cited by: §5.2, §5.
 Concept drift detection for streaming data. In 2015 International Joint Conference on Neural Networks (IJCNN), pp. 1–9. Cited by: §2.
 An overview of concept drift applications. In Big Data Analysis: New Algorithms for a New Society, pp. 91–114. Cited by: §2.