1 Introduction
Predictive modeling for data streams is becoming an increasinglyrelevant task, in particular with the increasing advent of sensor networks and tasks in artificial intelligence, including robotics, reinforcement learning, system monitoring, anomaly detection, social network and media analysis.
In a data stream, we assume that data arrives
over time . A model observes test instance and is required to make a prediction
at time . Hence the amount of computational time spent per instance must be less that the rate of arrival of new instances (i.e., the real clock time between time steps and ). A usual assumption is that true label becomes available at time , thus allowing to update the model.
This is in contrast to a standard batch setting, where a dataset of fixed is observed prior to inducing the model. See [11] for introduction and definitions.
1.1 Building Predictive Models from Data Streams
We wish to build a model that approximates, either directly or indirectly, the generating distribution. For example, a MAP estimate for classification
The incremental nature of data streams has encouraged a focus on fast and incremental, i.e., updateable, models; both in the classification and regression contexts. Incremental decision trees such as the Hoeffding tree (HT,
[15]) have had a huge impact on the datastreams literature and dominate recent output in this area. They are fast, incremental, easy to deploy, and offer powerful nonlinear decision boundaries. Dozens of modifications have been made, including ensembles [6, 18, 13, 12], and adaptive trees [4].High performance in data streams has also been obtained by nearest neighbors (NN) methods, e.g., [19, 26, 25]. As a lazy method, there is no training time requirement other than simply storing examples to which – as a distancebased approach – it compares current instances, in order to make a prediction. The buffer of examples should be large enough to represent the current concept adequately, but not too large as to be prohibitively expensive to query.
Methods employing stochastic gradient descent (SGD)
^{1}^{1}1i.e., incremental gradient descent; SGD usually implies drawing randomly from a dataset in i.i.d. fashion. Typically a stream is assumed to be i.i.d., and thus equivalent in that sense. We challenge the i.i.d. assumption later, but keep this terminology to be in line with that of the related literature have been surprisingly underutilized in this area of the literature. Baseline linear approaches obtained relatively poor results, but can be competitive with appropriate nonlinearities [20] and have been used within other methods, e.g., at the leaves of a tree [1]. In this work, we argue that the effectiveness of SGD methods on datastream learning has been underappreciated and in fact offer great potential for future work in streams.1.2 Dealing with Concept Drift
Dealing with concept drift, where the generating distribution in at least some part of the stream, is a major focus of the datastream literature, since it means that the model current has become partially or fully invalid. Almost all papers on the topic propose some way to tackle its implications, e.g., [14, 28, 11, 6, 18, 16, 17]. A comprehensive survey to concept drift in streams is given in [11].
The limitedsized buffer of NN methods imply a natural forgetting mechanism where old examples are purged, as dictated by available computational (memory and CPU) resources. Any impact by concept drift is inherently temporary in these contexts. Of course, adaptation can be increased by flushing the buffer when drift is detected.
HTs can efficiently and incrementally build up a model over an immense number of instances without needing to prune or purge from a batch. However, a permanent change in concept (generating distribution) will permanently invalidate the current tree (or at least weaken the relevance of it – depending on severity of drift) therefore dealing with drift becomes essential. The usual approach is to deploy a mechanism to detect the change, and reset a tree or part of a tree when this happens, so that more relevant parts may be grown on the new concept. Common detection mechanisms include ADWIN [3], CUMSUM [22], Page Hinkley [7], and various geometric moving average and statistical tests [10]. For example, the Hoeffding Adaptive Tree (HAT, [4]) uses an offtheshelf ADWIN detector at each node of a tree, and cuts a branch at a node where changes in registered. It is expected that the branch will be regrown on new data and thus represent the new concept. Tree approaches are almost universally employed in ensembles to mitigate potential fallout from misdetections.
In this paper we argue strongly for the potential importance of a third option – of continuous adaptation, where knowledge (e.g., a set of parameters determining a decision boundary) is transferred as best as possible to a newer/updated concept rather than discarded or reset as in currently the case with the popular detect and reset approach. This possibility can be enacted with SGD. SGD is intimately known and widely used across swathes of the machine learning literature, however, we note that it is markedly absent from the bulk of the datastream methods, and often only compared to only as a baseline in experiments. In this paper we argue that it has been discarded prematurely and underappreciated. We analyse and parameterize it specifically in reflection to performance in conceptdrifting datastreams, and show it to be very competitive with other approaches in the literature (results in sec:discussion).
To summarize the main mechanisms to deal with concept drift:

Forgetting mechanism
(e.g., NN, and batchincremental ensembles) 
Detect and reset mechanisms
(e.g., HATs and HTensembles with ADWIN) 
Continuous adaptation
(e.g., SGDbased models incl. neural networks)
1.3 Organization and contributions
In spite of the enormous popularity of ensembled trees and distancebased (NN) approaches, we will show that continuous adaptation can be more suited to conceptdrifting data streams. We do this by breaking with a common assumption.
Namely, existing work in data streams is mostly based on the assumption of i.i.d. data within a particular concept; therefore seeking as an objective to detect a change in concept, so that offtheshelf i.i.d. models can be (re)deployed. A model belonging to a previous concept is seen as invalid. This leads to the detectandreset approach to dealing with concept drift mentioned above. In this work, on the contrary, we show that drift inherently implies temporal dependence; that all conceptdrifting streams are in some way a time series, and can be treated as such. We propose to treat the concept as a temporal sequence, to enable continuous adaptation as an effective alternative to detectandreset approaches. For this purpose we derive gradient descent approaches; and we show scenarios where they compare very favorably in comparison with more popular tree and distancebased methods. The contributions of this work can be summarized as follows:

We show that conceptdrifting data streams can be considered as time series problems (sec:time_series)

Following a fresh analysis on concept drift (sec:revision), we conduct a biasvariance analysis of learning under drift, and derive gradient descent approaches for conceptdrifting data streams in the framework of continuous adaptation (sec:novel)

We give an analytical and empirical investigation (sec:experiments) which highlights properties of our suggested approach; and from the results (displayed and discussed in sec:discussion) we outline the implications and make a number of recommendations before making concluding remarks (sec:conclusion).
2 Conceptdrifting Streams are Time Series
Conceptdrifting data streams have been widely treated under the assumption of independent samples (see, e.g., [10, 6, 11, 27] and references therein). In this section, we argue that if a data stream involves concept drift, then the independence assumption is violated.
If data points are drawn independently, we should be able to write
(1)  
see fig:main_figure.a for illustration, where the lack of an edge between timesteps indicates the independence.
The subscript of reminds us of the possibility of concept drift (in which case, ).
Lemma 1.
A data stream that exhibits a concept drift also exhibits temporal dependence.
Proof.
Let where denotes the concept at timestep , and let the drift occur at point . Thus for , and for . Under independence,
However, it is obvious that
namely, after the drift we no longer expect instances from the first concept.
We can use the joint distribution, eq:ind, to check for independence, for any particular time step
, marginalizing out which is not observed:(2) 
which does not equal eq:ind, thereby indicating temporal dependence (via the presence of ). ∎
This can be visualized as a probabilistic graphical model in fig:main_figure.b (observed variables are shaded); and in fig:main_figure.c with concept drift marginalized out.
We have shown above that data streams with concept drift exhibit temporal dependence, which essentially means that all such streams can be seen as a time series.
We have done this analysis in the most extreme case of a switch in concepts over a single time step (a sudden change in concept). One might argue that as , the importance of the dependence resulting from this onetime drift becomes negligible, since after drift, and could thus be considered a constant essentially rendering independence within each concept. However we do not observe ; we cannot know exactly when the drift will occur or if it has ocurred. As a result, an instantaneous drift between two time steps can manifest itself as temporal dependence in the error signal over many instances. It is surprising that this is not explicit across the literature, since it is indeed implicit in most changedetection algorithms, in the sense that they measure the change in the error signal of predictive models.
The relationship between the predictive model and the error is clearly seen in the relation
where is the error function (e.g., meansquared or classification error, depending on the problem). Clearly if is poorly adapted to deal with a concept drift, this will appear in increasing (i.e., a time series). It is illustrated (for incremental drift^{2}^{2}2We will review this type of drift in sec:revision) in fig:drift_series. Rather than monitoring for drift so as to reset , in this paper, we look at adapting directly.
Having argued that concept drifting streams are time series – should we just apply offtheshelf time series models? Expanding on some differences mentioned in [27], we can point out that

Only data streams exhibiting concept drift are guaranteed to have time series elements, and only in consideration specifically of these parts of the stream

Time dependence in data streams it is seen as a problem (something to deal with), rather than as part of the solution (something to explicitly model)

In data streams the final estimation of each is required at time , and thus retrospective/forwardbackward (smoothing) inference (as is typical in statespace models) is not applicable

A common assumption in data streams is that the ground truth is available at time , providing a stream of training examples whereas time series models are typically built offline before being deployed live

Data streams are assumed to be of infinite length (therefore, also training data, on account of item 4)
Some of these assumptions are broken on a paperbypaper level. For example, changes to point 4 have been addressed in, e.g., [12, 14].
The most closely related time series task to prediction^{3}^{3}3Of the current time step in streams is that known as filtering
. Actually, eq:deriv is a starting point for state space models such as hidden Markov models, Kalman and paricle filters (see
[2] for a thorough review) for which filtering is a main task, although these models are not usually expected to be updated at each timestep as in the streaming case (a possibility due to point 4 above).Although we cannot always apply statespace models directly to data streams, we nevertheless remark again that temporal dependence plays a nonnegligible role, and can be leveraged as an advantage. In particular, in reflection of this section, we next make a revision of concept drift – in sec:revision – which then allows us to employ efficient and effective methods (sec:novel), allowing us to draw conclusions that have important repercussions in data streams mining.
3 Types of Concept Drift: A Fresh Analysis
Sudden, complete, and immediate drift is widely considered in the literature (and for this reason we worked with it in the previous section), yet we could also argue that gradual, or incremental drift fit better a dictionary definition which implies a movement or tendency^{4}^{4}4Cambridge dictionary offers, among others: “a slow movement” and are more widespread in practice. A complete change in concept simply means we have changed problem domains. The idea of a movement inherently includes the implication of dependence (e.g., across time and space) – and dependence (among instances) implies naturally a time series – as we have elaborated in the previous section.
Let us review and revise concept drift, based on the types of drift identified in [11], which relist for convenience:

Sudden/abrupt

Incremental

Gradual
additionally noting the possibility of reoccurring drift which may involve any of these types and, noting also the related task of dealing with outliers, which are
not concept drift.In the following, we denote as the true (unknown) parameters defining the current concept at time , i.e., . This allows for a smooth drift across concept space
(for example, a set of coefficients defining a hyperplane; see fig:drift_series) but also allows for a qualitative view of categorical concepts
; such that represents the parameters of the th concept; i.e., we would speak of drift between concepts and .3.1 Abrupt change
If the concept changes abruptly, in either a partial or complete manner, we may denote
(3) 
for some time index where the ‘drift’^{5}^{5}5As noted above, we would prefer the term shift for this particular case. Nevertheless, we inherent this terminology from the literature for the sake of consistency occurs. The drift may be total ( and are drawn independently from ) or partial (only a local change, to some part^{6}^{6}6Recalling that is likely to be multidimensional of .
3.2 Incremental drift
Incremental drift denotes a change over time. It can be denoted as an additive operation, where the current concept can be written as
(4) 
i.e., an increment of . We generally assume that drift is active in range (and that outside of this range); where concept before, after, and a blended mixture inbetween).
3.3 Gradual drift
In gradual drift, drift also occurs over time, but in a stochastic way. We may write
(5) 
where
with probability
, and otherwise;being a Bernoulli distribution. Note that
is itself is an incremental drift (see eq:2) between values and . The stream thus increasinglyoften generates examples from the new concept ; where drift and outside of the drift range .Note that neither incremental nor gradual drift need be smooth or monotonically increasing, although that is a common simplification. A sigmoid function is often used in the literature; as in many of the stream generators of the MOA framework
[5].3.4 Reoccurring drift
Reoccurring drift may be any of the above cases (sudden, gradual, or incremental) where a concept may repeat at different parts of the stream. It is very much related to the idea of statespace models such as hidden Markov models, and switching models (both are reviewed in [2]). We remark that there is no technical difference between modelling states, and tracking concepts. Usually we can distinguish a state as something that we want to model (e.g., a weather system), and a concept drift as something we wish to adapt to or take into account (e.g., change/degradation of in the monitoring sensors, or climate change).
4 Learning under Drift: Theoretical Insights
In this section we investigate an approach to adapt to drift continuously as part of the learning process, rather than reactively (i.e., the detectandreset approach) using explicit drift detection mechanisms, as has previously been the main approach (see sec:intro).
It is well known that prediction error of supervised learners breaks down into variance, bias, and irreducible error (see, e.g., [2, 9]). Let represent the true underlying (unknown) model parametrized by , which produces observations , where . The expected mean squared error (MSE) over the data, with some estimated model , can be expressed as
(6) 
i.e., irreducible noise, variance, and bias terms, respectively.
The result in eq:bvt_normal hinges on the assumption that due to the fact that .
However, recall that in the case of concept drifting data streams,
is not constant, but inherits randomness from random variable
(see Sections 2 and 3). We can get around this problem by taking the expectation of MSE within each concept; wrt the point of a single change (which we denote ), then for time ),where represents the change in concept. Regarding eq:bvt_normal: At the third term (bias) is now essentially the difference between the current true concept, and an estimated old concept. In other words (in terms of parameters),
(7) 
and clearly the obvious goal (in terms of reducing bias), is moving (an estimate of the previous concept) towards (the true current concept), and over a concept drifting stream of multiple drifts of different types: to model the journey of .
In the data streams literature, ensemble models with driftdetection strategies have blossomed (see, [13, 18] for surveys). We can now describe theoretical insights to this popularity in this particular area^{7}^{7}7
Aside from the popularity and effectiveness of ensembles in supervised learning in general
: By resetting a model when drift is detected, it is possible to reduce the bias implied by the drift in those models. However, this may increase the variance component of the error (since variance can be higher on smaller datasets [8]). Ensembles are precisely renown for reducing the variance component of the error (see., e.g., [9] and references therein) and thus desirable to counteract it.This analysis also leads us once more to the downside of this approach: Under long and intensive drift, a vicious circle develops; increased efforts to detect drift lead to more frequent detections and thus frequent resetting of models (implicitly, to reduce bias), which encourages the deployment of everlarger ensembles (to reduce the variance caused precisely by resetting models). As seen in the literature, ensemble sizes continue to grow; and our experiment section confirm that such implementations can require significant computational time.
Methods with an explicit forgetting mechanism (e.g., NN, batchincremental methods – also popular in the literature) will automatically establish ‘normal’ bias as becomes as large as the maximum number of instances stored in memory. However, this can be a long time if that size is large; and not a solution when the drift is sustained over a long time or occurs regularly.
Finally, we remark again that detectors will fire when the error signal has already shown a significant change, by which time many (possibly very biased) predictions may have been made.
In the following section we discuss how to avoid this tradeoff. Namely, we propose strategies of continuous adaptation which do not detect drift, but track drift through time.
5 Continuous Adaptation under Concept Drift
Drift detection methods usually monitor the error signal retrospectively for change (is statistically different from ; a warning can be raised). However, since – as argued above – concept drift can be seen as a time series we can attempt to forecast and track the drift, and adapt continuously.
In this sense, solving the concept drift problem is identical to solving the forecasting problem of predicting which defines the true unknown concept (see eq:tracking). Using all the stream we have seen up to timestep , we could write
which at first glance appears unusable because it is a function over an increasingly large window of data. However, using a recursion on , for minimizing least mean squares, there is a closedform solution: recursive least squares (RLS), which uses recursion to approximate
where .
RLS is a well known adaptive filter, which can be easily extended to forgetting
RLS and Kalman filters (see, e.g.,
[9]).Noting that and replacing with learning rate we derive stochastic gradient descent (SGD), as follows.
For :
where the last line is essentially forecasting the concept for the following timestep (note the connection to, e.g., eq:2). Note the datastream assumption that at timepoint we have already observed the true value .
This may be viewed as a trivial result, but it has important implications regarding much of the datastream research and practice, which traditionally relies predominantly on ensembles of decision trees or nearest neighborbased methods; see sec:learning_from_data_streams. We have argued that it has been underappreciated. Unlike Hoeffding tree methods, a gradientdescent based approach can learn from a conceptdrifting stream without explicit concept drift detection. Unlike NN, time complexity is much more favourable.
We remark that for SGD to perform robustly over the length of a stream, we have to ensure certain conditions. In particular, to not decay the learning rate towards zero over time. In batch scenarios wish wish to converge to a fixed point and such learning rate scheduling is common and effective practice. However, under as stream this would cause SGD to react more and more slowly to concept drift until eventually becoming stuck in one concept.
An illustration of how SGD performs in a constantdrift setting is given in fig:missing on a synthetic data stream (which is detailed later in sec:experiments). We clearly see how no drift detection or model reset is necessary, and a concept can be smoothly tracked in its drift through concept space.
Recall that under gradual drift, it is the (see eq:3) rather than which forms a time series. A detailed treatment is left for future work.
Since abrupt drift cannot necessarily be forecast in advanced, one might argue that traditional drift detectors are best suited to this case. We remark that this argument can be clearly accepted only under the condition of a complete change in concept; where the two tasks are not at all related; a scenario unlikely to be the case in practice. If the drift is partial (i.e., the two concepts are partially related), then we wish to transfer part of the old concept (i.e., not
discard it when drift is detected). We note that SGD, in this sense, performs a kind of transfer learning; namely continuous transfer learning. The literature on transfer learning (see, e.g.,
[23]) indicates that we are thus likely to learn the new concept much faster.6 Experiments
We carried out a series of experiments to enrich the discussion and support the arguments made in this work. All methods are implemented in Python and evaluated in the ScikitMultiFlow framework [21] under prequential evaluation (testing then training with each instance). Experiments are carried out on a desktop machine, 2.60GHz, 16GB RAM.
First, we generated synthetic data using a weight matrix to represent the th concept. We introduced drift using the equations eq:1, eq:2, eq:3, under parameters in tab:synth. For incremental drift,
where is a rotational matrix (of angle in radians); for gradual drift,
and, for sudden drift a new is simply resampled after timestep .
Sym.  Description  

pretraining ends  
5K  start of drift  
6K  end of drift (gradual, incr.)  
end of drift (sudden)  
10K  length of stream 
We additionally look at two common benchmark datasets from the datastreams literature involving realworld data: the Electricity and CoverType datasets. Electricity contains 45,312 instances, with has attributes describing an electricity market (the goal is to predict the demand). CoverType contains 581,012 instances of 54 attributes, as input to predict one of seven classes representing forest cover type. See, e.g., [10, 24, 4, 12] for details^{8}^{8}8The data is available at https://moa.cms.waikato.ac.nz/datasets/.
We employed the three main approaches discussed in this work (listed in tab:methods); both a vanilla configuration (‘stardard’ NN, SGD, HT) and an ‘advanced’ configuration of the same methodologies. For HT and NN we used stateoftheart adaptations. For SGD, we simply employed basis expansion to accommodate nonlinear decision boundaries.
NN  , buffer size 100 

SGD  regularization; , hinge loss 
HT 
split confidence, 0.05 tie threshold, naive Bayes at leaves 
SAMNN  Selfadjusting memory NN [19] 
PBFSGD  SGD with deg. polynomial basis expansion, e.g., [9] 
RFHT  Adaptive Random Forest [12]: an ensemble of HTs, ADWIN drift detection, , 
First we had a look at performance on the synthetic data (fig:synth), then on real world data (fig:difficult), and finally, on an especially challenging scenario (fig:syncremental) over a sustained incremental drift. tab:accuracy summarizes the predictive performance and tab:time provides some running time results.
7 Discussion
We present and discuss results from the experiments outlined in the previous section.
Hoeffding Trees are fast but conservative
We observe how HTs grow conservatively (f:synth.a). This behaviour is intended by design, since they have no natural forgetting mechanism and thus it is important to make a statistically safe split (based on the Hoeffding bound).
Indeed, note after accuracy jumps 10 percentage points; indicating the initial split.
This conservative approach provides strong confidence that we may produce tree equivalent to one built from a batch, but only within a single concept. It necessarily means that Hoeffding trees will struggle when the true target concept is a moving target rather than a fixed point in concept space.
Destructive adaptation is costly
In the literature sophisticated conceptdrift detectors and ensembles are used to counteract this disadvantage, and provide robust prediction in dynamic data streams. And we confirm that this approach is effective in many cases (as seen in fig:difficult) but at a significant cost: the computational time (see tab:time) of the detectandreset approach clearly enunciates the overhead of destroying (resetting) and regrowing HTs constantly, so as to adapt to a changing concept.
Bufferbased methods are limited
Respective of its buffer size, NNmethods can respond to drift by forgetting old instances (from old concepts) and taking in (i.e., learning from) new ones. This mechanism allows it to recover quickly from drift (f:synth.b–f:synth.d). Nevertheless predictive power is always limited in proportion to the number of instances stored in this buffer and, as observable in f:synth.a, sometimes this is insufficient (we note there is no upward trend here, as opposed to the other methods, despite more instances from the same concept). If buffer size is widened, performance can be higher, but adaptation to drift will take correspondingly longer or require explicit drift detection methods as with HT approaches.
SGD for Efficient Continuous Adaptation
SGD is a simple method which has been around a long time. With a nondecayed learning rate, we find that it behaves as well as we hypothesized on synthetic data: it continues to learn a static concept better over time (in f:synth.a it recovers quickly from sudden and gradual drift, and its performance is almost unaffected under incremental drift where (as we see in f:synth.c) it is able to adapt continuously.
We suspect that SGD has not been widely considered in stateoftheart datastream evaluations because it performs poorly on realworld and complex data when deployed in an offtheshelf manner, especially if the learning rate is decayed – as is often the standard. However, we put together the PBFSGD method from elementary components and find that it performs strongly in these scenarios (fig:syncremental, fig:difficult).
We do see that performance of the advanced/stateoftheart methods (HT and NN based) is also competitive, as expected, yet it is crucial to emphasise the difference in computational performance (see tab:time): running time is up to an order of magnitude or more higher for the decision tree ensemble, compared to other methods; even greater than PBFSGD, which has a feature space cubicly greater than the original.
An analysis of time and memory complexity
The worstcase complexity is outlined in tab:complexity (for the vanilla methods, which does not take into account the additional overhead of ensembles and drift detectors nor basis expansion for PBFSGD). We can further remark computational time and memory expectations of SGD are constant across time (the expected running time is the same for each instance in the stream), as also with NN (given a fixed batch size). On the other hand, HT costs are not constant, but continue to grow with the depth of the tree (the term in tab:complexity). This is an issue which as, to our knowledge, not been considered in depth in the literature: as trees in an ensemble grow and are reset under drift, time and space complexity fluctuates – making practical requirements difficult to estimate precisely in advance. If there is no drift – then the trees may, in theory, grow unbounded and use up all available memory.
SAMNN  PBFSGD  RFHT  

Electricity  79.8  85.9  86.2 
RTG  78.8  81.8  77.9 
CoverType  93.3  92.6  93.9 
Synthetic  96.0  95.1  93.6 
Method  Time  Space 

NN  
HT  
SGD 
Limitations and Further Considerations
We need to acknowledge that an incremental decisiontree based approach continues to be a powerful competitor, and more efficient implementations exist than the Python framework we used in this work. Furthermore, it is clear that more work is needed to investigate performance under sudden and gradual and mixed types of drift. There are other stateoftheart HT methods as well as NN methods which could be additionally experimented with.
However, we have shown analytically and empirically that the most desirable and efficient option is supported naturally by gradient descent (and by extension, neural networks), given certain constraints wrt the learning rate, namely that is significantly greater than zero. This indicates that more attention in the streaming literature should be paid to neural networks, in particularly on ways to parameterize them effectively for datastream scenarios, so that they are more easy to deploy.
In general it is a more promising strategy to model the drift and preempt its development, rather than waiting for an indication that drift has already occurred and following such an indication retrospectively, to reset models that have been previously built.
8 Conclusions
The literature for data streams is well furnished with a great number and diversity of ensembles and driftdetection methodologies for employing incremental decision tree learners such as Hoeffding trees. These methods continue to obtain success in empirical evaluations. In this paper we have taken a closer analytical look at the reasons for this performance, but also we have been able to highlight its limitations, namely the cost involved of its performance under sustained drift, where it forced to carry out a continued destructive adaptation.
In particular, we showed that conceptdrifting data streams can be treated as time series, which in turn suggests predictability, thus encouraging an approach of tracking and estimating the evolution of a concept, and carrying out continuous adaptation to concept drift. To demonstrate this we derived an appropriate approach based on stochastic gradient descent. The method we used was simple, but results clearly supported our analytical and theoretical argument which carries important implications: gradientbased methods offer and effective and parsimonious way to deal with dynamic concept drifting data streams and should be considered more seriously in future research in this area. This is especially true with the advent of more powerful and easytodeploy neural networks and recent improvements in gradient descent.
In the future will investigate more complex scenarios involving oscillating and mixtures of different times of drift, and experiment with more stateoftheart gradientdescent approaches, such as deep learning. Our study already indicates that further investigation along these lines will yield promising results.
References
 [1] Ezilda Almeida, Carlos Ferreira, and João Gama. Adaptive model rules from data streams. In ECMLPKDD ’13: European Conference on Machine Learning and Knowledge Discovery in Databases, pages 480–492. Springer Berlin Heidelberg, 2013.
 [2] David Barber. Bayesian Reasoning and Machine Learning. Cambridge University Press, 2012.
 [3] Albert Bifet and Ricard Gavaldà. Learning from timechanging data with adaptive windowing. In SDM ’07: 2007 SIAM International Conference on Data Mining, 2007.
 [4] Albert Bifet and Ricard Gavaldà. Adaptive learning from evolving data streams. In IDA 2009: 8th International Symposium on Intelligent Data Analysis, pages 249–260. Springer Berlin Heidelberg, September 2009.
 [5] Albert Bifet, Geoff Holmes, Richard Kirkby, and Bernhard Pfahringer. Moa: Massive Online Analysis. Journal of Machine Learning Research (JMLR), 11:1601–1604, August 2010.
 [6] Albert Bifet, Geoffrey Holmes, and Bernhard Pfahringer. Leveraging bagging for evolving data streams. In ECML PKDD’10, pages 135–150. SpringerVerlag, 2010.

[7]
Hanen Borchani, Pedro Larrañaga, João Gama, and Concha Bielza.
Mining multidimensional conceptdrifting data streams using Bayesian network classifiers.
Intell. Data Anal., 20(2):257–280, 2016.  [8] Damien Brain and G. Webb. On the effect of data set size on bias and variance in classification learning. In Proceedings of the Fourth Australian Knowledge Acquisition Workshop, pages 117–128, 1999.
 [9] KeLin Du and M. N.S. Swamy. Neural Networks and Statistical Learning. Springer Publishing Company, Incorporated, 2013.
 [10] João Gama, Pedro Medas, Gladys Castillo, and Pedro Rodrigues. Learning with drift detection. In SBIA 2004: 17th Brazilian Symposium on Artificial Intelligence. Proceedings, pages 286–295. Springer Berlin Heidelberg, 2004.
 [11] João Gama, Indre Zliobaite, Albert Bifet, Mykola Pechenizkiy, and Abdelhamid Bouchachia. A survey on concept drift adaptation. ACM Computing Surveys, 46(4):44:1–44:37, 2014.
 [12] Heitor M. Gomes, Albert Bifet, Jesse Read, Jean Paul Barddal, Fabrício Enembreck, Bernhard Pfahringer, Geoff Holmes, and Talel Abdessalem. Adaptive random forests for evolving data stream classification. Machine Learning Journal, 106(910):1469–1495, 2017.
 [13] Heitor Murilo Gomes, Jean Paul Barddal, Fabrício Enembreck, and Albert Bifet. A survey on ensemble learning for data stream classification. ACM Comput. Surv., 50(2):23:1–23:36, March 2017.
 [14] Vera Hofer and Georg Krempl. Drift mining in data: A framework for addressing drift in classification. Computational Statistics & Data Analysis, 57(1):377 – 391, 2013.
 [15] Geoff Hulten, Laurie Spencer, and Pedro Domingos. Mining timechanging data streams. In ACM SIGKDD international conference on Knowledge discovery and data mining (KDD ’09), pages 97–106, 2001.
 [16] Ioannis Katakis, Grigorios Tsoumakas, and Ioannis Vlahavas. Tracking recurring contexts using ensemble classifiers: an application to email filtering. Knowledge and Information Systems, 22(3):371–391, Mar 2010.
 [17] J. Zico Kolter and Marcus A. Maloof. Dynamic weighted majority: An ensemble method for drifting concepts. J. Mach. Learn. Res., 8:2755–2790, 2007.
 [18] Bartosz Krawczyk, Leandro L. Minku, João Gama, Jerzy Stefanowski, and Michał Woźniak. Ensemble learning for data stream analysis: A survey. Information Fusion, 37:132 – 156, 2017.
 [19] Viktor Losing, Barbara Hammer, and Heiko Wersing. KNN classifier with self adjusting memory for heterogeneous concept drift. In ICDM’16: IEEE 16th International Conference on Data Mining, pages 291–300, 2016.
 [20] Diego Marron, Jesse Read, and Albert Bifet. Data stream classification using random feature functions and novel method combinations. Journal of Systems and Software, 127(May):195–204, 2017.
 [21] Jacob Montiel, Jesse Read, and Albert Bifet. ScikitMultiFlow: A multioutput streaming framework. CoRR, abs/1807.04662:1–5, 2018. ArXiv.
 [22] S. Muthukrishnan, Eric van den Berg, and Yihua Wu. Sequential change detection on data streams. In Workshops Proceedings of the 7th IEEE International Conference on Data Mining (ICDM 2007).
 [23] Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE Trans. on Knowl. and Data Eng., 22(10):1345–1359, October 2010.
 [24] Jesse Read, Albert Bifet, Bernhard Pfahringer, and Geoff Holmes. Batchincremental versus instanceincremental learning in dynamic and evolving data. In IDA 2012: 11th International Symposium on Advances in Intelligent Data Analysis, pages 313–323, 2012.
 [25] A. Shaker and E. Hüllermeier. Iblstreams: a system for instancebased classification and regression on data streams. Evolving Systems, 3:235 – 249, 2012.
 [26] Mark Tennant, Frederic Stahl, Omer Rana, and João Bártolo Gomes. Scalable realtime classification of data streams with concept drift. Future Generation Computer Systems, 75:187 – 199, 2017.
 [27] Indrė Žliobaitė, Albert Bifet, Jesse Read, Bernhard Pfahringer, and Geoff Holmes. Evaluation methods and decision theory for classification of streaming data with temporal dependence. Machine Learning, 98(3):455–482, 2014.
 [28] Ge Xie, Yu Sun, Minlong Lin, and Ke Tang. A Selective Transfer Learning Method for Concept Drift Adaptation, pages 353–361. Springer International Publishing, Cham, 2017.
Comments
There are no comments yet.