1 Introduction
The Maximum Likelihood Estimates (MLE) as well as the Bayesian estimation families operate with the premise that the distribution of the data being estimated is stationary over time. Under such settings, the convergence to the true value of the parameter being estimated takes place with probability 1 when the number of samples tends to infinity. However, in many reallife applications, the assumption on the stationarity of the data does not hold and the true underlying parameter being estimated changes over time. In this paper, we consider the problem of estimating binomial and multinomial random variables which vary over time. The Stochastic Learning Weak Estimators (SLWE) are known to be the stateoftheart approach for such an estimation problem
[23, 36]. The SLWE enjoys a multiplicative update form that makes it superior to the stateoftheart estimation approaches which are mainly of additive flavor. However, the right choice of the intrinsic parameter of the SLWE, , is still an open issue. The latter parameter controls the forgetting of old data and controls the ability of the scheme to adapt to changes in the environments. If the system changes rapidly the parameter should be chosen to rapidly forget the old stale data. On the other hand, if the environment is stabilizing, the rate of forgetting should decrease.The SLWE has found numerous successful applications in the literature. Applications of the SLWE include adaptive classifiers for spam filtering
[36], adaptive file encoding with nonstationary distributions [26], intrusion detection in computer networks [30], tracking shifts of languages in online discussions [29], learning user preferences under conceptshift [22, 33], faulttolerant routing in Adhoc networks [21], digital content forensics for detecting illicit images [8], detection and tracking of malicious nodes in both Adhoc networks [24], vehicular mobile WiMAX networks [18], and optimizing firewall matching time via dynamic rule ordering [19]– to mention a few.In many of such practical problems the dynamical system changes abruptly followed by periods where the system is almost stationary. Unfortunately, the SLWE is not well suited for such cases. By choosing a low value of , the estimator will rapidly adjust after an abrupt change, but on the other hand, it will result in a higher estimation uncertainty when the system stabilizes. By choosing a high value of , the estimation uncertainty will be low in the stationary parts, but on the other hand, the estimation procedure will suffer from adjusting too slowly after an abrupt change.
In this paper, we suggest a computationally efficient estimation procedure for a dynamical system that contains both abrupt changes and stationary parts. The estimator combines an estimator that is suitable for the stationary parts together with an event detection procedure. When an abrupt change is detected, the estimator rapidly jumps to a more suitable estimate. The far most common event detection approach is to compare the properties of the data stream on a long term time window with a more short term time window [6]. In such window based approaches, each sample in the window is given an equal weight, but intuitively it is more reasonable to give more weight to the most recent data which is done by the SLWE. In this paper we therefore suggest to build the event detection procedure by comparing the estimate by the stationary estimator with an SLWE estimator. Through lightweight and subtle hypothesis testing mechanisms, we decide, in each iteration, if the stationary estimate should jump to new value (event detected) or not. Quite surprisingly, we have found only one other paper in the literature using the advantages of the SLWE for event detection, namely the paper by Ross et al. (2012) [25]. Compared to [25], our suggested approach is simpler and better founded theoretically. We present the estimation procedure for the binomial and multinomial distributions, but can be applied to other distributions as well. The article is organized as follows. In Section 2 we review related work. In Section 3 we present the SLWE estimator for a stream of Bernoulli variables and in Section 4 we present the details of our approach. In Section 5 we extend the scheme to the multinomial case. Finally, in Section 6 we perform thorough evaluation of the algorithms and draw some conclusions in Section 7.
2 Related Work and StateoftheArt
In this Section we review related work. First, in Section 2.1 we will review legacy scheme for estimation under nonstationary environment. Then, in Section 2.2 we will review the different approaches for controlling the parameters of estimators operating in nonstationary environments.
2.1 Estimation in NonStationary Environments
Probably, the most classical and utilized method for dealing with nonstationary estimation problems is the sliding window approach which can be seen as a short memory version of the MLE. According to the sliding window approach, online the last samples that fit in the window are used to compute the estimates. Nevertheless, the sliding window method suffers from a tuning problem. In fact, if the size of the window is chosen too large, then the quality of the estimates will be deteriorated by stale data values, while choosing a too small window size would rather lead to poor estimates with low confidence.
A myriad of works have been proposed to address detecting change points. Those methods fall under two main families: Page’s cumulative sum (CUSUM) [2] detection procedure, and the ShiryaevRobertsPollak detection procedure. In [28], Shiryayev resorted to a Bayesian formulation in which the change point is assumed to have a geometric prior distribution. CUSUM uses the idea of maximum likelihood ratio test hypothesis to discern change points. However, a downside of these two approaches is their computational complexity which renders the SLWE as well as the estimator in this paper lightweight alternatives.
When it comes to extensions of the sliding window, Koychev et al. proposed a new paradigm called Gradual Forgetting (GF) [11, 13, 14]. According to the principles of GF, observations in the same window are treated unequally when computing the estimates based on weight assignment. Recent observations receive more weights than distant ones. Different forgetting functions were proposed ranging from linear [12] to exponential [10].
In [23], Oommen and Rueda presented the SLWE to estimate the underlying parameters of time varying binomial/multinomial distribution. The SLWE originally stems from the theory of variable structure Learning Automata [20], and more particularly, its rewardinaction flavor. The most appealing properties of the SLWE which makes it the stateoftheart is its multiplicative form of updates. Two different counterparts of SLWE [23] for discretized spaces was recently proposed in [35] and [34]. In a similar manner to the SLWE, the latter solution also suffers from the problem of tuning the resolution parameter.
2.2 Estimation using Adjustable parameters
In this Section, we survey some of the most pertinent techniques for estimation in dynamic environments that are orthogonal to the SLWE. For a thorough survey we refer the reader to the surveys [6, 15] which provide a comprehensive taxonomy of estimation methods in nonstationary environments, namely, adaptive windowing, aging factors, instance selection and instance weighting.
Gama et al. [6] presents a clear distinction between memory management and forgetting mechanisms. Adaptive windowing [32] works with the premise of growing the size the sliding window indefinitely until a change is detected via a change detection technique. In this situation, the size of the window is reset whenever a changed is detected.
Another interesting family of approaches assume that the true value of the parameter being estimated is revealed after some delay, which enables quantifying the error of the estimator. In such settings, some research [31] have used ensemble methods where the output of different estimators is combined using weighted majority voting. The weights of each estimator is adjusted based on its error. In this sense, estimation methods that produce high error see their weight decrease.
In the same perspective, the estimated error can be used for reinitializing the estimation as performed in [25]. In all brevity, changes are detected based on comparing sections of data, using statistical analysis to detect distributional changes, i.e., abrupt or gradual changes in the mean of the data points when compared with a baseline mean with a random noise component. One option is also to keep a reference window and compare recent windows with the reference window to detect changes [5]
. This can, for example, be done based on comparing the probability distributions of the reference window and the recent window using KullbackLeibler divergence
[4, 27].3 Stochastic Learning Weak Estimator
Let represent a stream of independent and identically distributed Bernoulli stochastic variables with parameter . That is
(1) 
for .
We now want to estimate the parameter from the stream of Bernoulli variables. Using the weak estimator, the estimate of is updated by the following recursion
(2) 
where represents the estimate of after the arrival of and are constants between zero and one. The intuition is that if we should reduce our current estimate of (the probability of one) which is achieved by multiplying the current estimate of by . On the other hand, if we should reduce the estimate of (the probability of zero) which gives
which is equal to the last equation in (2).
The recursions in (2) can be written as follows
with . Using straight forward calculations this simplifies to
(3) 
which can be recognized as the exponentially weighted moving average.
We can prove by induction that
is an unbiased estimator for
for every as followsThe variance depends on the choice of the
’s. We look at two special cases.constant: It can be proved that if we set all , the limiting variance is given by [36]
An advantage of the constant approach is that if the value of is changing with time in the underlying Bernoulli data stream, the estimator will rapidly adjust to these changes [36]. A disadvantage is that if is not changing, the variance of the estimator will have a lower limit and never reaches zero.
Sample mean: The sample mean is the maximum likelihood estimator of and is the natural estimator to use if is not changing with time. Let denote the sample mean of the first Bernoulli variables from the stream
When arrives, the sample mean can be updated as follows
(4) 
which is equivalent to (3) with . This means that the sample mean is a special case of the general recursion in (3). It is well known that . A disadvantage of the sample mean is that if is changing with time, the sample mean will become very slow at adjusting to these changes. On the other hand if is not changing, the sample mean is the optimal estimator in the sense that no other unbiased estimators can achieve less variance.
4 Estimation in a shifting environment
Suppose a situation where is switching between different values with time. An example could be a news stream where the topic of the news stream suddenly changes due to different real life events. Another example could be a machine operated by different employees at different time periods each with its own error rate characterized by . We assume that the instants in which switches value are unknown.
For such systems a natural strategy would be to use the sample mean whenever the is not changing, and a mechanism to “jump” fast towards a new estimate if the value of has changed. In this paper we suggest a method that combines the sample mean and a weak estimator with constant . Ross et al. (2012) [25] is the only paper we have found in the literature that uses the same idea. Let and denote the weak estimator with constant and the sample mean, respectively, after the arrival of . If switches value, will rapidly adjust to the new value of , while minor changes will appear to . This can be used to build an efficient method to detect changes in and “jump” to the new value of . The key ingredient will be the distribution of the difference between the estimators . If switches value we expect that will be large in absolute value and larger then what we would expect if remains constant. This can be used to build a statistical test if has changed value or not. We start by presenting the expectation and variance of this distribution.
Theorem 1.
Let represent a stream of independent and identically distributed Bernoulli stochastic variables with parameter . Further let and denote the weak estimator with constant and the sample mean, respectively, after the arrival of . Then
(5)  
(6) 
Proof.
We start by computing through the recursions in (3)
Setting (and still ) we get
and setting we get the sample mean .
Now we are ready to compute the expectation and variance
∎
Please note that can be computed recursively such that all variances up to can be computed in time. The actual recursions are not shown, but are straightforward to compute from (6). Another appealing property is that the variance does not depend on the stream of observations and can be computed before the data streaming starts. This lays the foundations for building very efficient algorithms.
Theorem 1 stated the expectation and variance of the distribution of . Next we investigate other properties the distribution. From the proof we saw that could be written as follows
which is a weighted sum of the independent Bernoulli variables. If the sum satisfies the Lindeberg criterion (and thus the Lyapunov criterion), the sum will, according to the central limit Theorem, converge to a normal distribution
[9]. Unfortunately the sum does not satisfy this criterion (proofs omitted). A second option is to study the distribution of by stochastic simulation. We perform the following experiment. We generatedindependent outcomes from the Bernoulli distribution and computed
using . Further we repeated this procedure times. The upper left panel in Figure 1 shows the histogram of these values when .The black curve is the normal distribution with expectation and variance as given by Theorem 1. The upper right panel shows the same, but with . The second and the third row shows the same but with and . Overall we see that the distribution of is almost identical to a normal distribution. We only observe that when is small (or high) the distribution is a little asymmetric compared to a normal distribution. Based on these observations it is a reliable choice to build a test assuming that is normally distributed. We then get the following test.
Theorem 2.
Let represent a stream of independent and identically distributed Bernoulli stochastic variables with parameter . Further let denote the quantile of the standard normal distribution. Define the hypotheses

The underlying has not changed value

The underlying has changed value
Suppose that we decide to reject if
(7) 
Then the probability of rejecting if is true is approximately and the rejection rule (7
) controls the type I error.
Proof.
Let denote a normal distribution with expectation
. From the discussion above and Figure 1 we know thatwhich means that
∎
From Theorem 1 we see that depends on which of course is unknown. To perform the test above, a natural choice is to substitute with the sample mean since this is our best estimate of under the hypothesis that is constant.
The basic idea of our method is to estimate using the sample mean, but perform occasional jumps if the test in Theorem 2 brings evidence that has switched value. In the Section we discuss different alternatives to perform the jumps.
4.1 Performing a jump
Let denote the estimate using the sample mean with jumps method after the arrival of . Further let denote the value used for in the recursions in (3) to compute . Assume now that the test in Theorem 2 brings evidence that has switched value which means that the current estimate is not reliable (since it is based on the sample mean). Therefore we need to adjust the estimate (jump). Two options seem natural to perform the jump.

Forget the whole estimation history and set

Assume that the current estimate based on the weak estimator with constant , , is reliable since it adjusts fast and set .
A third option could be to set equal to some weighting between these two alternatives.
To continue the update of the estimator after the jump, we also need to decide a new value for . There are at least two natural alternatives

Recall that by setting in (3), we get the sample mean. If we decide to follow the first option above and set , is just the sample mean of one observation which means that it is natural to set .

If we decide to follow the second option above and set , it seems natural to do the next update of similar to the update , which means to relate to . Since we will continue to update according to the sample mean, we must relate such a choice to the number of terms in a sample mean. We do this as follows. Define as the solution of the equation
Solving with respect to and rounding off to the nearest integer we get
where denotes the value of a rounded of to the nearest integer. The interpretation of is the number of terms in a sample mean in which an update of the estimate will be similar to the weak estimator .
Note that the choice of in the first alternative above is equivalent to setting . It may be that when the test in Theorem 2 detects a change in , the value of has not converged completely around the new value of . Therefore a value of somewhere between 1 and may be an even better alternative. By relating the variance of to a sample mean with terms, the variance , which we need in the test in Theorem 2, can be computed recursively. In addition, all the variances can recursively be computed in advance before the data stream starts.
Before the algorithm can be run, we also need to decide a value for in the test proposed in Theorem 2. When we run the test, the probability of wrongly detecting a change in , is approximately . In practice we may run the test many times, for example every tenth iteration. If we run the test many times, the chance of wrongly detecting a change in in some of these tests naturally will be larger then . This refers to the multiple testing problem in the statistical literature, see e.g. [3]. A simple and much used approach is the Bonferroni correction where a significance level of is used instead of , where is the number of tests. There are two challenges with applying this approach (and other standard corrections). First, we do not know the number of tests we need to run. Second, the Bonferroni correction assumes that all the tests are independent. In our case this is far from true, since two subsequent tests are based on almost the same data stream (only a few extra observation have been added since the last test) and the outcomes are highly correlated. Using the Bonferroni correction will result in a too low significance level, and the tests may never detect that has changed. In practice, setting to about overall performs well and is, as expected, somewhere between standard significance levels (0.05) and Bonferroni corrected levels.
The algorithm using the second option above is shown in Algorithm 1.
5 Extension to the multinomial case
We now show how the jump algorithm above can be extended to the multinomial case. As described above, a Bernoulli variable takes the values 0 or 1 with probabilities and , respectively. For the multinomial case this is extended such that takes one of the values with probabilities , such that
. For ease of presentation below, define a stochastic vector
which is a map from as follows(8) 
where denote the indicator function returning one if is true and zero if is false. We see that is a vector with value one in position and zero in all the other positions.
Let denote a stream of independent stochastic variables identical to . We now want to maintain running estimates of the probabilities . The SLWE in (3) can easily be extend to the multinomial case as follows
(9) 
where denote the estimate of after receiving the variable from the data stream.
Now let denote estimates based on (9) using a constant value of and let denote the sample mean, i.e. using . Following the same argumentation as in Section 4 we have the following
(10) 
As an extension to Section 4, we now want to construct a statistical test to check wether the unknown probability vector has changed value. A common statistical test on the probability vector of the multinomial distribution is the Pearson’s test [1]. Adapting the test to the application in this paper, we get the following theorem.
Theorem 3.
Let represent a stream of independent and identically distributed multinomial stochastic vectors with probability vector . Further let denote the quantile of the distribution with degrees of freedom. Define the hypotheses

The underlying probability vector has not changed value

The underlying probability vector has changed value
Suppose that we decide to reject if
(11) 
Then the probability of rejecting if is true is approximately and the rejection rule (11) controls the type I error.
Proof.
It is well known that the sum of independent squared standard normally distributed stochastic variables is distributed, denoting a distribution with degrees of freedom. From (10) we see that the sum in (11) is a sum of approximately squared standard normally distributed stochastic variables and therefore is approximately distributed. Knowing terms in the sum, the last term can be computed since the probability estimates sum to one. The sum in (11) thus is approximately distributed
(12) 
Algorithm 2 shows the resulting jump algorithm for the multinomial case.
6 Experiments
In this Section we evaluate the methodology above for both synthetic and reallife data. In all the experiments reported below, we set after a jump, i.e. the second alternative discussed in Section 4.1 and as given in Algorithms 1 and 2. We have not found any paper in the literature dealing with tracking the probabilities in binomial and multinomial distributions in abruptly changing environments. The most related papers, in our opinion, are [25] and [7], but they consider a slightly different problem, namely the problem of concept drift. Those methods may be modified to accommodate online estimation as for the devised algorithm in this paper, but we have not looked into that. Therefore, in our experiments we compare the performance of the suggested algorithms in this paper with the SLWE estimator from [23], i.e. (2) with constant lambda and denote it .
6.1 Synthetic data example
We will evaluate the binomial case (Algorithm 1) and the multinomial case for classes (Algorithm 2). Figure 2 shows a comparison of the different estimators for the binomial case when the changes in are large.
The gray, green and blue curves show the jump estimator (), the SLWE with constant () and the sample mean, respectively. The black curve shows the true value of in each iteration. We see that the test in Theorem 2 detects the changes in very efficiently such that on average will (gray) perform better then . We also see, as expected, that the sample mean is not very useful in a dynamic environment.
Figure 3 shows results from a similar experiment, but where the changes in are smaller.
We see that the changes in also here will be efficiently detected and that performs better than . In both experiments above we chose , and .
In Figure 4 we also evaluate the jump estimator for an environment where is changing smoothly. More specifically the true changes following a cosine function. For such an environment, it seems like and perform almost equally well. Note that even though the jump estimator is not constructed for such environments, but we see that it still performs well. In this experiment we chose , and still
For the jump algorithms described by Algorithms 1 and 2 there are three tuning parameters, namely , and . We now want to chose values for these parameters such that the estimation error will be as small as possible. We measure the estimation error as the difference in absolute value between the true and the estimator averaged over all the iterations.
We start by investigating reasonable values for . For the binomial and multinomial algorithm we computed the estimation error for different choices of and , respectively. To reduce the Monte Carlo error we ran the Bernoulli and multinomial data streams for iterations. In the experiments we used and . We assumed that the system shifted state every iteration similar to the examples in Figures 2 and 3.
For the binomial case we considered three different cases.

Large changes: changed with time as shown in Figures 2.

Small changes: changed with time as shown in Figures 3.

Dynamic: changed with time as shown in Figures 4.
For the multinomial case we considered two different cases.

Every iterations, we changed the probability vector as follows

Draw a random number uniformly from

Set

Set for
Below we refer to this alternative as ’spike probability’.


Every iterations, we updated the probability vector as an outcome from the Dirichlet distribution with parameter values
. This is referred to as the flat Dirichlet distribution and the probability distribution is uniformly distributed over the simplex of possible probability vectors, i.e. the vectors satisfying
and . Below we refer to this alternative as ’flat Dirichlet’.
The results are shown in Figure 5.
We start discussing the binomial case. For the blue curve in the left panel of Figure 5 we see that an optimal value of is about 3 which is equivalent to . By choosing smaller values of , the test will too often wrongly detect changes. Choosing a too high value of , the test will detect changes in too late or never. With above 5, the test will never detect the changes in and we reach a limit in the estimation error which is equal to the estimation error using the sample mean. When the changes in are large (black curve), we can allow using higher values of since we still are able to detect the large changes in . An optimal value for is around 4. Choosing an even higher value of slightly reduces the performance because the method uses a few iterations more before detecting that has changed value. For the dynamic system, we see that the estimation error is higher and is as expected since the method in this paper is not directly constructed for such environment. We see that the optimal value for is around 2.2 which seems reasonable. Since continuously is changing value, should not be chosen too high to be able to keep track of these changes.
For the multinomial case (right panel), we see that for the flat Dirichlet alternative, an optimal value is which is equivalent to . For the spike probability alternative any value of between 25 and 100 perform well. We see that a lower value of performs the best in the multinomial cases compared to the binomial cases. The reason is that it is easier to detect a change in the probability vector in the multinomial case compared to the binomial case.
We turn our attention now to evaluating the optimal values for . The results are shown in Figure 6. Also in this experiment we set . Further we sat and .
Overall we see that the estimation error does not depend strongly on the choice of . Please note that the increase in estimation error for for the spike probability alternative is an actual effect and not Monte Carlo error.
Finally we investigate how the estimation error depends on the choice of . We evaluate both the jump estimator and the estimator using a constant , i.e. the original SLWE . The results are shown in Figure 7.
We start by discussing the binomial case (left panel). Comparing the solid and dashed black curves we see that outperforms with a large margin. We also observe that for an optimal value of is about 0.96. We also observe that the optimal value for for the jump estimator is about 0.9. Recall that this is the we should choose for the weak estimator with constant that runs in parallel with the sample mean. This difference may come as a surprise, but remember that the purpose of the weak estimator with constant is different in these two cases. For (original SLWE) we chose to minimize the estimation error. For the jump estimator, we chose to detect changes in as fast as possible to rapidly perform a jump. When the changes in are small, we also observe that outperforms for all choices of (blue curves). When is changing dynamically (red curves), the picture is, as expected, less clear and which estimator that performs the best depends on the choice of .
For the multinomial case (right panel), we see that the jump algorithm (Algorithm 2) outperforms the multinomial SLWE with a large margin for both the spike probability and the flat Dirichlet alternatives.
From both panels in Figure 7, we see that for all the cases the performance of the jump estimator is less sensitive on the choice of compared to the SLWE with constant . Said in another way, the jump estimator, , performs well for a large range of different choices of while for the SLWE with constant , , performs well only for a small interval of values for . This is a very nice property of the jump estimator since in practical situations we do not know what is an optimal value for .
6.2 Reallife data example
In this section, we investigate the problem of tracking topics or sentiment in online streams of text. Examples of such text streams could be online discussion threads and news/social media feeds like Twitter. A popular approach is to use keyword lists like sentiment lexicons. A keyword list is a set of words for each topic or sentiment type (for example: happy, sad, angry, etc). Such an approach is usually more robust to domain changes than machine learning approaches
[16] which makes the keyword approach ideal for online tracking of topics or sentiment in discussion threads and news/social media feeds.In the experiment in this section we consider the problem of online tracking of the current topic in a news feed. We assumed four topics, namely news about the European Union (EU), news about economy, sports and entertainment. We collected a large set of news articles about the four topics from the popular Norwegian online news paper site vg.no. We assumed that the instants when the text changed between the different topics were unknown to our algorithm. The task was to track the probabilities that the current topic is EU, economy, sports or entertainment.
We now want to apply the algorithms described in this paper for the topic tracking problem. We started by generating a keyword list for each of the four topics. The keyword list for a given topic were generated by choosing words that had a high Pointwise mutual information to the given topic [17]. We assumed that we received one word at the time from the news feed and every time we received a new word, we updated our probability estimates that the current topic were EU, economy, sports or entertainment. If the current word received from the news feed was part of the EU keyword list, we can think of this as an outcome ’1’ from a multinomial distribution. If the word was part of the economy keyword list, we can think of this as an outcome ’2’ from a multinomial distribution and so on. Using the weak estimator in equation (9), we can now update our estimate of the probability vector, namely the probabilities that the current topic is EU, economy, sports or entertainment. Similarly we can update the estimate of the probability vector using the jump algorithm in Algorithm 2. All words that were not part of any of the keyword lists were removed from the text corpus.
A natural offline way to estimate of the probability that the current topic was EU (economy, sports, entertainment) based on the keyword lists was to compute the portion of all the keywords in an article that were EU (economy, sports, entertainment) keywords. We denote this the offline approach and can be seen as the optimal estimates for the probability of the different topics based on the keyword lists. In an online setting it is not possible to compute the offline estimates, but ideally we want the online estimators in (9) and in Algorithm 2 to be as close as possible to the optimal offline estimates. We compare the performance of the online estimators in this paper by measuring how close they were to the optimal offline approach. When performing the experiments we ran a two fold cross validation where we used half of the articles to compute the keyword lists and the other half to track the probabilities that the current topic was EU, economy, sports or entertainment. Next, we switched and trained and tested in the opposite direction.
Figure 8 shows the tracking of the probabilities for the different topics
We see that the jump estimator in Algorithm 2 adapts faster when the text stream changes topic and also tracks the offline estimates more efficiently in the stationary parts than the SLWE with constant . The jump estimator performed well for a large range of values for and but the best results were achieved using and . The SLWE estimator, , performed the best using . The mean absolute estimation error compared to the offline estimator where 0.0235 and 0.0505 for the jump and the SLWE estimators, respectively, which means that the jump estimator clearly outperforms the SLWE estimator for this application.
7 Closing remarks
In this paper we have constructed an estimation procedure that combines the strengths of a weak estimator with constant and and decreasing (sample mean). We have developed a hypothesis test procedure to rapidly detect a change in the underlying . Further we have proposed an efficient procedure to jump to a new estimate when a change is detected. The experiments show that the procedure efficiently detects changes in the underlying distribution and outperforms the original SLWE with constant with a large margin.
The experiments also showed that the performance of the jump estimator is less sensitive to the choice of compared to the SLWE with constant . Said in another way, the jump estimator performs well for a large range of different choices of while the SLWE with constant , , performed well only for a small range of choices for . This is a very attractive property of the jump estimator since in practical situations we do not know what is an optimal value for .
References
 [1] Alan Agresti and Maria Kateri. Categorical data analysis. Springer, 2011.
 [2] Mich le Basseville and Igor V. Nikiforov. Detection of abrupt changes: theory and application. PrenticeHall, Inc., 1993.
 [3] Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological), pages 289–300, 1995.
 [4] Tamraparni Dasu, Shankar Krishnan, Suresh Venkatasubramanian, and Ke Yi. An informationtheoretic approach to detecting changes in multidimensional data streams. In In Proc. Symp. on the Interface of Statistics, Computing Science, and Applications. Citeseer, 2006.
 [5] Anton Dries and Ulrich Rückert. Adaptive concept drift detection. Stat. Anal. Data Min., 2(5):311–327, December 2009.
 [6] João Gama, Indrė Žliobaitė, Albert Bifet, Mykola Pechenizkiy, and Abdelhamid Bouchachia. A survey on concept drift adaptation. ACM Comput. Surv., 46(4):44:1–44:37, March 2014.

[7]
Joao Gama, Pedro Medas, Gladys Castillo, and Pedro Rodrigues.
Learning with drift detection.
In
Brazilian Symposium on Artificial Intelligence
, pages 286–295. Springer, 2004.  [8] Amin Ibrahim and Miguel Vargas Martin. Detecting and preventing the electronic transmission of illicit images and its network performance. In International Conference on Digital Forensics and Cyber Crime, pages 139–150. Springer, 2009.
 [9] A. F. Karr. Probability. Springer, New York, 2012.
 [10] Ralf Klinkenberg. Learning drifting concepts: Example selection vs. example weighting. Intell. Data Anal., 8(3):281–300, August 2004.
 [11] Ivan Koychev. Gradual forgetting for adaptation to concept drift. In Proceedings of ECAI 2000 Workshop Current Issues in SpatioTemporal Reasoning, pages 101–106, 2000.
 [12] Ivan Koychev. Gradual forgetting for adaptation to concept drift. Proceedings of ECAI 2000 Workshop on Current Issues in SpatioTemporal Reasoning,, 2000.
 [13] Ivan Koychev and Robert Lothian. Tracking drifting concepts by time window optimisation. In Max Bramer, Frans Coenen, and Tony Allen, editors, Research and Development in Intelligent Systems XXII, pages 46–59. Springer London, 2006.
 [14] Ivan Koychev and Ingo Schwab. Adaptation to drifting user’s interests. In Proceedings of ECML2000 Workshop: Machine Learning in New Information Age, pages 39–46, 2000.
 [15] Pallavi Kulkarni and Roshani Ade. Incremental learning from unbalanced data with concept class, concept drift and missing features: A review. International Journal of Data Mining and Knowledge Management Process (IJDKP), 4(6):15–29, November 2014.
 [16] Bing Liu. Sentiment analysis and opinion mining. Synthesis lectures on human language technologies, 5(1):1–167, 2012.

[17]
Christopher D Manning and Hinrich Schütze.
Foundations of statistical natural language processing
, volume 999. MIT Press, 1999.  [18] Sudip Misra, Nayan Ranjan Kapri, and Bernd E Wolfinger. Selfishnessaware target tracking in vehicular mobile wimax networks. Telecommunication Systems, 58(4):313–328, 2015.
 [19] Ratish Mohan, Anis Yazidi, Boning Feng, and B John Oommen. Dynamic ordering of firewall rules using a novel swapping windowbased paradigm. In Proceedings of the 6th International Conference on Communication and Network Security, pages 11–20. ACM, 2016.
 [20] Kumpati S Narendra and Mandayam AL Thathachar. Learning automata: an introduction. Courier Corporation, 2012.
 [21] B. J. Oommen and S. Misra. Faulttolerant routing in adversarial mobile ad hoc networks: an efficient route estimation scheme for nonstationary environments. Telecommunication Systems, 44:159–169, 2010.
 [22] B. J. Oommen, A. Yazidi, and OC. Granmo. An adaptive approach to learning the preferences of users in a social network using weak estimators. Journal of Information Processing Systems, 8(2), 2012.

[23]
B. John Oommen and Luis Rueda.
Stochastic learningbased weak estimation of multinomial random variables and its applications to pattern recognition in nonstationary environments.
Pattern Recogn., 39(3):328–341, 2006.  [24] NasserEddine Rikli and Aljawharah Alnasser. Lightweight trust model for the detection of concealed malicious nodes in sparse wireless ad hoc networks. International Journal of Distributed Sensor Networks, 12(7):1550147716657246, 2016.
 [25] Gordon J. Ross, Niall M. Adams, Dimitris K. Tasoulis, and David J. Hand. Exponentially weighted moving average charts for detecting concept drift. Pattern Recognition Letters, 33(2):191 – 198, 2012.
 [26] L. Rueda and B. John Oommen. Stochastic automatabased estimators for adaptively compressing files with nonstationary distributions. Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on, 36(5):1196 –1200, October 2006.
 [27] Raquel Sebastião and João Gama. Change detection in learning histograms from data streams. In Proceedings of the Aritficial Intelligence 13th Portuguese Conference on Progress in Artificial Intelligence, EPIA’07, pages 112–123, Berlin, Heidelberg, 2007. SpringerVerlag.
 [28] Albert Nikolaevich Shiryayev. Optimal Stopping Rules. Springer, 1978.
 [29] A. Stensby, B. J. Oommen, and OC. Granmo. The use of weak estimators to achieve language detection and tracking in multilingual documents. International Journal of Pattern Recognition and Artificial Intelligence, 27(04):1350011, 2013.
 [30] A. G. Tartakovsky, B. L. Rozovskii, R. B. Blazek, and Hongjoong Kim. A novel approach to detection of intrusions in computer networks via adaptive sequential and batchsequential changepoint detection methods. IEEE Transactions on Signal Processing, 54:3372–3382, September 2006.
 [31] Alexey Tsymbal, Mykola Pechenizkiy, Pádraig Cunningham, and Seppo Puuronen. Dynamic integration of classifiers for handling concept drift. Inf. Fusion, 9(1):56–68, January 2008.
 [32] Gerhard Widmer and Miroslav Kubat. Learning in the presence of concept drift and hidden contexts. Machine Learning, 23(1):69–101, 1996.
 [33] A. Yazidi, OC. Granmo, B. J. Oommen, M. Gerdes, and F. Reichert. A usercentric approach for personalized service provisioning in pervasive environments. Wireless Personal Communications, 61(3):543–566, 2011.
 [34] Anis Yazidi and B. John Oommen. Novel discretized weak estimators based on the principles of the stochastic search on the line problem. IEEE Trans. Cybernetics, 46(12):2732–2744, 2016.
 [35] Anis Yazidi, B John Oommen, Geir Horn, and OleChristoffer Granmo. Stochastic discretized learningbased weak estimation: a novel estimation method for nonstationary environments. Pattern Recognition, 60:430–443, 2016.
 [36] Justin Zhan, B. John Oommen, and Johanna Crisostomo. Anomaly detection in dynamic systems using weak estimators. ACM Trans. Internet Technol., 11:3:1–3:16, July 2011.