1.1 Purpose of this paper
This study is concerned with online learning in data streams. We consider a situation where each datum arrives in an online fashion. In such a situation, we would like to (i) learn robustly to outliers or anomalies in the observed data; (ii) adapt to changes of the underlying data-generating mechanism. As for (i), if a data point is an outlier, we would like to learn with as little influence by its outlier as possible. In this paper, we refer to such a nature of online learning algorithms as robustness. In contrast, as for (ii), we would like to quickly follow the changes of the data-generating mechanism. We refer to such a nature of online learning algorithms as adaptivity. Figure 1 shows an illustration of the concepts of the robustness and adaptivity.
A tradeoff exists between the robustness and adaptivity: the robustness generally decreases if we try to adapt to the changes. Conversely, the adaptivity decreases if we try to learn robustly. Although many online learning algorithms have been introduced and some studies have addressed this issue (Tsay, 1988; Gama et al., 2014; Chu et al., 2004; Huang et al., 2016; Odakura, 2018; Cejnek and Bukovsky, 2018; Fearnhead and Rigaill, 2019; Guo, 2019), to the best of our knowledge, no algorithm has quantitatively considered the tradeoff between the robustness and adaptivity.
This study aims to propose an online learning algorithm that considers the tradeoff between the robustness and adaptivity. We introduce a novel algorithm, called the stochastic approximation-based robustness–adaptivity (SRA) algorithm, to show a theoretical analysis for non-asymptotic convergence of SRA in the presence of outliers and demonstrate its effectiveness for both synthetic and real datasets. The key idea of SRA is to update parameters of distribution or sufficient statistics with the stochastic approximation (SA) (Robbins and Monro, 1951) while dropping points with large values of stochastic updates (drift terms).
1.2 Related work
This study is concerned with the robustness and adaptivity of online learning algorithms. Moreover, we briefly review studies related to the SA (Robbins and Monro, 1951) and the online expectation–maximization (EM) algorithm (Cappé and Moulines, 2009; Karimi et al., 2019a) because SRA uses them.
1.2.1 Robustness and adaptivity of online learning algorithms
The robustness and adaptivity of online learning algorithms have often been discussed in the context of the concept drift (Gama et al., 2014; Chu et al., 2004; Huang et al., 2016; Cejnek and Bukovsky, 2018). Yamanishi et al. proposed an online learning algorithm, called the sequentially discounting EM algorithm (SDEM) (Yamanishi et al., 2004). Although SDEM can handle complicated distributions, it is prone to noises and is easily overfitted to data. Odakura proposed an online nonstationary robust learning algorithm (Odakura, 2018). This algorithm introduced two parameters independently to control the robustness and adaptivity, respectively. Fearnhead and Rigaill proposed an algorithm for change detection that is robust to the presence of outliers (Fearnhead and Rigaill, 2019)
. The key idea of the algorithm is to adapt existing penalized cost approaches for detecting changes such that they use loss functions that are less sensitive to outliers. Guo proposed an algorithm based on an online sequential extreme learning machine for robust and adaptive learning(Guo, 2019).
1.2.2 Online (stochastic) EM algorithms
The EM algorithm (Dempster et al., 1977) is a popular class of inference to minimize loss functions. The original EM algorithm (Dempster et al., 1977) does not scale to a large dataset because it requires the entire data at each iteration. To overcome this problem, several studies proposed online versions of the EM algorithms.
Neal and Hinton proposed an EM algorithm in an incremental scheme referred to as the incremental EM (iEM) (Neal and Hinton, 1999). Cappé and Moulines proposed the stochastic (online) EM (sEM) algorithm (Cappé and Moulines, 2009), which updates sufficient statistics in an SA scheme (Robbins and Monro, 1951)
. Chen et al. proposed the variance reduced sEM (sEM-VR) algorithm(Chen et al., 2018). Meanwhile, Karimi et al. showed non-asymptotic convergence bounds for the global convergence of iEM, sEM-VR, and the fast incremental EM (fiEM) (Karimi et al., 2019b).
By contrast, only a few studies considered the online EM algorithm in a situation where a fresh sample is drawn at each iteration. Cappé and Moulines proved the asymptotic convergence of the online EM algorithm (Cappé and Moulines, 2009). Balakrishnan et al. analyzed the non-asymptotic convergence for a variant of the online EM algorithm (Balakrishnan et al., 2017), where the initial radius around the optimal parameter must be known in advance. Karimi et al. considered the SA scheme (Robbins and Monro, 1951)
, whose stochastic update (drift term) depends on a state-dependent Markov chain, and the mean field is not necessarily of a gradient type, thereby covering an approximate second-order method and allowing an asymptotic bias for one-step updates(Karimi et al., 2019a)
. They illustrated these settings with the online EM algorithm and the policy-gradient method for the average reward maximization in reinforcement learning.
1.3 Significance of this paper
1.3.1 Novel online learning algorithm for tradeoff between robustness and adaptivity
We propose a novel online learning algorithm, called SRA, to consider the tradeoff between the robustness and adaptivity. Previous studies (Chu et al., 2004; Huang et al., 2016; Cejnek and Bukovsky, 2018; Yamanishi et al., 2004; Odakura, 2018; Fearnhead and Rigaill, 2019; Guo, 2019) considered only one of them, and even if some considered both, the relation between them was not made clear. This study considers both and gives a theoretical analysis for the non-asymptotic convergence of SRA. To do so, we adopt the SA scheme (Robbins and Monro, 1951) in a setting that outliers and change points might exist. As SRA is formulated on the majorization–minimization principle (Lange, 2016; Mairal, 2015), it is a general algorithm including many schemes, such as the online EM algorithm (Cappé and Moulines, 2009; Balakrishnan et al., 2017; Karimi et al., 2019a) and stochastic gradient descent (SGD). Our approach is regarded as an extension of the work of (Karimi et al., 2019a), but they presented convergence analysis of the biased SA in the setting that there is no outlier and change point. On the contrary, we consider convergence analysis in a setting that outliers and change points might exist. Our study is novel in that we show non-asymptotic convergence analysis in this broader setting and apply it to quantify and evaluate the tradeoff between the robustness and adaptivity of online learning algorithms.
Note that many studies already addressed the tradeoff between exploration and exploitation in bandit algorithms (e.g., (Lattimore and Szepesväri, 2018)). However, our problem setting is different from these studies. Bandit algorithms search for parameters independently of changes of the environment. In contrast, our SRA does not change parameters greatly when the data-generating mechanism does not so much change. It adapts to the changes of the data-generating mechanism. Therefore, although both our study and those concerned with bandit algorithms consider the tradeoff between global and local information, our motivation is different from the studies.
1.3.2 Empirical demonstration of the proposed algorithm
We evaluated the effectiveness of SRA for both synthetic and real datasets. We empirically showed characteristics of SRA by inspecting the dependencies on the parameters of SRA, and these were consistent with those in the theoretical analysis. We also compared the performance of SRA with those of the previously proposed algorithms (Neal and Hinton, 1999; Yamanishi et al., 2004; Cappé and Moulines, 2009)
, including important tasks such as change detection and anomaly detection. Consequently, SRA was superior to other algorithms.
2.1 Problem setting
We consider a situation where each datum arrives in an online fashion at each time . If no noise exists, we assume that is drawn from
where is an element of a parametric class of distribution , is a parameter space, and is the associated parameter. However, in the real world, data are sometimes contaminated by noises. In this case, we assume that
is drawn from a mixture of probability density functions:
where denotes a mixture ratio (). Equation (2) means that a datum is generated from true distribution with probability and from noisy distribution with probability . is an element of a parametric class of data distributions , where is a parameter space, and is the associated parameter. This study addresses the convergence property of Equation (2).
We assume that a change point is given, and each datum before and after the change point is drawn from different distributions. This means that in Equation (2) varies as
where . It means a change abruptly occurs at .
2.2 Nonasymptotic analysis of SA
where denotes the -th iterate of parameters or the sufficient statistics of distribution, is the step size.
denotes the random variable at, and does its realization. is the stochastic update at time . When
is an i.i.d. sequence of random vectors, the mean field for the SA is defined as, where is the filtration generated by the random variables at time . When is a state-dependent Markov chain, under the assumption that , where denotes the norm of vector in and is the true distribution. In this study, we consider the former case, that is, is an i.i.d sequence of random vectors. Karimi et al. assumed that is related to a smooth Lyapunov function , where . This SA scheme in Equation (4) aims to find a minimizer or a stationary point of the non-convex Lyapunov function .
Here, is twice differentiable and convex, is concave and differentiable, is a convex open subset of , denotes the sufficient statistics, and does dot product. The Lyapunov function is defined for the sufficient statistics as
where is Kullback–Leibler (KL) divergence between and defined as
Therefore, is represented as
Karimi et al. considered the following assumptions for and .
(Karimi et al., 2019a)
, , s.t. .
, , s.t. .
The Lyapunov function is L-smooth: .
Here, means the norm of the mean field, which takes small values as the SA scheme in Equation (4) converges. Assumption 1 (a) and (b) assume that the mean field is indirectly related to the Lyapunov function , but it is not necessarily the same as . The constants and characterize the bias between the mean field and the gradient of the Lyapunov function. We note that the Lyapunov function can be a nonconvex function under Assumption 1 (c).
For any , we denote
as a discrete random variable independent of. When we adopt a randomized stopping rule in SA as in (Ghadimi and Lan, 2013), we define , where is the terminating iteration for Equation (4). We consider the following expectation:
We then define the following noise vector:
Equation (11) means the difference between the stochastic update and the mean field at time .
We assume the following assumption:
(Karimi et al., 2019a) The noise vectors have a Martingale difference sequence for any , , with .
The following theorem then holds:
3 Proposed algorithm
In this section, we introduce the proposed online learning algorithm from data streams, called the SRA, to consider the tradeoff between the robustness and adaptivity. First, we describe SRA in Section 3.1 and its application to the online EM algorithm (Cappé and Moulines, 2009; Karimi et al., 2019a) in Section 3.2. Since SRA is formulated on the majorization–minimization principle (e.g., (Lange, 2016; Mairal, 2015)), it is widely applicable to a broad class of algorithms, such as SGD (e.g., (Bottou et al., 2018)). We explain this point in Section 3.3. Notations are followed by Section 2.2 unless we particularly define them.
We consider the convergence property of Equation (2) under the following SA scheme:
where is the step size, as in Equation (4), and is defined for given as
Equation (13) is different from Equation (4) in that Equation (13) does not update the parameters of distribution or the sufficient statistics when . This means that SRA drops data points with large values of stochastic updates and updates the parameters of distribution or the sufficient statistics with SA. The former corresponds to the robustness, whereas the latter does to the adaptivity of SRA. They are controlled by the threshold parameter and the step sizes , respectively. The step size is sometimes referred to as the discounting parameter (e.g., (Yamanishi et al., 2004)). Although the step size of the SA is generally different from the discounting parameter, it is related to the adaptivity in the sense that they bring effects of new samples. The step sizes particularly bring high adaptivity when the decrease rate is relatively small. Therefore, it is sufficient to discuss the step size to consider adaptivity in the SA setting. The relation between and , and the issue of determination of the optimal values of with are addressed in Section 4. The former procedure of SRA is somewhat similar to that in the study of (Hara et al., 2019), while they inspected influential instances for the models trained with SGD.
3.2 Application to the Online EM algorithm
Next, we consider SRA in the online EM setting (Cappé and Moulines, 2009). The SA with the online EM algorithm is described as
denotes estimated sufficient statistics at. The E-step of the online EM algorithm updates the sufficient statistics, while the M-step updates the parameters. in Equation (15) is defined as
where and are the observed and latent variables, respectively, and denotes the complete-data sufficient statistics. We consider the curved exponential family in Equation (5). The negated complete data loglikelihood of Equation (5) is defined in Equation (8). Moreover, in Equation (16) is defined in Equation (9). Accordingly, Equation (13) , (15), and (16) show that the stochastic update and its mean field are represented as
as regards the application to the Gaussian mixture model (GMM).
3.3 Surrogate functions of SRA
As SRA is formulated on the majorization–minimization principle (e.g., (Lange, 2016; Mairal, 2015)), it is naturally applicable to a wider class of algorithms, such as SGD. For example, stochastic optimization with -regularizer is described as
where is a loss function, is a learning rate, and is a penalty parameter. We obtain the solution of Equation (20) as
4 Convergence analysis
This section shows the convergence analysis of SRA. All the proofs are given in the Appendix.
4.1 Upper Bound of Expectation of the Mean Field
We are concerned with how Equation (13) converges. In particular, our concern is on how Theorem 2.1 would be altered when each datum is generated from Equation (2) instead of Equation (1). In this case, we define the following noise vector:
The following lemma holds with respect to the expectation of the dot product of the gradient of the Lyapunov function and the noise vector.
There exists , such that the following inequality holds for :
The proof of Lemma 1 is given in A.1. The left-hand side of Equation (25) represents the magnitude of the bias of . In contrast, as for the right-hand side of Equation (25), the sharper the distribution of is, the smaller gets. As a result, the bound would be improved. We address this point for a situation where is bounded in the discussion of Corollary 3.
We set the following assumption for the noise distribution in Equation (2):
Then, the following lemma holds:
If we consider Assumption 3 and , , the following inequality holds:
The proof of Lemma 2 is given in A.2. Note that the right-hand side of Equation (27) represents weighted sum of variances of the noise vector in Equation (11) from true distribution and noisy one. In particular, the first term in the right-hand side of Equation (27) has an additional term in comparison with the noiseless case in Assumption 2. It indicates the bias of the noise vector by truncating in Equation (14).
The following theorem then holds:
The proof of Theorem 4.1 is given in A.3. Note that is an inevitable bias term between the mean field and the gradient of the Lyapunov function, defined in Assumption 1 (a). It also appeared in the result in Equation (12). When we set in Equation (30), Theorem 4.1 is represented as
Equation (32) asserts that the SA scheme in Equation (13) finds an stationary point within iterations. Note that when is a constant or the decay rate of is small, whenever a change occurs according to Equation (3), the convergence rate of Equation (30) is considered to be dependent on , , , , , , , , and . Consequently, it is independent of the change point in Equation (3). It means that when a change of the distribution occurs according to Equation (3), if the distribution satisfies the assumptions of Theorem 4.1, it converges at an almost constant rate irrespectively of the time point at which a change occurs. In that sense, SRA is guaranteed to possess the adaptivity. In contrast, when we adopt decreasing step sizes, for example, the convergence rate becomes worse because the step sizes become small if the change happens later. In this case, the adaptivity decreases.
Since , , , , and are not known in general, we have to tune these parameters, for example, with a cross validation.
The following corollary holds as regards the relationship between the threshold parameter and the step size :
If , and we set , the right-hand side of Equation (30) is minimized by
4.2 Effect of
Next, let us address how the upper bound of Equation (32) behaves when goes to infinite. The following corollary holds as regards the expectation of the norm of the mean field :
The following inequality holds:
The following corollary then holds as regards the decrease of the upper bound by setting .
The proof of Corollary 3 is given in A.6. Equation (36) represents the effect of setting . The first term in the right-hand side of Equation (36) means the decrease of the upper bound by setting as the threshold parameter. In contrast, as its demerit, the second term appears in the right-hand side. As we mentioned it after Lemma 1, if is bounded, the cost of the second term disappears in a finite region. In such a case, the advantage of SRA becomes clearer. In fact, if holds (, we get the following inequality with Hoeffding’s inequality (Vershynin, 2018):
We then obtain the following equation for :
Therefore, the following equation holds for :
When satisfies , Equation (40) shows that the effect of setting is proportional to .
This section presents the experimental results of SRA. All the source codes are available at https://github.com/s-fuku/robustadapt.
5.1 Synthetic dataset
We generated the following one-dimensional sequence:
denotes the univariate normal distribution with mean. denotes the uniform distribution, the range of which is . We evaluated the performances of SRA and the compared algorithms using the following mean squared errors (MSE):
where is the estimated mean at , is the sequence length (), is the change point (), and is a transient period. , , and represent the MSEs for the overall sequence, time points before the change point, and one after the change point, respectively. Each MSE excludes the transient period between and . We set and the mixture of GMM to for each algorithm throughout the following experiments for synthetic datasets. For each algorithm, we initialized the parameters or the sufficient statistics with the data in the first 10 time steps.
5.1.1 Tradeoff between and
We empirically confirmed the tradeoff between the threshold parameter and the step size
. In practice, the hyperparameters, , , and must be tuned to determine on in Equation (33). As is regarded as one parameter, we changed , , and to estimate the optimal value of in Equation (33). We generated 10 data streams according to Equation (41) with and and evaluated the MSE between and , denoted as . Figure 2 shows the estimated , , , , and . For each , the optimal combination of , , , and estimated in Equation (33) was selected, which minimized . Figure 2 shows that , , , and were minimized when , indicating that the choice of using Equation (33) is reasonable because it gives both the robustness and adaptivity. The best combination of the hyperparameters was , and the estimated . We also observe from Figure 2 that and were not so different for , whereas and were different. It indicates that did not have much influence before the change point between and . In contrast, after the change point, the mean of distribution changed to from . Therefore, the difference between and became significant. In fact, the sufficient statistics corresponds to mean in this case. The result with indicates that it led to the decrease of accuracy in estimation of the mean by dropping more data points than . For , we see the decrease of MSEs in comparison with even before the change point. It is due to the influence of outliers.
5.1.2 Comparison with other algorithms
We compared the performance of SRA with those of the rival algorithms. We chose the following algorithms for comparison:
SDEM (Yamanishi et al., 2004): an online learning algorithm based on GMM. SDEM sequentially updates parameters and adapts to non-stationary changes with the discounting parameter.
sEM (Cappé and Moulines, 2009): a stochastic (online) EM algorithm. sEM updates the sufficient statistics in an SA scheme.
SDEM (Yamanishi et al., 2004) and sEM (Cappé and Moulines, 2009) have the discounting parameter to adapt to new data. We chose , , , and , , . We set the parameters of SRA to , , , , , , , that is, the best combination in the previous experiment. Then, we evaluated , the MSE between and as before. As a result, the discounting parameters were selected as and for SDEM and sEM, respectively. We generated data streams using Equation (41) for 10 times with and . Table 1 shows the average MSEs , , and for each algorithm. SRA was superior to other algorithms. for the time periods after the change point and before it. This result indicates that SRA is more equipped with both the robustness and adaptivity compared to other algorithms.
5.1.3 Dependency on
We investigated the dependency of the upper bound in Equation (30) on the ratio of the outlier. It is characterized as in Equation (41). Based on Equation (36), the smaller the , the more the upper bound of the expectation of the mean field. In other words, the upper bound increases as the noisy data increases.
For , we set , , and . It means that we use the best combination in the previous experiment, but is modified by the value of . Therefore, is also modified according to Equation (33). We generated data streams using Equation (41) for 10 times with and . We then estimated the MSEs , , and . Figure 3 shows , , and . Each MSE decreased as increased. This result is consistent with Equation (36).
5.2 Real dataset: change detection
We applied SRA to the Well-log dataset for change detection. This dataset is available at https://github.com/alan-turing-institute/rbocpdms/, for example. The Well-log dataset was first studied in Ruanaidh et al. (Ruanaidh et al., 1996) and has become a benchmark dataset for univariate change detection. It is a data stream consisting of 4050 measurements of nuclear magnetic response during the drilling of a well. Although this dataset has been used in several studies (e.g., (Adams and MacKay, 2007; Levy-leduc and Harchaoui, 2008; Ruggieri and Antonellis, 2016; Fearnhead and Rigaill, 2019)), the outliers have often been removed before change detection, except only a few studies (e.g., (Fearnhead and Rigaill, 2019)). Figure 4 shows the annotated change points proposed by (Burg and Williams, 2020). There are five sets of annotated changes, each of which was provided by each annotator:
Annotation 1: , , , , , , , , , , .
Annotation 2: , , , , , , , , .
Annotation 3: , , , , , , , , .
Annotation 4: , .
Annotation 5: , , , , , , , , , , , , , , , , .
We used the first 1550 data points as the training dataset and the remaining points as the test dataset. We calculated the change score as , where is the parameter estimated at . We evaluated the performance of each algorithm for the training dataset in terms of detection delay and overdetection. We used the area under the curve (AUC) score (Fawcett and Provost, 1999; Yamanishi and Miyaguchi, 2016). The AUC score was calculated as follows: we first fixed the threshold parameter and converted change scores to binary alarms . That is, , where denotes the binary function that takes if and only if is true. We let be a maximum tolerant delay of change detection. In this experiment, we set . When the change actually occurred at , we defined the benefit of an alarm at time as
The number of false alarms was calculated as
where and are the starting and end time points for evaluation, respectively, and denotes a sequence of change scores in the period. We visualized the performance by plotting the recall rate of the total benefit, , aginst the false alarm rate, , with varying.
We chose SDEM (Yamanishi et al., 2004), iEM (Neal and Hinton, 1999), and sEM (Cappé and Moulines, 2009) for comparison. Each algorithm employed the univariate normal distribution. We choose the hyperparameters of each algorithm with AUC scores between and : the hyperparameters of SRA were chosen among , , , , , , and , , , and those of SDEM and sEM were chosen among , , , , , , . We initialized each parameter or sufficient statistics of each algorithm for 10 times and selected the combination which gave the best performance on average. To initialize the parameter or sufficient statistics, we drew 20 initial points from the uniform distribution whose range was , where is a sequence between and .
We applied each algorithm and calculated the AUC scores on the test dataset after the parameters were determined. Table 2 shows the AUCs on the test dataset with the algorithms and annotations. We observe that SRA was superior to other algorithms for each annotation.
The best combinations of hyperparameters were for all the annotations for SDEM. For sEM, (Annotation 1, 2, and 3), (Annotation 4), and (Annotation 5). For SRA, , , for Annotation 1, 2, and 3, , , for Annotation 4, and , ,