# Online Robust and Adaptive Learning from Data Streams

In online learning from non-stationary data streams, it is both necessary to learn robustly to outliers and to adapt to changes of underlying data generating mechanism quickly. In this paper, we refer to the former nature of online learning algorithms as robustness and the latter as adaptivity. There is an obvious tradeoff between them. It is a fundamental issue to quantify and evaluate the tradeoff because it provides important information on the data generating mechanism. However, no previous work has considered the tradeoff quantitatively. We propose a novel algorithm called the Stochastic approximation-based Robustness-Adaptivity algorithm (SRA) to evaluate the tradeoff. The key idea of SRA is to update parameters of distribution or sufficient statistics with the biased stochastic approximation scheme, while dropping data points with large values of the stochastic update. We address the relation between two parameters, one of which is the step size of the stochastic approximation, and the other is the threshold parameter of the norm of the stochastic update. The former controls the adaptivity and the latter does the robustness. We give a theoretical analysis for the non-asymptotic convergence of SRA in the presence of outliers, which depends on both the step size and the threshold parameter. Since SRA is formulated on the majorization-minimization principle, it is a general algorithm including many algorithms, such as the online EM algorithm and stochastic gradient descent. Empirical experiments for both synthetic and real datasets demonstrated that SRA was superior to previous methods.

## Authors

• 3 publications
• 14 publications
• 15 publications
• ### Meta-descent for Online, Continual Prediction

This paper investigates different vector step-size adaptation approaches...
07/17/2019 ∙ by Andrew Jacobsen, et al. ∙ 0

• ### Fast and Strong Convergence of Online Learning Algorithms

In this paper, we study the online learning algorithm without explicit r...
10/10/2017 ∙ by Zheng-Chu Guo, et al. ∙ 0

• ### Adaptive Stochastic Gradient Descent on the Grassmannian for Robust Low-Rank Subspace Recovery and Clustering

12/12/2014 ∙ by Jun He, et al. ∙ 0

• ### Nonlinear Online Learning with Adaptive Nyström Approximation

Use of nonlinear feature maps via kernel approximation has led to succes...
02/21/2018 ∙ by Si Si, et al. ∙ 0

• ### Adaptation in Online Social Learning

This work studies social learning under non-stationary conditions. Altho...
03/04/2020 ∙ by Virginia Bordignon, et al. ∙ 0

• ### Adaptive regularization for Lasso models in the context of non-stationary data streams

Large scale, streaming datasets are ubiquitous in modern machine learnin...
10/28/2016 ∙ by Ricardo Pio Monti, et al. ∙ 0

• ### Online Learning Without Prior Information

The vast majority of optimization and online learning algorithms today r...
03/07/2017 ∙ by Ashok Cutkosky, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

### 1.1 Purpose of this paper

This study is concerned with online learning in data streams. We consider a situation where each datum arrives in an online fashion. In such a situation, we would like to (i) learn robustly to outliers or anomalies in the observed data; (ii) adapt to changes of the underlying data-generating mechanism. As for (i), if a data point is an outlier, we would like to learn with as little influence by its outlier as possible. In this paper, we refer to such a nature of online learning algorithms as robustness. In contrast, as for (ii), we would like to quickly follow the changes of the data-generating mechanism. We refer to such a nature of online learning algorithms as adaptivity. Figure 1 shows an illustration of the concepts of the robustness and adaptivity.

A tradeoff exists between the robustness and adaptivity: the robustness generally decreases if we try to adapt to the changes. Conversely, the adaptivity decreases if we try to learn robustly. Although many online learning algorithms have been introduced and some studies have addressed this issue (Tsay, 1988; Gama et al., 2014; Chu et al., 2004; Huang et al., 2016; Odakura, 2018; Cejnek and Bukovsky, 2018; Fearnhead and Rigaill, 2019; Guo, 2019), to the best of our knowledge, no algorithm has quantitatively considered the tradeoff between the robustness and adaptivity.

This study aims to propose an online learning algorithm that considers the tradeoff between the robustness and adaptivity. We introduce a novel algorithm, called the stochastic approximation-based robustness–adaptivity (SRA) algorithm, to show a theoretical analysis for non-asymptotic convergence of SRA in the presence of outliers and demonstrate its effectiveness for both synthetic and real datasets. The key idea of SRA is to update parameters of distribution or sufficient statistics with the stochastic approximation (SA) (Robbins and Monro, 1951) while dropping points with large values of stochastic updates (drift terms).

### 1.2 Related work

This study is concerned with the robustness and adaptivity of online learning algorithms. Moreover, we briefly review studies related to the SA (Robbins and Monro, 1951) and the online expectation–maximization (EM) algorithm (Cappé and Moulines, 2009; Karimi et al., 2019a) because SRA uses them.

#### 1.2.1 Robustness and adaptivity of online learning algorithms

The robustness and adaptivity of online learning algorithms have often been discussed in the context of the concept drift (Gama et al., 2014; Chu et al., 2004; Huang et al., 2016; Cejnek and Bukovsky, 2018). Yamanishi et al. proposed an online learning algorithm, called the sequentially discounting EM algorithm (SDEM) (Yamanishi et al., 2004). Although SDEM can handle complicated distributions, it is prone to noises and is easily overfitted to data. Odakura proposed an online nonstationary robust learning algorithm (Odakura, 2018). This algorithm introduced two parameters independently to control the robustness and adaptivity, respectively. Fearnhead and Rigaill proposed an algorithm for change detection that is robust to the presence of outliers (Fearnhead and Rigaill, 2019)

. The key idea of the algorithm is to adapt existing penalized cost approaches for detecting changes such that they use loss functions that are less sensitive to outliers. Guo proposed an algorithm based on an online sequential extreme learning machine for robust and adaptive learning

(Guo, 2019).

#### 1.2.2 Online (stochastic) EM algorithms

The EM algorithm (Dempster et al., 1977) is a popular class of inference to minimize loss functions. The original EM algorithm (Dempster et al., 1977) does not scale to a large dataset because it requires the entire data at each iteration. To overcome this problem, several studies proposed online versions of the EM algorithms.

Neal and Hinton proposed an EM algorithm in an incremental scheme referred to as the incremental EM (iEM) (Neal and Hinton, 1999). Cappé and Moulines proposed the stochastic (online) EM (sEM) algorithm (Cappé and Moulines, 2009), which updates sufficient statistics in an SA scheme (Robbins and Monro, 1951)

. Chen et al. proposed the variance reduced sEM (sEM-VR) algorithm

(Chen et al., 2018). Meanwhile, Karimi et al. showed non-asymptotic convergence bounds for the global convergence of iEM, sEM-VR, and the fast incremental EM (fiEM) (Karimi et al., 2019b).

By contrast, only a few studies considered the online EM algorithm in a situation where a fresh sample is drawn at each iteration. Cappé and Moulines proved the asymptotic convergence of the online EM algorithm (Cappé and Moulines, 2009). Balakrishnan et al. analyzed the non-asymptotic convergence for a variant of the online EM algorithm (Balakrishnan et al., 2017), where the initial radius around the optimal parameter must be known in advance. Karimi et al. considered the SA scheme (Robbins and Monro, 1951)

, whose stochastic update (drift term) depends on a state-dependent Markov chain, and the mean field is not necessarily of a gradient type, thereby covering an approximate second-order method and allowing an asymptotic bias for one-step updates

(Karimi et al., 2019a)

. They illustrated these settings with the online EM algorithm and the policy-gradient method for the average reward maximization in reinforcement learning.

### 1.3 Significance of this paper

In the context of Section 1.1 and 1.2, the contributions of this paper are summarized below.

#### 1.3.1 Novel online learning algorithm for tradeoff between robustness and adaptivity

We propose a novel online learning algorithm, called SRA, to consider the tradeoff between the robustness and adaptivity. Previous studies (Chu et al., 2004; Huang et al., 2016; Cejnek and Bukovsky, 2018; Yamanishi et al., 2004; Odakura, 2018; Fearnhead and Rigaill, 2019; Guo, 2019) considered only one of them, and even if some considered both, the relation between them was not made clear. This study considers both and gives a theoretical analysis for the non-asymptotic convergence of SRA. To do so, we adopt the SA scheme (Robbins and Monro, 1951) in a setting that outliers and change points might exist. As SRA is formulated on the majorization–minimization principle (Lange, 2016; Mairal, 2015), it is a general algorithm including many schemes, such as the online EM algorithm (Cappé and Moulines, 2009; Balakrishnan et al., 2017; Karimi et al., 2019a) and stochastic gradient descent (SGD). Our approach is regarded as an extension of the work of (Karimi et al., 2019a), but they presented convergence analysis of the biased SA in the setting that there is no outlier and change point. On the contrary, we consider convergence analysis in a setting that outliers and change points might exist. Our study is novel in that we show non-asymptotic convergence analysis in this broader setting and apply it to quantify and evaluate the tradeoff between the robustness and adaptivity of online learning algorithms.

Note that many studies already addressed the tradeoff between exploration and exploitation in bandit algorithms (e.g., (Lattimore and Szepesväri, 2018)). However, our problem setting is different from these studies. Bandit algorithms search for parameters independently of changes of the environment. In contrast, our SRA does not change parameters greatly when the data-generating mechanism does not so much change. It adapts to the changes of the data-generating mechanism. Therefore, although both our study and those concerned with bandit algorithms consider the tradeoff between global and local information, our motivation is different from the studies.

#### 1.3.2 Empirical demonstration of the proposed algorithm

We evaluated the effectiveness of SRA for both synthetic and real datasets. We empirically showed characteristics of SRA by inspecting the dependencies on the parameters of SRA, and these were consistent with those in the theoretical analysis. We also compared the performance of SRA with those of the previously proposed algorithms (Neal and Hinton, 1999; Yamanishi et al., 2004; Cappé and Moulines, 2009)

, including important tasks such as change detection and anomaly detection. Consequently, SRA was superior to other algorithms.

## 2 Preliminaries

### 2.1 Problem setting

We consider a situation where each datum arrives in an online fashion at each time . If no noise exists, we assume that is drawn from

 yt∼f(yt;θt), (1)

where is an element of a parametric class of distribution , is a parameter space, and is the associated parameter. However, in the real world, data are sometimes contaminated by noises. In this case, we assume that

is drawn from a mixture of probability density functions:

 yt∼αf(yt;θt)+(1−α)fnoise(yt;ξ), (2)

where denotes a mixture ratio (). Equation (2) means that a datum is generated from true distribution with probability and from noisy distribution with probability . is an element of a parametric class of data distributions , where is a parameter space, and is the associated parameter. This study addresses the convergence property of Equation (2).

We assume that a change point is given, and each datum before and after the change point is drawn from different distributions. This means that in Equation (2) varies as

 θt={θ1(t=1,…,t∗−1),θ2(t=t∗,…), (3)

where . It means a change abruptly occurs at .

### 2.2 Nonasymptotic analysis of SA

Karimi et al. showed a convergence analysis (Karimi et al., 2019a) of the non-convex objective function under the SA scheme (Robbins and Monro, 1951) in Equation (1):

 θt+1=θt−ρt+1Hθt(Yt+1), (4)

where denotes the -th iterate of parameters or the sufficient statistics of distribution, is the step size.

denotes the random variable at

, and does its realization. is the stochastic update at time . When

is an i.i.d. sequence of random vectors, the mean field for the SA is defined as

, where is the filtration generated by the random variables at time . When is a state-dependent Markov chain, under the assumption that , where denotes the norm of vector in and is the true distribution. In this study, we consider the former case, that is, is an i.i.d sequence of random vectors. Karimi et al. assumed that is related to a smooth Lyapunov function , where . This SA scheme in Equation (4) aims to find a minimizer or a stationary point of the non-convex Lyapunov function .

For example, let us consider the online EM algorithm (Cappé and Moulines, 2009; Karimi et al., 2019a) to the curved exponential family:

 f(Y,Z;θ)=h(Y,Z)exp(⟨S(Y,Z)|ϕ(θ)⟩−ψ(θ)). (5)

Here, is twice differentiable and convex, is concave and differentiable, is a convex open subset of , denotes the sufficient statistics, and does dot product. The Lyapunov function is defined for the sufficient statistics as

 V(s)\makebox[0.0pt]\tiny def=KL(π,g(⋅;¯θ(s)))+R(¯θ(s)), (6)

where is Kullback–Leibler (KL) divergence between and defined as

 KL(π,g) \makebox[0.0pt]\tiny def=Eπ[log(π(Y)/g(Y;θ))], (7)

and is a penalization term assumed to be twice differentiable (Karimi et al., 2019a). in Equation (6) is defined as the minimizer of the loss function:

 ℓ(s;θ)=ψ(θ)+R(θ)−⟨s|ϕ(θ)⟩. (8)

Therefore, is represented as

 ¯θ(s)\makebox[0.0pt]\tiny def=\operatornamewithlimitsargminθℓ(s;θ)=\operatornamewithlimitsargminθ{ψ(θ)+R(θ)−⟨s|ϕ(θ)⟩}. (9)

Karimi et al. considered the following assumptions for and .

###### Assumption 1

(Karimi et al., 2019a)

(a)

, , s.t. .

(b)

, , s.t. .

(c)

The Lyapunov function is L-smooth: .

Here, means the norm of the mean field, which takes small values as the SA scheme in Equation (4) converges. Assumption 1 (a) and (b) assume that the mean field is indirectly related to the Lyapunov function , but it is not necessarily the same as . The constants and characterize the bias between the mean field and the gradient of the Lyapunov function. We note that the Lyapunov function can be a nonconvex function under Assumption 1 (c).

For any , we denote

as a discrete random variable independent of

. When we adopt a randomized stopping rule in SA as in (Ghadimi and Lan, 2013), we define , where is the terminating iteration for Equation (4). We consider the following expectation:

 E[∥h(θN)∥2]=n∑k=1P(N=k)∥h(θk)∥2, (10)

where is solved with Equation (4). The left side of Equation (10) means the expectation of the norm of the mean field when we consider the weights of the data points.

We then define the following noise vector:

 et+1\makebox[0.0pt]\tiny def=Hθt(Yt+1)−h(θt). (11)

Equation (11) means the difference between the stochastic update and the mean field at time .

We assume the following assumption:

###### Assumption 2

(Karimi et al., 2019a) The noise vectors have a Martingale difference sequence for any , , with .

The following theorem then holds:

###### Theorem 2.1

(Karimi et al., 2019a) If Assumption 1 (a), (c) and Assumption 2 hold, and for all , then we obtain the following inequality:

 E[∥h(θN)∥2]≤2c1(V0,n+σ20L∑nt=0ρ2t+1)∑nt=0ρt+1+2c0, (12)

where .

In particular, when we set , the right hand side of Equation (12) evaluates to . This means that the SA scheme in Equation (4) finds an stationary point within iterations. Note that is the inevitable bias between the mean field and the gradient of the Lyapunov function .

## 3 Proposed algorithm

In this section, we introduce the proposed online learning algorithm from data streams, called the SRA, to consider the tradeoff between the robustness and adaptivity. First, we describe SRA in Section 3.1 and its application to the online EM algorithm (Cappé and Moulines, 2009; Karimi et al., 2019a) in Section 3.2. Since SRA is formulated on the majorization–minimization principle (e.g., (Lange, 2016; Mairal, 2015)), it is widely applicable to a broad class of algorithms, such as SGD (e.g., (Bottou et al., 2018)). We explain this point in Section 3.3. Notations are followed by Section 2.2 unless we particularly define them.

### 3.1 Sra

We consider the convergence property of Equation (2) under the following SA scheme:

 θt+1 =θt−ρt+1Gθt(Yt+1), (13)

where is the step size, as in Equation (4), and is defined for given as

 Gθt(Y) ={Hθt(Y)(∥Hθt(Y)∥≤γ),0(∥Hθt(Y)∥>γ). (14)

We call the SA scheme in Equation (13) SRA, which is summarized in Algorithm 1. The computational cost of SRA is at each time.

Equation (13) is different from Equation (4) in that Equation (13) does not update the parameters of distribution or the sufficient statistics when . This means that SRA drops data points with large values of stochastic updates and updates the parameters of distribution or the sufficient statistics with SA. The former corresponds to the robustness, whereas the latter does to the adaptivity of SRA. They are controlled by the threshold parameter and the step sizes , respectively. The step size is sometimes referred to as the discounting parameter (e.g., (Yamanishi et al., 2004)). Although the step size of the SA is generally different from the discounting parameter, it is related to the adaptivity in the sense that they bring effects of new samples. The step sizes particularly bring high adaptivity when the decrease rate is relatively small. Therefore, it is sufficient to discuss the step size to consider adaptivity in the SA setting. The relation between and , and the issue of determination of the optimal values of with are addressed in Section 4. The former procedure of SRA is somewhat similar to that in the study of (Hara et al., 2019), while they inspected influential instances for the models trained with SGD.

### 3.2 Application to the Online EM algorithm

Next, we consider SRA in the online EM setting (Cappé and Moulines, 2009). The SA with the online EM algorithm is described as

 E-step:^st+1 =^st−ρt+1(^st−¯s(Yt+1;^θt)), (15) M-step:^θt+1 =¯θ(^st+1), (16)

where

denotes estimated sufficient statistics at

. The E-step of the online EM algorithm updates the sufficient statistics, while the M-step updates the parameters. in Equation (15) is defined as

 ¯s(y;θ)\makebox[0.0pt]\tiny def% =Eθ[s(Y=y,Z)|Y=y], (17)

where and are the observed and latent variables, respectively, and denotes the complete-data sufficient statistics. We consider the curved exponential family in Equation (5). The negated complete data loglikelihood of Equation (5) is defined in Equation (8). Moreover, in Equation (16) is defined in Equation (9). Accordingly, Equation (13) , (15), and (16) show that the stochastic update and its mean field are represented as

 H^sn(Yn+1) =^sn−¯s(Yn+1;¯θ(^sn)), (18) h(^sn) =Eπ[H^sn(Yn+1)|Fn]=^sn−Eπ[¯s(Yn+1;¯θ(^sn))]. (19)

We use Equation (18) in Equation (14). Please refer to (Karimi et al., 2019a)

as regards the application to the Gaussian mixture model (GMM).

### 3.3 Surrogate functions of SRA

As SRA is formulated on the majorization–minimization principle (e.g., (Lange, 2016; Mairal, 2015)), it is naturally applicable to a wider class of algorithms, such as SGD. For example, stochastic optimization with -regularizer is described as

 θt+1=\operatornamewithlimitsargminθ{−ρt+1⟨∇ℓ(θ),θ−θt⟩+12∥θ−θt∥2+ρt+12λ∥θt∥2}(t=1,…), (20)

where is a loss function, is a learning rate, and is a penalty parameter. We obtain the solution of Equation (20) as

 −(θt+1−θt) =ρt+1∇ℓ(θt)+ρt+1λθt, (21) ⟺ θt+1 =(1−ρt+1λ)θt−ρt+1∇ℓ(θt), (22) ⟺ θt+1 =θt−ρt+1(λθt+∇ℓ(θt)). (23)

The final equation in Equation (23) corresponds to Equation (13), where . Please refer to (Ghadimi and Lan, 2013; Bottou et al., 2018) for details on the stochastic optimization in the SA scheme.

## 4 Convergence analysis

This section shows the convergence analysis of SRA. All the proofs are given in the Appendix.

### 4.1 Upper Bound of Expectation of the Mean Field

We are concerned with how Equation (13) converges. In particular, our concern is on how Theorem 2.1 would be altered when each datum is generated from Equation (2) instead of Equation (1). In this case, we define the following noise vector:

 ξt+1\makebox[0.0pt]\tiny def=Gθt(Yt+1)−h(θt). (24)

We then address the convergence property of under Equation (13), where denotes a discrete random variable for any , and the expectation is calculated with Equation (10) as in Section 2.2.

The following lemma holds with respect to the expectation of the dot product of the gradient of the Lyapunov function and the noise vector.

###### Lemma 1

There exists , such that the following inequality holds for :

 ≤∥∇V(θk)∥∫∞γexp(−z2M2)dz. (25)

The proof of Lemma 1 is given in A.1. The left-hand side of Equation (25) represents the magnitude of the bias of . In contrast, as for the right-hand side of Equation (25), the sharper the distribution of is, the smaller gets. As a result, the bound would be improved. We address this point for a situation where is bounded in the discussion of Corollary 3.

We set the following assumption for the noise distribution in Equation (2):

###### Assumption 3

We assume that the noise distribution in Equation (2

) obeys the uniform distribution:

 fnoise(yt;ξ) =1/(2U)d, (26)

where , , and is the dimension of data.

Then, the following lemma holds:

###### Lemma 2

If we consider Assumption 3 and ,   , the following inequality holds:

 E[∥ξk+1∥2|Fk] ≤α(σ20+(σ21+1)∥h(θk)∥2)+(1−α)min(dU2,γ2). (27)

The proof of Lemma 2 is given in A.2. Note that the right-hand side of Equation (27) represents weighted sum of variances of the noise vector in Equation (11) from true distribution and noisy one. In particular, the first term in the right-hand side of Equation (27) has an additional term in comparison with the noiseless case in Assumption 2. It indicates the bias of the noise vector by truncating in Equation (14).

The following theorem then holds:

###### Theorem 4.1

Let us consider the SA scheme in Equation (13). If we assume that Assumption 3 holds and , , the following inequality holds for :

 E[∥h(θN)∥2] =∑nk=0ρk+1E[∥h(θk)∥2|Fk]∑nk=0ρk+1 (28) ≤2(c0+c1(d0+1)∫∞γexp(−z2M2)dz) (29) +2c1V0,n+L(ασ20+(1−α)min(dU2,γ2))∑nk=0ρ2k+1∑nk=0ρk+1. (30)

where , is a constant satisfying Assumption 1 (c), is the dimension of data, and is the mixture ratio in Equation (2).

The proof of Theorem 4.1 is given in A.3. Note that is an inevitable bias term between the mean field and the gradient of the Lyapunov function, defined in Assumption 1 (a). It also appeared in the result in Equation (12). When we set in Equation (30), Theorem 4.1 is represented as

 E[∥h(θN)∥2] ≤2c0+2c1(d0+1)∫∞γexp(−z2M2)dz (31) +2c1V0,nρ(n+1)+2c1ρL(ασ20+(1−α)min(dU2,γ2)). (32)

Equation (32) asserts that the SA scheme in Equation (13) finds an stationary point within iterations. Note that when is a constant or the decay rate of is small, whenever a change occurs according to Equation (3), the convergence rate of Equation (30) is considered to be dependent on , , , , , , , , and . Consequently, it is independent of the change point in Equation (3). It means that when a change of the distribution occurs according to Equation (3), if the distribution satisfies the assumptions of Theorem 4.1, it converges at an almost constant rate irrespectively of the time point at which a change occurs. In that sense, SRA is guaranteed to possess the adaptivity. In contrast, when we adopt decreasing step sizes, for example, the convergence rate becomes worse because the step sizes become small if the change happens later. In this case, the adaptivity decreases.

Since , , , , and are not known in general, we have to tune these parameters, for example, with a cross validation.

The following corollary holds as regards the relationship between the threshold parameter and the step size :

###### Corollary 1

If , and we set , the right-hand side of Equation (30) is minimized by

 ρ =(d0+1)exp(−γ2M2)2L(1−α)γ. (33)

The proof of Corollary 1 is given in A.4.

### 4.2 Effect of γ

Next, let us address how the upper bound of Equation (32) behaves when goes to infinite. The following corollary holds as regards the expectation of the norm of the mean field :

###### Corollary 2

The following inequality holds:

 limγ→∞E[∥h(θN)∥2]≤2c0+2c1V0,n+L(ασ20+(1−α)dU2)∑nk=0ρ2k+1∑nk=0ρk+1. (34)

The proof of Corollary 2 is given in A.5. Note that the right-hand side of Equation (34) recovers one of Equation (12), when (noiseless case).

The following corollary then holds as regards the decrease of the upper bound by setting .

###### Corollary 3

The difference of the upper bounds between Equation (34) and Equation (30) is calculated as

 g(γ) =2c1L(1−α)max(0,dU2−γ2)∑nk=0ρ2k+1∑nk=0ρk+1 (35) −2c1(d0+1)∫∞γexp(−z2M2)dz. (36)

The proof of Corollary 3 is given in A.6. Equation (36) represents the effect of setting . The first term in the right-hand side of Equation (36) means the decrease of the upper bound by setting as the threshold parameter. In contrast, as its demerit, the second term appears in the right-hand side. As we mentioned it after Lemma 1, if is bounded, the cost of the second term disappears in a finite region. In such a case, the advantage of SRA becomes clearer. In fact, if holds (, we get the following inequality with Hoeffding’s inequality (Vershynin, 2018):

 P[∥Hθk(Xk+1)∥≥z]≤exp(−z2(γ∗−γ)2)(γ≤z≤γ∗). (37)

We then obtain the following equation for :

 g(γ) =2c1L(1−α)max(0,dU2−γ2)∑nk=0ρ2k+1∑nk=0ρk+1 (38) −2c1(d0+1)∫γ∗γexp(−z2(γ∗−γ)2)dz. (39)

Therefore, the following equation holds for :

 g(γ)=2c1L(1−α)max(0,dU2−γ2)∑nk=0ρ2k+1∑nk=0ρk+1. (40)

When satisfies , Equation (40) shows that the effect of setting is proportional to .

## 5 Experiments

This section presents the experimental results of SRA. All the source codes are available at https://github.com/s-fuku/robustadapt.

### 5.1 Synthetic dataset

We generated the following one-dimensional sequence:

 xt∼f=αf1+(1−α)f2(t=1,…,20000), (41)

where

 f1 (42) μ =(μ1μ2)={(0.5,−0.5)⊤(t≤10000),(1.0,−1.0)⊤(10001≤t≤20000), (43) σ1 =σ2=0.1. (44)

denotes the univariate normal distribution with mean

. denotes the uniform distribution, the range of which is . We evaluated the performances of SRA and the compared algorithms using the following mean squared errors (MSE):

 (45)

where is the estimated mean at , is the sequence length (), is the change point (), and is a transient period. , , and represent the MSEs for the overall sequence, time points before the change point, and one after the change point, respectively. Each MSE excludes the transient period between and . We set and the mixture of GMM to for each algorithm throughout the following experiments for synthetic datasets. For each algorithm, we initialized the parameters or the sufficient statistics with the data in the first 10 time steps.

#### 5.1.1 Tradeoff between γ and ρ

We empirically confirmed the tradeoff between the threshold parameter and the step size

. In practice, the hyperparameters

, , , and must be tuned to determine on in Equation (33). As is regarded as one parameter, we changed , , and to estimate the optimal value of in Equation (33). We generated 10 data streams according to Equation (41) with and and evaluated the MSE between and , denoted as . Figure 2 shows the estimated , , , , and . For each , the optimal combination of , , , and estimated in Equation (33) was selected, which minimized . Figure 2 shows that , , , and were minimized when , indicating that the choice of using Equation (33) is reasonable because it gives both the robustness and adaptivity. The best combination of the hyperparameters was , and the estimated . We also observe from Figure 2 that and were not so different for , whereas and were different. It indicates that did not have much influence before the change point between and . In contrast, after the change point, the mean of distribution changed to from . Therefore, the difference between and became significant. In fact, the sufficient statistics corresponds to mean in this case. The result with indicates that it led to the decrease of accuracy in estimation of the mean by dropping more data points than . For , we see the decrease of MSEs in comparison with even before the change point. It is due to the influence of outliers.

#### 5.1.2 Comparison with other algorithms

We compared the performance of SRA with those of the rival algorithms. We chose the following algorithms for comparison:

• SDEM (Yamanishi et al., 2004): an online learning algorithm based on GMM. SDEM sequentially updates parameters and adapts to non-stationary changes with the discounting parameter.

• iEM (Neal and Hinton, 1999): an EM algorithm in an incremental scheme. As is pointed out in (Yamanishi et al., 2004), iEM is thought of as a version of SDEM, where the discounting parameter is set to at time when a fresh sample is drawn at each time.

• sEM (Cappé and Moulines, 2009): a stochastic (online) EM algorithm. sEM updates the sufficient statistics in an SA scheme.

SDEM (Yamanishi et al., 2004) and sEM (Cappé and Moulines, 2009) have the discounting parameter to adapt to new data. We chose , , , and , , . We set the parameters of SRA to , , , , , , , that is, the best combination in the previous experiment. Then, we evaluated , the MSE between and as before. As a result, the discounting parameters were selected as and for SDEM and sEM, respectively. We generated data streams using Equation (41) for 10 times with and . Table 1 shows the average MSEs , , and for each algorithm. SRA was superior to other algorithms. for the time periods after the change point and before it. This result indicates that SRA is more equipped with both the robustness and adaptivity compared to other algorithms.

#### 5.1.3 Dependency on α

We investigated the dependency of the upper bound in Equation (30) on the ratio of the outlier. It is characterized as in Equation (41). Based on Equation (36), the smaller the , the more the upper bound of the expectation of the mean field. In other words, the upper bound increases as the noisy data increases.

For , we set , , and . It means that we use the best combination in the previous experiment, but is modified by the value of . Therefore, is also modified according to Equation (33). We generated data streams using Equation (41) for 10 times with and . We then estimated the MSEs , , and . Figure 3 shows , , and . Each MSE decreased as increased. This result is consistent with Equation (36).

### 5.2 Real dataset: change detection

We applied SRA to the Well-log dataset for change detection. This dataset is available at https://github.com/alan-turing-institute/rbocpdms/, for example. The Well-log dataset was first studied in Ruanaidh et al. (Ruanaidh et al., 1996) and has become a benchmark dataset for univariate change detection. It is a data stream consisting of 4050 measurements of nuclear magnetic response during the drilling of a well. Although this dataset has been used in several studies (e.g., (Adams and MacKay, 2007; Levy-leduc and Harchaoui, 2008; Ruggieri and Antonellis, 2016; Fearnhead and Rigaill, 2019)), the outliers have often been removed before change detection, except only a few studies (e.g., (Fearnhead and Rigaill, 2019)). Figure 4 shows the annotated change points proposed by (Burg and Williams, 2020). There are five sets of annotated changes, each of which was provided by each annotator:

• Annotation 1: , , , , , , , , , , .

• Annotation 2: , , , , , , , , .

• Annotation 3: , , , , , , , , .

• Annotation 4: , .

• Annotation 5: , , , , , , , , , , , , , , , , .

We used the first 1550 data points as the training dataset and the remaining points as the test dataset. We calculated the change score as , where is the parameter estimated at . We evaluated the performance of each algorithm for the training dataset in terms of detection delay and overdetection. We used the area under the curve (AUC) score (Fawcett and Provost, 1999; Yamanishi and Miyaguchi, 2016). The AUC score was calculated as follows: we first fixed the threshold parameter and converted change scores to binary alarms . That is, , where denotes the binary function that takes if and only if is true. We let be a maximum tolerant delay of change detection. In this experiment, we set . When the change actually occurred at , we defined the benefit of an alarm at time as

 b(t;t∗)={1−|t−t∗|τ(0≤|t−t∗|<τ),0(otherwise). (46)

The number of false alarms was calculated as

 n(stendtstart)=m∑k=1αtk1(b(tk,t∗)=0), (47)

where and are the starting and end time points for evaluation, respectively, and denotes a sequence of change scores in the period. We visualized the performance by plotting the recall rate of the total benefit, , aginst the false alarm rate, , with varying.

We chose SDEM (Yamanishi et al., 2004), iEM (Neal and Hinton, 1999), and sEM (Cappé and Moulines, 2009) for comparison. Each algorithm employed the univariate normal distribution. We choose the hyperparameters of each algorithm with AUC scores between and : the hyperparameters of SRA were chosen among , , , , , , and , , , and those of SDEM and sEM were chosen among , , , , , , . We initialized each parameter or sufficient statistics of each algorithm for 10 times and selected the combination which gave the best performance on average. To initialize the parameter or sufficient statistics, we drew 20 initial points from the uniform distribution whose range was , where is a sequence between and .

We applied each algorithm and calculated the AUC scores on the test dataset after the parameters were determined. Table 2 shows the AUCs on the test dataset with the algorithms and annotations. We observe that SRA was superior to other algorithms for each annotation.

The best combinations of hyperparameters were for all the annotations for SDEM. For sEM, (Annotation 1, 2, and 3), (Annotation 4), and (Annotation 5). For SRA, , , for Annotation 1, 2, and 3, , , for Annotation 4, and , , for Annotation 5.