# Outlier Robust Online Learning

We consider the problem of learning from noisy data in practical settings where the size of data is too large to store on a single machine. More challenging, the data coming from the wild may contain malicious outliers. To address the scalability and robustness issues, we present an online robust learning (ORL) approach. ORL is simple to implement and has provable robustness guarantee -- in stark contrast to existing online learning approaches that are generally fragile to outliers. We specialize the ORL approach for two concrete cases: online robust principal component analysis and online linear regression. We demonstrate the efficiency and robustness advantages of ORL through comprehensive simulations and predicting image tags on a large-scale data set. We also discuss extension of the ORL to distributed learning and provide experimental evaluations.

## Authors

• 128 publications
• 41 publications
• 95 publications
• ### Distributed Robust Learning

We propose a framework for distributed robust statistical learning on b...
09/21/2014 ∙ by Jiashi Feng, et al. ∙ 0

• ### Anomaly Detection by Robust Statistics

Real data often contain anomalous cases, also known as outliers. These m...
07/31/2017 ∙ by Peter J. Rousseeuw, et al. ∙ 0

• ### On Fundamental Limits of Robust Learning

We consider the problems of robust PAC learning from distributed and str...
03/30/2017 ∙ by Jiashi Feng, et al. ∙ 0

• ### SOL: A Library for Scalable Online Learning Algorithms

SOL is an open-source library for scalable online learning algorithms, a...
10/28/2016 ∙ by Yue Wu, et al. ∙ 0

• ### Structure-Property Maps with Kernel Principal Covariates Regression

Data analysis based on linear methods, which look for correlations betwe...
02/12/2020 ∙ by Benjamin A. Helfrecht, et al. ∙ 8

• ### Scaling Limit: Exact and Tractable Analysis of Online Learning Algorithms with Applications to Regularized Regression and PCA

We present a framework for analyzing the exact dynamics of a class of on...
12/08/2017 ∙ by Chuang Wang, et al. ∙ 0

• ### Locally Weighted Regression Pseudo-Rehearsal for Online Learning of Vehicle Dynamics

We consider the problem of online adaptation of a neural network designe...
05/13/2019 ∙ by Grady Williams, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In the era of big data, traditional statistical learning methods are facing two significant challenges: (1) how to scale current machine learning methods to the large-scale data? And (2) how to obtain accurate inference results when the data are noisy and may even contain malicious outliers? These two important challenges naturally lead to a need for developing

scalable robust learning methods.

Traditional robust learning methods generally rely on optimizing certain robust statistics [16, 21] or applying some sample trimming strategies [7], whose calculations require loading all the samples into the memory or going through the data multiple times [9]. Thus, the computational time of those robust learning methods is usually at least linearly dependent on the size of the sample set, . For example, in RPCA [21], the computational time is where is the intrinsic dimension of the subspace and is the ambient dimension. In robust linear regression [3], the computational time is super-linear on the sample size: . This rapidly increasing computation time becomes a major obstacle for applying robust learning methods to big data in practice, where the sample size easily reaches the terabyte or even petabyte scale.

Online learning and distributed learning are natural solutions to the scalability issue. Most of existing online statistical learning methods propose to optimize a surrogate function in an online fashion, such as employing stochastic gradient descent

[10, 15, 8]

to update the estimates, which however cannot handle the outlier samples in the streaming data

[12]. Similarly, most of existing distributed learning approaches (e.g., MapReduce [6]) are not robust to contamination from outliers, communication errors or computation node breakdown.

In this work, we propose an online robust learning (ORL) framework to efficiently process big data with outliers while preserving robustness and statistical consistency of the estimates. The core technique is based on two-level online learning procedure, one of which employs a novel median filtering process. The robustness of median has been investigated in statistical estimations for heavy-tailed distributions [17, 11]. However, to our best knowledge, this work is among the first to employ such estimator to deal with outlier samples in the context of online learning.

The implementation of ORL follows mini-batch based online optimization which is popular in a wide range of machine learning problems (e.g.

, deep learning, large-scale SVM) from large-scale data. Within each mini-batch, ORL computes an independent estimate. However, outliers may be heterogeneously distributed on the mini-batches and some of them may contain overwhelmingly many outliers. The corresponding estimate will be arbitrarily bad and break down the overall online learning. Therefore, on top of such streaming estimates ORL performs another level of robust estimation—median filtering—to obtain reliable estimate. The ORL approach is general and compatible with many popular learning algorithms. Besides its obvious advantage of enhancing the computation efficiency for handling big data, ORL incurs negligible robustness loss compared to centralized (and computationally unaffordable) robust learning methods. In fact, we provide analysis and demonstrate that ORL is robust to a constant fraction of “bad” estimates generated in the streaming mini-batches that are corrupted by outliers.

We specify the ORL approach for two concrete problems—online robust principal component analysis (PCA) and linear regression. Comprehensive experiments on both synthetic and real large scale datasets demonstrate the efficiency and robustness advantages of the proposed ORL approach. In addition, ORL can be adapted straightforwardly to distributed learning setting and offers additional robustness to corruption of several computation nodes or communication errors, as demonstrated in the experiments.

In short, we make following contributions in this work. First, we develop an outlier robust online learning framework which is the first one with provable robustness to a constant fraction of outliers. Secondly, we introduce two concrete online robust learning approaches, one for unsupervised learning and the other for supervised learning. Other examples can be developed in a similar way easily. Finally, we also present the application of the ORL approach to distributed learning setting which is equally attractive for learning from large scale data.

## 2 Preliminaries

### 2.1 Problem Set-up

We consider a set of observation samples , which contains a mixture of authentic samples and outliers . The authentic samples are generated according to an underlying model (i.e., the ground truth) parameterized by . The target of a statistical learning procedure is to estimate the model parameter according to the provided observations

. Throughout the paper, we assume the authentic samples are sub-Gaussian random vectors in

, which thus satisfy that

 P(|⟨x,u⟩|>t)≤2e−t2/L2%fort>0 and u∈Sp−1, (1)

for some . Here denotes the unit sphere.

In this work, we focus on the case where a constant fraction of the observations are outliers, and we use to denote this outlier fraction. In the context of online learning, samples are provided in a sequence of mini batches, each of which contains observations. Denote the sequence as . The target of online statistical learning is to estimate the parameter , only based on the observations revealed so far.

### 2.2 Geometric Median

We first introduce the geometric median here—a core concept underlying the median filtering procedure that is important for developing the proposed online robust learning approach.

###### Definition 1 (Geometric Median).

Given a finite collection of i.i.d. estimates , their geometric median is the point which minimizes the total distance to all the given estimates, i.e.,

 ˆθ=median(θ1,…,θT):=argminθ∈ΘT∑j=1∥θ−θj∥. (2)

An important property of the geometric median is that it indeed aggregates a collection of independent estimates into a single estimate with strong concentration guarantees, even in presence of a constant fraction of outlying estimates in the collection. The following lemma, straightforwardly derived from Lemma 2.1 in [17], characterizes such robustness property of the geometric median.

###### Lemma 1.

Let be the geometric median of the points . Fix and . Suppose there exists a subset of cardinality such that for all and any point , . Then we have .

In words, given a set of points, their geometric median will be close to the “true” as long as at least half of them are close to

. In particular, the geometric median will not be skewed severely even if some of the points deviate significantly away from

.

## 3 Online Robust Learning

In this section, we present how to scale up robust learning algorithms to process large-scale data (containing outliers) through online learning without losing robustness. We term the proposed approach as online robust learning (ORL).

The idea behind ORL is intuitive—instead of equally incorporating generated estimates at each time step, ORL aggregates the sequentially generated estimates by mini-batch based learning methods via an online computation of the robust geometric median. Basically, ORL runs online learning at two levels: at the bottom level, ORL employs appropriate robust learning procedures with parameter (e.g., robust PCA algorithms on a mini-batch of samples) to obtain a sequence of estimates of based on the observation mini-batch ; at the top level, ORL updates the running estimate () through a geometric median filtering algorithm (explained later) over and outputs a robust estimate after going through all the mini-batches. Intuitively, according to Lemma 1, as long as a majority of mini-batch estimates are not skewed by outliers, the produced would be robust and accurate. This new two-level robust learning gives ORL stronger robustness to outliers compared with ordinary online learning.

To develop the top level geometric median filtering procedure, recall definition of the geometric median in (2). A natural estimate of the geometric median is the minimum

of the following empirical loss function

:

 ˆθT=argminθ∈Θ{ˆGT≜1TT∑i=1∥θi−θ∥}. (3)

The empirical function is differentiable everywhere except for the points , and can be optimized by applying stochastic gradient descent (SGD) [1]. More concretely, at the time step , given a new estimate (based on the -st mini-batch) and the current estimate , ORL computes the gradient at point of the empirical function in Eqn. (3) evaluated only at :

 ˆg(θ;θt+1)≜∂ˆGT(θ;θt+1)∂θ=2(θ−θt+1)∥θ−θt+1∥. (4)

Then ORL updates the estimate by following filtering:

 ˆθt+1←ˆθt−ηtˆg(ˆθt;θt+1)=(1−wt)ˆθt+wtθt+1. (5)

Here is a predefined step size parameter which usually takes the form of with a constant characterizing convexity of the empirical function to optimize. Besides, controls contribution of each new estimate conservatively in updating the global estimate . Details of ORL are provided in Algorithm 1.

Another level of filtering is important. Certain mini-batches may contain overwhelming outliers. Therefore, even though a robust learning procedure is employed on each mini-batch, the resulted estimate cannot be guaranteed to be accurate. In fact, a mini-batch containing over outliers would corrupt any robust learning procedure—the resulted estimate can be arbitrarily bad and breakdown the overall online learning. To address this critical issue, ORL performs another level of online learning for updating the “global” estimate with adaptive weights for the new estimate and “filters out” possibly bad estimates.

## 4 Performance Guarantees

We provide the performance guarantees for ORL in this section. Throughout this section, we use following asymptotic inequality notations: for positive numbers and , the asymptotic inequality means that where is a constant only depending on . Suppose

samples, a constant fraction of which are authentic ones and have sub-Gaussian distributions as specified in (

1) for some , are evenly divided to mini-batches and outlier fractions on the mini-batches are respectively. Let be a collection of independent estimates of output by implementing the robust learning procedure on the mini-batches independently. We assume an estimate or the robust learning procedure provides following composite deviation bound,

 P(∥θi−θ⋆∥≲δ,L√1b+λi1−λi√p)≥1−δ, (6)

where is the size of each mini-batch whose value can be tuned by the desired accuracy (e.g., through data augmentation). We will specify value of the constant depending on and explicitly in concrete applications. The above bound indicates the estimation error depends on the standard statistically error and the outlier fraction. If is overwhelmingly large, the estimate will be arbitrarily bad.

We now proceed to demonstrate that the ORL approach is robust to outliers—even on a constant fraction of mini-batches, the obtained estimates are not good, ORL can still provide reliable estimate with bounded error. Given a sequence of estimates produced internally in ORL, we analyze and provide guarantee on performance of the ORL through following two steps. We first demonstrate the geometric median function is in fact strongly convex and thus geometric median filtering provides a good estimate of the “true” geometric median of . Then we derive following performance guarantee for ORL by invoking the robustness property of geometric median.

###### Proposition 1.

Suppose in total samples, a constant fraction of which have sub-Gaussian distribution as in (1), are divided into sequential mini batches of size with outlier fraction . Here . We run a base robust learning algorithm having a deviation bound as in (6) on each mini batch. Denote the ground truth of the parameter to estimate as and the output of ORL (Alg. 1) as

. Then with probability at least

, satisfies:

 ∥ˆθT−θ⋆∥≲δ,L,p,γlog(log(T))+1T+√1b+λ(γ)√p.

Here and denotes the smallest outlier fraction in with .

The above results demonstrate the estimation error of ORL consists of two components. The first term accounts for the deviation between the solution and the “true” geometric median of the sequential estimations. When is sufficiently large, i.e., after ORL seeing sufficiently many mini batches of observations, this error vanishes at a rate of . The second term explains the deviation of geometric median of estimates from the ground truth. The significant part of this result is that the error of ORL only depends on the smallest outlier fraction among mini-batches, no matter how severely the other estimates are corrupted. This explains why ORL is robust to outliers in the samples.

## 5 Application Examples

In this section, we provide two concrete examples of the ORL approach, including one unsupervised learning algorithm—principal component analysis (PCA) and one supervised learning one—linear regression (LR). Both algorithms are popular in practice but their online learning versions with robustness guarantees are still absent. Finally, we also discuss an extension of ORL for distributed robust learning.

### 5.1 Online Robust PCA

Classical PCA is known to be fragile to outliers and many robust PCA methods have been proposed so far (see [21] and references therein). However, most of those methods require to load all the data into memory and have computational cost (super-)linear in the sample size, which prevents them from being applicable for big data. In this section, we first develop a new robust PCA method which robustifies PCA via a robust sample covariance matrix estimation, and then demonstrate how to implement it with the ORL approach to enhance the efficiency.

Given a sample matrix , the standard covariance matrix is computed as , i.e., . Here denotes the th row vector of matrix . To obtain a robust estimate of the covariance matrix, we replace the vector inner product by a trimmed inner product, , as proposed in [4] for linear regressor estimation. Intuitively, the trimmed inner product removes the outliers having large magnitude and the remaining outliers are bounded by inliers. Thus, the obtained covariance matrix, after proper symmetrization, is close to the authentic sample covariance. How to calculate the trimmed inner product for a robust estimation of sample covariance matrix is given in Algorithm 2.

Then we perform a standard eigenvector decomposition on the covariance matrix to produce the principal component estimations. The details of the new Robust Covariance PCA (RC-PCA) algorithm are provided in Algorithm

3.

Applying the proposed ORL approach onto the above RC-PCA develops a new online robust PCA algorithm, called ORL-PCA, as explained in Algorithm 4.

Based on the above result, along with Proposition 1, we provide the following performance guarantee for ORL-PCA.

###### Theorem 1.

Suppose samples are divided into mini-batches of size . Authentic samples satisfy the sub-Gaussian distribution with parameter . Let where is the smallest outlier fraction out of the mini-batches. Let denote the projection operator given by ORL-PCA, and denotes the projection operator to the ground truth dimensional subspace. Then, with a probability at least , we have,

 ∥ˆP(T)U−P⋆U∥F≤Calog(log(T)/δ)+1T+c1p√dlog(1/δ)b+c2λ(γ)√dp.

Here are positive constants.

### 5.2 Online Robust Linear Regression

We then showcase another example of the application of ORL—online robust regression. As aforementioned, the target of linear regression is to estimate the parameter of linear regression model given the observation pairs where samples are corrupted. Here is additive noise. Similar to ORL-PCA, we use the robustified thresholding (RoTR) regression proposed in Algorithm 5 (ref. [4]) as the robust learning procedure for parameter estimation within each mini-batch.

Due to the blessing of online robust learning framework, ORL-LR has the following performance guarantee.

###### Theorem 2.

Adopt the notations in Theorem 1. Suppose the authentic samples have the sub-Gaussian distribution as in (1) with noise level , are divided into sequential mini-batches. Let be the output of ORL-LR and be the ground truth. Then, with probability at least , the following holds:

 ∥ˆθT−θ⋆∥2≤Calog(log(T)/δ)+1T+Cγ∥θ⋆∥2√1+σ2e∥θ⋆∥22(√plog(1/δ)b+λ(γ)√plog(1δ)).

### 5.3 Distributed Robust Learning

Following the spirit of ORL, we can also develop a distributed robust learning (DRL) approach. Suppose in a distributed computing platform, machines are usable for parallel computation. Then for processing a large scale dataset, one can evenly distribute them onto the machines and run robust learning procedure in parallel. Each machine provides an independent estimate for the parameter of interest . Aggregating these estimates via geometric median (ref. Eqn. (2)) can provide additional robustness to the inaccuracy, breakdown and communication error for a fraction of machines in the computing cluster, as stated in Lemma 1. Of particular interest, DRL can provide much stronger robustness than the commonly used averaging over the estimates, as average or mean is notoriously fragile to corruption. Even a single corrupted estimate out of the estimates can make the final estimate arbitrarily bad.

## 6 Simulations

In this section, we investigate robustness of the ORL approach by evaluating the ORL-PCA and ORL-LR algorithms and comparing them with their centralized and non-robust counterparts. We also perform similar investigation on DRL (ref. Section 5.3) considering robustness is also critical for distributed learning in practice. In the simulations, we report the results with the outlier fraction which is computed as .

Data generation In simulations of the PCA problems, samples are generated according to . Here the signal

is sampled from the normal distribution:

. The noise is sampled as: . The underlying matrix is randomly generated whose columns are then orthogonalized. The entries of outliers

are i.i.d. random variables from uniform distribution

. We use the distance between two projection matrices to measure the subspace estimation error for PCA: . Here is the output estimates and is the ground truth.

In simulations of the LR problems, samples are generated according to . Here the model parameter is randomly sampled from , and is also sampled from normal distribution: . The noise is sampled as: . The entries of outlier are also i.i.d. randomly sampled from uniform distribution . The response of outlier is generated by . We use to measure the error. Here is the output estimate.

Online Setting Results shown in Figure 1(a) give following observations. First, ORL-PCA converges to comparable performance with batch RC-PCA with accesses to the entire data set. This demonstrates the rapid convergence of ORL-PCA. It is worth noting that ORL-PCA saves considerable memory cost than batch RC-PCA ( Mb vs. Mb) and computation time ( seconds vs.  Hours) since ORL-PCA performs SVD on much smaller data. Secondly, ORL-PCA offers much stronger robustness than naively averaged aggregation when outlier order is adversarial to corrupt a fraction of mini-batches. As shown in Figure 1(a), when some batches have overwhelming outliers (outlier fraction ), base RC-PCA fails on these batches and outputs completely corrupted estimations. The corruption of mini-batches also fails online averaging RPCA. In contrast, ORL-PCA still offers correct estimation, even when a fraction of of estimates from mini batches are corrupted. We also report the results of ORL-LR and comparison with online-averaging baselines in Figure 1(b). Similar to ORL-PCA, one observes that ORL-LR offers outperforming robustness to the sample outliers and batch corruptions, in contrast to the naive averaging algorithm.

Distributed setting All the simulations are implemented on a PC with GHz Quad CPU and GB RAM. It takes centralized RPCA around seconds to handle samples with dimensionality of . In contrast, distributed RPCA only costs seconds by using parallel procedures. The communication cost here is negligible since only eigenvector matrices of small sizes are communicated. For RLR simulations, we also observe about efficiency enhancement.

As for the performance, from Fig. 1(c), we observe that when , DRL-RPCA, RPCA with division-averaging (Div.-Avg. RPCA) and centralized RPCA (i.e., the RC-PCA) achieve similar performances, which are much better than non-robust standard PCA. When , i.e., when there are no outliers, the performances of DRL-RPCA and Div.-Avg. RPCA are slightly worse than standard PCA as the quality of each mini-batch estimate deteriorates due to the smaller sample size. However, distributed algorithms of course offer significant higher efficiency. Similar observations also hold for LR simulations from Fig. 1(d). Actually, standard PCA and LR begin to break down when . These results demonstrate that DRL preserves the robustness of centralized algorithms well.

When outlier fraction increases to , centralized (blue lines) and division-averaging algorithms (green lines) break down sharply, as the outliers outnumber their maximal breakdown point of . In contrast, DRL-RPCA and DRL-RLR still present strong robustness and perform much better, which demonstrate that the DRL framework is indeed robust to computing nodes breaking down, and even enhances the robustness of the base robust learning methods under favorable outlier distributions across the machines.

Comparison with Averaging Taking the average instead of the geometric median is a natural alternative to DRL. Here we provide more simulations for the RPCA problem to compare these two different aggregation strategies in the presence of different errors on the computing nodes.

In distributed computation of learning problems, besides outliers, significant deterioration of the performance may result from unreliabilities, such as latency of some machines or communication errors. For instance, it is not uncommon that machines solve their own sub-problem at different speed, and sometimes users may require to stop the learning before all the machines output the final results. In this case, results from the slow machines are possibly not accurate enough and may hurt the quality of the aggregated solution. Similarly, communication errors may also damage the overall performance. We simulate the machine latency by stopping the algorithms once over half of the machines finish their computation. To simulate communication error, we randomly sample estimations and flip the sign of of the elements in these estimations. The estimation errors of the solution aggregated by averaging and DRL are given in Table 1. Clearly, DRL offers stronger resilience to unreliability of the computing nodes.

Real large-scale data We further apply the ORL-LR for an image tag prediction problem on a large-scale image set, i.e., the Flickr-10M image set, which contains images with noisy users contributed tags. We employ robust linear regression to predict semantic tags for each image, which is described by a -dimensional deep CNN feature (output of the fc6 layer) [13]. Performing such a large scale regression task is impossible for a single PC (with a GB memory), as only storing the features costs nearly GB memory. Therefore, we solve this problem via the proposed online and distributed learning algorithms. We randomly sample from the entire dataset a training set of images and a test set of images.

We perform experiments with the online learning setting, and compare the performance of the proposed ORL-LR with the Online Averaging LR. We also implement a non-robust baseline – stochastic gradient descent to solve the LR problem. The size of min-batch is fixed as images. From the results in Table 2, one can observe that ORL-LR achieves significantly higher accuracy than non-robust baseline algorithms, with a margin of more than .

## 7 Proofs

### 7.1 Technical Lemmas

###### Lemma 2 (Hoeffding’s Inequality).

Let be independent random variables taking values in . Let and . Then for ,

 P(¯Xn−μ≥t)≤exp{−2nt2}.
###### Lemma 3 (A coupling result [14]).

Let be independent random variables, let be a real number and let . Let such that, for all , and let be a random variable with a binomial law . There exists a coupling such that has the same distribution as , has the same distribution as and such that . In particular, for all , .

The following lemma demonstrates that aggregating estimates via their geometric median can enhance the confidence significantly.

###### Lemma 4.

Given independent estimates of satisfying , for all . Let , then we have,

 P(∥ˆθ−θ⋆∥

Here and are defined in Lemma 1.

###### Proof.

According to Lemma 1, we have

 P(∥ˆθ−θ⋆∥≥CγR)≤P(k∑j=11(∥θj−θ⋆∥≥R)≥γk).

Let , and let

, then

 P(k∑j=11{∥θj−θ⋆∥>R}>γk)≤P(W>γk),

according to Lemma 3. Applying the Hoeffding’s inequality in Lemma 2 (with , , and ) gives

 P(∥ˆθ−θ⋆∥>CγR)≤P(k∑j=11{∥θj−θ⋆∥>R}>γk)≤P(W>γk)≤exp{−2k(γ−1+p∗)2}

### 7.2 Proofs of Main Results on ORL

#### 7.2.1 Conditions

We suppose from now on following conditions hold.

###### Condition 1.

Assume that is the parameter of interest. Let be a collection ofindependent estimates of , which are not concentrated on a straight line: for all , there is such that and with .

As noted in [2], Condition 1 ensures that geometric median of the estimates is uniquely defined.

###### Condition 2.

The distribution of the independent estimates of is a mixing of two “nice” distributions: . Here is not strongly concentrated around single points: if is the ball , and is a random variable with distribution , then for any constant ,

 ∃Ca∈[0,∞),∀u∈B(0,a),EY[∥Y−u∥−1]≤Ca.

In addition, is a discrete measure, . Here is a Dirac measure at point . We denote by the support of and assume that the median .

Conditions 1 and 2 are only technical conditions to avoid pathologies in the convergence analysis for Algorithm 1. In practical implementations, we can simply set the sub-gradient of at as zero (a valid sub-gradient as proved in [2]) when .

#### 7.2.2 Convergence Rate of Geometric Median Filtering

Given the definition of geometric median in (2), we can define following population geometric median loss function, , that we want to minimize to compute the geometric median:

 G(u)≜E[∥Θ−u∥−∥Θ∥]. (7)

In this subsection, we first show that the geometric median function in (7) is indeed strongly convex under Conditions 1 and 2. Thus the SGD optimization is able to provide solutions with a convergence rate of to the true geometric median , given independent estimates.

###### Definition 2 (β-strongly convex function [18]).

A function is -strongly convex, if for all and any sub-gradient of at , we have

 ⟨g(u2)−g(u1),u2−u1⟩≥β∥u2−u1∥2.

The following theorem establishes the strong convexity of the geometric median function in (7).

###### Theorem 3.

Let be the sub-gradient of at . Under Conditions 1 and 2, there is a strictly positive constant , such that:

 ∀u1,u2∈B(0,a),⟨g(u2)−g(u1),u2−u1⟩≥ca∥u2−u1∥2,

and thus is -strongly convex.

The proof can be derived from the proof for the Proposition 2.1 in [2] straightforwardly and we omit details here.

Given the strong convexity property of geometric median function , we can apply the convergence argument of SGD for strongly convex functions (e.g., Proposition 1 in [19]), and obtain the following convergence rate for online geometric median filtering.

###### Theorem 4.

Assume Conditions 1 and 2 hold, and . Then with probability . Assume . Let . Pick in Algorithm 1 and let denote the output at time step . Furthermore, let be the geometric median of . Then for any ,

 ∥ˆθt−ˆθ∥≤C′a(log(log(t)/δ)+1)t,

with probability at least . Here .

The bound on the gradient is from the definition of the gradient in (4), Condition 2 and the assumption that all the estimates are bounded.

#### 7.2.3 Proofs of Proposition 1

From now on, we slightly abuse the notation and use to denote the geometric median of a collection of estimates.

###### Proof.

Proposition 1 can be derived by following triangle inequality:

 ∥ˆθt−θ⋆∥≤∥ˆθt−˜θ∥+∥˜θ−θ⋆∥,

where denotes the “true” geometric median of estimates . We now proceed to bound the above two terms separately. Based on Theorem 4, we have

 ∥ˆθt−˜θ∥≤C′a(log(log(t)/δ)+1)t,

with a probability at least . The second term can be bounded as follows by applying Lemma 4:

 P(∥˜θ−θ⋆∥≲δ,L√1b+λ(γ)√p)≥1−δ,

where and denotes the smallest outlier fraction in with . Combining these two bounds together gives:

 ∥ˆθt−θ⋆∥≲δ,LC′alog(log(t)/δ)+1t+Cγ√1b+C′λ(γ)√p.

### 7.3 Proofs of Application Examples

Before proving the performance guarantee for ORL-PCA and ORL-LR, we provide robustness analysis for the base robust learning procedure—the RC-PCA and RoTR.

#### 7.3.1 Robustness Guarantee of RC-PCA

###### Theorem 5.

Suppose in total samples are provided with authentic samples and outliers. Let . Assume the authentic samples follow sub-Guassian design with parameter . Let , where denotes the largest eigenvalue of ground-truth sample covariance matrix . Let be the output -dimensional subspace projector from RC-PCA. Then for a constant , we have with probability ,

 ∥PU−P⋆U∥∞≤2LΔd{√4clog(4δ)√pn+λ1−λlog(2δ)}.
###### Proof.

According to the proof of Theorem 4 in [4] and deviation bound on the empirical covariance matrix estimation in [20], when the authentic samples are from sub-Gaussian distribution with parameter , we have, for the covariance matrix constructed in Algorithm 3,

 ∥ˆC−C⋆∥∞≤L√4clog(4δ)√pn+n1nLlog(2δ)

with a probability at least . Here is a constant, is the number of authentic samples and is the number of outliers.

Let be the eigenvalue gap, where denotes the -th largest eigenvalue of . Then, applying the Davis-Kahan perturbation theorem [5], we have, whenever , . Thus,

 ∥PU−P⋆U∥∞≤2LΔd{√4clog(4δ)√pn+n1nlog(2δ)},

with a probability at least .

#### 7.3.2 Proof of Theorem 1

###### Proof.

Theorem 1 can be derived directly from following triangle inequality:

 ∥ˆP(T)U−P⋆U∥F≤∥ˆP(T)U−˜PU∥F+∥˜PU−P⋆U∥F,

and we bound the above two terms separately. The first term can be bounded by Theorem 4 as,

 ∥ˆP(T)U−˜PU∥F≤C′a(log(log(T)/δ)+1)T,

with a probability . The second term can be bounded as in Theorem 5 that with a probability ,

 ∥˜PU−P⋆U∥F≤c1p√dlog(1/δ)N+c2λ(γ)√dp.

Combining the above two bounds (with union bound) proves the theorem. ∎

#### 7.3.3 Proof of Theorem 2

Before proving Theorem 2, we first show the following performance guarantee for RoTR algorithm from [4]. The estimation error of the RoTR is bounded as in Lemma 5.

###### Lemma 5 (Performance of RoTR [4]).

Suppose the samples are from sub-Gaussian design with , with dimension and noise level , then with probability at least , the output of RoTR satisfies the bound:

 ∥∥ˆθ−θ⋆∥∥2≤c∥θ⋆∥2√1+σ2e∥θ⋆∥22 ⎛⎝√plog(1/δ)n +λ1−λ√plog(1/δ)).

Here is a constant independent of .

###### Proof.

Based on the results in the Lemma 5 and Lemma 4, it is straightforward to get:

 ∥˜θ−θ⋆∥2≤C′γ∥θ⋆∥2√1+σ2e∥θ⋆∥22⎛⎝√plog(1/δ)N+λ(γ)√plog(1/δ)⎞⎠

where with being the constant in Lemma 5, , and with being the smallest outlier fraction in . ∎

As proving Theorem 1, Theorem 2 can be derived based on the results in Theorem 4. For simplicity, we omit the details here.

## 8 Conclusions

We developed a generic Online Robust Learning (ORL) approach with provable robustness guarantee and we also demonstrate its application for Distributed Robust Learning (DRL). The proposed approaches not only significantly enhance the time and memory efficiency of robust learning but also preserve the robustness of the centralized learning procedures. Moreover, when the outliers are not uniformly distributed, the proposed approaches are still robust to adversarial outliers distributions. We provided two concrete examples, online and distributed robust principal component analysis and linear regression.