# Choosing the Sample with Lowest Loss makes SGD Robust

The presence of outliers can potentially significantly skew the parameters of machine learning models trained via stochastic gradient descent (SGD). In this paper we propose a simple variant of the simple SGD method: in each step, first choose a set of k samples, then from these choose the one with the smallest current loss, and do an SGD-like update with this chosen sample. Vanilla SGD corresponds to k = 1, i.e. no choice; k >= 2 represents a new algorithm that is however effectively minimizing a non-convex surrogate loss. Our main contribution is a theoretical analysis of the robustness properties of this idea for ML problems which are sums of convex losses; these are backed up with linear regression and small-scale neural network experiments

• 8 publications
• 12 publications
• 52 publications
06/13/2022

### On the Convergence to a Global Solution of Shuffling-Type Gradient Algorithms

Stochastic gradient descent (SGD) algorithm is the method of choice in m...
06/20/2019

### Data Cleansing for Models Trained with SGD

Data cleansing is a typical approach used to improve the accuracy of mac...
02/26/2018

### VR-SGD: A Simple Stochastic Variance Reduction Method for Machine Learning

In this paper, we propose a simple variant of the original SVRG, called ...
07/01/2020

### Online Robust Regression via SGD on the l1 loss

We consider the robust linear regression problem in the online setting w...
02/21/2020

### Stein Self-Repulsive Dynamics: Benefits From Past Samples

We propose a new Stein self-repulsive dynamics for obtaining diversified...
04/29/2022

### The Directional Bias Helps Stochastic Gradient Descent to Generalize in Kernel Regression Models

We study the Stochastic Gradient Descent (SGD) algorithm in nonparametri...
02/01/2021

### Information-Theoretic Generalization Bounds for Stochastic Gradient Descent

We study the generalization properties of the popular stochastic gradien...

## 1 Introduction

This paper focuses on machine learning problems that can be formulated as optimizing the sum of

convex loss functions:

 minw F(w) (1)

where is the sum of convex, continuously differentiable loss functions.

Stochastic gradient descent (SGD) is a popular way to solve such problems when is large; the simplest SGD update is:

 SGD: wt+1=wt−ηt∇fit(wt) (2)

where the sample is typically chosen uniformly at random from [n].

However, as is well known, the performance of SGD and most other stochastic optimization methods is highly sensitive to the quality of the available training data. A small fraction of outliers can cause SGD to converge far away from the true optimum. While there has been a significant amount of work on more robust algorithms for special problem classes (e.g. linear regression, PCA etc.) in this paper our objective is to make a modification to the basic SGD method itself; one that can be easily applied to the many settings where vanilla SGD is already used in the training of machine learning models.

We call our method Min- Loss SGD MKL-SGD, given below (Algorithm 1). In each iteration, we first choose a set of samples and then select the sample with the smallest current loss in that set; the gradient of this sample is then used for the update step.

The effectiveness of our algorithm relies on a simple observation: in a situation where most samples adhere to a model but a few are outliers skewing the output, the outlier points that contribute the most to the skew are often those with high loss

. In this paper, our focus is on the stochastic setting for standard convex functions. We show that it provides a certain degree of robustness against outliers/bad training samples that may otherwise skew the estimate.

##### Our Contributions
• To keep the analysis simple yet insightful, we define four natural and deterministic problem settings - noiseless with no outliers, noiseless with outliers, and noisy with and without outliers - in which we study the performance of MKL-SGD. In each of these settings the individual losses are assumed to be convex, and the overall loss is additionally strongly convex. We are interested in finding the optimum of the “good” samples, but we do not a-priori know which samples are good and which are outliers.

• The expected MKL-SGD update (over the randomness of sample choice) is not the gradient of the original loss function (as would have been the case with vanilla SGD); it is instead the gradient of a different non-convex surrogate loss, even for the simplest and friendliest setting of noiseless with no outliers. Our first result establishes that this non-convexity however does not yield any bad local minima or fixed points for MKL-SGD in this particular setting, ensuring its success.

• We next turn to the setting of noiseless with outliers, where the surrogate loss can now potentially have many spurious local minima. We show that by picking a value of high enough (depending on the condition number of the loss functions that we define) the local minima of MKL-SGD closest to is better than the (unique) fixed point of SGD.

• We establish the convergence rates of MKL-SGD-with and without outliers - for both the noiseless and noisy settings.

• We back up our theoretical results with both synthetic linear regression experiments that provide insight, as well as encouraging results on the MNIST and CIFAR-10 datasets.

## 2 Related Work

The related work can be divided into the following four main subparts:

##### Stochastic optimization and weighted sampling

The proposed MKL-SGDalgorithm inherently implements a weighted sampling strategy to pick samples. Weighted sampling is one of the popular variants of SGD that can be used for matching one distribution to another (importance sampling), improving the rate of convergence, variance reduction or all of them and has been considered in

[16, 33, 38, 19]. Other popular weighted sampling techniques include [26, 25, 23]. Without the assumption of strong convexity for each , the weighted sampling techniques often lead to biased estimators which are difficult to analyze. Another idea that is analogous to weighted sampling includes boosting [11]

where harder samples are used to train subsequent classifiers.

However, in presence of outliers and label noise, learning the hard samples may often lead to over-fitting the solution to these bad samples. This serves as a motivation for picking samples with the lowest loss in MKL-SGD.

##### Robust linear regression

Learning with bad training samples is challenging and often intractable even for simple convex optimization problems. For example, OLS is quite susceptible to arbitrary corruptions by even a small fraction of outliers. Least Median Squares (LMS) and least trimmed squares (LTS) estimator proposed in [31, 34, 35] are both sample efficient, have a relatively high break-down point, but require exponential running time to converge. [14] provides a detailed survey on some of these robust estimators for OLS problem. Recently, [5, 6, 32, 20, 18] have proposed robust learning algorithms for linear regression which require the computation of gradient over the entire dataset which may be computationally intractable for large datasets. Another line of recent work considers robustness in the high-dimensional setting ([27, 36, 8, 3, 24]) In this version, our focus is on general stochastic optimization in presence of outliers.

##### Robust optimization

Robust optimization has received a renewed impetus following the works in [10, 22, 7]. In most modern machine learning problems, however, simultaneous access to gradients over the entire dataset is time consuming and often, infeasible. [9, 28]

provides robust meta-algorithms for stochastic optimization under adversarial corruptions. However, both these algorithms require the computation of one or many principal components per epoch which requires atleast

computation ([1]). In contrast, MKL-SGDalgorithm runs in computations per iteration where is the number of loss evaluations per epoch. In this paper, we don’t consider the stronger adversarial model, our focus is on a tractable method that provides robustness on a simpler corruption model (as defined in the next section).

##### Label noise in deep learning

[2, 21, 4] describe different techniques to learn in presence of label noise and outliers. [30] showed that deep neural networks are robust to random label noise especially for datasets like MNIST and CIFAR10. [15, 29] propose optimization methods based on re-weighting samples that often require significant pre-processing. In this paper, our aim is to propose a computationally inexpensive optimization approach that can also provide a certain degree of robustness.

## 3 Problem Setup

We make the following assumptions about our problem setting described in 1. Let be the set of outlier samples; this set is of course unknown to the algorithm. We denote the optimum of the non-outlier samples by , i.e.

 w∗ := argminw∑i∉Ofi(w)

In this paper we show that MKL-SGD allows us to estimate without a-priori knowledge of the set , under certain conditions. We now spell these conditions out.

###### Assumption 1 (Individual losses).

Each is convex in , with Lipschitz continuous gradients with constant .

 \norm∇fi(w1)−∇fi(w2)≤Li\normw1−w2

and define

It is common to also assume strong convexity of the overall loss function . Here, since we are dropping samples, we need a very slightly stronger assumption.

###### Assumption 2 (Overall loss).

For any size subset of the samples, we assume the loss function is strongly convex in . Recall that here is the size of the sample set in the MKL-SGD algorithm.

Lastly, we also assume that all the functions share the same minimum value. Assumption 3 is often satisfied by most standard loss functions with a finite unique minima [12] such as squared loss, hinge loss, etc.

###### Assumption 3 (Equal minimum values).

Each of the functions shares the same minimum value .

We are now in a position to formally define three problem settings we will consider in this paper. For each let denote the set of optimal solutions (there can be more than one because is only convex but not strongly convex). Let denote the shortest distance between point and set .

##### Noiseless setting with no outliers:

As a first step and sanity check, we consider what happens in the easiest case: where there are no outliers. There is also no “noise”, by which we mean that the optimum we seek is also in the optimal set of every one of the individual sample losses, i.e.

 w∗∈Ci for all i

Of course in this case vanilla SGD (and many other methods) will converge to as well; we just study this setting as a first step and also to build insight.

##### Outlier setting:

Finally, we consider the case where a subset of the samples are outliers. Specifically, we assume that for outlier samples the we seek lies far from their optimal sets, while for the others it is in the optimal sets:

 d(w∗,Ci)≥2δ for all i∈O
 w∗∈Ci for all i∉O

Note that now vanilla SGD on the entire loss function will not converge to .

##### Noisy setting:

As a second step, we consider the case when samples are noisy but there are no outliers. In particular, we model noise by allowing to now be outside of individual optimal sets , but not too far; specifically,
No outliers

 d(w∗,Ci)≤δ for all i

With outliers

 d(w∗,Ci)≤δ for all i∉O
 d(w∗,Ci)>2δ for all i∈O

For the noisy setting, we will focus only on the convergence guarantees. We will show that MKL-SGD gets close to in this setting; again in this case vanilla SGD will do so as well for the no outliers setting of course.

## 4 Understanding MKL-SGD

To build some intuition for MKL-SGD, we describe the notation and look at some simple settings. Recall MKL-SGD takes samples and then retains the one with lowest current loss; this means it is sampling non-uniformly. For any , let be the sorted order w.r.t. the loss at that , i.e.

 fm1(w)(w)≤fm2(w)(w)≤⋯≤fmn(w)(w)

Recall that for a sample to be the one picked by MKL-SGD for updating , it needs to first be part of the set of

samples, and then have the lowest loss among them. A simple calculation shows that probability that the

best sample is the one picked by MKL-SGD is given by

 (3)

In the rest of the paper, we will focus on the “with replacement” scenario for ease of presentation; this choice does not change our main ideas or results. With this notation, we can rewrite the expected update step of MKL-SGD  as

 E[w+|w]=w−η∑ipmi(w)∇fmi(w)(w)

For simplicity of notation in the rest of the paper, we will relabel the update term in the above by defining as follows:

 ∇˜F(w) := ∑ipmi(w)∇fmi(w)(w)

Underlying this notation is the idea that, in expectation, MKL-SGD is akin to gradient descent on a surrogate loss function which is different from the original loss function ; indeed if needed this surrogate loss can be found (upto a constant shift) from the above gradient. We will not do that explicitly here, but instead note that even with all our assumptions, indeed even without any outliers or noise, this surrogate loss can be non-convex. It is thus important to see that MKL-SGD does the right thing in all of our settings, which is what we describe next.

### 4.1 Noiseless setting with no outliers

As a first step (and for the purposes of sanity check), we look at MKL-SGD in the simplest setting when there are no outliers and no noise. Recall from above that this means that is in the optimal set of every single individual loss . However as mentioned above, even in this case the surrogate loss can be non-convex, as seen e.g. in Figure 1 for a simple example.

However, in the following lemma we show that even though the overall surrogate loss is non-convex, in this no-noise no-outlier setting it has a special property with regards to the point .

###### Lemma 1.

In the noiseless setting, for any there exists a such that

 ∇˜F(w)⊤(w−w∗)≥λw\normw−w∗2.

In other words, what this lemma says is that on the line between any point and the point , the surrogate loss function is convex from any point – even though it is not convex overall. This is akin to the restricted secant inequality condition described in [17, 37]. The following theorem uses this lemma to establish our first result: that in the noiseless setting with no outliers, is the only fixed point (in expectation) of MKL-SGD.

###### Theorem 1 (Unique stationary point).

For the noiseless setting with no outliers, and under assumptions , the expected MKL-SGD update satisfies if and only if .

### 4.2 Noiseless setting with Outliers

In presence of outliers, the surrogate loss can have multiple local minima that are far from and indeed potentially even worse than what we could have obtained with vanilla SGD on the original loss function. We now analyze MKL-SGD in the simple setting of symmetric squared loss functions and try to gain useful insights into the landscape of loss function for the scalar setting. We would like to point out that the analysis in the next part serves as a clean template and can be extended for many other standard loss functions used in convex optimization.

##### Squared loss in the scalar setting

Figure 2 will be a handy tool for visualizing and understanding both the notation and results of this subsection. Consider the case where all losses are squared losses, with all the clean samples centered at and all the outliers at , but all having different Lipschitz constants. Specifically, consider:

 fi(w)={li(w−w∗)2  ∀ i∉Oli(w−wB)2  ∀ i∈O, (4)

Let and Let and , . Let us define . We initialize MKL-SGD at , a point where the losses of outlier samples are 0 and all the clean samples have non-zero losses. As a result at , MKL-SGD  has a tendency to pick all the outlier samples with a higher probability than any of the clean samples. This does not bode well for the algorithm since this implies that the final stationary point will be heavily influenced by outliers. Let be the stationary point of MKL-SGD for this scalar case when initialized at .

Let us define as follows:

 (5)

Thus, is the closest point to on the line joining and where the loss function of one of the clean samples and one of the outliers intersect as illustrated in Figure 2.

By observation, we know for the above scalar case . Let represent the total probability of picking outliers at the starting point . The maximum possible value that can be attained over the entire landscape is given as:

 (6)

The next condition gives a sufficient condition to avoid all the bad local minima are avoided no matter where we initialize. For the simple scalar case, the condition is:

###### Condition 1.

To further elaborate on this, for the loss functions and defined in equations (4) and (5) respectively, if condition 1 is not satisfied, then we cannot say anything about where MKL-SGD  converges. However, if condition 1 holds true, then we are in Case 1 (Figure 2), i.e. the stationary point attained by MKL-SGD will be such that it is possible to avoid the existence of the first bad local minima. The first bad local minima occurs by solving the optimization problem where the top- highest probabilities are assigned to the bad samples.

Following the above analysis recursively, we can show that all other subsequent bad local minimas are avoided as well, until we reach the local minima which assigns the largest probabilities to the clean samples222Refer to Appendix section 8.2.3 for further details on this discussion. This indicates that irrespective of where we initialize in the landscape, we are bound to end up at a local minima with the highest probabilities assigned to the clean samples. In the latter part of this section, we will show that MKL-SGD solution attained when Case 1 holds is provably better than the SGD solution. However, if condition 1 is false (Case 2, Figure 2), then it is possible that MKL-SGD gets stuck at any one of the many local minimas that exist close to the outlier center and we cannot say anything about the relative distance from .

A key takeaway from the above condition is that for a fixed as increases, we can tolerate smaller and consequently smaller fraction of corruptions . For a fixed and , increasing the parameter (upto ) in MKL-SGD leads to an increase in and thus increasing can lead to the violation of the above condition. This happens because samples with lower loss will be picked with increasing probability as increases and as a result the propensity of MKL-SGD to converge towards the closest stationary point it encounters is higher.

##### Squared loss in the vector setting

The loss functions are redefined as follows:

 fi(w)={li\normw−w∗2  ∀ i∉Oli\normw−wbi2  ∀ i∈O, (7)

Without loss of generality, assume that and . Let be any stationary attained by MKL-SGD. Suppose be the angle between the line passing through and and the line connecting and . Let us define Consider . Let represent the total probability of picking outliers at the starting point . The maximum possible value that can be attained is given as:

 (8)

where for any , are ordered i.e. .

At , by definition, we know that , and , . By continuity arguments, there exists a ball of radius around , , defined as follows:

 (9)

In the subsequent lemma, we show that that it is possible to drift into the ball where the clean samples have the highest probability or the lowest loss. 333It is trivial to show the existence of a ball of radius for any set of continuously differentiable .

###### Lemma 2.

Consider the loss function and as defined in equations (7) and (9) respectively. Suppose and as defined in Equation (8) satisfies . Starting from any initialization , for any stationary point attained by MKL-SGD, we have that

In other words, initializing at any point in the landscape, the final stationary point attained by MKL-SGD will inevitably assign the largest probabilities to the clean samples. The proof is availabe in Appendix Section 8.2.3. For the scalar case, , we have . If and all the outliers are centered at the same point, then in the scalar setting the condition in Lemma 2 reduces to condition 1.

Note that, the above lemma leads to a very strong worst-case guarantee. It states that the farthest optimum will always be within a bowl of distance from no matter where we initialize. Moreover, as long as the condition is satisfied no matter where the outliers lie (can be adversarially chosen), MKL-SGD  always has the propensity to bring the iterates to a ball of radius around . However, when the necessary conditions for its convergence are violated, the guarantees are initialization dependent. Thus, all the discussions in the rest of this section will be with respect to these worst case guarantees. However, as we see in the experimental section for both neural networks and linear regression, random initialization also seems to perform better than SGD.

##### Effect of κ

A direct result of Lemma 2 is that higher the condition number of the set of quadratic loss functions, lower is the fraction of outliers the MKL-SGD can tolerate. This is because large results in a small value of . This implies that has to be small which in turn requires smaller fractions fo corruptions, .

##### Effect of γ:

The relative distance of the outliers from plays a critical role in the condition for Lemma 2. We know that . implies the outliers are equidistant from the optimum . Low values of lead to a large leading to the violation of the condition with (since RHS in the condition is very small), which implies that one bad outlier can guarantee that the condition in Lemma 2 are violated. The guarantees in the above lemma are only when the outliers are not adversarially chosen to lie at very high relative distances from . One way to avoid the set of outliers far far away from the optimum is to have a filtering step at the start of the algorithm like the one in [9]. We will refer this in Experiments.

##### Effect of cosθj,¯w:

At first glance, it may seem that may cause and since , the condition in Lemma 2 may never be satisfied. Since, the term shows up in the denominator of the loss associated with outlier centered at . Thus, low values of implies high value of loss associated with the function centered at which in turn implies the maximum probability attained by that sample can never be in the top- probabilities for that .

##### Analysis for the general outlier setting:

In this part, we analyze the fixed point equations associated with MKL-SGD and SGD and try to understand the behavior in a ball around the optimum? For the sake of simplicity, we will assume that . Next, we analyze the following two quantities: i) distance of from and distance of the any of the solutions attained by from .

###### Lemma 3.

Let indicate the solution attained SGD. Under assumptions 1-3, there exists an such that for all ,

 ϵG≤(1−ϵ)L\norm¯wSGD−w∗

Using Lemma 1, we will define as follows:

 λ:=minwλw (10)

In Appendix Section 8.2.2, we show that , however the exact lower bounds for this are loss function dependent. Naively,

###### Lemma 4.

Let be any first order stationary point attained by MKL-SGD. Under assumptions 1-3, for a given and as defined in equation (10), there exists a such that for all ,

 \norm¯wMKL−w∗≤ϵkGλ

Finally, we show that any solution attained by MKL-SGD is provably better than the solution attained by SGD. We would like to emphasize that this is a very strong result. The MKL-SGD has numerous local minima and here we show that even the worst444farthest solution from solution attained by MKL-SGD is closer to than the solution attained by SGD. Let us define

###### Theorem 2.

Let and be the the stationary points attained by SGD and MKL-SGD algorithms respectively for the noiseless setting with outliers. Under assumptions 1-3, for any and defined in equation (10), there exists an and such that for all and , we have and,

 \norm¯wMKL−w∗<α(ϵ,L,k,λ)\norm¯wSGD−w∗ (11)

For squared loss in scalar setting, we claimed that for a fixed and , using a large may not be a good idea. Here, however once we are in the ball, , using larger (any ), reduces and allows MKL-SGD to get closer to .

The conditions required in Lemma 2 and Theorem 2 enable us to provide guarantees for only a subset of relatively well-conditioned problems. We would like to emphasize that the bounds we obtain are worst case bounds and not in expectation. As we will note in the Section 6 and the Appendix, however these bounds may not be necessary, for standard convex optimization problems MKL-SGD easily outperforms SGD.

## 5 Convergence Rates

In this section, we go back to the in expectation convergence analysis which is standard for the stochastic settings. For smooth functions with strong convexity, [25, 26] provided guarantees for linear rate of convergence. We restate the theorem here and show that the theorem still holds for the non-convex landscape obtained by MKL-SGD in noiseless setting.

###### Lemma 5 (Linear Convergence[26]).

Let be -strongly convex. Set with . Suppose . Let . After iterations, SGD satisfies:

 E[∥ΔT∥2]≤(1−2η^C)T∥Δ0∥2+ηRσ (12)

where and .

In the noiseless setting, we have and so . in (12) is the same as stated in Theorem 1. Even though above theorem is for SGD, it still can be applied to our algorithm 1. At each iteration there exists a parameter that could be seen as the strong convexity parameter (c.f. Lemma 1). For MKL-SGD, the parameter in (12) should be . Thus, MKL-SGD algorithm still guarantees linear convergence result but with an implication of slower speed of convergence than standard SGD.

However, Lemma 5 will not hold for MKL-SGD in noisy setting since there exists no strong convexity parameter. Even for noiseless setting, the rate of convergence for MKL-SGD given in Lemma 5 is not tight. The upper bound in (12) is loosely set to the constant for all the iterations. We fix it by concretely looking at each iteration. We give a general bound for the any stochastic algorithm (c.f. Theorem 3) for both noiseless and noisy setting in absence and presence of outliers.

###### Theorem 3 (Distance to w∗).

Let . Denote the strong convexity parameter for all the good samples. Let

 ψ=2ηtλgood(1−ηtsupiLi)mini∉Opi(wt)

Suppose at iteration, the stepsize is set as , then conditioned on the current parameter , the expectation of the distance between the and can be upper bounded as:

 Ei[\normΔt+12|wt]≤ (1−ψ)∥Δt∥2+ηtRt (13)

where

 Rt= −2∑i∉Opi(wt)⟨wt−w∗,∇fi(w∗)⟩+∑i∈Opi(wt)(2ηt\norm∇fi(w∗)2+ηt\norm∇fi(wt)2+2(fi(w∗)−fi(wt)))

Theorem 3 implies that for any stochastic algorithm in the both noisy and noiseless setting, outliers can make the upper bound () much worse as it produces an extra term (the third term in ). The third term in has a lower bound that could be an increasing function of . However, its impact can be reduced by appropriately setting , for instance using a larger in MKL-SGD. In the appendix, we also provide a sufficient condition (Corollary 1 in the Appendix) when MKL-SGD is always better than standard SGD (in terms of its distance from in expectation).

The convergence rate depends on the constant . Note that this term is not too small for our algorithm MKL-SGD since it is a minimum among all the good sample (not including the outliers). However, when compared with vanilla SGD where , with defined in (3) for MKL-SGD, in some sense, could be smaller than . For instance, in the experiments given in Figure 5 (a)-(c) (Appendix 8.4.1), the slope of SGD is steeper than MKL-SGD, which implies that .

To understand the residual term . Let us take the noiseless setting with outliers for an example. We have and for all . But for , and . Then the term can be reduced to

 Rt= ∑i∈Opi(wt)(2ηt\norm∇fi(w∗)2+ηt\norm∇fi(wt)2+2(fi(w∗)−fi(wt))) (14)

If we are at the same point for both SGD and MKL-SGD and for , we have . It means that MKL-SGD could reach to a neighbor with a radius that is possibly smaller than vanilla SGD algorithm, with a rate proportional to but not necessarily faster than vanilla SGD.

## 6 Experiments

In this section, we compare the performance of MKL-SGD  and SGD for synthetic datasets for linear regression and small-scale neural networks.

### 6.1 Linear Regression

For simple linear regression, we assume that

are sampled from normal distribution with different condition numbers.

where is a diagonal matrix such that and for all ). For the noisy case, we assume additive Gaussian noise with mean and variance . We compare the performance of MKL-SGD and SGD for different values of (Fig. 3) under noiseless and noisy settings against varying levels of corruption . It is important to note that different values correspond to different rates of convergence. To ensure fair comparison, we run the algorithms till the error values stop decaying and take the distance of from the exponential moving average of the iterates.

### 6.2 Neural Networks

For deep learning experiments, our results are in presence of corruptions via the directed noise model. In this corruption model, all the samples of class

that are in error are assigned the same wrong label . This is a stronger corruption model than corruption by random noise (results in Appendix). For the MKL-SGD algorithm, we run a more practical batched (size ) variant such that if the algorithm picks samples out of sample loss evaluations. The oracle contains results obtained by running SGD over only non-corrupted samples. More experimental results on neural networks for MNIST and CIFAR10 datasets can be found in the Appendix.

##### Mnist:

We train standard 2 layer convolutional network on subsampled MNIST ( samples with labels). We train over 80 epochs using an initial learning rate of with the decaying schedule of factor after every epochs. The results of the MNIST dataset are averaged over 5 runs.

##### Cifar10:

We train Resnet-18 [13] on CIFAR-10 ( training samples with labels) for over epochs using an initial learning rate of with the decaying schedule of factor after every epochs. The reported accuracy is based on the true validation set. The results of the CIFAR-10 dataset are averaged over runs.

Lastly, in Fig. 4, we show that for a neural network MKL-SGD typically has a higher training loss but smaller test loss which partially explains its superior generalization performance.

## 7 Conclusion and Future Work

In this paper, we propose MKL-SGD that is computationally inexpensive, has linear convergence (upto a certain neighborhood) and is robust against outliers. We analyze MKL-SGD  algorithm under noiseless and noisy settings with and without outliers. MKL-SGD outperforms SGD in terms of generalization for both linear regression and neural network experiments. More importantly, MKL-SGD  opens up a plethora of challenging questions with respect to understanding convex optimization in a non-convex landscape.

To ensure consistency, i.e. , we require that . In all other cases, there will be a non-zero contribution from the outliers which keeps the MKL-SGD solution from exactly converging to . In this paper, we consider unknown and thus

should be treated as a hyperparameter. However, if we knew the fraction of corruption, then with the right

and smart initialization, it is possible to guarantee consistency. For neural network experiments in the Appendix, we show that tuning as a hyperparameter can lead to significant improvements in performance in presence of outliers.

Preliminary experiments indicate that smarter initialization techniques can improve the performance of MKL-SGD. The obvious question then is to provide worst case guarantees for a larger subset of problems using smarter initialization techniques. It will be interesting to analyze the tradeoff between rates of convergence to MKL-SGD  and its robustness to outliers. The worst case analysis in the noisy setting with and without outliers also remains an open problem.

## 8 Appendix

### 8.1 Additional Results for Section 3

The following lemma provides upper bounds on the expected gradient of the worst-possible MKL-SGD solution that lies in a ball around . Simultaneously satisfying the following bound with the one in Lemma 3 may lead to an infeasible set of and . And thus we use Lemma 4 in conjunction with 3.

###### Lemma 6.

Let us assume that MKL-SGD converges to . For any that satisfies assumptions N1, N2, A4 and A5, there exists and such that,

 \norm∑i∉Opi(¯wMKL)∇fi(¯wMKL)≤min{(1−ϵk)L\norm¯wMKL−w∗,ϵkG(w)}

The proof for lemma 2 can be found in the Appendix Section 8.2.7

### 8.2 Proofs and supporting lemmas

#### 8.2.1 Proof of Lemma 1

###### Proof.

. Let us fix a such that . We know that for any , is strongly convex in with parameter . This implies

 ∇˜F(w)⊤(w−w∗)≥λw\normw−w∗2

A naive bound for the above Lemma can be:

 ∇˜F(w)⊤(w−w∗)≥minipi∑ifi(w)≥λminipiλw\normw−w∗2

#### 8.2.2 Proof of Theorem 1

###### Proof.

By the definition of the noiseless framework, is the unique optimum of and lies in the optimal set of each . We will prove this theorem by contradiction. Assume there exists some that also satisfies optimum of . At , we have . This implies . ∎

Theorem 1 and Assumption 2 guarantee that . If is strongly convex and is convex, then we know that is strongly convex. On similar lines we can show that by splitting the terms in as and . The first term has (Assumption 2) and the second term has (since it is convex). Note, is a positive constant independent of and so the above lemma is for all .

#### 8.2.3 Proof of Lemma 2

Let be a stationary point of MKL-SGD. Now, we analyze the loss landscape on the line joining and where is any arbitrary point 555Note that we just need for the purpose of landscape analysis and it is not a parameter of the algorithm in the landscape at a distance as far as the farthest outlier from . Let be a very large number.

The loss functions and are redefined as follows:

 fi(w)={li\normw−w∗2  ∀ i∈Oli\normw−wbi2  ∀ i∉O,

where such that . Let and Let and , . Let us define .

Now at , we have . Let us assume that the outliers are chosen in such a way that at , all the outliers have the lowest loss. As stated in the previous lemma, the results hold irrespective of that. This implies:

 ∑i∉Opi(wC)∇fi(¯w) =−∑j∈Opj(wC)∇fj(¯w) ∑i∉Opi(wC)li(¯w−w∗) =−∑j∈Opj(wC)lj(¯w−wbj) ¯w =∑i∉Opi(wC)liw∗+∑j∈Opj(wC)ljwbj∑i∉Opi(wC)li+∑j∈Opj(wC)lj By triangle inequality, \norm¯w−w∗ ≤∑j∈Opj(wC)lj\normwbj−w∗∑i∉Opi(wC)li+∑j∈Opj(wC)lj

Without loss of generality assume that the outliers are ordered as follows: .

Now be some point of intersection of function in the set of clean samples and a function in the set of outliers to . Let be the angle between the line connecting and to the line connecting to . For any two curves with Lipschitz constants and , the halfspaces passing through the weighted mean are also the region where both functions have equal values.

Thus,

 ˜w=√liw∗+√ljwbj√li+√lj

.

 \norm˜w−w∗ =√lj\normwbj−w∗√lj+√li

Let denote the following ratio:

 γ=minj∈O\normwbj−w∗maxj∈O\normwbj−w∗=2δδmax

Now, we want:

 ∑j∈Opj(wC)lj\normwbj−w∗∑i∉Opi(wC)li+∑j∈Opj(wC)lj ≤√ltj√ltj+√lg\normwbj−w∗cosθj=\norm˜w−w∗cosθj ∑j∈Opj(wC)lj\normwbj−w∗∑i∉Opi(wC)li+∑j∈Opj(wC)lj≤∑j∈Opj(wC)lj\normwb|O|−w∗∑i∉Opi(wC)li+∑j∈Opj(wC)lj ≤√ltj√ltj+√lg\normwbj−w∗cosθj ∑j∈Opj(wC)lj∑i∉Opi(wC)li+∑j∈Opj(wC)lj ≤√ltj√ltj+√lg\normwbj−w∗cosθj\normwb|O|−w∗ ∑j∈Opj(wC)lj∑i∉Opi(wC)li+∑j∈Opj(wC)lj ≤√ltj√ltj+√lgγcosθj

For simplicity, , then we have:

 ∑j∈Opj(wC)lj∑i∉Opi(wC)li+∑j∈Opj(wC)lj ≤√ltj√ltj+√lgΓ 1Γ⎛⎜ ⎜⎝√lg√ltj+1⎞⎟ ⎟⎠−1 ≤(1−^p)lm^plM≤∑i∉Opi(wC)li∑j∈Opj(wC)lj ^p1−^p ≤lmlM1Γ−1+1Γ√lg√ltj ^p ≤11+κ(1Γ−1+√κΓ)≤11+lMlm(1Γ−1+1Γ√lg√ltj)

Replacing , and let the condition to guarantee that bad local minima do no exist is and . Now, we can repeat the above analysis recursively for every corresponding and