The Benefits of Implicit Regularization from SGD in Least Squares Problems

08/10/2021 ∙ by Difan Zou, et al. ∙ RealNames Johns Hopkins University University of Washington 0

Stochastic gradient descent (SGD) exhibits strong algorithmic regularization effects in practice, which has been hypothesized to play an important role in the generalization of modern machine learning approaches. In this work, we seek to understand these issues in the simpler setting of linear regression (including both underparameterized and overparameterized regimes), where our goal is to make sharp instance-based comparisons of the implicit regularization afforded by (unregularized) average SGD with the explicit regularization of ridge regression. For a broad class of least squares problem instances (that are natural in high-dimensional settings), we show: (1) for every problem instance and for every ridge parameter, (unregularized) SGD, when provided with logarithmically more samples than that provided to the ridge algorithm, generalizes no worse than the ridge solution (provided SGD uses a tuned constant stepsize); (2) conversely, there exist instances (in this wide problem class) where optimally-tuned ridge regression requires quadratically more samples than SGD in order to have the same generalization performance. Taken together, our results show that, up to the logarithmic factors, the generalization performance of SGD is always no worse than that of ridge regression in a wide range of overparameterized problems, and, in fact, could be much better for some problem instances. More generally, our results show how algorithmic regularization has important consequences even in simpler (overparameterized) convex settings.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural networks often exhibit powerful generalization in numerous machine learning applications, despite being

overparameterized. It has been conjectured that the optimization algorithm itself, e.g., stochastic gradient descent (SGD), implicitly regularizes such overparameterized models (Zhang et al., 2016); here, (unregularized) overparameterized models could admit numerous global and local minima (many of which generalize poorly (Zhang et al., 2016; Liu et al., 2019)), yet SGD tends to find solutions that generalize well, even in the absence of explicit regularizers (Neyshabur et al., 2014; Zhang et al., 2016; Keskar et al., 2016). This regularizing effect due to the choice of the optimization algorithm is often referred to as implicit regularization (Neyshabur et al., 2014).

Before moving to the non-convex regime, we may hope to start by understanding this effect in the (overparameterized) convex regime. At least for linear models, there is a growing body of evidence suggesting that the implicit regularization of SGD is closely related to an explicit, -type of (ridge) regularization (Tihonov, 1963). For example, (multi-pass) SGD for linear regression converges to the

minimum-norm interpolator

, which corresponds to the limit of the ridge solution with a vanishing penalty (Zhang et al., 2016; Gunasekar et al., 2018). Tangential evidence for this also comes from examining gradient descent, where a continuous time (gradient flow) analysis shows how the optimization path of gradient descent is (pointwise) closely connected to an explicit, -regularization (Suggala et al., 2018; Ali et al., 2019).

However, as of yet, a precise comparison between the implicit regularization afforded by SGD and the explicit regularization of ridge regression (in terms of the generalization performance) is still lacking. This motivates the central question in this work:

How does the generalization performance of SGD compare with that of ridge regression in least square problems?

In particular, even in the arguably simplest setting of linear regression, we seek to understand if/how SGD behaves differently from using an explicit -regularizer, with a particular focus on the overparameterized regime.

Our Contributions.

Due to recent advances on sharp, instance-dependent excess risks bounds of both (one-pass) SGD and ridge regression for overparameterized least square problems (Tsigler and Bartlett, 2020; Zou et al., 2021), a nearly complete answer to the above question is now possible using these tools. In this work, we deliver an instance-based risk comparison between SGD and ridge regression in several interesting settings, including one-hot distributed data and Gaussian data. In particular, for a broad class of least squares problem instances that are natural in high-dimensional settings, we show that

  • For every problem instance and for every ridge parameter, (unregularized) SGD, when provided with logarithmically more samples than that provided to ridge regularization, generalizes no worse than the ridge solution, provided SGD uses a tuned constant stepsize.

  • Conversely, there exist instances in our problem class where optimally-tuned ridge regression requires quadratically more samples than SGD in order to achieve the same generalization performance.

Quite strikingly, the above results show that, up to some logarithmic factors, the generalization performance of SGD is always no worse than that of ridge regression in a wide range of overparameterized least square problems, and, in fact, could be much better for some problem instances. As a special case (for the above two claims), our problem class includes a setting in which: (i) the signal-to-noise is bounded and (ii) the eigenspectrum decays at a polynomial rate , for (which permits a relatively fast decay). This one-sided near-domination phenomenon (in these natural overparameterized problem classes) could further support the preference for the implicit regularization brought by SGD over explicit ridge regularization.

Several novel technical contributions are made to make the above risk comparisons possible. For the one-hot data, we derive similar risk upper bound of SGD and risk lower bound of ridge regression. For the Gaussian data, while a sharp risk bound of SGD is borrowed from (Zou et al., 2021), we prove a sharp lower bound of ridge regression by adapting the proof techniques developed in (Tsigler and Bartlett, 2020; Bartlett et al., 2020). By carefully comparing these upper and lower bound results (and exhibiting particular instances to show that our sample size inflation bounds are sharp), we are able to provide nearly complete conditions that characterize when SGD generalizes better than ridge regression.

Notation.

For two functions and defined on , we write if for some absolute constant ; we write if ; we write if

. For a vector

and a positive semidefinite matrix , we denote .

2 Related Work

In terms of making sharp risk comparisons with ridge, the work of (Dhillon et al., 2013) shows that OLS (after a PCA projection is applied to the data) is instance-wise competitive with ridge on fixed design problems. The insights in our analysis are draw from this work, though there are a number of technical challenges in dealing with the random design setting. We start with a brief discussion of the technical advances in the analysis of ridge regression and SGD, and then briefly overview more related work comparing SGD to explicit norm-based regularization.

Excess Risk Bounds for Ridge Regression.

In the underparameterized regime, the excess risk bounds for ridge regression has been well-understood (Hsu et al., 2012). In the overparameterized regime, a large body of works (Dobriban et al., 2018; Hastie et al., 2019; Xu and Hsu, 2019; Wu and Xu, 2020) focused on characterizing the excess risk of ridge regression in the asymptotic regime where both the sample size and dimension go to infinite and for some finite . More recently, Bartlett et al. (2020)

developed sharp non-asymptotic risk bounds for ordinary least square in the overparameterized setting, which are further extended to ridge regression by

Tsigler and Bartlett (2020). These bounds have additional interest because they are instance-dependent, in particular, depending on the data covariance spectrum. The risk bounds of ridge regression derived in Tsigler and Bartlett (2020) is highly nontrivial in the overparameterized setting as it holds when the ridge parameter equals to zero or even being negative. This line of results build one part of the theoretical tools for this paper.

Excess Risk Bounds for SGD.

Risk bounds for one-pass, constant-stepsize (average) SGD have been derived in the finite dimensional case (Bach and Moulines, 2013; Défossez and Bach, 2015; Jain et al., 2017a, b; Dieuleveut et al., 2017). Very recently, the work of (Zou et al., 2021) extends these analyses, providing sharp instance-dependent risk bound applicable to the overparameterized regime; here, Zou et al. (2021) provides nearly matching upper and lower excess risk bounds for constant-stepsize SGD, which are sharply characterized in terms of the full eigenspectrum of the population covariance matrix. This result plays a pivotal role in our paper.

Implicit Regularization of SGD vs. Explicit Norm-based Regularization.

For least square problems, multi-pass SGD converges to the minimum-norm solution (Neyshabur et al., 2014; Zhang et al., 2016; Gunasekar et al., 2018), which is widely cited as (one of) the implicit bias of SGD. However, in more general settings, e.g., convex but non-linear models, a (distribution-independent) norm-based regularizer is no longer sufficient to characterize the optimization behavior of SGD (Arora et al., 2019; Dauber et al., 2020; Razin and Cohen, 2020) Those discussions, however, exclude the possibility of hyperparameter tuning, e.g., stepsize for SGD and penalty strength for ridge regression, and are not instance-based, either. Our aim in this paper is to provide instance-based excess risk comparison between the optimally tuned (one-pass) SGD and the optimally tuned ridge regression.

3 Problem Setup and Preliminaries

We seek to compare the generalization ability of SGD and ridge algorithms for least square problems. We use to denote a feature vector in a (separable) Hilbert space . We use to refer to the dimensionality of , where if is infinite-dimensional. We use to denote a response that is generated by

where is an unknown true model parameter and

is the model noise. The following regularity assumption is made throughout the paper. [Well-specified noise] The second moment of

, denoted by , is strictly positive definite and has finite trace. The noise is independent of and satisfies

In order to characterize the interplay between and in the excess risk bound, we introduce:

where

are the eigenvalues of

sorted in non-increasing order and

’s are the corresponding eigenvectors. Then we define

The least squares problem is to estimate the true parameter

. Assumption 3 implies that is the unique solution that minimizes the population risk:

(1)

Moreover we have that . For an estimation found by some algorithm, e.g., SGD or ridge regression, its performance is measured by the excess risk, .

Constant-Stepsize SGD with Tail-Averaging.

We consider the constant-stepsize SGD with tail-averaging (Bach and Moulines, 2013; Jain et al., 2017a, b; Zou et al., 2021): at the -th iteration, a fresh example is sampled independently from the data distribution, and SGD makes the following update on the current estimator ,

where is a constant stepsize. After iterations (which is also the number of samples observed), SGD outputs the tail-averaged iterates as the final estimator:

In the underparameterized setting (), constant-stepsize SGD with tail-averaging is known for achieving minimax optimal rate for least squares (Jain et al., 2017a, b). More recently, Zou et al. (2021) investigate the performance of constant-stepsize SGD with tail-averaging in the overparameterized regime (), and establish instance-dependent, nearly-optimal excess risk bounds under mild assumptions on the data distribution. Notably, results from (Zou et al., 2021) cover underparameterized cases () as well.

Ridge Regression.

Given i.i.d. samples , let us denote and . Then ridge regression outputs the following estimator for the true parameter (Tihonov, 1963):

(2)

where (which could possibly be negative) is a regularization parameter. We remark that the ridge regression estimator takes the following two equivalent form:

(3)

The first expression is useful in the classical, underparameterized setting () (Hsu et al., 2012); and the second expression is more useful in the overparameterized setting () where the empirical covariance is usually not invertible (Kobak et al., 2020; Tsigler and Bartlett, 2020). As a final remark, when , ridge estimator reduces to the ordinary least square estimator (OLS) (Friedman et al., 2001).

Generalizable Regime.

In the following sections we will make instance-based risk comparisons between SGD and ridge regression. To make the comparison meaningful, we focus on regime where SGD and ridge regression are “generalizable”, i.e, the SGD and the ridge regression estimators, with the optimally-tuned hypeparameters, can achieve excess risk that is smaller than the optimal population risk, i.e., . The formal mathematical definition is as follows.

[Generalizability] Consider an algorithm and a least squares problem instance . Let be the output of the algorithm when provided with i.i.d. samples from the problem instance , and a set of hyperparameters (that could be a function on ). Then we say that the algorithm with sample size and hyperparameters configuration is generalizable on problem instance , if

where the expectation is over the randomness of and data drawn from the problem instance . Clearly, the generalizable regime is defined by conditions on both the sample size, hyperparameter configuration, the problem instance, and the algorithm. For example, in the -dimensional setting, the ordinary least squares (OLS) solution (ridge regression with ), i.e., has excess risk, then we can say that is in the generalizable regime if .

Sample Inflation vs. Risk Inflation Comparisons.

This work characterizes the sample inflation of SGD, i.e., bounding the required sample size of SGD to achieve an instance-based comparable excess risk as ridge regression (which is essentially the notion of Bahadur statistical efficiency Bahadur (1967, 1971)). Another natural comparison would be examining the risk inflation of SGD, examining the instance-based increase in risk for any fixed sample size. Our preference for the former is due to the relative instability of the risk with respect to the sample size (in some cases, given a slightly different sample size, the risk could rapidly change.).

4 Warm-Up: One-Hot Least Squares Problems

Let us begin with a simpler data distribution, the one-hot data distribution. (inspired by settings where the input distribution is sparse). In detail, assume each input vector is sampled from the set of natural basis according to the data distribution given by , where and . The class of one-hot least square instances is completely characterized by the following problem set:

Clearly the population data covariance matrix is . The next two theorems give an instance-based sample inflation comparisons for this problem class.

[Instance-wise comparison, one-hot data] Let and be the solutions found by SGD and ridge regression when using training examples. Then for any one-hot least square problem instance such that the ridge regression solution is generalizable and any , there exists a choice of stepsize for SGD such that

provided the sample size of SGD satisfies

Theorem 4 suggests that for every one-hot problem instance, when provided with the same or more number of samples, the SGD solution with a properly tuned stepsize generalizes at most constant times worse than the optimally tuned ridge regression solution. In other words, with the same number of samples, SGD is always competitive with ridge regression.

[Best-case comparison, one-hot data] There exists an one-hot least square problem instance satisfying , and a SGD solution with constant stepsize and sample size , such that for any ridge regression solution with sample size

it holds that,

Theorem 4 shows that for some one-hot least square instance, ridge regression, even with the optimally-tuned regularization, needs at least (nearly) quadratically more samples than that provided to SGD, in order to compete with the optimally-tuned SGD. In other words, ridge regression could be much worse than SGD for one-hot least squares problems.

The above two results together indicate a superior performance of the implicit regularization of SGD in comparison with the explicit regularization of ridge regression, for one-hot least squares problems. This is not the only case that SGD is always no worse than ridge estimator. In fact, we will next turn to compare SGD with ridge regression for the class of Gaussian least square instances, where both SGD and ridge regression exhibit richer behaviors but SGD still exhibits superiority over the ridge estimator.

5 Gaussian Least Squares Problems

In this section, we consider least squares problems with a Gaussian data distribution. In particular, assume the population distribution of the input vector is Gaussian111

We restrict ourselves to the Gaussian distribution for simplicity. Our results hold under more general assumptions, e.g.,

has sub-Gaussian tail and independent components (Bartlett et al., 2020) and is symmetrically distributed., i.e., . We further make the following regularity assumption for simplicity: is strictly positive definite and has a finite trace. Gaussian least squares problems are completely characterized by the following problem set .

The next theorem give an instance-based sample inflation comparison between SGD and ridge regression for Gaussian least squares instances.

[Instance-wise comparison, Gaussian data] Let and be the solutions found by SGD and ridge regression respectively. Then under Assumption 5, for any Gaussian least square problem instance such that the ridge regression solution is generalizable and any , there exists a choice of stepsize for SGD such that

provided the sample size of SGD satisfies

where

Note that the result in Theorem 5 holds for arbitrary . This theorem provides a sufficient condition for SGD such that it provably performs no worse than optimal ridge regression solution (i.e., ridge regression with optimal ). We would also like to point out that the stepsize of SGD in Theorem 5 only depends on and , which does not require the full access to the data distribution and can be easily estimated from the training dataset.

Different from the one-hot case, here the required sample size for SGD depends on two important quantities: and . In particular, can be understood as the signal-to-noise ratio. The quantity characterizes the flatness of the eigenspectrum of in the top -dimensional subspace, which clearly satisfies . Let us further explain why we have the dependencies on and in the condition of the sample inflation for SGD.

A large emphasize the problem hardness is more from the numerical optimization instead of from the statistic learning. In particular, let us consider a special case where and , i.e., there is no noise in the least square problem, and thus solving it is purely a numerical optimization issue. In this case, ridge regression with achieves zero population risk so long as the observed data can span the whole parameter space, but constant stepsize SGD in general suffers a non-zero risk in finite steps, thus cannot be competitive with the risk of ridge regression, which is as predicted by Theorem 5. From a learning perspective, a constant or even small is more interesting.

To explain why the dependency on is unavoidable, we can consider a -d dimensional example where

It is commonly known that for this problem, ridge regression with can achieve excess risk bound (Friedman et al., 2001). However, this problem is rather difficult for SGD since it is hard to learn the second coordinate of using gradient information (the gradient in the second coordinate is quite small). In fact, in order to accurately learn , SGD requires at least iterations/samples, which is consistent with our theory.

Then from Theorem 5

it can be observed that when the signal-to-noise ratio is nearly a constant, i.e.,

, and the eigenspectrum of does not decay too fast so that , SGD provably generalizes no worse than ridge regression, provided with logarithmically more samples than that provided to ridge regression. More specifically, the following corollary gives a example class of problem instances that are in this regime. Under the same conditions as Theorem 5, let be the sample size of ridge regression. Consider the problem instance that satisfies , , and for some , then SGD, with a tuned stepsize , provably generalizes no worse than any ridge regression solution in the generalizable regime if

We would like to emphasize that the condition is not necessary, we can arbitrary increase and reset the eigenspectrum as for any and for any for some and , the results in Corollary 5 can still hold. This is because that the value of does not play an important role in our analysis, while the eigenspectrum is more essential to the generalization guarantees.

The next theorem shows that, in fact, for some instances, SGD could perform much better than ridge regression, as for the one-hot least square problems.

[Best-case comparison, Gaussian data] There exists a Gaussian least square problem instance satisfying and , and an SGD solution with a constant stepsize and sample size , such that for any ridge regression solution (i.e., any ) with sample size

it holds that,

Besides the instance-wise comparison, it is also interesting to see under what condition SGD can provably outperform ridge regression, i.e., achieving comparable or smaller excess risk using the same number of samples. The following theorem shows that this occurs when the signal-to-noise ratio is a constant and there is only a small fraction of

living in the tail eigenspace of

. [SGD outperforms ridge regression, Gaussian data] Let be sample size of ridge regression and , then if , and

for any ridge regression solution that is generalizable and any , there exists a choice of stepsize for SGD such that

provided the sample size of SGD satisfies

5.1 Experiments

We perform experiments on Gaussian least square problem. We consider problem instances, which are the combinations of different covariance matrices : and ; and different true model parameter vectors : , , and . Figure 1 compares the required sample sizes of ridge regression and SGD that lead to the same population risk on these problem instances, where the hyperparameters (i.e., and ) are fine-tuned to achieve the best performance. We have two key observations: (1) in terms of the worst problem instance for SGD (i.e., ), its sample size is only worse than ridge regression up to nearly constant factors (the curve is nearly linear); and (2) SGD can significantly outperform ridge regression when the true model mainly lives in the head eigenspace of (i.e., ). The empirical observations are pretty consistent with our theoretical findings and again demonstrate the benefit of the implicit regularization of SGD.

(a)
(b)
Figure 1: Sample size comparison between SGD and ridge regression, where the stepsize and regularization parameter are fine-tuned to achieve the best performance. The problem dimension is

and the variance of model noise is

. We consider combinations of different covariance matrices and different ground truth model vectors. The plots are averaged over independent runs.

6 An Overview of the Proof

In this section, we will sketch the proof of main Theorems for Gaussian least squares problems. Recall that we aim to show that provided certain number of training samples, SGD is guaranteed to generalize better than ridge regression. Therefore, we will compare the risk upper bound of SGD (Zou et al., 2021) with the risk lower bound of ridge regression (Tsigler and Bartlett, 2020)222The lower bound of ridge regression in our paper is a tighter variant of the lower bound in Tsigler and Bartlett (2020) since we consider Gaussian case and focus on the expected excess risk. Tsigler and Bartlett (2020)

studied the sub-Gaussian case and established a high-probability risk bound.

. In particular, we first provide the following informal lemma summarizing the aforementioned risk bounds of SGD and ridge regression.

[Risk bounds of SGD and ridge regression, informal] Suppose Assumptions 3 and 5 hold and , then SGD has the following risk upper bound for arbitrary ,

(4)

Additionally, ridge regression has the following risk lower bound for a constant , depending on , , and , and

(5)

We first highlight some useful observations in Lemma 6.

  1. SGD has a condition on the stepsize: , while ridge regression has no condition on the regularization parameter .

  2. Both the upper bound of SGD and the lower bound of ridge regression can be decomposed into two parts corresponding to the head and tail eigenspaces of . Furthermore, for the upper bound of SGD, the decomposition is arbitrary ( and are arbitrary), while for the lower bound of ridge estimator, the decomposition is fixed (i.e., is fixed).

  3. Regarding the and , performing the transformation and will decrease by a factor of while the remains unchanged.

Based on the above useful observations, we can now interpret the proof sketch for Theorems 5, 5, and 5. We will first give the sketch for Theorem 5 and then prove Theorem 5 for the ease of presentation. We would like to emphasize that the calculation in the proof sketch may not be the sharpest since they are presented for the ease of exposition. A preciser and sharper calculation can be found in Appendix.

Proof Sketch of Theorem 5.

In order to perform instance-wise comparison, we need to take care of all possible . Therefore, by Observation 2, we can simply pick in the upper bound (6). Then it is clear that if setting and , we have

Then by Observation 3, enlarging by times suffices to guarantee

On the other hand, according to Observation 1, there is an upper bound on the feasible stepsize of SGD: . Therefore, the above claim only holds when .

When , the stepsize is no longer feasible and instead, we will use the largest possible stepsize: . Besides, note that we assume ridge regression solution is in the generalizable regime, then it holds that since otherwise we have

Then again we set in and . Applying the choice of stepsize and sample size

we get

(6)

Moreover, we can also get the following bound on ,

where in the second inequality we use the fact that

Therefore by Observation 3 again we can enlarge properly to ensure that remains unchanged and . Then combining this and (6) we can get

which completes the proof.

Proof Sketch of Theorem 5.

Now we will investigate in which regime SGD will generalizes no worse than ridge regression when provided with same training sample size. For simplicity in the proof we assume . First note that we only need to deal with the case where by the proof sketch of Theorem 5.

Unlike the instance-wise comparison that consider all possible , in this lemma we only consider the set of that SGD performs well. Specifically, as we have shown in the proof of Theorem 5, in the worst-case comparison (in terms of ), we require SGD to be able to learn the first (where ) coordinates of in order to be competitive with ridge regression, while SGD with sample size can only be guaranteed to learn the first coordinates of , where . Therefore, in the instance-wise comparison we need to enlarge to to guarantee the learning of the top coordinates of .

However, this is not required for some good ’s that have small components in the - coordinates. In particular, as assumed in the theorem, we have , where satisfies . Then let in , we have

where is due to the condition that . Moreover, it is easy to see that given and , we have . As a consequence we can get

Proof Sketch of Theorem 5.

We will consider the best for SGD, which only has nonzero entry in the first coordinate. For example, consider a true model parameter vector with and for and a problem instance whose spectrum of has a flat tail with and . Then according to Lemma 6, we can set the stepsize as and get

For ridge regression, according to Lemma 6 we have

Therefore, it is evident that ridge regression is guaranteed to be worse than SGD if . This completes the proof.

7 Conclusions

We conduct an instance-based risk comparison between SGD and ridge regression for a broad class of least square problems. We show that SGD is always no worse than ridge regression provided logarithmically more samples. On the other hand, there exist some instances where even optimally-tuned ridge regression needs quadratically more samples to compete with SGD. This separation in terms of sample inflation between SGD and ridge regression suggests a provable benefit of implicit regularization over explicit regularization for least squares problems. In the future, we will explore the benefits of implicit regularization for learning other linear models and potentially nonlinear models.

Appendix A Proof of One-hot Least Squares

a.1 Excess risk bound of SGD

In this part we will mainly follow the proof technique in Zou et al. (2021) that is developed to sharply characterize the excess risk bound for SGD (with tail-averaging) when the data distribution has a nice finite fourth-moment bound. However, such condition does not hold for the one-hot case so that their results cannot be directly applied here.

Before presenting the detailed proofs, we first introduce some notations and definitions that will be repeatedly used in the subsequent analysis. Let be the covariance of data distribution. It is easy to verify that is a diagonal matrix with eigenvalues . Let be the -th iterate of the SGD, we define as the centered SGD iterate. Then we define and as the bias error and variance error respectively, which are described by the following update rule:

(7)

Accordingly, we can further define the bias covariance and variance covariance as follows

Regarding these two covariance matrices, the following lemma mathematically characterizes the upper bounds of the diagonal entries of and . Under Assumptions 3, let and , then if the stepsize satisfies , we have

Proof.

According to (A.1), we have

(8)

Note that with probability , then we have

Plugging the above equation into (A.1) gives

Then if only look at the diagonal entries of both sides, we have

where in the first equation we use the fact that and the inequality follows from the fact that both and </