# Minimum norm solutions do not always generalize well for over-parameterized problems

Stochastic gradient descent is the de facto algorithm for training deep neural networks (DNNs). Despite its popularity, it still requires fine hyper-parameter tuning in order to achieve its best performance. This has led to the development of adaptive methods, that claim automatic hyper-parameter tuning. Recently, researchers have studied both algorithmic classes via thoughtful toy problems: e.g., for over-parameterized linear regression, [1] shows that, while SGD always converges to the minimum-norm solution (similar to the case of the maximum margin solution in SVMs that guarantees good prediction error), adaptive methods show no such inclination, leading to worse generalization capabilities. Our aim is to study this conjecture further. We empirically show that the minimum norm solution is not necessarily the proper gauge of good generalization in simplified scenaria, and different models found by adaptive methods could outperform plain gradient methods. In practical DNN settings, we observe that adaptive methods often perform at least as well as SGD, without necessarily reducing the amount of tuning required.

## Authors

• 6 publications
• 38 publications
• 37 publications
• ### The Marginal Value of Adaptive Gradient Methods in Machine Learning

Adaptive optimization methods, which perform local optimization with a m...
05/23/2017 ∙ by Ashia C. Wilson, et al. ∙ 0

Optimization algorithms for training deep models not only affects the co...
09/13/2017 ∙ by Zijun Zhang, et al. ∙ 0

• ### Minnorm training: an algorithm for training over-parameterized deep neural networks

In this work, we propose a new training method for finding minimum weigh...
06/03/2018 ∙ by Yamini Bansal, et al. ∙ 0

• ### Minnorm training: an algorithm for training overcomplete deep neural networks

In this work, we propose a new training method for finding minimum weigh...
06/03/2018 ∙ by Yamini Bansal, et al. ∙ 0

• ### Optimizing Stochastic Gradient Descent in Text Classification Based on Fine-Tuning Hyper-Parameters Approach. A Case Study on Automatic Classification of Global Terrorist Attac

The objective of this research is to enhance performance of Stochastic G...
02/18/2019 ∙ by Shadi Diab, et al. ∙ 0

• ### Embedded hyper-parameter tuning by Simulated Annealing

We propose a new metaheuristic training scheme that combines Stochastic ...
06/04/2019 ∙ by Matteo Fischetti, et al. ∙ 0

• ### Is SGD a Bayesian sampler? Well, almost

Overparameterised deep neural networks (DNNs) are highly expressive and ...
06/26/2020 ∙ by Chris Mingard, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In theory, deep neural networks (DNNs) are hard to train [2]

. Apart from sundry architecture configurations –such as network depth, layer width, and type of activation functions– there are key algorithmic hyper-parameters that need to be properly tuned, in order to obtain a model that generalizes well, within reasonable amount of time.

Among the hyper-parameters, the one of pivotal importance is step size [3, 4]

. To set the background, note that most algorithms in practice are gradient-descent based: given the current model estimate

and some training examples, we iteratively compute the gradient of the objective , and update the model by advancing along negative directions of the gradient , weighted by the step size ; i.e.,

 wk+1=wk−η∇f(wk).

This is the crux of all gradient-descent based algorithms, including the ubiquitous stochastic gradient descent (SGD) algorithm. Here, the step size could be set as constant, or could be changing per iteration [5], usually based on a predefined learning rate schedule [6, 7, 8].

Beyond practical strategies and tricks that lead to faster convergence and better generalization for simple gradient-based algorithms [9, 10, 5], during the past decade we have witnessed a family of algorithms that argue for automatic hyper-parameter adaptation [11] during training (including step size). The list includes AdaGrad [12], Adam [13], AdaDelta [14]

[15], AdaMax [13], Nadam [16], just to name a few. These algorithms utilize current and past gradient information , for , to design preconditioning matrices that better pinpoint the local curvature of the objective function. A simpler description of the above111In this work, our theory focuses on adaptive but non-momentum-based methods, such as AdaGrad. algorithms is as follows:

 wk+1=wk−ηDk∇f(wk).

The main argument is that eliminates pre-setting a learning rate schedule, or diminishes initial bad step size choices, thus, detaching the time-consuming part of step size tuning from the practitioner [17].

Recently though, it has been argued that simple gradient-based algorithms may perform better compared to adaptive ones, creating skepticism regarding their efficiency. More specifically, for the linear regression setting, [1] shows that, under specific assumptions, the adaptive methods converge to a different solution than the minimum norm one. The latter has received attention due to its efficiency as the maximum margin solution in classification [18]. This behavior is also demonstrated using DNNs, where simple gradient descent generalizes at least as well as the adaptive methods, adaptive techniques require at least the same amount of tuning as the simple gradient descent methods.

In this paper, our aim is to further study this conjecture. The paper is separated into the theoretical (Sections 2-4) and the practical (Subsection 4.3 and Section 5) part. For our theory, we focus on simple linear regression (Section 2), and discuss the differences between under- and over-parameterization. Section 3 focuses on simple gradient descent, and establishes closed-form solutions, for both settings. We study the AdaGrad algorithm on the same setting in Section 4, and we discuss under which conditions gradient descent and AdaGrad perform similarly or their behavior diverges.

Our findings can be summarized as follows:

• [leftmargin=0.5cm]

• For under-parameterized linear regression, closed-form solutions indicate that simple gradient descent and AdaGrad converge to the same solution; thus, adaptive methods have the same generalization capabilities, which is somewhat expected in under-parameterized convex linear regression.

• In over-parameterized linear regression, simple gradient methods converge to the minimum norm solution. In contrast, AdaGrad converges to a different point: while the computed model predicts correctly within the training data set, it has different generalization behavior on unseen data, than the minimum norm solution. [1] shows that AdaGrad generalizes worse than gradient descent; in this work, we empirically show that an AdaGrad variant generalizes better than the minimum norm solution on a different counterexample. We conjecture that the superiority of simple or adaptive methods depends on the problem/data at hand, and the discussion “who is provably better” is inconclusive.

• We conduct neural network experiments using different datasets and network architectures. Overall, we observe a similar behavior either using simple or adaptive methods. Our findings support the conclusions of [1] that adaptive methods still require fine parameter tuning. Generalization-wise, we observe that simple algorithms are not universally superior than adaptive ones.

## 2 Background on linear regression

Consider the linear regression setting:

 minw∈Rd 12⋅∥Xw−y∥22

where is the feature matrix and are the observations. There are two different settings, depending on the number of samples and dimensions:

• [leftmargin=0.5cm]

• Over-parameterized case: In this case, we have more parameters than the number of samples: . In this case, assuming that is in general position, is full rank.

• Under-parameterized case: Here, the number of samples is larger than the number of parameters: . In this case, usually is full rank.

The most studied case is that of : the problem has solution , under full rankness assumption on . In the case where the problem is over-parameterized , there is a solution of similar form that has received significant attention, despite the infinite cardinality of optimal solutions. This is the so-called minimum norm solution. The optimization instance to obtain the minimum norm solution is:

 minw∈Rd ∥w∥22  subject to  y=Xw.

Let us denote its solution as . can be found through the Lagrange multipliers method. Define the Lagrange function: . The optimal conditions w.r.t. the variables and are:

 ∇wL=2w+λ⊤X=0  ⇒w=−X⊤λ2,and∇λL=Xw−y=0  ⇒λ=−2(XX⊤)−1y

Substituting back in the expression for , we get the minimum norm solution: . Any other solution has to have equal or larger Euclidean norm than .

Observe that the two solutions, and , differ between the two cases: in the under-parameterized case, the matrix is well-defined (full-rank) and has an inverse, while in the over-parameterized case, the matrix is full rank. Importantly, there are differences on how we obtain these solutions in an iterative fashion. We next show how both simple and adaptive gradient descent algorithms find for well-determined systems. This does not hold for the over-parameterized case: there are infinite solutions, and the question which one they select is central in the recent literature [1, 19, 20].

## 3 Closed-form expressions for gradient descent in linear regression

Studying iterative routines in simple tasks provides intuitions on how they might perform in more complex problems, such as neural networks. Next, we will distinguish our analysis into under- and over-parameterized settings for linear regression.

### 3.1 Under-parameterized linear regression.

Here, and is assumed to be full rank. Simple gradient descent with step size satisfies: . Unfolding for iterations, we get:

 wK=(K∑i=1(−1)i−1⋅(Ki)⋅ηi⋅(X⊤X)i−1)X⊤y.

 K∑i=1 (−1)i−1⋅(Ki)⋅ηi⋅(X⊤X)i−1=(−X⊤X)−1⋅((I−ηX⊤X)K−I)

Therefore, we get the closed form solution: . In order to prove that gradient descent converges to the minimum norm solution, we need to prove that:

 (I−ηX⊤X)K−I =−I  ⇒  (I−ηX⊤X)K  n,K~{}large enough⟶  0.

This is equivalent to showing that . From optimization theory [21], we need for convergence, where

denotes the eigenvalues of the argument. Then,

has spectral norm that is smaller than 1, i.e., . Combining the above, we make use of the following theorem.

###### Theorem 1

[Behavior of square matrix [22, 23]] Let is a matrix. Let denote the spectral radius of the matrix . Then, there exists a sequence such that: .

Using the above theorem, has . Further, for sufficiently large , has a small value such that ; i.e., after some , , will be less than zero, converging to zero for increasing . As is going towards infinity, this concludes the proof, and leads to the left inverse solution: , as . This is identical to the closed for solution of well-conditioned linear regression.

### 3.2 Over-parameterized linear regression.

For completeness, we briefly provide the analysis for the over-parameterized setting, where and is assumed to be full rank. By inspection, unfolding gradient descent recursion gives:

 wK=X⊤(K∑i=1(−1)i−1⋅(Ki)⋅ηi⋅(XX⊤)i−1)y.

Similarly, the summation can be simplified to:

 K∑i=1 (−1)i−1⋅(Ki)⋅ηi⋅(XX⊤)i−1=(−XX⊤)−1⋅((I−ηXX⊤)K−I),

and, therefore:

 wK=X⊤(−XX⊤)−1⋅((I−ηXX⊤)K−I)y.

Under similar assumption on the spectral norm of and using Theorem 1, we obtain the right inverse solution: , as . Bottomline, in both cases, gradient descent converges to left and right inverse solutions, related to the Moore-Penrose inverse.

Let us now focus on adaptive methods. For simplicity, we study only non-accelerated adaptive gradient descent methods, like AdaGrad, following the analysis in [1]; the momentum-based schemes are left for future work. While there exists considerable work analyzing the stochastic variants of adaptive methods in [12, 13, 24, 25], we concentrate on non-stochastic variants, for simplicity and ease of comparison with gradient descent. In summary, we study: . E.g., in the case of AdaGrad, we have:

 Dk=diag(1/ ⎷k∑j=k−J∇f(wj)⊙∇f(wj)+ε)≻0,for someε>0,  and  J

The main ideas apply for any positive definite preconditioner. The case where , for a constant matrix, is deferred to the appendix (Section 7.2). Here, we focus on the case where varies per iteration.

### 4.1 Under-parameterized linear regression.

When is varying (Section 7.3), we end up with the following proposition (folklore); the proof is in Section 7.4.:

###### Proposition 1

Consider the under-parameterized case. Assume the recursion , for positive definite matrices. Then, after iterations, satisfies:

 wK=(−X⊤X)−1⋅(0∏i=K−1(I−ηX⊤XDi)−I)X⊤y.

Using Theorem 1, we can again infer that, for sufficiently large and for sufficiently small , such that , we have: . Thus, for sufficiently large and assuming : , which is the same as the plain gradient descent approach. Thus, in this case, under proper assumptions (which might seem stricter than plain gradient descent), adaptive methods have the same generalization capabilities as gradient descent.

### 4.2 Over-parameterized linear regression.

Let us focus on the case where . Finding a closed form for , as in the under-parameterized case, seems much trickier to achieve, despite our best efforts. Here, we follow a different path than the previous sections.

What is the predictive power of adaptive methods within the training set?

For the first question, we look for a way to express the predictions within the training dataset, i.e., , where is found by the recursion for updates.

###### Proposition 2

Consider the over-parameterized case. Assume the recursion , for positive definite matrices. Then, after iterations, the prediction satisfies:

 ˆyK=XwK=−(0∏i=K−1(I−ηXDiX⊤)−I)y.

The proof can be found in Section 7.5. Using Theorem 1, we observe that, for sufficiently large and for sufficiently small step size , . Thus, . This further implies that , as increases; i.e., adaptive methods fit the training data, and make the correct predictions within the training dataset.

What is the predictive power of adaptive methods on unseen data?

We start with the counterexample in [1], where adaptive methods –where takes the form of (4)–fail to find a solution that generalizes, in contrast to gradient descent methods (under assumptions).

Let us briefly describe their setting: we take to be of the order of , with ; empirically, the counterexample holds for various values of , as long as . For the responses, we consider two classes . For , we sample

with probability

as , and with probability as , for . Given , for each , we design the -th row of , as:

 (Xi)j=⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩yi,j=1,1,j=2,3,1,j=4+5(i−1),0,otherwise.    if yi=1,(Xi)j=⎧⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪⎩yi,j=1,1,j=2,3,1,j=4+5(i−1),⋯,8+5(i−1),0,otherwise.   if yi=−1. (1)

Given this structure for , only the first feature is indicative for the predicted class: i.e., a model that always predicts correctly is ; however, we note that this is not the only model that might lead to the correct predictions. The and features are the same , and the rest features are unique for each (the positions of non-zeros in are unique for each ).

Given this generative model and assuming , [1] show theoretically that AdaGrad, with as in (4), only predicts correctly the positive class, while plain gradient descent-based schemes perform flawlessly (predicting both positive and negative classes correctly), as long as the number of positive examples in training is more than the of the negative ones. This shows that simple gradient descent generalizes better than adaptive methods for some simple problem instances; this further implies that such behavior might transfer to more complex cases, such as neural networks.

### 4.3 A counterexample for the counterexample

We alter the previous counterexample by slightly changing the problem setting: at first, we reduce the margin between the two classes; the case where we increase the margin is provided in the appendix. We empirically show that gradient-descent methods fail to generalize as well as adaptive methods –with a slightly different than AdaGrad.

In particular, for the responses, we consider two classes for some ; i.e., we consider a smaller margin between the two classes.222One can consider classes in , but the rest of the problem settings need to be weighted accordingly. We selected to weight the classes differently in order not to drift much from the couterexample from [1]. can take different values, and still we get the same performance, as we show in the experiments below. The rest of the problem setting is the same. Likewise as above, only the first feature is indicative of the correct class.

Given this generative model, we construct samples , and set , for different values. We compare two simple algorithms: the plain gradient descent for ; the recursion , where is set as above, and follows the rule:

 Dk=diag ⎛⎜⎝1/⎛⎝k∑j=k−J∇f(wj)⊙∇f(wj)+ε⎞⎠2⎞⎟⎠≻0,for some ε>0,  and  J

Observe that uses the dot product of gradients, squared. A variant of this preconditioner is found in [25]; however our purpose is not to recommend a particular preconditioner but to show that there are that lead to better performance than the minimum norm solution. We denote as , and the estimates of the adam, adagrad variant and simple gradient descent, respectively.

The experiment obeys the following steps: we train both gradient and adaptive gradient methods on the same training set, we test models on new data . We define performance in terms of the classification error: for a new sample and given , and , the only features that are non-zeros in both and ’s are the first 3 entries [1, pp. 5]. This is due to the fact that, for gradient descent and given the structure in , only these 3 features333Further experiments were performed in Section 7.7 to empirically verify this consistency affects the performance of gradient descent. Thus, the decision rules for both algorithms are:

where finds the nearest point w.r.t. . With this example, our aim is to show that adaptive methods lead to models that have better generalization than gradient descent.

Table 1 summarizes the empirical findings. In order to cover a wider range of settings, we consider and set , as dictated by [1]. We generate as above, where instances in the positive class, , are generated with probability ; the cases where and are provided in the appendix section 7.7, and also convey the same message as in Table 1.

The simulation is completed as follows: For each setting , we generate 100 different instances for , and for each instance we compute the solutions from gradient descent, AdaGrad variant and Adam (RMSprop is included in the Appendix) and the minimum norm solution . In the appendix, we have the above table with the Adagrad variant that normalizes the final solution (Table 3) before calculating the distance w.r.t. the minimum norm solution: we observed that this step did not improve or worsen the performance, compared to the unnormalized solution. This further indicates that there is an infinite collection of solutions –with different magnitudes– that lead to better performance than plain gradient descent; thus our findings are not a pathological example where adaptive methods work better.

We record , where represents the corresponding solutions obtained by the algorithms in the comparison list. For each instance, we further generate , and we evaluate the performance of both models on predicting , .

The proposed AdaGrad variant described in equation 4.3 falls under the broad class of adaptive algorithms with . However, for the counter example in [1, pp. 5], the AdaGrad variant neither satisfies the convergence guarantees of Lemma 3.1 there, nor does it converge to the minimum norm solution evidenced by its norm in Table 1. To buttress our claim that the AdaGrad variant in (2) converges to a solution different than that of minimum norm (which is the case for plain gradient descent), we provide the following proposition for a specific class of problems444Not the problem proposed in the counter-example 1 on pg 5.; the proof is provided in Appendix 7.6.

###### Proposition 3

Suppose has no zero components. Define and assume there exists a scalar such that . Then, when initialized at 0, the AdaGrad variant in (2) converges to the unique solution .

This result, combined with our experiments, indicate that the minimum norm solution does not guarantee better generalization performance for over-parameterized settings, even in cases of linear regression. Thus, it is unclear why that should be the case for deep neural networks.

A detailed analysis about the class of counter-examples is available in Section 7.7.1.

## 5 Experiments

We empirically compare two classes of algorithms in deep neural network training:

• [leftmargin=0.5cm]

• Plain gradient descent algorithms, including the mini-batch stochastic gradient descent and the accelerated stochastic gradient descent, with constant momentum.

The details of the datasets and the DNN architectures used in our experiments are given in Table 2.

### 5.1 Hyperparameter tuning

Both for adaptive and non-adaptive methods, the step size and momentum parameters are key for favorable performance, as also concluded in [1]. Default values were used for the remaining parameters. The step size was tuned over an exponentially-spaced set , while the momentum parameter was tuned over the values of . We observed that step sizes and momentum values smaller/bigger than these sets gave worse results. Yet, we note that a better step size could be found between the values of the exponentially-spaced set. The decay models were similar to the ones used in [1]: no decay and fixed decay. We used fixed decay in the over-parameterized cases, using the StepLR implementation in pytorch. We experimented with both the decay rate and the decay step in order to ensure fair comparisons with results in [1]

. A complete set of hyperparameters tuned over for comparison can be found in Section

7.8 in the Appendix.

### 5.2 Results

Our main observation is that, both in under- or over-parameterized cases, adaptive and non-adaptive methods converge to solutions with similar testing accuracy: the superiority of simple or adaptive methods depends on the problem/data at hand. Further, as already pointed in [1], adaptive methods often require similar parameter tuning. Most of the experiments involve using readily available code from GitHub repositories. Since increasing/decreasing batch-size affects the convergence [26], all the experiments were simulated on identical batch-sizes. Finally, our goal is to show performance results in the purest algorithmic setups: often, our tests did not achieve state of the art performance.

Overall, despite not necessarily converging to the same solution as gradient descent, adaptive methods generalize as well as their non-adaptive counterparts. In M1 and C1-UP settings, we compute standard deviations from all Monte Carlo instances, and plot them with the learning curves (shown in shaded colors is the one-apart standard deviation plots; best illustrated in electronic form). For the cases of C{1-5}-OP we show single runs due to lack of excessive computational resources.

##### MNIST dataset and the M1 architecture.

Each experiment for M1 is simulated over 50 epochs and 10 runs for both under- and over-parameterized settings. Both the MNIST architectures consisted of two convolutional layers (the second one with dropouts

[27]) followed by two fully connected layers. The primary difference between the M1-OP (K parameters) and M1-UP (K parameters) architectures was the number of channels in the convolutional networks and of nodes in the last fully connected hidden layer.

Figure 1, left two columns, reports the results over 10 Monte-Carlo realizations. Top row corresponds to the M1-UP case; bottom row to the M1-OP case. We plot both training errors and the accuracy results on unseen data. For the M1-UP case, despite the grid search, observe that AdaGrad (and its variant) do not perform as well as the rest of the algorithms. Nevertheless, adaptive methods (such as Adam and RMSProp) perform similarly to simple SGD variants, supporting our conjecture that each algorithm requires a different configuration, but still can converge to a good local point; also that adaptive methods require the same (if not more) tuning. For the M1-OP case, SGD momentum performs less favorably compared to plain SGD, and we conjecture that this is due to non-optimal tuning. In this case, all adaptive methods perform similarly to SGD.

##### CIFAR10 dataset and the C1 architecture.

For C1, C1-UP is trained for 10 runs over epochs, while C1-OP was trained for 1 run, each consisting of epochs. The under-parameterized setting is on-purpose tweaked to ensure that we have fewer parameters than examples (K parameters), and slightly deviates from [28]; our generalization guarantees () are in conjunction with the attained test accuracy levels. Similarly, for the C1-OP case, we implement a Resnet [29] + dropout architecture ( million parameters) 555The code from the following github repository was used for experiments: https://github.com/kuangliu/pytorch-cifar and obtained top-1 accuracy of . Adam and RMSProp achieves the best performance than their non-adaptive counterparts for both the under-parameterized and over-parameterized settings.

Figure 1, right panel, follows the same pattern with the MNIST data; it reports the results over 10 Monte-Carlo realizations. Again, we observe that AdaGrad methods do not perform as well as the rest of the algorithms. Nevertheless, adaptive methods (such as Adam and RMSProp) perform similarly to simple SGD variants.

##### CIFAR100 and other deep architectures (C{2-5}-OP).

In this experiment, we focus only on the over-parameterized case: DNNs are usually designed over-parameterized in practice, with ever growing number of layers, and, eventually, a larger number of parameters [30]. Due to the depth and complexity of the networks, we only perform one run for each architecture. C2-OP corresponds to PreActResNet18 from [31], C3-OP corresponds to MobileNet from [32], C4-OP is MobileNetV2 from [33], and C5-OP is GoogleNet from [34]. The results are depicted in Figure 2. We did not perform a fine grid search over the hyper-parameters, but selected the best choices among the parameters used for the MNIST/CIFAR10 experiments. The results show only slight superiority of non-adaptive methods, but overall support our claims: the superiority depends on the problem/data at hand; also, all algorithms require fine tuning to achieve their best performance. We note that a more comprehensive reasoning requires multiple runs for each network, as other hyper-parameters (such as initialization) might plain significant role in closing the gap between different algorithms.

## 6 Conclusions and Future Work

We highlight that the small superiority of non-adaptive methods on some DNN simulations is not fully understood, and needs further investigation, beyond the simple linear regression model. A preliminary analysis of regularization for over-parameterized linear regression reveals that it can act as an equalizer over the set of adaptive and non-adaptive optimization methods, i.e. force all optimizers to converge to the same solution. However, more work is needed to analyze its effect on the overall generalization guarantees both theoretically and experimentally as compared to the non-regularized versions of these algorithms.

## 7 Supplementary Material

### 7.1 Unfolding gradient descent in the under-parameterized setting

Let us unfold this recursion, assuming that :

 w1 =w0−η∇f(w0) =ηX⊤y w2 =w1−η∇f(w1) =2ηX⊤y−η2(X⊤X)X⊤y ⋮ w5 =w4−η∇f(w4) =5ηX⊤y−10η2(X⊤X)X⊤y+10η3(X⊤X)2X⊤y −5η4(X⊤X)3X⊤y+η5(X⊤X)4X⊤y ⋮

What we observe is that:

• [leftmargin=0.5cm]

• The coefficients follow the Pascal triangle principle and can be easily expressed through binomial coefficients.

• The step size appears with increasing power coefficient, as well as the term .

• There are some constant terms, and .

The above lead to the following generic characterization of the gradient descent recursion:

 wK=(K∑i=1(−1)i−1⋅(Ki)⋅ηi⋅(X⊤X)i−1)X⊤y

The expression in the parentheses satisfies:

 K∑i=1(−1)i−1⋅(Ki)⋅ηi⋅(X⊤X)i−1 =K∑i=1(−1)i(−1)−1⋅(Ki)⋅ηi⋅(X⊤X)−1(X⊤X)i =(−X⊤X)−1⋅K∑i=1(−1)i⋅(Ki)⋅ηi⋅(X⊤X)i =(−X⊤X)−1⋅K∑i=1(Ki)⋅(−ηX⊤X)i =(−X⊤X)−1⋅(K∑i=0(Ki)⋅(−ηX⊤X)i−(K0)⋅(−ηX⊤X)0) =(−X⊤X)−1⋅(K∑i=0(Ki)⋅(−ηX⊤X)i−I) =(−X⊤X)−1⋅(K∑i=0(Ki)⋅IK−i⋅(−ηX⊤X)i−I)

Since and commute, we can use the binomial theorem:

 K∑i=0(Ki)⋅IK−i⋅(−ηX⊤X)i=(I−ηX⊤X)K.

Thus, we finally get:

 K∑i=1(−1)i−1⋅(Ki)⋅ηi⋅(X⊤X)i−1=(−X⊤X)−1⋅((I−ηX⊤X)K−I)

### 7.2 Dk=D is a diagonal matrix with D≻0.

Here, we simplify the selection of preconditioner in adaptive gradient methods. Our purpose is to characterize their performance, and check how an adaptive (=preconditioned) algorithm performs in both under- and over-parameterized settings.

#### 7.2.1 Under-parameterized linear regression.

 w1 =w0−ηD∇f(w0)=ηDX⊤y w2 =w1−ηD∇f(w1) =2ηDX⊤y−η2D(X⊤XD)X⊤y ⋮ w5 =w4−ηD∇f(w4) =5ηDX⊤y−10η2D(X⊤XD)X⊤y +10η3D(X⊤XD)2X⊤y −5η4D(X⊤XD)3X⊤y+η5D(X⊤XD)4X⊤y, ⋮

leading to the following closed form solution:

 wK=D(K∑i=1(−1)i−1⋅(Ki)⋅ηi⋅(X⊤XD)i−1)X⊤y.

Once again the question is: Under which conditions on the above recursion converges to the left inverse solution?

For the special case of being a positive definite constant matrix, observe that, for full rank , the matrix is also full rank, and thus invertible. We can transform the above sum, using similar reasoning to above, to the following expression:

 K∑i=1(−1)i−1⋅ (Ki)⋅ηi⋅(X⊤XD)i−1 =(−X⊤XD)−1⋅(K∑i=0(Ki)(−ηX⊤XD)i−(K0)(−ηX⊤XD)0) =(−X⊤XD)−1⋅(K∑i=0(Ki)(−ηX⊤XD)i−I) =(−X⊤XD)−1⋅((I−ηX⊤XD)K−I)

This further transforms our recursion into:

 wK =D(−X⊤XD)−1⋅((I−ηX⊤XD)K−I)X⊤y =DD−1(−X⊤X)−1⋅((I−ηX⊤XD)K−I)X⊤y =(−X⊤X)−1⋅((I−ηX⊤XD)K−I)X⊤y

Using Theorem 1, we can again prove that, for sufficiently large and for sufficiently small step size , we can prove that , and thus, . Thus, for sufficiently large :

 wK=(−X⊤X)−1⋅(−I)X⊤y=(X⊤X)−1⋅X⊤y,

which is the left inverse solution, as in gradient descent.

#### 7.2.2 Over-parameterized linear regression.

For the over-parameterized linear regression, we obtain a different expression by using a different kind of variable grouping in the unfolding procedure. In particular, we need to take in consideration that now is full rank, and thus the matrix is also full rank, and thus invertible. Going back to the main preconditioned gradient descent recursion:

 w1 =w0−ηD∇f(w0)=ηDX⊤y w2 =w1−ηD∇f(w1)=2ηDX⊤y−η2DX⊤(XDX⊤)y ⋮ w5 =w4−ηD∇f(w4)=5ηDX⊤y−10η2DX⊤(XDX⊤)y+10η3DX⊤(XDX⊤)2y −5η4DX⊤(XDX⊤)3y+η5DX⊤(XDX⊤)4y, ⋮

leading to the following closed form solution:

 wK=DX⊤(K∑i=1(−1)i−1⋅(Ki)⋅ηi⋅(XDX⊤)i−1)y.

The sum can be similarly simplified as:

 K∑i=1(−1)i−1⋅ (Ki)⋅ηi⋅(XDX⊤)i−1=(−XDX⊤)−1⋅((I−ηXDX⊤)K−I)

This further transforms our recursion into:

 wK =DX⊤(−XDX⊤)−1⋅((I−ηXDX⊤)K−I)y

Using Theorem 1, we can again prove that, for sufficiently large and for sufficiently small step size , we can prove that , and thus, . Thus, for sufficiently large :

 wK=DX⊤(−XDX⊤)−1⋅(−I)y=DX⊤(XDX⊤)−1y≠wmn

which is not the same as the minimum norm solution, except when for some constant . This proves that preconditioned algorithms might lead to different solutions, depending on the selection of the preconditioning matrix/matrices.

### 7.3 Unfolding adaptive gradient descent with varying Dk in the under-parameterized setting

Unfolding the recursion, when is varying, we get:

 w1 =w0−ηD0∇f(w0)=ηD0X⊤y w2 =w1−ηD1∇f(w1)=η(D0+D1)X⊤y−η2(D1X⊤XD0)X⊤y w3 =w2−ηD2∇f(w2) =η(D0+D1+D2)X⊤y−η2(D1X⊤XD0+D2X⊤X(D0+D1))X⊤y +η3D2X⊤XD1X⊤XD0X⊤y w4 =w3−ηD3∇f(w3) =η(