DeepAI

# An optimal scheduled learning rate for a randomized Kaczmarz algorithm

We study how the learning rate affects the performance of a relaxed randomized Kaczmarz algorithm for solving A x ≈ b + ε, where A x =b is a consistent linear system and ε has independent mean zero random entries. We derive a scheduled learning rate which optimizes a bound on the expected error that is sharp in certain cases; in contrast to the exponential convergence of the standard randomized Kaczmarz algorithm, our optimized bound involves the reciprocal of the Lambert-W function of an exponential.

• 9 publications
• 9 publications
05/20/2016

### Convergence of Contrastive Divergence with Annealed Learning Rate in Exponential Family

In our recent paper, we showed that in exponential family, contrastive d...
09/27/2019

### On a convergence property of a geometrical algorithm for statistical manifolds

In this paper, we examine a geometrical projection algorithm for statist...
07/14/2019

### Finite-Time Performance Bounds and Adaptive Learning Rate Selection for Two Time-Scale Reinforcement Learning

We study two time-scale linear stochastic approximation algorithms, whic...
12/31/2019

### A doubly stochastic block Gauss-Seidel algorithm for solving linear equations

We propose a simple doubly stochastic block Gauss-Seidel algorithm for s...
07/08/2020

### AutoLR: An Evolutionary Approach to Learning Rate Policies

The choice of a proper learning rate is paramount for good Artificial Ne...
10/18/2019

### Scheduling the Learning Rate via Hypergradients: New Insights and a New Algorithm

We study the problem of fitting task-specific learning rate schedules fr...
03/24/2022

### Learning the Dynamics of Autonomous Linear Systems From Multiple Trajectories

We consider the problem of learning the dynamics of autonomous linear sy...

## 1. Introduction and main result

### 1.1. Introduction

Let be an matrix and be a consistent linear system of equations. Suppose that is a corrupted version of defined by

 (1) ~b=b+ε,

where

has independent mean zero random entries. Given an initial vector

, we consider the relaxed Kaczmarz algorithm

 (2) xk+1=xk+αk~bik−⟨aik,xk⟩∥aik∥2aik,

where is the learning rate (or relaxation parameter), is the -th row of , is the row index for iteration , and is the -norm. When the rows are chosen randomly, (2) is an instance of stochastic gradient descent, see [18], whose performance in practice depends on the definition of the learning rate, see [5]. In this paper, we derive a scheduled learning rate for a randomized Kaczmarz algorithm, which optimizes a bound on the expected error; our main result proves an associated convergence result, see Theorem 1.1 and Figure 1.

### 1.2. Background

The Kaczmarz algorithm dates back to the 1937 paper by Kaczmarz [10] who considered the iteration (2) for the case . The algorithm was subsequently studied by many authors; in particular, in 1967, Whitney and Meany [29] established a convergence result for the relaxed Kaczmarz algorithm: if is a consistent linear system, , and for fixed , then (2) converges to , see [29]. In 1970 the Kaczmarz algorithm was rediscovered under the name Algebraic Reconstruction Technique (ART) by Gordon, Bender, and Herman [6] who were interested in applications to computational tomography (including applications to three-dimensional electron microscopy); such applications typically use the relaxed Kaczmarz algorithm with learning rate , see [7]

. Methods and heuristics for setting the learning rate

have been considered by several authors, see the 1981 book by Censor [3]; also see [2, 8, 9].

More recently, in 2009, Strohmer and Vershynin [26] established the first proof of a convergence rate for a Kaczmarz algorithm that applies to general matrices; in particular, given a consistent linear system , they consider the iteration (2) with . Under the assumption that the row index at iteration

is chosen randomly with probability proportional to

they prove that

 (3) E∥xk−x∥2≤(1−η)k∥x−x0∥2,

where ; here, is a condition number for the matrix defined by , where is the Frobenius norm of , and is the operator norm of the left inverse of . We remark that the convergence rate in (3) is referred to as exponential convergence in [26], while in the field of numerical analysis (where it is typical to think about error on a logarithmic scale) it is referred to as linear convergence.

The result of [26] was subsequently extended by Needell [19] who considered the case of a noisy linear system: instead of having access to the right hand side of the consistent linear system , we are given , where the entries of satisfy but are otherwise arbitrary. Under these assumptions [19] proves that the iteration (2) with satisfies

 (4) E∥xk−x∥2≤(1−η)k∥x−x0∥2+δ2η,

that is, we converge until we reach some ball of radius around the solution and then no more. Let be the left inverse of , and observe that

 (5) ∥x−A−1(b+γ)∥=∥A−1γ∥≤∥A−1∥∥γ∥≤δ∥A−1∥∥A∥F=δ√η.

Moreover, if is a scalar multiple of the left singular vector of

associated with the smallest singular value, and

, then (5) holds with equality (such examples are easy to manufacture). Thus, (4) is optimal when is arbitrary. In this paper, we consider the case where , where has independent mean zero random entries: our main result shows that in this case, we break through the convergence horizon of (4) by using an optimized learning rate and many equations with independent noise, see Theorem 1.1 for a precise statement.

Modifications and extensions of the randomized Kaczmarz algorithm of [26] have been considered by many authors, see the references discussed in §4. In particular, [1] considers the convergence of a relaxed Kaczmarz algorithm, but their analysis focuses on the case of a consistent linear system. The randomized Kaczmarz algorithm of [26]

can also be viewed as an instance of other machine learning methods. In particular, it can be viewed as an instance of coordinate descent, see

[30], or as an instance of stochastic gradient descent, see [18]

. Part of our motivation for studying scheduled learning rates for the randomized Kaczmarz algorithm is that the randomized Kaczmarz algorithm provides a model where we can start to develop a complete theoretical understanding that can be transferred to other problems. The learning rate is of essential importance in machine learning; in particular, in deep learning:

“The learning rate is perhaps the most important hyperparameter…. the effective capacity of the model is highest when the learning rate is correct for the optimization problem”, [5, pp. 429]

. In this paper, we derive a scheduled learning rate (depending on two hyperparameters) for a randomized Kaczmarz algorithm, which optimizes a bound on the expected error, and prove an associated convergence result. In general, a learning rate is said to be scheduled if the method of changing the learning rate is determined a priori as a function of the iteration number and possibly hyperparameters (that is, the rate is non-adaptive). There are many general results about scheduled learning rates, see for example,

[23, 31]; the contribution of this paper is that we are able to derive a stronger convergence guarantee for the specific case of a randomized Kaczmarz algorithm consisting of the iteration (2). We remark that in practice adaptive learning rates such as Adadelta [32] or ADAM [11] are often used. We discuss the possibility of extending our analysis to adaptive learning rates and other potential extensions in §4.

### 1.3. Main result

Let be an matrix, be a consistent linear system of equations, and denote the -th row of . Suppose that is defined by

 ~b=b+ε,

where

are independent random variables such that

has mean

and variance

. Given , define

 (6) xk+1=xk+αk~bik−⟨aik,xk⟩∥aik∥2aik,

where denotes the learning rate parameter; assume that is chosen from with probability proportional to . Let denote the matrix formed by deleting rows from .

###### Theorem 1.1.

In addition to the assumptions of §1.3, assume that

 (7) κ(Ak)−2≥η,fork≤N.

Given and set, and define recursively by

 (8) αk=ηβkηβk+1,andβk+1=βk(1−ηαk),fork=0,1,….

If is defined iteratively by (6), then

 (9) E∥xk−x∥2≤σ2βk≤f(k),fork≤N,

where

 (10) f(k):=σ2ηW(eηk+c),forc:=σ2η∥x−x0∥2−ln(η∥x−x0∥2σ2),

where denotes the Lambert- function (the inverse of the function ).

The proof of Theorem 1.1 is given in §2. In the following, we illustrate Theorem 1.1 with a mix of remarks, examples, and corollaries.

###### Remark 1.1 (Interpreting Theorem 1.1).

We summarize a few key observations that aid in interpreting Theorem 1.1:

1. Asymptotic behavior of : in the limit as the noise goes to zero

 (11) f(k)∼e−ηk∥x−x0∥2,asσ→0,

and in the limit as the number of iterations go to infinity

see Corollary 1.1 and 1.2, respectively, for more precise statements.

2. If is an random matrix with i.i.d. rows, then (7) can be replaced by a condition on the distribution of , which is independent of the iteration number . In this case, the result (9) holds for all , see Corollary 1.3.

3. We emphasize that is chosen without replacement, see §1.3, so the maximum number of iterations is at most the number of rows . In practice, the algorithm can be run with restarts: after a complete pass over the data we use the final iterate to redefine , and restart the algorithm (potentially using different hyperparameters to define the learning rate). In the context of machine learning, the statement of Theorem 1.1

applies to one epoch, see §

4.

4. The error bound is sharp in some cases; in these cases the learning rate is optimal, see Corollary 1.4. We demonstrate this corollary numerically in Figure 2.

###### Remark 1.2 (Interpreting the plot of the relative error in Figure 2).

At first, when the error decreases exponentially (linearly in the logarithmic scale of the figure) at the asymptotic rate (11), see Corollary 1.1. For large , when the error decreases like (like in the logarithmic scale of the figure) at the asymptotic rate (12), see Corollary 1.2.

###### Example 1.1 (Numerical illustration of Theorem 1.1).

Let be an matrix with nonzero entries in each row. Assume these nonzero entries are independent random vectors drawn uniformly at random from the unit sphere: . Let be an -dimensional vector with independent standard normal entries, and set . Let , where is an -dimensional vector with independent mean , variance normal entries. We run the relaxed Kaczmarz algorithm (2) using the learning rate (8) of Theorem 1.1. In particular, we set

 m=2000,n=100,s=10,σ=.05,andx0=→0.

Using the estimates

and we define by (8) (we justify this choice of in Corollary 1.4). We plot the numerical relative error together with the bound on the expected relative error in Figure 2. Furthermore, to provide intuition about how varies with , we plot for various values of in Figure 2, keeping other parameters fixed.

###### Example 1.2 (Continuous version of learning rate αk).

The learning rate defined in (8) optimizes the error bound of Theorem 1.1, see §2.4. The scheduled learning rate depends on two parameters (assuming ):

• , and

• the condition number parameter .

The result of Theorem 1.1 states that the function defined in (10) is an upper bound for . From the proof of Theorem 1.1 it will be clear that this upper bound is a good approximation when is small, see §2.5. In this case, it is illuminating to consider a continuous version of the optimal scheduled learning rate of Theorem 1.1. In particular, we define

 α(t)=ηf(t)ηf(t)+σ2,

where is defined by (10). We plot the function for three different levels of noise: , , and , while keeping the other parameters fixed (and set by the values in Example 1.1), see Figure 3.

We also plot the function for three different values of the condition number parameter: for , and , while keeping the other parameters fixed (and set by the values in Example 1.1), see Figure 3.

### 1.4. Corollaries of the main result

The following corollaries provide more intuition about the error bound function , and cases when the error bound of Theorem 1.1 is sharp. First, we consider the case where the variance of the noise is small, and the other parameters are fixed; in this case we recover the convergence rate of the standard randomized Kaczmarz algorithm with .

###### Corollary 1.1 (Limit as σ→0).

We have

 f(k)=e−ηk∥x−x0∥2(1+O(σ2∥x−x0∥2)),asσ→0,

where the constant in the big- notation depends on and .

The proof of Corollary 1.1 is given in §3.1. Next, we consider the convergence rate as the number of iterations goes to infinity and the other parameters are fixed.

###### Corollary 1.2 (Limit as k→∞).

We have

where the constant in the big- notation depends on , and .

The proof of Corollary 1.2 is given in §3.2. Informally speaking, in combination with Theorem 1.1, this corollary says that we should expect

 E∥x−xk∥≲ση1√k,

which agrees with the intuition that using independent sources of noise should reduce the error in the -norm by a factor of .

###### Corollary 1.3 (Matrices with i.i.d. rows).

Suppose that is an random matrix with i.i.d. rows, and let denote an arbitrary row of . If we run the algorithm (2) with , and in place of the condition (7) assume that

 (13) E∣∣ ∣∣⟨z,a∥a∥⟩∣∣ ∣∣2≥η∥z∥2,∀z∈Rn,

for some fixed value , then the result (9) of Theorem 1.1 holds for .

The proof of Corollary 1.3 is given in §3.3. The condition of Corollary 1.3 holds if, for example, the rows of are sampled uniformly at random from the unit sphere . Indeed, in this case

 E|⟨z,a⟩|2≥1n∥z∥2,∀z∈Rn,

see [28, Lemma 3.2.3, §3.3.1]. The following corollary gives a condition under which the learning rate (8) is optimal. In particular, this corollary implies that the error bound and learning rate are optimal for the example of matrices whose rows are sampled uniformly at random from the unit sphere discussed above.

###### Corollary 1.4 (Case when error bound is sharp and learning rate is optimal).

Assume that (20) in the proof of Theorem 1.1 holds with equality, that is,

 (14) Ei0,…,ik−1∣∣ ∣∣⟨xk−x,∥x−xk∥,aik∥aik∥⟩∣∣ ∣∣2=η.

Then,

 E∥x−xk∥2=σ2βk,

and the learning rate defined by (8) is optimal in the sense that it minimizes the expected error over all possible choices of scheduled learning rates (learning rates that only depend on the iteration number and possibly hyperparameters). Moreover, if (13) holds with , then it follows that (14) holds with equality and hence the learning rate is optimal.

The proof of Corollary 1.4 is given in §3.4. Informally speaking, the learning rate defined by (8) is optimal whenever the bound in Theorem 1.1 is a good approximation of the expected error , and there is reason to expect this is the case in practice, see the related results for consistent systems [26, §3, Theorem 3] and [25, Theorem 1]. For additional discussion about Theorem 1.1 and its corollaries see §4.

## 2. Proof of Theorem 1.1

The proof of Theorem 1.1 is divided into five steps: in Step 1 (§ 2.1) we consider the relaxed Kaczmarz algorithm for the consistent linear system . This step uses the same proof strategy as [26]; in Step 2 (§ 2.2), we consider the effect of additive random noise. In Step 3 (§ 2.3), we estimate conditional expectations. In Step 4 (§ 2.4), we optimize the learning rate with respect to an error bound. In Step 5 (§ 2.5), we show that this optimized learning rate is related to a differential equation, which can be used to establish the upper bound on the expected error.

### 2.1. Step 1: relaxed randomized Kaczmarz for consistent systems

We start by considering the consistent linear system . Assume that is given and let

 xk+1:=xk+αkbik−⟨aik,xk⟩∥aik∥2aik,

denote the iteration of the relaxed randomized Kaczmarz algorithm with learning rate on the consistent linear system , and let

 yk+1:=xk+bik−⟨aik,xk⟩∥aik∥2aik,

be the projection of

on the affine hyperplane defined by the

-th equation. The points , and lie on the line , which is perpendicular to the affine hyperplane that contains and , see the illustration in Figure 4.

By the Pythagorean theorem, it follows that

 ∥x−xk+1∥2=∥x−xk∥2−∥yk+1−xk∥2+∥yk+1−xk+1∥2,

and by definition of and we have

 yk+1−xk+1=(1−αk)(yk+1−xk),

see Figure 4. Thus

 ∥x−xk+1∥2=∥x−xk∥2−(2αk−α2k)∥yk+1−xk∥2.

Factoring out from the right hand side gives

 ∥x−xk+1∥2=(1−(2αk−α2k)∥yk+1−xk∥2∥x−xk∥2)∥x−xk∥2.

Since , it follows that

 ∥xk+1−x∥2=⎛⎝1−(2αk−α2k)∣∣ ∣∣⟨xk−x∥xk−x∥,aik∥aik∥⟩∣∣ ∣∣2⎞⎠∥xk−x∥2.

Taking the expectation conditional on gives

 (15) Ei0,…,ik−1∥xk+1−x∥2=(1−(2αk−α2k)ηk)∥xk−x∥2,

where

 ηk:=Ei0,…,ik−1∣∣ ∣∣⟨xk−x∥xk−x∥,aik∥aik∥⟩∣∣ ∣∣2.

We delay discussion of conditional expectation until later.

###### Remark 2.1 (Optimal learning rate for consistent linear systems).

Observe that

 (1−ηk)∥xk−x∥2≤(1−(2αk−α2k)ηk)∥xk−x∥2,

with equality only when . It follows that, for consistent linear systems of equations, the optimal way to define the learning rate is to set for all . In the following, we show that, under our noise model, defining as a specific decreasing function of is advantageous.

### 2.2. Step 2: relaxed randomized Kaczmarz for systems with noise

In this section, we redefine and for the case where the right hand side of the consistent linear system is corrupted by additive random noise: . Let

 xk+1:=xk+αk~bik−⟨aik,xk⟩∥aik∥2aik=xk+αkbik+εik−⟨aik,xk⟩∥aik∥2aik,

be the iteration of the relaxed randomized Kaczmarz algorithm using , and

 yk+1:=xk+bik−⟨aik,xk⟩∥aik∥2aik,

be the projection of on the affine hyperplane defined by the uncorrupted -th equation. Note that both and differ from the previous section: is corrupted by the noise term and is the projection of the previously corrupted iterate onto the hyperplane defined by the uncorrupted equation. However, the following expansion still holds:

 (16) ∥x−xk+1∥2=∥x−xk∥2−∥yk+1−xk∥2+∥yk+1−xk+1∥2.

Indeed, and are still contained on the line , which is perpendicular to the affine hyperplane that contains and , see Figure 4. By the definition of and we have

 ∥yk+1−xk+1∥2=∥∥∥(1−αk)(yk+1−xk)−αkεik∥aik∥aik∥aik∥∥∥∥2.

Expanding the right hand side gives

 (17) ∥yk+1−xk+1∥2=(1−αk)2∥yk+1−xk∥2+Zk,

where

 Zk:=−2αk(1−αk)εik∥aik∥⟨yk+1−xk,aik∥aik∥⟩+α2kε2ik∥aik∥2.

By using the fact that

 ⟨yk+1−xk,aik∥aik∥⟩=bik−⟨aik,xk⟩∥aik∥,

we can rewrite as

 Zk=−2αk(1−αk)εik∥aik∥bik−⟨aik,xk⟩∥aik∥+α2kε2ik∥aik∥2.

Combining (16) and (17) gives

 ∥x−xk+1∥2=∥x−xk∥2−(2αk−α2k)∥yk+1−xk∥2+Zk.

As in the analysis of the consistent linear system in Step 1 (§2.1) above, we factor out , use the fact that , and take the expectation conditional on to conclude that

 (18) Ei0,…,ik−1∥xk+1−x∥2=(1−(2αk−α2k)ηk)∥xk−x∥2+ζk,

where

 ηk:=Ei0,…,ik−1∣∣ ∣∣⟨xk−x∥xk−x∥,aik∥aik∥⟩∣∣ ∣∣2,% andζk=Ei0,…,ik−1Zk.

In the following section, we discuss the terms and .

### 2.3. Step 3: estimating conditional expectations

In this section, we discuss estimating the conditional expectations and . First, we discuss , which has two terms (a linear term and quadratic term with respect to ). In particular, we have

 ζk=−2αk(1−αk)Ei0,…,ik−1εik∥aik∥bik−⟨aik,xk⟩∥aik∥+α2kEi0,…,ik−1ε2ik∥aik∥2.

Recall that are independent random variables such that has mean zero and variance . Since we assume that is chosen from (that is, they are drawn without replacement) with probability proportional to , see §1.3, it follows that is independent from . Hence

 (19) ζk=α2kσ2.

We remark that if was chosen uniformly at random from and we had previously selected equation , say, during iteration for , then the error in the -th equation may depend (or even be determined) by ; thus the assumption that rows are drawn without replacement is necessary for this term to vanish. We use the same estimate for as in [26]. In particular, by [26, eq. 7] we have

 (20) ηk=Ei0,…,ik−1∣∣ ∣∣⟨xk−x,∥x−xk∥,aik∥aik∥⟩∣∣ ∣∣2≥κ(Ak)−2≥η,

where the final inequality follows by assumption (7) in the statement of Theorem 1.1.

###### Remark 2.2 (Comparison to case αk=1).

Note that when , the linear term

 −2αk(1−αk)Ei0,…,ik−1εik∥aik∥bik−⟨aik,xk⟩∥aik∥=0

because , which geometrically is the result of an orthogonality relation, which holds regardless of the structure of the noise. Here we consider the case , and the linear term vanishes due to the assumption that are independent mean zero random variables.

### 2.4. Step 4: optimal learning rate with respect to upper bound

In this section we derive the optimal learning rate with respect to an upper bound on the expected error. In Remark 2.1, we already considered the case , and found that the optimal learning rate is to set for all regardless of the other parameters. Thus, in this section we assume that . By (18), (19), and (20) we have

 (21) Ei0,…,ik−1∥xk+1−x∥2≤(1−(2αk−α2k)η)∥xk−x∥2+α2kσ2.

Iterating this estimate and taking a full expectation gives

 E∥xk+1−x∥2≤g(k,α)σ2,

for

 g(k,α):=k∏j=0(1−(2αj−α2j)η)∥x0−x∥2σ2+k∑j=0α2jk∏i=j+1(1−(2αi−α2i)η),

where and we use the convention that empty products are equal to . This upper bound satisfies the recurrence relation

 g(k,α)=(1−(2αk−α2k)η)g(k−1,α)+α2k,

where does not depend on . Setting the partial derivative of with respect to equal to zero, and solving for gives

 (22) αk=ηg(k−1,α)ηg(k−1,α)+1.

Since the value of defined by (22) does indeed minimize with respect to . It is straightforward to verify that this argument can be iterated to conclude that the values of that minimize satisfy the recurrence relation:

 (23)

for . Note that we can simplify (23) by observing that

 −(2αk−α2k)ηβk+α2k=−(2ηβk(ηβk+1)−η2β2k)ηβk+η2β2k(ηβk+1)2=−η2β2kηβk+1=−ηβkαk.

In summary, we can compute the optimal learning rate with respect to the upper bound on the expected error as follows: if , then for all . Otherwise, we define

 (24) β0=∥x−x0∥2/σ2,αk=ηβkηβk+1,andβk+1=βk(1−ηαk),

for . In the following section we study the connection between this recursive formula and a differential equation.

### 2.5. Step 5: relation to differential equation

In this section, we derive a closed form upper bound for . It follows from (24) that

 ηβk+1−ηβkη=−ηβkηβkηβk+1.

Making the substitution

gives the finite difference equation

 uk+1−ukη=−u2kuk+1,

which can be interpreted as one step of the forward Euler method (with step size

) for the ordinary differential equation

 (25) ˙u=−u2u+1,

where and . It is straightforward to verify that the solution of this differential equation is

 (26) u(t)=1W(et+c),

where is the Lambert- function (the inverse of the function ) and is determined as the initial condition; in particular, if , then

 (27) c=1u0−ln(u0).

We claim that is a convex function when . It suffices to check that . Direct calculation gives

 (28) ¨u=u4+2u3(u+1)3.

Observe that cannot change sign because when . Thus, (28) is always nonnegative when as was to be shown. Since the forward Euler method is a lower bound for convex functions, it follows from (26) and (27) that

 βk≤1ηW(eηk+c),forc:=1ηβ0−ln(ηβ0).

Thus if we set

 (29) f(k):=σ2ηW(eηk+c),forc:=σ2η∥x−x0∥2−ln(η∥x−x0∥2/σ2),

it follows that

 (30) E∥x−xk∥≤σ2βk≤σ2ηu(ηk)=f(k).

This completes the proof of Theorem 1.1.

## 3. Proof of Corollaries

### 3.1. Proof of Corollary 1.1

By the definition of and , see (29), we have

 f(k)=σ2ηW(σ2η∥x−x0∥2eηkeσ2/(η∥x−x0∥2)).

The Lambert- function has the Taylor series

 W(x)=∞∑n=1(−1)n−1nn−2(n−1)!xn,for|x|<1e,

see for example [21, eq. 4.13.5]; in particular, as . Thus,

 f(k)=σ2η1σ2η∥x−x0∥2eηkeσ2/(η∥x−x0∥2)+O(σ4∥x−x0∥4).

Canceling terms and using the fact that gives

 f(k)=e−ηk∥x−x0∥2(1+O(σ2∥x−x0∥2)),asσ→0,

where the constant in the big- notation depends on and , as was to be shown.

### 3.2. Proof of Corollary 1.2

By the definition of and , see (29), we have

 f(k)=σ2η1W(σ2η∥x−x0∥2eηkeσ2/(η∥x−x0∥2)).

The Lambert- function has asymptotic expansion

 W(eξ)=ξ−ln(ξ)+O(lnξξ),% asξ→+∞,

see [21, eq. 4.13.10]. It follows that

 f(k)=σ2η2k(1