# Relaxed regularization for linear inverse problems

We consider regularized least-squares problems of the form min_x1/2‖ Ax - b‖_2^2 + ℛ(Lx). Recently, Zheng et al., 2019, proposed an algorithm called Sparse Relaxed Regularized Regression (SR3) that employs a splitting strategy by introducing an auxiliary variable y and solves min_x,y1/2‖ Ax - b‖_2^2 + κ/2‖ Lx - y‖_2^2 + ℛ(x). By minimizing out the variable x we obtain an equivalent system min_y1/2‖ F_κy - g_κ‖_2^2+ℛ(y). In our work we view the SR3 method as a way to approximately solve the regularized problem. We analyze the conditioning of the relaxed problem in general and give an expression for the SVD of F_κ as a function of κ. Furthermore, we relate the Pareto curve of the original problem to the relaxed problem and we quantify the error incurred by relaxation in terms of κ. Finally, we propose an efficient iterative method for solving the relaxed problem with inexact inner iterations. Numerical examples illustrate the approach.

## Authors

• 1 publication
• 8 publications
• ### Sparse Relaxed Regularized Regression: SR3

Regularized regression problems are ubiquitous in statistical modeling, ...
07/14/2018 ∙ by Peng Zheng, et al. ∙ 0

• ### On inner iterations of the joint bidiagonalization based algorithms for solving large scale linear discrete ill-posed problems

The joint bidiagonalization process of a large matrix pair A,L can be us...
08/11/2020 ∙ by Haibo Li, et al. ∙ 0

• ### Regularity and stability of feedback relaxed controls

This paper proposes a relaxed control regularization with general explor...
01/09/2020 ∙ by Christoph Reisinger, et al. ∙ 0

• ### Improving the Accuracy and Consistency of the Scalar Auxiliary Variable (SAV) Method with Relaxation

The scalar auxiliary variable (SAV) method was introduced by Shen et al....
04/14/2021 ∙ by Maosheng Jiang, et al. ∙ 0

• ### Stability of the Kaczmarz Reconstruction for Stationary Sequences

The Kaczmarz algorithm is an iterative method to reconstruct an unknown ...
06/19/2019 ∙ by Caleb Camrud, et al. ∙ 0

• ### Minimum Volume Topic Modeling

We propose a new topic modeling procedure that takes advantage of the fa...
04/03/2019 ∙ by Byoungwook Jang, et al. ∙ 0

• ### Robust method for finding sparse solutions to linear inverse problems using an L2 regularization

We analyzed the performance of a biologically inspired algorithm called ...
01/03/2017 ∙ by Gonzalo H Otazu, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Inverse problems are problems where a certain quantity of interest has to be determined from indirect measurements. In medicine, well-known examples include MRI [Liang2000] , CT [Herman2009], and ultrasound imaging [Besson2019] where the objective is to obtain images of the interior of the human body. In the geosciences, inverse problems arise in seismic exploration and seismology [Virieux2009], where the interest lies in exploring the elastic properties of the different layers of our planet. Other examples include tomography [Arridge2009, Natterer2001, Bertero1998], radar imaging [Borden2001], remote sensing [Tamminen2012, Nolan2002], astrophysics [Starck2016], and more recently, machine learning [Goodfellow2016].

Inverse problems are challenging for a number of reasons. There may be limited data available, or the data may be corrupted by noise. The datasets are generally very large, and the underlying model is generally not well-defined for retrieving the quantity of interest. Therefore, inverse problems often have to be regularized, meaning prior information has to be added. They can be posed in the following way:

 minx12∥Ax−b∥22+R(Lx), (1)

where is the linear forward operator, is the regularization term and the regularization operator. The latter two encode the prior information about . In our work, we focus on , or, equivalently, . By equivalent we mean that for every there is a such that the solutions of the two problems coincide [Aravkin2012e]. A direct solution to the problem above is generally not possible, either because a closed form solution does not exist, or because evaluating the direct solution is too computationally expensive. Therefore, we have to resort to iterative methods to solve the problem, with most algorithms being designed for specific choices of and .
Traditionally, , called Tikhonov regularization, is a popular choice, because the objective function is differentiable and allows for a closed-form expression of the solution of Eq. 1 in terms of and . For this class of problems, Krylov based algorithms have been proven very effective [Calvetti2000, Calvetti2003, Hansen2010, Kilmer2001, Golub1997, Zwaan2017, Hochstenbach2010, Hochstenbach2015, Kilmer2007]. These methods generally exploit the fact that a closed-form solution exists by constructing a low dimensional subspace from which an approximate solution is extracted.

The choice has gained popularity in recent years because it gives sparse solutions while still yielding a convex objective. Sparsity is important in a number of applications, like compressed sensing [Candes2006], seismic imaging [Herrmann2008], image restoration [Rudin1994], and tomography [Hansen2011]. However, the objective is no longer differentiable and the aforementioned Krylov methods do not apply. If , a proximal gradient method (sometimes referred to as Iterative Soft Tresholding – ISTA) [Daubechies2004] can be applied, iteratively updating the solution via

 xk+1=proxαkλ(xk−αkAT(Axk−b)),

where the proximal operator is the soft thresholding operator, which can be efficiently evaluated. Generally, ISTA achieves a sub-linear rate of convergence of (unless and has full rank, in which case we have a linear rate of convergence). FISTA (Fast Iterative Soft Thresholding Algorithm) [Beck2009] is a faster version of ISTA that generally achieves a sublinear rate of .

If , the proximal operator is no longer easy to evaluate in general and FISTA may no longer be attractive. An example of this class of problems is TV regularization, where is the discretization of the gradient, that gives blocky solutions. A popular algorithm for this class of problems is the Alternating Direction Method of Multipliers, ADMM [Boyd2011]. ADMM solves Eq. 1 by forming the augmented Lagrangian

 minx,y,z12∥Ax−b∥22+λ∥y∥pp+zT(Lx−y)+ρ2∥Lx−y∥22,

and alternatingly minimizing over the variables and , and the Lagrange multiplier . The strength of ADMM is that it can closely approximate the solution of any convex sparse optimization problem. However, convergence can be slow [Boyd2011].

If , the emphasis on sparsity of the solution is stronger than for the case . However, the objective function is no longer convex which makes it more difficult to solve.

Recently, a unifying algorithm was proposed that allows the efficient approximation of the solution of any problem of the form Eq. 1, called Sparse Relaxed Regularized Regression (SR3) [Zheng2019]. This algorithm makes use of a splitting strategy by introducing an auxiliary variable and yields:

 minx,y12∥Ax−b∥22+κ2∥Lx−y∥22+R(y). (2)

By minimizing out , we obtain a new system of the form:

 miny12∥Fκy−gκ∥22+R(y), (3)

where and , . If is invertible the algorithm is in standard-form and for any other the algorithm is in general form. This method has several advantages when applied to solving inverse problems that we highlight in the examples below.

### 1.1 Motivating examples

Below we show some typical examples encountered in various areas of science to which SR3 can be applied. The problems we tackle are of the form

 minx12∥Ax−b∥22s.t.∥Lx∥p≤τ. (4)

The main tasks are to solve this for a given value of and to find an appropriate value of . The latter is achieved by picking the corner of the Pareto curve (sometimes called the L-curve) where solves Eq. 4. Comparing a proximal gradient method to SR3, we show the residual as a function of , the optimal reconstruction, and the convergence history. These examples show two favourable aspects of SR3 over the conventional proximal gradient method: i) SR3 converges (much) faster for any fixed value of and ii)

the corners of both Pareto-curves coincide, allowing us to effectively use SR3 to estimate

.

#### Spiky deconvolution (m=n, p=1, L=i)

Consider a deconvolution problem where is a Toeplitz-matrix that convolves the input with a bandlimited function;

 aij=w(ti−tj),

where and . We take , and . The results are shown in figure 1.

#### Compressed sensing (m<n, p=1, L=i)

Here, the goal is to recover a sparse signal from compressive samples. The forward operator is a random matrix with i.i.d. normally distributed entries. We take

and . The results are shown in figure 2.

#### Total variation (m=n, p=1, L=d)

Consider a deconvolution problem where is a Toeplitz-matrix that convolves the input with a bandlimited function;

 aij=w(ti−tj),

where and . is a finite-difference discretization of the first-order derative with certain boundary conditions. We take , and . The results are shown in figure 3.

### 1.2 Contributions

In this paper we set out to further analyze the SR3 method proposed in [Zheng2019] and analyze in detail the observations made in the above examples. Our contributions are:

Conditioning of for general .

In [Zheng2019] the particular case with is analyzed. Using the SVD of

, the singular values of

were calculated, showing a relation between the condition number of and depending on . In short, the result shows that a small improves the conditioning of and as the condition numbers are the same, because the original system is obtained. In this paper we derive the SVD of for general and that the singular values and the condition number of are related to the generalized singular values of . As a by-product, we show that SR3 implicitly makes a standard-form transformation [Elden1982] of Eq. 1.

Approximation of the Pareto-curve.

We show that that the Pareto curve corresponding to the relaxed problem always underestimates the Pareto curve of the original problem and that the error is of order . A by-product of this result is a better understanding of the pareto curve for general and an intuitive explanation of the observation that the corners of the relaxed original Pareto curves coincide.

Inexact solves.

The SR3 algorithm consists of two steps. The first step requires the solution of a regularized least-squares problem and the second step is the application of the proximal operator. For many important regularizers, the proximal operator is easy to evaluate and comes at little cost. However, for large-scale applications solving the system completely at each step is impossible due to computational limitations. Therefore, we investigate how only partially solving this system affects the convergence and the solution. We propose to use a warm starting approach, where at every step the solution of the previous iteration is used as an initial guess.

### 1.3 Outline

In Section 2 we analyze the operator . We derive the SVD of and analyse the limiting cases and . Our main results are a characterization of the singular values of and showing that SR3 implicitly applies a standard-form transformation. In Section 3 we relate the Pareto curve of SR3 to the Pareto curve of the original problem and derive an error bound in terms of . Next, Section 4 is concerned with the implementation of SR3. We propose two ingredients that make SR3 suitable for large-scale applications. In Section 5 we conduct our numerical experiments and verify the theoretical results from Section 2. Moreover, we numerically investigate the influence of on the convergence rate. Finally, in Section 6 we draw our conclusions.

## 2 Analysis of SR3

In this section we analyze some of the properties of the operator . We will characterize the singular values of and the limits and for general .

### 2.1 The Generalized Singular Value Decomposition

The central tool in our analysis is the Generalized Singular Value Decomposition (GSVD) of

. The definition of the GSVD depends on the size of the matrices and the dimensions of the matrices relative to each other. We use the definitions for the case and where and and and because this corresponds to the examples we use in our experiments. In the appendix we give the definition of the GSVD for all cases and note that our analysis is independent on the relative matrix sizes.

###### Definition 1 (Gsvd)

Let and . The Generalized Singular Value Decomposition (GSVD) of is given by , , where

 Σ=⎡⎢⎣Σp00In−p00⎤⎥⎦,Γ=[Γp0]form≥n,p≤n,

and

 Σ=[0Σm],Γ=⎡⎢⎣In−m00Γm00⎤⎥⎦formn.

The matrices and (where or ) are diagonal matrices satisfying , is invertible and and are orthonormal. Moreover, we have the following ordering of the singular values:

 0≤γr≤…≤γ1≤1, 0≤σ1≤…≤σr≤1.

The decomposition of and in the GSVD share similar properties to the SVD. The number of nonzero entries of and are the rank of and respectively. If is the rank of and is the rank of then the last columns, corresponding to , of form a basis for the range of and the first columns, corresponding to , of form a basis for the range of . The first columns, corresponding to , of form a basis for the nullspace of and the last columns, corresponding to , of form a basis for the nullspace of .

### 2.2 The SVD of Fκ

In this section we derive the SVD of in terms of the GSVD of .

###### Theorem 1

Let be the SVD of . Let the GSVD of . Then

 Y = ⎡⎢⎣κ1/2V˜Σ1/2κ,IκV˜Σ−1/2κ,IΓ(ΣTΣ+κΓTΓ)−1ΣTκUΣ(ΣTΣ+κΓTΓ)−1ΓT˜Σ−1/2κ,I−κ−1/2U˜Σ1/2m,κ⎤⎥⎦ Λ = [˜Σ1/2κ0] Z = V,

where , if and if , and the square root denotes the entry wise square root. If the diagonal matrix will have zeros on the diagonal. We denote to be the matrix where the zeros have been replaced by ones.

###### Proof

Using the GSVD of we have and hence . Given the fact that is orthonormal and is a diagonal matrix the above expression is the SVD of and we obtain the expressions for and . To obtain , we first partition . We have

 FκFTκ = YΛΛTYT ⟺ [κ(I−κLH−1κLT)2κ√κ(I−κLH−1κLT)LH−1κATκ√κAH−1κLT(I−κLH−1κLT)κ2AH−1κLLTH−1κAT] = [Y11Y12Y21Y22][˜Σκ000][YT11YT21YT12YT22]=[Y11˜ΣκYT11Y11˜ΣκYT21Y21˜ΣκYT11Y21˜ΣκYT21,].

Plugging in the GSVD gives

 FκFTκ=[κ−1V˜Σ2κVT√κV˜ΣκΓ(ΣTΣ+κΓTΓ)−1ΣTUT√κUΣ(ΣTΣ+κΓTΓ)−1ΓT˜ΣκVTκ2UΣ(ΣTΣ+κΓTΓ)−1ΓTΓ(ΣTΣ+κΓTΓ)−1ΣTUT].

Solving for gives:

 Y11=κ−1/2V˜Σ1/2κ,I.

Using this in the upper right part gives:

 Y21=κUΣ(ΣTΣ+κΓTΓ)−1ΓT˜Σ−1/2κ,I.

To solve for and , we use

The upper left part yields

 Y12=κV˜Σ−1/2κ,IΓ(ΣTΣ+κΓTΓ)−1ΣT.

The upper right part yields

 Y22=−κ−1/2U˜Σ1/2κ,m.

Note that the singular values are ordered in ascending order. We have the following corollary.

###### Corollary 1

If and the singular values of are given by

 ψi(Fκ)= ⎷σ2n−i+1σ2n−i+1/κ+γ2n−i+1.

If and the singular values of are given by

 ψi(Fκ)=⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩√κ if ip−rL

The question arises whether there is a direct relation between the singular values of and the . The answer is no, but we do, however, have the following result from [Hansen1989]:

###### Theorem 2 ([Hansen1989, Thm. 2.4])

Let and denote the singular values of and respectively and let and denote the nonzero entries of the matrices and respectively. Then for all

 ∥∥∥[AL]†∥∥∥−12≤ ψr−i+1(A)σi ≤∥∥∥[AL]∥∥∥2, ∥∥∥[AL]†∥∥∥−12≤ ψi(L)γi ≤∥∥∥[AL]∥∥∥2.

###### Remark 1

This result shows that, if the operator has quickly decaying singular values, the will have the same behavior, see also [Hansen1998, p. 24]. This is an important result because it shows how the ill-conditioning of transfers over to . Note that if we have and the singular values of . Hence, if the operator is severely ill-posed, this ill-posedness is inherited by the operator .

### 2.3 Limiting cases

If the limit yields the original system. However, if it is not immediately clear what happens in the limit due to the presence of the operator . In this section we derive this limit by using the GSVD of . Given the GSVD of , the matrix

and the vector

are given by

 Fκ=⎡⎢⎣√κV(Ip−Γ(ΣTΣ+κΓTΓ)−1ΓT)VTκUΣ(ΣTΣ+κΓTΓ)−1ΓTVT⎤⎥⎦,

and

 gκ=⎡⎣√κVΓ(ΣTΣ+κΓTΓ)−1ΣTUTbU(Im−Σ(ΣTΣ+κΓTΓ)−1ΣT)UTb⎤⎦.

As we have

Hence, as , SR3 solves

 miny12∥UΣΓ†VTy−b∥22+R(y). (5)

This expression is reminiscent of the standard-form transformation applied to the system

 minx12∥Ax−b∥22 s.t. ∥Lx∥22≤τ or minx12∥Ax−b∥22+λ∥Lx∥22,

see, [Elden1982, Hansen1998]. However, the standard-form transformation is not restricted to these cases, and applies to any regularizer. The standard-form transformation makes a substitution such that , where

 y=argminy12∥AL†Ay−b∥22+R(y),L†A=(I−(A(I−L†L))†A)L†.

and

 xN=(A(I−L†L))†b.

The operator is called the A-weighted pseudo-inverse. The transformation splits the solution into two parts: one part in the range of , , and one part in the nullspace of , . The operator makes the two parts -orthogonal so that the system decouples. Hence, the system is transformed to an equivalent system where . If is invertible and if and has full rank we have . Hence, if , the standard- form is achieved by simply applying .
If or if is rank deficient, the null space of has to be accounted for. The component in the nullspace can be written in terms of the GSVD as

 xN=X−1[000Idim(N(L))]UTb. (6)

The operator can be written in terms of the GSVD as

 L†A=X−1Γ†VT,

and hence the transformed system becomes

 miny12∥UΣΓ†VTy−b∥22+R(y), (7)

which is equivalent to (5). Note that the SVD of , with the singular values ordered in ascending order, and hence the singular values of are the generalized singular values, .
The solution to the original system using SR3 is given by

 x=H−1κ(ATb+κLTy)=H−1κATb+κH−1κLTy:=x1+x2.

We will now show that the component corresponds to the part in the nullspace of , and the component corresponds to the part in the range of , , if .
Using the GSVD, we have

 H−1κ=X−1(ΣTΣ+κΓTΓ)−1X−T.

As we have

 (ΣTΣ+κΓTΓ)−1→[000Idim(N(L))]

Hence,

 H−1κ→X−1[000Idim(N(L))]X−T. (8)

Recall that the last columns of are a basis for the nullspace of and hence projects onto the nullspace of . Using the GSVD of we see that

 limκ→∞x1:=limκ→∞H−1κATb=X−1[000Idim(N(L))]UTb,

which is equivalent to the nullspace component from (6).

We now show that corresponds to the part in the range of . We have

 x2:=κH−1κLTy=κX−1(ΣTΣ+κΓTΓ)−1ΓTVTy.

The elements of the diagonal matrix are

 γiσ2i/κ+γ2i if irL,

and as

 1γi if irL.

Hence, as

 κ(ΣTΣ+κΓTΓ)−1ΓT→Γ†,

and thus

 κH−1κLT→X−1Γ†VT=L†A.

The limit for the component is now given by

 limκ→∞x2=X−1Γ†VTy=L†Ay,

where solves

 miny12∥UΣΓ†VTy−b∥22+R(y),

which is equivalent to (7).
In conclusion, the relation between the limit for SR3 to the standard form transformation is given through the following two steps:

1. A substitution and a transformed system . SR3 achieves this through and the auxiliary variable .

2. Obtaining the solution . SR3 achieves this through .

The limit is much easier to derive. Recall that

 xκ=H−1κ(ATb+κLTy).

As we have and . Hence which is the unregularized minimum norm solution.

### 2.4 Relation to the standard-form transformation

We have shown that as SR3 implicitly applies a standard-form transformation and that as the system is unregularized. The question arises what happens for finite . To show what happens, we rewrite the singular values of as

 ψi(Fκ)= ⎷σ2r−i+1σ2r−i+1/κ+γ2r−i+1=  ⎷σ2r−i+1/γ2r−i+1σ2r−i+1/γ2r−i+1κ+1=   ⎷ψ2i(AL†A)ψ2i(AL†A)/κ+1.

This is equivalent to equation 9 in [Zheng2019], where it was shown that if ,

 ψi(Fκ)=ψ2i(A)ψ2i(A)/κ+1.

This shows that SR3 is applied to the system . This leads to the following theorem.

###### Theorem 3

SR3 is equivalent to first applying a standard-form transformation and then applying SR3 to the newly formed system, i.e.

 minz,y12∥AL†Az−b∥22+κ2∥z−y∥22+R(y),

where the solution is given by

 x=L†Az+xN.

Note that computing is computationally expensive. The strength in SR3 lies in the fact that it implicitly applies the standard-form transformation without the computational burden.

## 3 Approximating the value function

In this section we quantify the distance between the Pareto curve of the original problem and the Pareto curve of the relaxed problem in terms of . We first describe the value function of the problem and then present our theorem.

### 3.1 Value function

The value function of an optimization problem expresses the value of the objective at the solution as a function of the other parameters. Using the standard-form transformation, we can, without loss of generality, consider the standard-form value function:

 ϕ0(τ)=minx∥Ay−b∥2s.t.∥y∥p≤τ,

Following [VanDenBerg2008] we obtain the following (computable) upper and lower bounds for the value function

 bTrτ−τ∥ATrτ∥q≤ϕ0(τ)2≤∥rτ∥22,

where is a feasible point, is the corresponding residual and . Morever, by [VanDenBerg2008, Cor. 2.2] the derivative of the value function is given by

 ϕ′0(τ)=−∥ATrτ∥q/∥rτ∥2.

To gain some insight in the behaviour of the value function, we consider and at and :

 ϕ0(τ)=∥b∥2,ϕ′0(0)=−∥ATb∥q/∥b∥2,
 ϕ0(τ∗)=∥(I−AA†)b∥2,ϕ′0(τ∗)=0.

This immediately suggests that decreases linearly near (the zero solution) and flattens of near (the unconstrained minimizer). Since is known to be convex, its second derivative is always positive and will gradually bend the curve from decreasing to flat. How fast this happens and wether can expect the typical L-shape, depends on how fast the curve decreases initially. We can bound a follows. We let and find

 ∥ATb∥q=∥ATAx∥q≥Cq∥ATAx∥2≥Cq∥A†∥2∥x∥2,

where is a constant that exists due to the equivalence of norms. Furthermore,

 ∥b∥2=∥Ax∥2≤∥A∥2∥x∥2.

From this we get

 ϕ′0(0)≤−Cqκ2(A)∥A†∥2,

with the condition number of . We thus expect a steep slope for ill-conditioned problems, giving rise for the characteristic -shape of the curve. While this behavior is well-established for where it can be analysed using the SVD of [Hansen1992], this analysis gives us new insight in the behavior of the Pareto curve for ill-posed problems for general . An example for , is shown in figure 4.

### 3.2 Relaxed value function

We now present our theorem on the distance between the Pareto curve of the original problem and the Pareto curve of the relaxed problem.

###### Theorem 4

The distance between the Pareto curve of the original problem and the Pareto curve of the relaxed problem is given by

 (ϕκ(τ))2−(ϕ0(τ))2=−κ−1∥AT(b−Ayκ)∥22+O(κ−2),

where is the solution of the relaxed problem.

###### Proof

Let . The relaxed value function can be expressed as

 ϕϵ(τ)=miny∥Fϵy−gϵ∥2s.t% .∥y∥p≤τ.

For we can expand and get

 Fϵ=(A−ϵAATA+O(ϵ2)ϵ1/2ATA+O(ϵ3/2)),gϵ=(b−ϵATb+O(ϵ2)ϵ−1/2ATb+O(ϵ3/2)).

Introduce

 f(ϵ)=(ϕϵ(τ))2=minx,y∥Ax−b∥22+12ϵ∥x−y∥22s.t.∥y∥p≤τ.

We have . Furthermore

 f′(ϵ)=−ϵ−2∥xϵ−yϵ∥22,

where and is the optimal . With this we find

 (ϕϵ(τ))2−(ϕ0(τ))2=ϵf′(η)=−ϵ−1∥xϵ−yϵ∥22. (9)

We conclude that . For small we get

 f′(ϵ)=−∥AT(b−Ayϵ)∥22+O(ϵ).

Plugging this expression into Eq. 9 gives the desired result.

This explains why the error gets smaller for large ; for an unconstrained problem we have . An example is shown in figure 5.

## 4 Implementation

So far we have analyzed the properties of SR3 via the operator , that was obtained by minimizing out the variable in equation Eq. 1. For the implementation of SR3, it is not necessary to form this operator, as was shown in [Zheng2019]. The authors propose the following algorithm for solving the relaxed problem

 x ← (ATA+κLTL)−1(ATb+κLTy) (10) y ← proxαR(y−ακ(y−Lx)), (11)

which for the particular choice simplifies to

 x ← (ATA+κLTL)−1(ATb+κLTy) (12) y ← prox1/κR(Lx). (13)

The last equation shows that for the choice there is a relation between the parameters and . More specifically, depends on and hence we write . Given the optimal , we have . Note that if we use the constrained formulation, the dependence on the stepsize is lost because the proximal operator is the indicator function, and there is no relation between and .
The computational bottleneck is in the first step, which is the solution to the large-scale linear system

 (ATA+κLTL)xk=ATb+κLTyk−1. (14)

To avoid explicitly forming and , we instead solve the following minimization problem

 minx∥∥∥[A√κL]x−[b√κyk−1]∥∥∥22, (15)

with LSQR. We will numerically investigate how only partially solving Eq. 15 affects the convergence of SR3. This has been investigated for ADMM in [Eckstein2017, Eckstein2018, MarquesAlves2020]. The convergence of FISTA with an inexact gradient has been analyzed in [Schmidt2011]. The key message is that the error has to go down as the iterations increase.

We propose two additional ingredients for the implementation of SR3: warm starts and inexact solves of the system (