Relaxed regularization for linear inverse problems

06/26/2020 ∙ by Nick Luiken, et al. ∙ Utrecht University 0

We consider regularized least-squares problems of the form min_x1/2‖ Ax - b‖_2^2 + ℛ(Lx). Recently, Zheng et al., 2019, proposed an algorithm called Sparse Relaxed Regularized Regression (SR3) that employs a splitting strategy by introducing an auxiliary variable y and solves min_x,y1/2‖ Ax - b‖_2^2 + κ/2‖ Lx - y‖_2^2 + ℛ(x). By minimizing out the variable x we obtain an equivalent system min_y1/2‖ F_κy - g_κ‖_2^2+ℛ(y). In our work we view the SR3 method as a way to approximately solve the regularized problem. We analyze the conditioning of the relaxed problem in general and give an expression for the SVD of F_κ as a function of κ. Furthermore, we relate the Pareto curve of the original problem to the relaxed problem and we quantify the error incurred by relaxation in terms of κ. Finally, we propose an efficient iterative method for solving the relaxed problem with inexact inner iterations. Numerical examples illustrate the approach.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 18

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Inverse problems are problems where a certain quantity of interest has to be determined from indirect measurements. In medicine, well-known examples include MRI [Liang2000] , CT [Herman2009], and ultrasound imaging [Besson2019] where the objective is to obtain images of the interior of the human body. In the geosciences, inverse problems arise in seismic exploration and seismology [Virieux2009], where the interest lies in exploring the elastic properties of the different layers of our planet. Other examples include tomography [Arridge2009, Natterer2001, Bertero1998], radar imaging [Borden2001], remote sensing [Tamminen2012, Nolan2002], astrophysics [Starck2016], and more recently, machine learning [Goodfellow2016].

Inverse problems are challenging for a number of reasons. There may be limited data available, or the data may be corrupted by noise. The datasets are generally very large, and the underlying model is generally not well-defined for retrieving the quantity of interest. Therefore, inverse problems often have to be regularized, meaning prior information has to be added. They can be posed in the following way:

(1)

where is the linear forward operator, is the regularization term and the regularization operator. The latter two encode the prior information about . In our work, we focus on , or, equivalently, . By equivalent we mean that for every there is a such that the solutions of the two problems coincide [Aravkin2012e]. A direct solution to the problem above is generally not possible, either because a closed form solution does not exist, or because evaluating the direct solution is too computationally expensive. Therefore, we have to resort to iterative methods to solve the problem, with most algorithms being designed for specific choices of and .
Traditionally, , called Tikhonov regularization, is a popular choice, because the objective function is differentiable and allows for a closed-form expression of the solution of Eq. 1 in terms of and . For this class of problems, Krylov based algorithms have been proven very effective [Calvetti2000, Calvetti2003, Hansen2010, Kilmer2001, Golub1997, Zwaan2017, Hochstenbach2010, Hochstenbach2015, Kilmer2007]. These methods generally exploit the fact that a closed-form solution exists by constructing a low dimensional subspace from which an approximate solution is extracted.

The choice has gained popularity in recent years because it gives sparse solutions while still yielding a convex objective. Sparsity is important in a number of applications, like compressed sensing [Candes2006], seismic imaging [Herrmann2008], image restoration [Rudin1994], and tomography [Hansen2011]. However, the objective is no longer differentiable and the aforementioned Krylov methods do not apply. If , a proximal gradient method (sometimes referred to as Iterative Soft Tresholding – ISTA) [Daubechies2004] can be applied, iteratively updating the solution via

where the proximal operator is the soft thresholding operator, which can be efficiently evaluated. Generally, ISTA achieves a sub-linear rate of convergence of (unless and has full rank, in which case we have a linear rate of convergence). FISTA (Fast Iterative Soft Thresholding Algorithm) [Beck2009] is a faster version of ISTA that generally achieves a sublinear rate of .

If , the proximal operator is no longer easy to evaluate in general and FISTA may no longer be attractive. An example of this class of problems is TV regularization, where is the discretization of the gradient, that gives blocky solutions. A popular algorithm for this class of problems is the Alternating Direction Method of Multipliers, ADMM [Boyd2011]. ADMM solves Eq. 1 by forming the augmented Lagrangian

and alternatingly minimizing over the variables and , and the Lagrange multiplier . The strength of ADMM is that it can closely approximate the solution of any convex sparse optimization problem. However, convergence can be slow [Boyd2011].

If , the emphasis on sparsity of the solution is stronger than for the case . However, the objective function is no longer convex which makes it more difficult to solve.

Recently, a unifying algorithm was proposed that allows the efficient approximation of the solution of any problem of the form Eq. 1, called Sparse Relaxed Regularized Regression (SR3) [Zheng2019]. This algorithm makes use of a splitting strategy by introducing an auxiliary variable and yields:

(2)

By minimizing out , we obtain a new system of the form:

(3)

where and , . If is invertible the algorithm is in standard-form and for any other the algorithm is in general form. This method has several advantages when applied to solving inverse problems that we highlight in the examples below.

1.1 Motivating examples

Below we show some typical examples encountered in various areas of science to which SR3 can be applied. The problems we tackle are of the form

(4)

The main tasks are to solve this for a given value of and to find an appropriate value of . The latter is achieved by picking the corner of the Pareto curve (sometimes called the L-curve) where solves Eq. 4. Comparing a proximal gradient method to SR3, we show the residual as a function of , the optimal reconstruction, and the convergence history. These examples show two favourable aspects of SR3 over the conventional proximal gradient method: i) SR3 converges (much) faster for any fixed value of and ii)

the corners of both Pareto-curves coincide, allowing us to effectively use SR3 to estimate

.

Spiky deconvolution (, , )

Consider a deconvolution problem where is a Toeplitz-matrix that convolves the input with a bandlimited function;

where and . We take , and . The results are shown in figure 1.

Figure 1: Spiky deconvolution example. The left figure shows the Pareto curve, the middle figure shows the solution and the right figure shows the error as a function of the amount of iterations. The grey line in the middle figure shows the minimum norm solution.

Compressed sensing (, , )

Here, the goal is to recover a sparse signal from compressive samples. The forward operator is a random matrix with i.i.d. normally distributed entries. We take

and . The results are shown in figure 2.

Figure 2: Compressed sensing. A signal is reconstructed from very few samples, which requires sparse regularization. The left figure shows the Pareto curve, the middle figure shows the solution and the right figure shows the error as a function of the amount of iterations. The grey line in the middle figure shows the minimum norm solution.

Total variation (, , )

Consider a deconvolution problem where is a Toeplitz-matrix that convolves the input with a bandlimited function;

where and . is a finite-difference discretization of the first-order derative with certain boundary conditions. We take , and . The results are shown in figure 3.

Figure 3: Total variation example. Here, the solution has a blocky structure. The left figure shows the Pareto curve, the middle figure shows the solution and the right figure shows the error as a function of the amount of iterations. The grey line in the middle figure shows the minimum norm solution.

1.2 Contributions

In this paper we set out to further analyze the SR3 method proposed in [Zheng2019] and analyze in detail the observations made in the above examples. Our contributions are:

Conditioning of for general .

In [Zheng2019] the particular case with is analyzed. Using the SVD of

, the singular values of

were calculated, showing a relation between the condition number of and depending on . In short, the result shows that a small improves the conditioning of and as the condition numbers are the same, because the original system is obtained. In this paper we derive the SVD of for general and that the singular values and the condition number of are related to the generalized singular values of . As a by-product, we show that SR3 implicitly makes a standard-form transformation [Elden1982] of Eq. 1.

Approximation of the Pareto-curve.

We show that that the Pareto curve corresponding to the relaxed problem always underestimates the Pareto curve of the original problem and that the error is of order . A by-product of this result is a better understanding of the pareto curve for general and an intuitive explanation of the observation that the corners of the relaxed original Pareto curves coincide.

Inexact solves.

The SR3 algorithm consists of two steps. The first step requires the solution of a regularized least-squares problem and the second step is the application of the proximal operator. For many important regularizers, the proximal operator is easy to evaluate and comes at little cost. However, for large-scale applications solving the system completely at each step is impossible due to computational limitations. Therefore, we investigate how only partially solving this system affects the convergence and the solution. We propose to use a warm starting approach, where at every step the solution of the previous iteration is used as an initial guess.

1.3 Outline

In Section 2 we analyze the operator . We derive the SVD of and analyse the limiting cases and . Our main results are a characterization of the singular values of and showing that SR3 implicitly applies a standard-form transformation. In Section 3 we relate the Pareto curve of SR3 to the Pareto curve of the original problem and derive an error bound in terms of . Next, Section 4 is concerned with the implementation of SR3. We propose two ingredients that make SR3 suitable for large-scale applications. In Section 5 we conduct our numerical experiments and verify the theoretical results from Section 2. Moreover, we numerically investigate the influence of on the convergence rate. Finally, in Section 6 we draw our conclusions.

2 Analysis of SR3

In this section we analyze some of the properties of the operator . We will characterize the singular values of and the limits and for general .

2.1 The Generalized Singular Value Decomposition

The central tool in our analysis is the Generalized Singular Value Decomposition (GSVD) of

. The definition of the GSVD depends on the size of the matrices and the dimensions of the matrices relative to each other. We use the definitions for the case and where and and and because this corresponds to the examples we use in our experiments. In the appendix we give the definition of the GSVD for all cases and note that our analysis is independent on the relative matrix sizes.

Definition 1 (Gsvd)

Let and . The Generalized Singular Value Decomposition (GSVD) of is given by , , where

and

The matrices and (where or ) are diagonal matrices satisfying , is invertible and and are orthonormal. Moreover, we have the following ordering of the singular values:

The decomposition of and in the GSVD share similar properties to the SVD. The number of nonzero entries of and are the rank of and respectively. If is the rank of and is the rank of then the last columns, corresponding to , of form a basis for the range of and the first columns, corresponding to , of form a basis for the range of . The first columns, corresponding to , of form a basis for the nullspace of and the last columns, corresponding to , of form a basis for the nullspace of .

2.2 The SVD of

In this section we derive the SVD of in terms of the GSVD of .

Theorem 1

Let be the SVD of . Let the GSVD of . Then

where , if and if , and the square root denotes the entry wise square root. If the diagonal matrix will have zeros on the diagonal. We denote to be the matrix where the zeros have been replaced by ones.

Proof

Using the GSVD of we have and hence . Given the fact that is orthonormal and is a diagonal matrix the above expression is the SVD of and we obtain the expressions for and . To obtain , we first partition . We have

Plugging in the GSVD gives

Solving for gives:

Using this in the upper right part gives:

To solve for and , we use

The upper left part yields

The upper right part yields

Note that the singular values are ordered in ascending order. We have the following corollary.

Corollary 1

If and the singular values of are given by

If and the singular values of are given by

The question arises whether there is a direct relation between the singular values of and the . The answer is no, but we do, however, have the following result from [Hansen1989]:

Theorem 2 ([Hansen1989, Thm. 2.4])

Let and denote the singular values of and respectively and let and denote the nonzero entries of the matrices and respectively. Then for all

Remark 1

This result shows that, if the operator has quickly decaying singular values, the will have the same behavior, see also [Hansen1998, p. 24]. This is an important result because it shows how the ill-conditioning of transfers over to . Note that if we have and the singular values of . Hence, if the operator is severely ill-posed, this ill-posedness is inherited by the operator .

2.3 Limiting cases

If the limit yields the original system. However, if it is not immediately clear what happens in the limit due to the presence of the operator . In this section we derive this limit by using the GSVD of . Given the GSVD of , the matrix

and the vector

are given by

and

As we have

Hence, as , SR3 solves

(5)

This expression is reminiscent of the standard-form transformation applied to the system

see, [Elden1982, Hansen1998]. However, the standard-form transformation is not restricted to these cases, and applies to any regularizer. The standard-form transformation makes a substitution such that , where

and

The operator is called the A-weighted pseudo-inverse. The transformation splits the solution into two parts: one part in the range of , , and one part in the nullspace of , . The operator makes the two parts -orthogonal so that the system decouples. Hence, the system is transformed to an equivalent system where . If is invertible and if and has full rank we have . Hence, if , the standard- form is achieved by simply applying .
If or if is rank deficient, the null space of has to be accounted for. The component in the nullspace can be written in terms of the GSVD as

(6)

The operator can be written in terms of the GSVD as

and hence the transformed system becomes

(7)

which is equivalent to (5). Note that the SVD of , with the singular values ordered in ascending order, and hence the singular values of are the generalized singular values, .
The solution to the original system using SR3 is given by

We will now show that the component corresponds to the part in the nullspace of , and the component corresponds to the part in the range of , , if .
Using the GSVD, we have

As we have

Hence,

(8)

Recall that the last columns of are a basis for the nullspace of and hence projects onto the nullspace of . Using the GSVD of we see that

which is equivalent to the nullspace component from (6).

We now show that corresponds to the part in the range of . We have

The elements of the diagonal matrix are

and as

Hence, as

and thus

The limit for the component is now given by

where solves

which is equivalent to (7).
In conclusion, the relation between the limit for SR3 to the standard form transformation is given through the following two steps:

  1. A substitution and a transformed system . SR3 achieves this through and the auxiliary variable .

  2. Obtaining the solution . SR3 achieves this through .

The limit is much easier to derive. Recall that

As we have and . Hence which is the unregularized minimum norm solution.

2.4 Relation to the standard-form transformation

We have shown that as SR3 implicitly applies a standard-form transformation and that as the system is unregularized. The question arises what happens for finite . To show what happens, we rewrite the singular values of as

This is equivalent to equation 9 in [Zheng2019], where it was shown that if ,

This shows that SR3 is applied to the system . This leads to the following theorem.

Theorem 3

SR3 is equivalent to first applying a standard-form transformation and then applying SR3 to the newly formed system, i.e.

where the solution is given by

Note that computing is computationally expensive. The strength in SR3 lies in the fact that it implicitly applies the standard-form transformation without the computational burden.

3 Approximating the value function

In this section we quantify the distance between the Pareto curve of the original problem and the Pareto curve of the relaxed problem in terms of . We first describe the value function of the problem and then present our theorem.

3.1 Value function

The value function of an optimization problem expresses the value of the objective at the solution as a function of the other parameters. Using the standard-form transformation, we can, without loss of generality, consider the standard-form value function:

Following [VanDenBerg2008] we obtain the following (computable) upper and lower bounds for the value function

where is a feasible point, is the corresponding residual and . Morever, by [VanDenBerg2008, Cor. 2.2] the derivative of the value function is given by

To gain some insight in the behaviour of the value function, we consider and at and :

This immediately suggests that decreases linearly near (the zero solution) and flattens of near (the unconstrained minimizer). Since is known to be convex, its second derivative is always positive and will gradually bend the curve from decreasing to flat. How fast this happens and wether can expect the typical L-shape, depends on how fast the curve decreases initially. We can bound a follows. We let and find

where is a constant that exists due to the equivalence of norms. Furthermore,

From this we get

with the condition number of . We thus expect a steep slope for ill-conditioned problems, giving rise for the characteristic -shape of the curve. While this behavior is well-established for where it can be analysed using the SVD of [Hansen1992], this analysis gives us new insight in the behavior of the Pareto curve for ill-posed problems for general . An example for , is shown in figure 4.

Figure 4: Pareto curve for an ill-posed problem; the matrix is diagonal with elements for ; with . The tangent lines at and are shown in black.

3.2 Relaxed value function

We now present our theorem on the distance between the Pareto curve of the original problem and the Pareto curve of the relaxed problem.

Theorem 4

The distance between the Pareto curve of the original problem and the Pareto curve of the relaxed problem is given by

where is the solution of the relaxed problem.

Proof

Let . The relaxed value function can be expressed as

For we can expand and get

Introduce

We have . Furthermore

where and is the optimal . With this we find

(9)

We conclude that . For small we get

Plugging this expression into Eq. 9 gives the desired result.

This explains why the error gets smaller for large ; for an unconstrained problem we have . An example is shown in figure 5.

Figure 5: Pareto curve for an ill-posed problem; the matrix is diagonal with elements for ; with . The approximations for various values of are shown as well.

4 Implementation

So far we have analyzed the properties of SR3 via the operator , that was obtained by minimizing out the variable in equation Eq. 1. For the implementation of SR3, it is not necessary to form this operator, as was shown in [Zheng2019]. The authors propose the following algorithm for solving the relaxed problem

(10)
(11)

which for the particular choice simplifies to

(12)
(13)

The last equation shows that for the choice there is a relation between the parameters and . More specifically, depends on and hence we write . Given the optimal , we have . Note that if we use the constrained formulation, the dependence on the stepsize is lost because the proximal operator is the indicator function, and there is no relation between and .
The computational bottleneck is in the first step, which is the solution to the large-scale linear system

(14)

To avoid explicitly forming and , we instead solve the following minimization problem

(15)

with LSQR. We will numerically investigate how only partially solving Eq. 15 affects the convergence of SR3. This has been investigated for ADMM in [Eckstein2017, Eckstein2018, MarquesAlves2020]. The convergence of FISTA with an inexact gradient has been analyzed in [Schmidt2011]. The key message is that the error has to go down as the iterations increase.

We propose two additional ingredients for the implementation of SR3: warm starts and inexact solves of the system (