Algorithms and software for projections onto intersections of convex and non-convex sets with applications to inverse problems

02/26/2019 ∙ by Bas Peters, et al. ∙ 0

We propose algorithms and software for computing projections onto the intersection of multiple convex and non-convex constraint sets. The software package, called SetIntersectionProjection, is intended for the regularization of inverse problems in physical parameter estimation and image processing. The primary design criterion is working with multiple sets, which allows us to solve inverse problems with multiple pieces of prior knowledge. Our algorithms outperform the well known Dykstra's algorithm when individual sets are not easy to project onto because we exploit similarities between constraint sets. Other design choices that make the software fast and practical to use, include recently developed automatic selection methods for auxiliary algorithm parameters, fine and coarse grained parallelism, and a multilevel acceleration scheme. We provide implementation details and examples that show how the software can be used to regularize inverse problems. Results show that we benefit from working with all available prior information and are not limited to one or two regularizers because of algorithmic, computational, or hyper-parameter selection issues.



There are no comments yet.


page 21

page 23

page 25

page 26

page 27

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


We propose algorithms and software for computing projections onto the intersection of multiple convex and non-convex constraint sets. The software package, called SetIntersectionProjection, is intended for the regularization of inverse problems in physical parameter estimation and image processing. The primary design criterion is working with multiple sets, which allows us to solve inverse problems with multiple pieces of prior knowledge. Our algorithms outperform the well known Dykstra’s algorithm when individual sets are not easy to project onto because we exploit similarities between constraint sets. Other design choices that make the software fast and practical to use, include recently developed automatic selection methods for auxiliary algorithm parameters, fine and coarse grained parallelism, and a multilevel acceleration scheme. We provide implementation details and examples that show how the software can be used to regularize inverse problems. Results show that we benefit from working with all available prior information and are not limited to one or two regularizers because of algorithmic, computational, or hyper-parameter selection issues.


inverse problems, alternating direction method of multipliers, parallel computing, software, projection methods


We consider problems of the form


which is the projection of a vector

onto the intersection of convex and possibly non-convex sets . The projection in equation (1) is unique if all sets are closed and convex. The projection operation is a common tool used for solving constrained optimization problems of the form


Examples of algorithms that use projections include spectral projected gradient descent (SPG, Birgin et al., 1999), projected quasi-Newton (Schmidt et al., 2009), and projected Newton-type methods (Bertsekas, 1982; Schmidt et al., 2012). In the above optimization problem, the function is at least twice differentiable and may also be non-convex. Alternatively, proximal algorithms solve


which is equivalent to (2) and where is the indicator function of the set , which returns zero if is an element of the intersection and infinity otherwise. Because applications may benefit from using non-convex sets

, we also consider those sets in the numerical examples. While we do not provide convergence guarantees for this case, we will work with some useful/practical heuristics.

The main applications of interest in this work are inverse problems for the estimation of physical (model) parameters () from observed data (). Notable examples are geophysical imaging problems with seismic waves (full-waveform inversion, see, e.g., Tarantola, 1986; Pratt et al., 1998; Virieux and Operto, 2009) for acoustic velocity estimation and direct-current resistivity problems (DC-resistivity, see, e.g., Haber, 2014) to obtain electrical conductivity information. These problems have ‘expensive’ forward operators, i.e., evaluating the objective (data-misfit)

requires solutions of many partial-differential-equations (PDEs) if the PDE constraints are implicit in

, which corresponds to a reduced data-misfit (Haber et al., 2000). In our context, each set describes a different type of prior information on the model . Examples of prior knowledge as convex sets are bounds on parameter values, smoothness, matrix properties such as the nuclear norm, and whether or not the model is blocky with sharp edges (total-variation like constraints via the norm). Non-convex sets that we use in the numerical examples include the annulus (minimum and maximum norm), limited matrix rank, and vector cardinality.

Aside from the constrained minimization as in problem (2), we consider feasibility (also known as set-theoretic estimation) problem formulations (e.g., Youla and Webb, 1982; Trussell and Civanlar, 1984; Combettes, 1993, 1996). Feasibility only formulations accept any point in the intersection of sets that describe constraints on model parameter properties, and a data-fit constraint that ties the unknown model vector to the observed data via a forward operator . Examples of data-constraint sets are and . The upper and lower bounds are vectors and and is a scalar that depends on the noise level. The forward operators are linear and often computationally ‘cheap’ to apply. Examples include masks and blurring kernels. In case there is a good initial guess available, we can choose to solve a projection rather than feasibility problem by adding the squared distance term as follows:


To demonstrate the benefits of this constrained formulation, we recast joint denoising-deblurring-inpainting and image desaturation problems as (4). Especially when we have a few training examples from which we can learn constraint set parameters, the feasibility and projection approaches conveniently add many pieces of prior knowledge in the form of multiple constraint sets, but without any penalty or trade-off parameters. For instance, (Combettes and Pesquet, 2004) show that we can observe ‘good’ constraint sets, such as the average of the total variation of a few training images. We address increasing computational demand that comes with additional constraint sets with a reformulation of problem (4), such that we take into account similarity between sets, and split the problem up into simple parallel computations where possible.

Projected gradient and similar algorithms naturally split problem (2) into a projection and data-fitting part. In this setting, software for computing projections onto the intersection of sets can work together with codes for physical simulations that compute and , as we show in one of the numerical examples. See dolfin-adjoint (Farrell et al., 2013), Devito (Kukreja et al., 2016; Louboutin et al., 2018) in Python and WAVEFORM (Da Silva and Herrmann, 2017), jInv (Ruthotto et al., 2017), and JUDI (Witte et al., 2018) in Julia for examples of recent packages.

Compared to regularization via penalty functions (that are not an indicator function), constrained problem formulations (2 and 4) have several advantages when solving physical parameter estimation problems. Penalty methods


add prior knowledge through penalty functions with scalar weights to the data-misfit term . Alternatively, we can add penalties to the objective and work with a data constraint instead—i.e., we have


generally referred to as Basis Pursuit Denoise (Mallat and Zhang, 1992; Chen et al., 2001; van den Berg and Friedlander, 2009; Aravkin et al., 2014), Morozov/residual regularization (Ivanov et al., 2013), or Occam’s inversion (Constable et al., 1987). The scalar relates to the noise level in the data. For convex constraints/objectives/penalties, constrained, penalty and data-constrained problems are equivalent under certain conditions and for specific - pairs (Vasin, 1970; Gander, 1980; Golub and von Matt, 1991; van den Berg and Friedlander, 2009; Aravkin et al., 2016; Tibshirani, 2017), but differ in algorithmic implementation and in their ability to handle multiple pieces of prior information (). In that case, the simplicity of adding penalties is negated by the challenge of selecting multiple trade-off parameters (). For this, and for reasons we list below, we prefer constrained formulations that involve projections onto the intersection of constraint sets (problem 1). Constrained formulations

  • satisfy prior information at every iteration PDE-based inverse problems require model parameters that are in an interval for which the mesh (PDE discretization) is suitable, i.e., we have to use bound constraints. Projection-based algorithms satisfy all constraints at every iteration and give the user precise control of the model properties. This allows us to start solving a non-convex inverse problem with certain constraints, followed by a solution stage with ‘looser’ constraints. (Smithyman et al., 2015; Esser et al., 2016, 2018; Peters and Herrmann, 2017; Peters et al., 2018) apply this strategy to seismic full-waveform inversion to avoid local minimizers that correspond to geologically unrealistic models.

  • require a minimum number of manual tuning parameters for multiple constraints We want to avoid the time-consuming and possibly computationally costly procedure of manually tuning numerous nuisance parameters. Constraint sets have the advantage that their definitions are independent of all other constraint definitions. For penalty functions, the effect of the weights associated with each on the solutions of an inverse problem depends on all other and . For this reason, selecting multiple scalar weights to balance multiple penalty functions becomes increasingly difficult as we increase the number of penalties.

  • make direct use of prior knowledge We can observe model properties from training examples and use this information directly as constraints (Combettes and Pesquet, 2004, see also numerical examples in this work). Penalty and basis-pursuit type methods first need to translate this information into penalty functions and scalar weights.

Most classical and recently proposed methods to project onto an intersection of multiple (convex) sets, such as Dykstra’s algorithm and variants (Dykstra, 1983; Boyle and Dykstra, 1986; Censor, 2006; Bauschke and Koch, 2015; López and Raydan, 2016; Aragón Artacho and Campoy, 2018), (see also Appendix A), use projections onto each set separately, denoted as . The projection is a black box, and this may create difficulties if the projection onto one or more sets has no known closed-form solution. We then need another iterative algorithm to solve the sub-problems. This nesting of algorithms may lead to problems with the selection of appropriate stopping criteria for the sub-problem solver. In that case, we need two sets of stopping criteria: one for Dykstra’s algorithm itself and one for the iterative algorithm that computes the individual projections. Projections need to be sufficiently accurate such that Dykstra’s algorithm converges. At the same time, we do not want to waste computational resources by solving sub-problems more accurately than necessary. A second characteristic of the black-box projection algorithms is that they treat every set individually and do not attempt to exploit similarities between the sets. If we work with multiple constraint sets, some of the set definitions may include the same or similar linear operators in terms of sparsity (non zero) patterns.

Besides algorithms that are designed specifically to compute projections onto the intersection of multiple sets, there exist software packages capable of solving a range of generic optimization problems. However, many of the current software packages are not designed to compute projections onto intersections of multiple constraint sets where we usually do not know the projection onto each set in closed form. This happens, for instance, when the set definitions include linear operators that satisfy the relation for . A package such as Convex for Julia (Udell et al., 2014), an example of disciplined convex programming (DCP), does not handle non-convex sets and requires lots of memory even for large and sparse linear operators from problems on 2D grids. The high memory demands are a result of the packages that Convex can call as the back-end, for example, SCS (O’Donoghue et al., 2016) or ECOS (Domahidi et al., 2013). These solvers work with matrices that possess a structure similar to


This block-structured system becomes prohibitively large in case we work with multiple constraint sets that include a linear operator in their definitions. The software that comes closer to our implementation is Epsilon (Wytock et al., 2015), which is written in Python. Like our proposed algorithms, Epsilon also employs the alternating direction method of multipliers (ADMM), but reformulates optimization problems by emphasizing generalized proximal mappings as in equation (12, see below). Linear equality constraints then appear as indicator functions, which leads to different linear operators ending up in different sub-problems. Contrary, we work with a single ADMM sub-problem that includes all linear operators. The ProxImaL software (Heide et al., 2016) for Python is designed for linear inverse problems in imaging using ADMM with a similar problem reformulation. However, ProxImaL differs fundamentally since it applies regularization with a relatively small number of penalty functions. While in principle it should be possible to adapt that package to constrained problem formulations by replacing penalties with indicator functions, ProxImaL is in its current form not set up for that purpose. Finally there is StructuredOptimization (Antonello et al., 2018) in Julia. This package also targets inverse problems by smooth+non-smooth function formulations. Different from the goal of this work, StructuredOptimization focusses on problems with easy to compute generalized proximal mappings (12), i.e., penalty functions or constraints that are composed with linear operators that satisfy . Contrary, we focus on the situation where we have many constraints with operators () that make generalized proximal mappings (12) difficult to compute. Below, we list additional benefits of our approach compared to existing packages that can solve intersection projection problems.


Our aim is to design and implement parallel computational optimization algorithms for solving projection problems onto intersections of multiple constraint sets in the context of inverse problems. To arrive at this optimization framework, SetIntersectionProjection, we propose

  • an implementation that avoids nesting of algorithms and exploits similarities between constraint sets, unlike black-box alternating projection methods such as Dykstra’s algorithm. Taking similarities between sets into account allows us to work with many sets at a relatively small increase in computational cost.

  • algorithms that are based on a relaxed variant of the simultaneous direction method of multipliers (SDMM, Afonso et al., 2011; Combettes and Pesquet, 2011; Kitic et al., 2016). By merging SDMM with recently developed schemes for automatically adapting the augmented-Lagrangian penalty and relaxation parameters (Xu et al., 2017b, a), we achieve speedups when solving problem (1) compared to the straightforward application of operator splitting methods.

  • a software design specifically for set intersection projection problems. Our specializations enhance computational performance and include (i) a relatively simple multilevel strategy for ADMM-based algorithms that does part of the computations on coarser grids; (ii) solutions of banded linear systems in compressed diagonal format (CDS) with multi-threaded matrix-vector products (MVP). These MVPs are faster than general purpose storage formats like compressed sparse column storage (CSC) and support linear operators with spatially varying (blurring) kernels; (iii) more intuitive stopping criteria based on set feasibility.

  • to make our work available as a software package in Julia (Bezanson et al., 2017). Besides the algorithms, we also provide scripts for setting up the constraints, projectors and linear operators, as well as various examples. All presented timings, comparisons, and examples are reproducible.

  • an implementation that is suitable for small matrices (2D) up to larger tensors (3D models, at least

    ). Because we solve simple-to-compute sub-problems in closed form and independently in parallel, the proposed algorithms work with large models and many constraints. We achieve this because there is only a single inexact linear-system solve that does not become much more computationally expensive as we add more constraint sets.

To demonstrate the capabilities of our optimization framework and implementation, we provide examples how projections onto an intersection of multiple constraint sets can be used to solve linear image processing tasks such as denoising an deconvolution and more complicated inverse problems including nonlinear parameters estimation problems with PDEs.

Notation, assumptions, and definitions

Our goal is to estimate the model vector (e.g., discretized medium parameters such as the acoustic wave speed) , which in 2D corresponds to a vectorized (lexicographically ordered) matrix of size . Coordinate is the vertical direction and the horizontal direction. There are elements in a 2D model. Our work applies to 2D and 3D models but to keep the derivations simpler we limit the descriptions to 2D models discretized on a regular grid. We use the following discretization for the vertical derivative in our constraint definitions


where is the vertical grid size. We define the discretized vertical derivative for the 2D model as the Kronecker product of

and the identity matrix corresponding to the x-dimension:


The indicator function of a convex or non-convex set is defined as


We define the Euclidean projection onto a set as


This projection is unique if is a closed and convex set. If is a non-convex set, the projection may not be unique so the result is any vector in the set of minimizers of the projection problem. The proximal map of a function is defined as


so , where is a scalar. The case when includes a linear operator is of particular interest to us and we make it explicit with the definition


Even though is often available in closed-form solution, or cheap to compute (Combettes and Pesquet, 2011; Parikh and Boyd, 2014; Beck, 2017, Chapter 6 & 7), is usually not available in closed form if and more expensive to compute. Here, the symbol refers to (Hermitian) transpose. The proximal map for the indicator function is the projection:

with defined as in (10). The intersection of an arbitrary number of convex sets, , is also convex. We assume that all constraints are chosen consistently, such that the intersection of all selected constraint sets is nonempty:


This assumption is not restrictive in practice because apparently contradicting constraint sets often have a non-empty intersection. For example, -norm based total-variation constraints and smoothness promoting constraints have at least one model in their intersection: a homogeneous model has a total-variation equal to and maximal smoothness.

We use to indicate entries of the vector . Subscripts like refer to one of the sub-vectors that are part of .

The Euclidean inner product of two vectors is denoted as , and .

PARSDMM: Exploiting similarity between constraint sets

As we briefly mentioned in the introduction, we want to construct an algorithm to compute projections onto the intersection of multiple sets that (i) avoids nesting multiple algorithms if we do not know a projection onto one of the sets in closed-form; (ii) explicitly exploit similarities between the linear operators ; (iii) do most computational work in parallel. The first step to accomplish this is writing each constraint set in problem (1) as the indicator function of a ‘simple’ set () and a possibly non-orthogonal linear operator: . We formulate projection of onto the intersection of sets as


In this section, we introduce our main algorithm, Projection Adaptive Relaxed Simultaneous Method of Multipliers (PARSDMM). The name derives from the relation to existing works about adaptive relaxation and ADMM-variants for minimizing sums of functions. It is designed to solve inverse problems that call for multiple pieces of prior knowledge in the form of constraints. Each piece of prior knowledge corresponds to a single set. We focus on intersections up to about sets, which we found adequate to regularize inverse problems. To avoid technical issues with non-convexity, we, for now, assume all sets to be closed and convex.

We use ADMM as a starting point. ADMM is known to solve intersection projection (and feasibility) problems (Boyd et al., 2011; Pakazad et al., 2015; Bauschke and Koch, 2015; Jia et al., 2017; Tibshirani, 2017; Kundu et al., 2017). However, it remains a black-box algorithm and struggles with projections that do not have closed-form solutions. For completeness and to highlight the differences with the algorithm we propose below, we describe in Appendix A a black-box algorithm for the projection onto the intersection of sets based on ADMM.

The augmented Lagrangian

To start the derivation of PARSDMM, we introduce separate vectors for each of the constraint sets of problem (14) and we add linear equality constraints as follows:


The augmented Lagrangian (e.g., Nocedal and Wright, 2000, Chapter 17) of problem (15) is a basis for ADMM (see (19) below). To ensure that the -minimization of the augmented Lagrangian remains quadratic, we make this minimization problem independent of the distance term . This choice has the additional benefit of allowing for other functions that measure distance from . We remove the direct coupling of the distance term by (i) introducing the additional variables and constraints ; (ii) defining ; (iii) creating the function


where we use the symbol to indicate concatenated matrices and vectors, as well as functions that are the sum of multiple functions to simplify notation. The concatenated matrices and vectors read


The vectors are the Lagrangian multipliers that occur in the augmented Lagrangian for the projection problem. We always have for the Euclidean projection that uses the squared -distance . With these new definitions, problem (15) becomes


This formulation has the same form as problems that regular ADMM solves—i.e., . It follows that we can guarantee convergence under the same conditions as for ADMM. According to (Boyd et al., 2011; Eckstein and Yao, 2015), ADMM converges when and are proper and convex. The linear equality constraints involve matrices and and vectors , and .

To arrive at the main iterations of PARSDMM, we continue based on the augmented Lagrangian for (18), which reads


As we can see, this expression has a separable structure with respect to the Lagrangian multipliers , and the auxiliary vectors . Following the ADMM variants for multiple functions, as formulated by (Song et al., 2016; Kitic et al., 2016; Xu et al., 2017c), we use a different penalty parameter for each index . In this way, we make sure all linear equality constraints are satisfied sufficiently when running a limited number of iterations. Because the different matrices may have widely varying scalings and sizes, a fixed penalty for all could cause slow convergence of towards one of the constraint sets. To further accelerate the algorithm we also introduce a different relaxation parameter () for each index . After we derive the main steps of our proposed algorithm, we describe the automatic selection of the scalar parameters.

The iterations

With the above definitions, iteration counter , and inclusion of relaxation parameters, which we assume to be limited to the interval (see Xu et al., 2017b), the iterations can be written as

To arrive at our final algorithm, we rewrite these iterations in a more explicit form as

In this expression, we used the fact that is always the identity matrix of size for projection problems. Without over/under relaxation ( computation, Eckstein and Bertsekas, 1992; Iutzeler and Hendrickx, 2017; Xu et al., 2017b), these iterations are known as SALSA (Afonso et al., 2011) or the simultaneous direction method of multipliers (SDMM, Combettes and Pesquet, 2011; Kitic et al., 2016). The derivation in this section shows that ADMM/SDMM solve the projection onto an intersection of multiple closed and convex sets. However, the basic iterations from (The iterations) are not yet a practical and fast algorithm, because there are scalar parameters that need to be selected, no stopping conditions, and no specializations to constraints typically found in the imaging sciences. We address these issues in the following sections.

Computing the proximal maps

The proximal maps in the iterations (The iterations) become projections onto simple sets (e.g., bounds/ and norm-ball/cardinality/rank), which permit closed-form solutions that do not depend on the . When , (squared distance of to the reference vector ) the proximal map is also available in closed form:


where we understand the division in a point-wise sense. We thus avoided other convex optimization algorithms for computations of the proximal maps of interest.

Solving the linear system and automatic parameter selection

We can see from (The iterations) that the computation of involves the solution of a single system of normal equations that contains all linear operators. The system matrix equals


and is by construction always positive-definite because for all . The minimization over is therefore uniquely defined. As suggested by Xu et al. (2017a), we adapt the ’s every two iterations using the scheme we discuss below.

While we could use direct matrix factorizations of , we would need to refactorize every time we update any of the ’s. This makes computing too costly. Instead, we rely on warm-started iterative solvers with used as the initial guess for . There exist several alternatives including LSQR (Paige and Saunders, 1982) to solve the above linear system ( computation in The iterations) iteratively. We choose to apply the conjugate-gradient (CG) method to the normal equations for the following reasons:

  1. Contrary to LSQR, transforms that satisfy are free in CG because we explicitly form the sparse system matrix , which already includes the identity matrix.

  2. By limiting the relative difference between the and , where the latter corresponds to the identity matrix in (21), we ensure is sufficiently well conditioned so squaring the condition number does not become a numerical problem.

  3. For many transforms, the matrices are sparse and have at least partially overlapping sparsity patterns (discrete derivative matrices for one or more directions, orthogonal transforms). Multiplication with is therefore not much more expensive than multiplication with a single . However, LSQR requires matrix-vector products with all and at every iteration.

  4. Full reassembly of at iteration is not required. Every time we update any of the , we update by subtracting and adding the block corresponding to the updated . If the index that changed is indicated by , the system matrix for the next computation becomes


    For each update, forming the new system matrix involves a single addition of two sparse matrices (assuming all ’s are pre-computed).

To further save computation time, we solve the minimization with respect to inexactly. We select the stopping criterion for CG adaptively in terms of the relative residual of the normal equations—i.e., we stop CG if the relative residual drops below


Empirically, we found this choice robust and it also results in time savings for solving problem (18) compared to a fixed and accurate stopping criterion for the -minimization step. The stopping criterion for CG is relatively inexact during the first few iterations from (The iterations) and requests more accurate solutions later on, such that the conditions on inexact sub-problem solutions from (Eckstein and Bertsekas, 1992) will be satisfied eventually.

Just like standard ADMM, our algorithm may also need a large number of iterations (The iterations) for a fixed penalty parameter for all (e.g., Nishihara et al., 2015; Xu et al., 2017a). It is better to update and every couple of iterations to ensure we reach a good solution in a relatively small number of iterations. For this purpose, we use Xu et al. (2017a)’s automatic selection of and for ADMM. Numerical experiments by Xu et al. (2016) show that these updates also perform well on various non-convex problems. The updates themselves are based on a Barzilai-Borwein spectral step size (Barzilai and Borwein, 1988) for Douglas-Rachford (DR) splitting applied to the dual of and derive from equivalence between ADMM and DR (Eckstein and Bertsekas, 1992; Esser, 2009).

Exploiting parallelism

Given the grid size of 3D PDE-based parameter estimation problems and our desire to work with multiple constraint sets, we seek a parallel implementation that exploits multi-threading offered by programming languages such as Julia (Bezanson et al., 2017). Since the computational time for the x-minimization using the conjugate-gradient algorithm is dominated by the matrix-vector products (MVP) with , we concentrate our efforts there by using compressed diagonal storage (CDS), see, e.g., Saad (1989); Serón et al. (1990); Kotakemori et al. (2008). This format stores the non-zero bands of the matrix as a dense matrix, and we compute MVPs directly in this storage sytem. These MVPs are faster than the more general Compressed Sparse Column (CSC) format. CDS has the additional benifit that it can efficiently handle matrices generated by spatially varying (blurring, derivative) kernels. We can use CDS if all matrices have a banded sparsity-pattern. Using Julia’s multi-threading, we compute the MVPs with in parallel. In cases where the ’s do not have a banded structure we revert to computations in the standard CSC format.

Aside from matrix-vector products during the inner iterations, most calculation time in (The iterations) is used for , , , , and . To reduce these computation times, we compute these quantities in parallel. This is relatively straightforward to do because each problem is independent so that the operations for the constraints can be carried out by different Julia workers where each worker either uses Julia threads, multi-threaded BLAS (OpenBLAS, Wang et al., 2013)

, or multi-threaded Fourier-transforms

(FFTW library, Frigo and Johnson, 2005).

Stopping conditions

So far, we focussed on reducing the time for each iteration of (The iterations). However, fast solutions to the full problem are possible only if we know when to stop iterating. When working with a single constraint set, stopping criteria based on a combination of the primal and dual residual are adequate as long as both become sufficiently small (e.g., Boyd et al., 2011; Kitic et al., 2016; Xu et al., 2017a). However, the situation is more complicated in situations where we work with multiple constraint sets. In that case, the and contain a variety of vectors and linear operators that correspond to the different constraint sets. It then becomes more difficult to determine the relationship between the size of the residuals and the distance to each set. With other words, it becomes challenging to decide at what primal and dual residual to stop such that we are close to all constraint sets.

Instead, it may be more intuitive to look at the feasibility for each set separately. This holds if is in the intersection of the constraint sets but requires computationally costly projections onto each to verify, a situation we want to avoid in PARSDMM. Instead, we rely on the transform-domain set feasibility error


to which we have access at a relatively low cost since we already computed in the iterations from (The iterations). Our first stopping criterion thus corresponds to a normalized version of the objective Censor et al. (2005) uses when solving convex multiple set split-feasibility problems. We added this normalization in (24) to account for different types and sizes of the linear operators . The projections onto the constraint sets themselves, are relatively cheap to compute since they only include projections onto ‘simple’ sets based on norm-balls, bounds, cardinality, and matrix rank. By testing for transform-domain feasibility every few iterations only (typically ), we further reduce the computational costs for our stopping condition.

Satisfying constraints does not indicate whether or not is close to the projection onto the intersection of the different constraint sets or if it is just a feasible point, possibly ‘deep’ inside the intersection. If is indeed the projection of , then approaches a stationary point, assuming that converges to the projection. We make this property explicit by considering the maximum relative change of over the previous iterations: . The relative evolution of at the iteration thus becomes


By considering the history (we use in our numerical examples), our stopping criterion becomes more robust to oscillations in as a function of So we propose to stop PARSDMM if


During our numerical experiments, we select and , which balance sufficiently accurate solutions and short solution times. These are still two constants to be chosen by the user, but we argue that may relate better to our intuition on feasibility because it behaves like a distance to each set separately. The evolution term is found in many optimization algorithms and is especially informative for physical parameter estimation problems where practitioners often have a good intuition to which the physical forward operator is sensitive.

The PARSDMM algorithm

We summarize our discussions from the previous sections in the following Algorithms.

Algorithm PARSDMM

Algorithm 1 Projection Adaptive Relaxed Simultaneous Direction Method of Multipliers (PARSDMM) to compute the projection onto an intersection, including automatic selection of the penalty parameters and relaxation.

 //point to project
   //linear operators
   for  //norm/bound/cardinality/... projectors
   //prox for the squared distance from 
  select update-freqency for  and 
  optional: initial guess for  and 

    //pre-compute for all 
WHILE not converged
       //CG, stop when (23holds
      FOR  //compute in parallel
           stop if conditions (26) hold
          End if
      FOR  //update  if necessary
          End if

Algorithm adapt-rho-gamma


Algorithm 2 Adapt and according to (Xu et al., 2017b) with some modifications to save computational work. The constant is in the range as suggested by (Xu et al., 2017b). Quantities from the previous call to adapt-rho-gamma have the indication . Actual implementation computes and re-uses some of the inner products and norms.


set  and save for next call to adapt-rho-gamma 
save  for next call to adapt-rho-gamma

Multilevel PARSDMM

Inverse problems with PDE forward models typically need a fine grid for stable physical simulations. At the same time, we often use constraints to estimate ‘simple’ models—i.e. models that are smooth, have a low-rank, are sparse in some transform-domain, and that may not need many grid points for accurate representations of the image/model. This suggests we can reduce the total computational time of PARSDMM (Algorithm 1 Projection Adaptive Relaxed Simultaneous Direction Method of Multipliers (PARSDMM) to compute the projection onto an intersection, including automatic selection of the penalty parameters and relaxation.) by using a multilevel continuation strategy (see, e.g., Nash, 2000; Ascher and Haber, 2001; Nash, 2014; Parpas, 2017). The multilevel idea presented in this section applies to the projection onto the intersection of constraint sets only and not to the grids for solving PDEs. Our approach proceeds as follows: we start at a coarse grid and continue towards finer grids without cycling between coarse and fine grids. By using the solution at the coarse grid as the initial guess for the solution on the finer grid, the convergence guarantees are the same as for the single level version of our algorithm. This multilevel approach is similar to multilevel ADMM by Macdonald and Ruthotto (2018) for non-convex linear equality constrained problems. Numerical experiments in the following section show reduced solution times and better performance on non-convex problems.

Different from many applications of multilevel algorithms, is that we are not interested in approximating the primal variable , because ADMM-type iterations compute this quantity in the first step, see Algorithm 1 Projection Adaptive Relaxed Simultaneous Direction Method of Multipliers (PARSDMM) to compute the projection onto an intersection, including automatic selection of the penalty parameters and relaxation.. Instead, we need to concern ourselves with the initial guesses of auxilliary variables and Lagrangian multipliers for all

. After initialization of the coarsest grid with all zero vectors, we move to a finer grid by interpolating

and . Since the solution estimate always refers to an image (2D/3D) for our applications, we know that and relate to images as well and their dimensions depend on the corresponding . Therefore we interpolate and as images.

Example. When is a discrete derivative matrix, then the vectors and live on a grid that we know at every level of the multilevel scheme. If we have , where is the first-order finite-difference matrix as in (8), we know that and therefore and . We can thus reshape the associated vectors and as an image (in 2D) of size and interpolate it to the finer grid for the next level, working from coarse to fine. In 3D, we follow the same approach. We also need a coarse version of at each level: for . We simply obtain the coarse models by applying anti-alias filtering and subsampling of the original . In principle, any subsampling and interpolation technique may be used in this multilevel framework, because it just constructs initial guesses for the next levels.

We decide the number of levels () and the coarsening factor ahead of time. Together with the original grid, these determine the grid at all levels so we can set up the linear operators and proximal mappings at each level. This set-up phase is a one time cost since its result is reused every time we project a model onto the intersection of constraint sets. The additional computational costs of the multilevel scheme are the interpolation of all and to a finer grid, but this happens only once per level and not every ML-PARSDMM (Algorithm 3 Multilevel PARSDMM to compute the projection onto an intersection of sets.) iteration. So the computational overhead we incur from the interpolations is small compared to the speedups that Algorithm 3 Multilevel PARSDMM to compute the projection onto an intersection of sets. offers.


Algorithm 3 Multilevel PARSDMM to compute the projection onto an intersection of sets.

 //number of levels
     //grid info at each level 
         //model to project at every level 
   //linear operators at every level
  // norm/bound/cardinality/... projectors at each level:
  // proximal map for the squared distance from  at each level:

//start at coarsest grid
      //solve on current grid
       //interpolate to finer grid
           //interpolate to finer grid
           //interpolate to finer grid
  at original grid (level )

Software and numerical examples

The software corresponding to this paper is available at All algorithms, examples, and utilities to set up the projectors and linear operators are included. Our software is specialized to the specific and fixed problem structure (14) with the flexibility to work with multiple linear operators and projectors. Because of these design choices, the user only needs to provide the model to project, , and pairs of linear operators and projectors onto simple sets: . The software adds the identity matrix and the proximal map for the distance squared from . These are all computational components required to solve intersection projection problems as formulated in (16). No internal reformulation is required by our software.

To reap benefits from modern programming language design, including just-in-time compilation, multiple dispatch, and mixing distributed and multi-threaded computations, we wrote our software package in Julia. Our code uses parametric typing, which means that the same scripts can run in Float32 (single) and Float64 (double) precision. As expected, most components of our software run faster in Float32 with reduced memory consumption. The timings in the following examples use Float32.

We provide scripts that the set up the linear operators and projectors for regular grids in 2D and 3D. It is not necessary to use these scripts as the solver is agnostic to the specific construction of the projectors or linear operators. Table (1) displays the constraints we currently support. For example, when the user requests the script to set up minimum and maximum bounds on the discrete gradient in the -direction of the model, the script returns the discrete derivative matrix and a function that projects the input onto the bounds. The software currently supports the identity matrix, matrices representing the discrete gradient and the operators that we apply matrix-free: the discrete cosine/Fourier/wavelet/curvelet (Ying et al., 2005) transforms.

descriptions set
nuclear norm , with is the SVD.
cardinality , is a positive integer
rank , with is the SVD and
subspace constraints
Table 1: Overview of constraint sets that the software currently supports. A new constraint requires the projector onto the set (without linear operator) and a linear operator or equivalent matrix-vector product together with its adjoint. Vector entries are indexed as , and indicate singular vectors.

For the special case of orthogonal linear operators, we leave the linear operator inside the set definition because we know the projection onto in closed form. For example, if with discrete Fourier transform (DFT) matrix , the projection is known in closed form as , where denotes the complex-conjugate transpose and is the projection onto the