# Projective Splitting with Forward Steps: Asynchronous and Block-Iterative Operator Splitting

This work is concerned with the classical problem of finding a zero of a sum of maximal monotone operators. For the projective splitting framework recently proposed by Combettes and Eckstein, we show how to replace the fundamental subproblem calculation using a backward step with one based on two forward steps. The resulting algorithms have the same kind of coordination procedure and can be implemented in the same block-iterative and potentially distributed and asynchronous manner, but may perform backward steps on some operators and forward steps on others. Prior algorithms in the projective splitting family have used only backward steps. Forward steps can be used for any Lipschitz-continuous operators provided the stepsize is bounded by the inverse of the Lipschitz constant. If the Lipschitz constant is unknown, a simple backtracking linesearch procedure may be used. For affine operators, the stepsize can be chosen adaptively without knowledge of the Lipschitz constant and without any additional forward steps. We close the paper by empirically studying the performance of several kinds of splitting algorithms on the lasso problem.

## Authors

• 6 publications
• 6 publications
• ### Single-Forward-Step Projective Splitting: Exploiting Cocoercivity

This work describes a new variant of projective splitting in which cocoe...
02/24/2019 ∙ by Patrick R. Johnstone, et al. ∙ 0

• ### Projective Splitting with Forward Steps only Requires Continuity

A recent innovation in projective splitting algorithms for monotone oper...
09/17/2018 ∙ by Patrick R. Johnstone, et al. ∙ 0

• ### Convolutional Proximal Neural Networks and Plug-and-Play Algorithms

In this paper, we introduce convolutional proximal neural networks (cPNN...
11/04/2020 ∙ by Johannes Hertrich, et al. ∙ 0

• ### Parallel Complexity of Forward and Backward Propagation

We show that the forward and backward propagation can be formulated as a...
12/18/2017 ∙ by Maxim Naumov, et al. ∙ 0

• ### A Relaxed Inertial Forward-Backward-Forward Algorithm for Solving Monotone Inclusions with Application to GANs

We introduce a relaxed inertial forward-backward-forward (RIFBF) splitti...
03/17/2020 ∙ by Radu Ioan Bot, et al. ∙ 0

• ### Negative time splitting is stable

For high order (than two) in time operator-splitting methods applied to ...
07/15/2021 ∙ by Dong Li, et al. ∙ 0

• ### A Unifying Framework for Variance Reduction Algorithms for Finding Zeroes of Monotone Operators

A wide range of optimization problems can be recast as monotone inclusio...
06/22/2019 ∙ by Xun Zhang, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

For a collection of real Hilbert spaces , consider the problem of finding such that

 0∈n∑i=1G∗iTi(Giz), (1)

where are linear and bounded operators, are maximal monotone operators and additionally there exists a subset such that for all the operator is Lipschitz continuous. An important instance of this problem is

 minx∈H0n∑i=1fi(Gix), (2)

where every is closed, proper and convex, with some subset of the functions also being differentiable with Lipschitz-continuous gradients. Under appropriate constraint qualifications, (1) and (2) are equivalent. Problem (2

) arises in a host of applications such as machine learning, signal and image processing, inverse problems, and computer vision; see

[4, 9, 11] for some examples. Operator splitting algorithms are now a common way to solve structured monotone inclusions such as (1). Until recently, there were three underlying classes of operator splitting algorithms: forward-backward [26], Douglas/Peaceman-Rachford [24], and forward-backward-forward [32]. In [13], Davis and Yin introduced a new operator splitting algorithm which does not reduce to any of these methods. Many algorithms for more complicated monotone inclusions and optimization problems involving many terms and constraints are in fact applications of one of these underlying techniques to a reduced monotone inclusion in an appropriately defined product space [6, 21, 12, 5, 10]. These four operator splitting techniques are, in turn, a special case of the Krasnoselskii-Mann (KM) iteration for finding a fixed point of a nonexpansive operator [23, 25].

A different, relatively recently proposed class of operator splitting algorithms is projective splitting: this class has a different convergence mechanism based on projection onto separating sets and does not in general reduce to the KM iteration. The root ideas underlying projective splitting can be found in [20, 30, 31] which dealt with monotone inclusions with a single operator. The algorithm of [17] significantly built on these ideas to address the case of two operators and was thus the original projective “splitting” method. This algorithm was generalized to more than two operators in [18]. The related algorithm in [1] introduced a technique for handling compositions of linear and monotone operators, and [8] proposed an extension to “block-iterative” and asynchronous operation — block-iterative operation meaning that only a subset of the operators making up the problem need to be considered at each iteration (this approach may be called “incremental” in the optimization literature). A restricted and simplified version of this framework appears in [16]. The asynchronous and block-iterative nature of projective splitting as well as its ability to handle composition with linear operators gives it an unprecedented level of flexibility compared with prior classes of operator splitting methods, none of which can be readily implemented in an asynchronous or block-iterative manner. Further, in the projective splitting methods of [8, 16] the order with which operators can be processed is deterministic, variable, and highly flexible. It is not necessary that each operator be processed the same number of times either exactly or approximately; in fact, one operator may be processed much more often than another. The only constraint is that there is an upper bound on the number of iterations between the consecutive times that each operator is processed.

Projective splitting algorithms work by performing separate calculations on each individual operator to construct a separating hyperplane between the current iterate and the problem’s

Kuhn-Tucker set (essentially the set of primal and dual solutions), and then projecting onto the this hyperplane. In prior projective splitting algorithms, the only operation performed on the individual operators is a proximal (backward) step, which consists of evaluating the operator resolvents for some scalar . In this paper, we show how, for the Lipschitz continuous operators, the same kind of framework can also make use of forward steps on the individual operators, equivalent to applying . Typically, such “explicit” steps are computationally much easier than “implicit”, proximal steps. Our procedure requires two forward steps each time it evaluates an operator, and in this sense is reminiscent of Tseng’s forward-backward-forward method [32] and Korpelevich’s extragradient method [22]. Indeed, for the special case of only one operator, projective splitting with the new procedure reduces to the variant of the extragradient method in [20]. Each stepsize must be bounded by the inverse of the Lipschitz constant of

. However, a simple backtracking procedure can eliminate the need to estimate the Lipschitz constant, and other options are available for selecting the stepsize when

is affine.

### 1.1 Intuition and contributions: basic idea

We first provide some intuition into our fundamental idea of incorporating forward steps into projective splitting. For simplicity, consider (1) without the linear operators , that is, we want to find such that , where are maximal monotone operators on a single real Hilbert space . We formulate the Kuhn-Tucker solution set of this problem as

 \calS=\set(z,w1,…,wn−1)(∀i∈{1,…,n−1})wi∈Tiz,−∑n−1i=1wi∈Tnz. (3)

It is clear that solves if and only if there exist such that . A separator-projector algorithm for finding a point in will, at each iteration , find a closed and convex set which separates from the current point, meaning is entirely in the set and the current point is not. One can then move closer to the solution set by projecting the current point onto the set .

If we define as in (3), then the separator formulation presented in [8] constructs the set through the function

 φk(z,w1,…,wn−1) =n−1∑i=1\innerz−xkiyki−wi+\Innerz−xniyni+n−1∑i=1wi (4) =\Innerzn∑i=1yki+n−1∑i=1\innerxki−xknwi−n∑i=1\innerxkiyki, (5)

for some such that , . From its expression in (5) it is clear that is an affine function on . Furthermore, it may easily be verified that for any , one has , so that the separator set may be taken to be the halfspace . The key idea of projective splitting is, given a current iterate , to pick so that is positive if . Then, since the solution set is entirely on the other side of the hyperplane , projecting the current point onto this hyperplane makes progress toward the solution. If it can be shown that this progress is sufficiently large, then it is possible to prove (weak) convergence.

Let the iterates of such an algorithm be . To simplify the subsequent analysis, define at each iteration , whence it is immediate from (4) that . To construct a function of the form (4) such that whenever , it is sufficient to be able to perform the following calculation on each individual operator : for , find such that and , with if . In earlier work on projective splitting [17, 18, 8, 1], the calculation of such a is accomplished by a proximal (implicit) step on the operator : given a scalar , we find the unique pair such that and

 xki+ρyki=zk+ρwki⇒zk−xki=ρ(yki−wki). (6)

We immediately conclude that

 \innerzk−xkiyki−wki=(1/ρ)\smallnormzk−xki2≥0, (7)

and furthermore that unless , which would in turn imply that and . If we perform such a calculation for each , we have constructed a separator of the form (4) which, in view of , has if . This basic calculation on is depicted in Figure 1(a) for : since , the line segment between and must have slope , meaning that and thus that . It also bears mentioning that the relation (7) plays (in generalized form) a key role in the convergence proof.

Consider now the case that is Lipschitz continuous with modulus (and hence single valued) and defined throughout . We now introduce a technique to accomplish something similar to the preceding calculation through two forward steps instead of a single backward step. We begin by evaluating and using this value in place of in the right-hand equation in (6), yielding

 zk−xki=ρ(Tizk−wki)⇒xki=zk−ρ(Tizk−wki), (8)

and we use this value for . This calculation is depicted by the lower left point in Figure 1(b). We then calculate , resulting in a pair on the graph of the operator; see the upper left point in Figure 1(b). For this choice of , we next observe that

 \innerzk−xkiyki−wki =\Innerzk−xkiTizk−wki−\Innerzk−xkiTizk−yki =\Innerzk−xki1ρ(zk−xki)−\Innerzk−xkiTizk−Tixki (9) ≥1ρ\normzk−xki2−Li\normzk−xki2 (10) =(1ρ−Li)\normzk−xki2. (11)

Here, (9) follows because from (8) and because we let . The inequality (10) then follows from the Cauchy-Schwarz inequality and the hypothesized Lipschitz continuity of . If we require that , then we have and (11) therefore establishes that , with unless , which would imply that . We thus obtain a conclusion very similar to (7) and the results immediately following from it, but using the constant in place of the positive constant .

For , this process is depicted in Figure 1(b). By construction, the line segment between and has slope , which is “steeper” than the graph of the operator, which can have slope at most by Lipschitz continuity. This guarantees that the line segment between and must have negative slope, which in is equivalent to the claimed inner product property.

Using a backtracking line search, we will also be able to handle the situation in which the value of is unknown. If we choose any positive constant , then by elementary algebra the inequalities and are equivalent. Therefore, if we select some positive , we have from (11) that

 \innerzk−xkiyki−wki≥Δ∥zk−xki∥2, (12)

which implies the key properties we need for the convergence proofs. Therefore we may start with any , and repeatedly halve until (12) holds; in Section 4.1 below, we bound the number of halving steps required. In general, each trial value of requires one application of the Lipschitz continuous operator . However, for the case of affine operators , we will show that it is possible to compute a stepsize such that (12) holds with a total of only two applications of the operator. By contrast, most backtracking procedures in optimization algorithms require evaluating the objective function at each new candidate point, which in turn usually requires an additional matrix multiply operation in the quadratic case [3].

### 1.2 Summary of Contributions

The main thrust of the remainder of this paper is to incorporate the second, forward-step construction of above into an algorithm resembling those of [8, 16], allowing some operators to use backward steps, and others to use forward steps. Thus, projective splitting may become useful in a broad range of applications in which computing forward steps is preferable to computing or approximating proximal steps. The resulting algorithm inherits the asynchronous and block-iterative features of [8, 16]. It is worth mentioning that the stepsize constraints are unaffected by asynchrony — increasing the delays involved in communicating information between parts of the algorithm does not require smaller stepsizes. This contrasts with other asynchronous optimization and operator splitting algorithms [28, 27]. Another useful feature of the stepsizes is that they are allowed to vary across operators and iterations.

Like previous asynchronous projective splitting methods [16, 8], the asynchronous method developed here does not rely on randomization, nor is the algorithm formulated in terms of some fixed communication graph topology.

We will work with a slight restriction of problem (1), namely

 0∈n−1∑i=1G∗iTi(Giz)+Tn(z). (13)

In terms of problem (1), we are simply requiring that be the identity operator and thus that . This is not much of a restriction in practice, since one could redefine the last operator as , or one could simply append a new operator with everywhere.

The principle reason for adopting a formulation involving the linear operators is that in many applications of (13) it may be relatively easy to compute the proximal step of but difficult to compute the proximal step of . Our framework will include algorithms for (13) that may compute the proximal steps on , forward steps when is Lipschitz continuous, and applications (“matrix multiplies”) of and . An interesting feature of the forward steps in our method is that while the allowable stepsizes depend on the Lipschitz constants of the for , they do not depend on the linear operator norms , in contrast with primal-dual methods [6, 12, 33]. Furthermore as mentioned the stepsizes used for each operator can be chosen independently and may vary by iteration.

We also suggest a greedy heuristic for selecting operators in block-iterative splitting, based on a simple proxy. Augmenting this heuristic with a straightforward safeguard allows one to retain all of the convergence properties of the main algorithm. The heuristic is not specifically tied to the use of forward steps and also applies to the earlier algorithms in

[8, 16]. The numerical experiments in Section 5 below attest to its usefulness.

## 2 Mathematical Preliminaries

### 2.1 Notation

Summations of the form for some collection will appear throughout this paper. To deal with the case , we use the standard convention that To ease the mathematical presentation, we use the following notation throughout the rest of the paper:

 Gn:Hn→Hn ≜I(the identity operator) (∀k∈N)wkn ≜−n−1∑i=1G∗iwki. (14)

Note that when , . We will use a boldface for elements of .

Throughout, we will simply write as the norm for and let the subscript be inferred from the argument. In the same way, we will write as for the inner product of . For the collective primal-dual space defined in Section 2.3 we will use a special norm and inner product with its own subscript.

For any maximal monotone operator we will use the notation for any scalar , to denote the proximal operator, also known as the backward or implicit step with respect to . This means that

 x=\proxρA(a)⟹∃y∈Ax:x+ρy=a.

The and satisfying this relation are unique. Furthermore, is defined everywhere and [2, Prop. 23.2].

We use the standard “” notation to denote weak convergence, which is of course equivalent to ordinary convergence in finite-dimensional settings.

The following basic result will be used several times in our proofs: For any vectors

,

###### Proof.

where the inequality follows from the convexity of the function . ∎

### 2.2 A Generic Linear Separator-Projection Method

Suppose that is a real Hilbert space with inner product and norm . A generic linear separator-projection method for finding a point in some closed and convex set is given in Algorithm 1.

The update on line 1 is the -relaxed projection of onto the halfspace using the norm . In other words, if is the projection onto this halfspace, then the update is . Note that we define the gradient with respect to the inner product , meaning we can write

 (∀p,~p∈H):φk(p)=⟨∇φk,p−~p⟩H+φk(~p).

We will use the following well-known properties of algorithms fitting the template of Algorithm 1; see for example [7, 17]: Suppose is closed and convex. Then for Algorithm 1,

1. The sequence is bounded.

2. ;

3. If all weak limit points of are in , then converges weakly to some point in .

Note that we have not specified how to choose the affine function . For our specific application of the separator projector framework, we will do so in Section 3.2.

### 2.3 Main Assumptions Regarding Problem (13)

Let and . Define the extended solution set or Kuhn-Tucker set of (13) to be

 \calS={(z,w1,…,wn−1)∈H∣∣wi∈Ti(Giz),i=1,…,n−1,−n−1∑i=1G∗iwi∈Tn(z)}. (15)

Clearly solves (13) if and only if there exists such that . Our main assumptions regarding (13) are as follows: Problem (13) conforms to the following:

1. and are real Hilbert spaces.

2. For , the operators are monotone.

3. For all in some subset , the operator is -Lipschitz continuous (and thus single-valued) and .

4. For , the operator is maximal and that the map can be computed to within the error tolerance specified below in Assumption 3.4 (however, these operators are not precluded from also being Lipschitz continuous).

5. Each for is linear and bounded.

6. The solution set defined in (15) is nonempty.

Suppose Assumption 15 holds. The set defined in (15) is closed and convex.

###### Proof.

We first remark that for the operators are maximal by [2, Proposition 20.27], so are all maximal monotone. The claimed result is then a special case of [5, Proposition 2.8(i)] with the following change of notatation:

 Notation here Notation in [5] Tn ⟶ A (a maximal monotone operator) (x1,…,xn−1)↦T1x1×⋯×Tn−1xn−1 ⟶ B (a maximal monotone operator) z↦(G1z,…,Gn−1z) ⟶ L (a bounded linear operator).

## 3 Our Algorithm and Convergence

### 3.1 Algorithm Definition

Algorithm 2 is our asynchronous block-iterative projective splitting algorithm with forward steps for solving (13). It is essentially a special case of the weakly convergent Algorithm of [8], except that we use the new forward step procedure to deal with the Lipschitz continuous operators for , instead of exclusively using proximal steps. For our separating hyperplane we use a special case of the formulation of [8], which is slightly different from the one used in [16]. Our method can be reformulated to use the same hyperplane as [16]; however, this requires that it be computationally feasible to project on the subspace given by the equation .

The algorithm has the following parameters:

• For each iteration , a subset .

• For each and , a positive scalar stepsize .

• For each iteration and , a delayed iteration index which allows the subproblem calculations to use outdated information.

• For each iteration , an overrelaxation parameter for some constants .

• A scalar which controls the relative emphasis on the primal and dual variables in the projection update in lines 2-2; see (16) in Section 3.2 for more details.

• Sequences of errors for , allowing us to model inexact computation of the proximal steps.

There are many ways in which Algorithm 2 could be implemented in various parallel computing environments; a specific suggestion for asynchronous implementation of a closely related class of algorithms is developed in [16, Section 3]. One simple option is a centralized or “master-slave” implementation in which lines 2-2 and 2-2 are implemented on a collection of “worker” processors, while the remainder of the algorithm, most notably the coordination process embodied by lines 2-2, is executed by a single coordinating processor. However, such a simple implementation risks the coordinating processor becoming a serial bottleneck as the number of worker processors grows or the memory required to store the vectors for , becomes large, since the amount of work required to execute lines 2-2 is proportional to the total number of elements in . Fortunately, all but a constant amount of the work in the coordination calculations in lines 2-2 involves only sums, inner products, and matrix multiplies by and . Summation and hence inner product operations can be efficiently distributed over multiple processors. Therefore, with some care exercised as to where one performs the matrix multiplications in cases in which the are nontrivial, the coordination calculations may be distributed over multiple processors so that the coordination process need not constitute a serial bottleneck.

In the form directly presented in Algorithm 2, the delay indices may seem unmotivated; it might seem best to always select . However, these indices can play a critical role in modeling asynchronous parallel implementation. In the simple “master-slave” scheme described above, for example, the “master” might dispatch subproblems to worker processors, but not receive the results back immediately. In the meantime, other workers may report back results, which the master could incorporate into its projection calculations. In this context, counts the number of projection operations performed at the master, and is the set of subproblems whose solutions reached the master between iterations and . For each , is the index of the iteration completed just before subproblem was last dispatched for solution. In more sophisticated parallel implementation, and would have similar interpretations.

We now start our analysis of the weak convergence of the iterates of Algorithm 2 to a solution of problem (13). While the overall proof strategy is similar to [16], considerable innovation is required to incorporate the forward steps.

### 3.2 The Hyperplane

In this section, we define the affine function our algorithm uses to construct a separating hyperplane. Let be a generic point in , the collective primal-dual space. For , we adopt the following norm and inner product for some :

 \norm(z,w)2γ =γ∥z∥2+n−1∑i=1∥wi∥2 \Inner(z1,w1)(z2,w2)γ =γ⟨z1,z2⟩+n−1∑i=1⟨w1i,w2i⟩. (16)

Define the following function generalizing (4) at each iteration :

 φk(p) = n−1∑i=1⟨Giz−xki,yki−wi⟩+⟨z−xkn,ykn+n−1∑i=1G∗iwi⟩, (17)

where the are chosen so that for (recall that each inner product is for the corresponding Hilbert space ). This function is a special case of the separator function used in [8]. The following lemma proves some basic properties of ; similar results are in [1, 8, 16] in the case .

Let be defined as in (17). Then:

1. is affine on .

2. With respect to inner product on , the gradient of is

 ∇φk=(1γ(n−1∑i=1G∗iyki+ykn),xk1−G1xkn,xk2−G2xkn,…,xkn−1−Gn−1xkn).
3. Suppose Assumption 15 holds and that for . Then for all defined in (15).

4. If Assumption 15 holds, for , and , then

###### Proof.

To see that is affine, rewrite (17) as

 φk(z,w) =n−1∑i=1⟨Giz,yki−wi⟩−n−1∑i=1⟨xki,yki−wi⟩+⟨z,ykn+n−1∑i=1G∗iwi⟩ −⟨xkn,ykn+n−1∑i=1G∗iwi⟩ =n−1∑i=1⟨z,G∗i(yki−wi)⟩+n−1∑i=1⟨wi,xki⟩−n∑i=1⟨xki,yki⟩+⟨z,ykn+n−1∑i=1G∗iwi⟩ −n−1∑i=1⟨wi,Gixkn⟩ =⟨z,n−1∑i=1G∗iyki+ykn⟩+n−1∑i=1⟨wi,xki−Gixkn⟩−n∑i=1⟨xki,yki⟩. (18)

It is now clear that is an affine function of . Next, fix an arbitrary . Using the fact that is affine, we may write

 φk(p) = \innerp−~p∇φkγ+φk(~p)=⟨p,∇φk⟩γ+φk(~p)−⟨~p,∇φk⟩γ = γ⟨z,∇zφk⟩+n−1∑i=1⟨wi,∇wiφ