In this paper we consider the Douglas-Rachford (DR) method to solve the problem of finding a zero of the sum of two maximal monotone operators, i.e., solving:
where are two (possibly multivalued) maximal monotone operators from a Hilbert space into itself .
The DR method originated from  and was initially proposed to solve the discretization of stationary and non-stationary heat equations where the involved monotone operators are linear (namely, the discretization of second derivatives in different spatial directions, for example and ). The iteration uses resolvents ( is the identity map) and , and from the original paper [15, Eq. (7.4), (7.5)] one can extract the iteration
where is a given stepsize. This iteration scheme also makes sense for general maximal monotone operators as soon as is single-valued. It has been observed in  that the iteration can be rewritten for arbitrary maximally monotone operators by substituting and using the identity
Comparing (2) and (4), we see that (4) does not require to evaluate , which avoids assuming that is single-valued as in (2). Otherwise, is not uniquely defined. While in (2) converges to a solution of (1), in (4) is just an intermediate sequence converging to such that is a solution of (1). Therefore, (2) gives us a convenient form to study the DR method in the framework of fixed-point theory. Note that the iterations (2) and (4) are equivalent in the stationary case, but they are not equivalent in the non-stationary case, i.e., when the stepsize varies along the iterations; we will shed more light on this later in Section 2.2.
From a practical point of view, the performance of a DR scheme mainly depends on the following two aspects:
Good stepsize : It seems to be generally acknowledged that the choice of the stepsize is crucial for the algorithmic performance of the method but a general rule to choose the stepsize seems to be missing [2, 16]. So far, convergence theory of DR methods provides some theoretical guidance to select the parameter in a given range in order to guarantee convergence of the method. Such a choice is often globally fixed for all iterations, and does not take into account local structures of the underlying operators and . Moreover, the global convergence rate of the DR method is known to be under only monotonicity assumption, but often using an averaging sequence [13, 14, 24], where is the iteration counter. Several experiments have shown that DR methods have better practical rate than its theoretical bound  by using the last iterate (i.e. not an averaging sequence). In the special case of convex optimization problems, the Douglas-Rachford method is equivalent to the alternating direction methods of multipliers (ADMM) (see, e.g. the recent  for a short historical account) and there a several proposals for dynamic stepsizes for ADMM [25, 28, 37, 41] but we are not aware of a method that applies to DR in the case of monotone operators. The recent work  provides explicit choices for constant stepsizes in cases where the monotone operator posses further properties.
Proper metric: Since the DR method is not invariant as the Newton method, the choice of metric and preconditioning seems to be crucial to accelerate its performance. Some researchers have been studying this aspect from different views, see, e.g., [10, 8, 19, 20, 23, 34]. Clearly, the choice of a metric and a preconditioner also affects the choice of the stepsize.
Note that a metric choice often depends on the variant of methods, while the choice of stepsize depends on the problem structures such as the strongly monotonicity parameters and the Lipschitz constants . In general cases, where and are only monotone, we only have a general rule to select the parameter to obtain its sublinear convergence rate [14, 16, 24]. This stepsize depends on the mentioned global parameters only and does not adequately capture the local structure of and to adaptively update . For instance, a linesearch procedure to evaluate a local Lipschitz constant for computing stepsize in first-order methods can beat the optimal stepsize using global Lipschitz constant , or a Barzilai-Borwein stepsize in gradient descent methods essentially exploits local curvature of the objective function to obtain a good performance.
We prove the convergence of a new version of the non-stationary Douglas-Rachford method for the case where both operators are merely maximally monotone. Moreover, we propose a very simple adaptive step-size rule and demonstrate that this rule does improve convergence in practical situations. We also transfer our results to the case of ADMM and obtain a new adaptive rule that outperforms previously known adaptive ADMM methods and also does have a convergence guarantee. Our step-size rule is relatively simple and does not incur significantly computational effort rather than the norm of two vectors. Our stepsize rule has a theoretical convergence guarantee.
Paper organization: We begin with an analysis of the case of linear monotone operators in section 2, analyze the convergence of the non-stationary form of the iteration (2), i.e. the form where varies with , in section 3, and then propose adaptive stepsize rules in section 4. Section 5 extends the analysis to non-stationary ADMM. Finally, section 6 provides several numerical experiments for the DR scheme and ADMM using our new stepsize rule.
1.1 State of the art
There are several results on the convergence of the iteration (4). The seminal paper  showed that, for any positive stepsize , the iteration map in (4) is firmly nonexpansive, that the sequence weakly converges to a fixed point of the iteration map [29, Prop. 2] and that, weakly converges to a solution of the inclusion (1) as soon as , and are maximal monotone [29, Theorem 1]. In the case where is coercive and Lipschitz continuous, linear convergence was also shown in [29, Proposition 4]. These results have been extended in various ways. Let us attempt to summarize some key contributions on the DR method. Eckstein and Bertsekas in  showed that the DR scheme can be cast into a special case of the proximal point method . This allows the authors to exploit inexact computation from the proximal point method . They also presented a connection between the DR method and the alternating direction method of multipliers (ADMM). In  Svaiter proved a weak convergence of the DR method in Hilbert space without the assumption that is maximal monotone and the prove have been simplified in . Combettes  cast the DR method as special case of the averaging operator from a fixed-point framework. Applications of the DR method have been studied in . The convergence rate of the DR method was first studied in  for the strongly monotone case, while the sublinear rate was then proved in . A more intensive research on convergence rates of the DR methods can be found in [13, 14, 30, 31]. The DR method has been extended to accelerated variant in  but specifying for a special setting. In  the authors analyzed a non-stationary DR method derived from (4) in the framework of perturbations of non-expansive iterations and showed convergence for convergent stepsize sequences with summable errors.
The DR method together with its dual variant, ADMM, become extremely popular in recent years due to a wide range of applications in image processing, and machine learning[6, 26], which are unnecessary to recall them all here.
In terms of stepsize selection for DR schemes as well as for ADMM methods, it seems that there is very little work available from the literature. Some general rules for fixed stepsizes based on further properties of the operators such as strong monotonicity, Lipschitz continuity, and coercivity are given in [18, 30]
, and it is shown that the resulting linear rates are tight. Heuristic rules for fixed stepsizes motivated by quadratic problems are derived in. A self-adaptive stepsize for ADMM proposed in  seems to be one of the first work in this direction. The recent works [41, 42]
also proposed an adaptive update rule for stepsize in ADMM based on a spectral estimation. Some other papers rely on theoretical analysis to choose optimal stepsize such as, but it only works in the quadratic case. In , the authors proposed a nonincreasing adaptive rule for the penalty parameter in ADMM. Another update rule for ADMM can be found in . While ADMM is a dual variant of the DR scheme, we unfortunately have not seen any work that converts such an adaptive step-size from ADMM to the DR scheme where the more general case of monotone operators can be handled. In addition, the adaptive step-size for the DR scheme by itself seems to not exist in the literature.
1.2 A motivating linear example
While the Douglas-Rachford iteration (weakly) converges for all positive stepsizes , it seems to be folk wisdom, that there is a “sweet spot” for the stepsize which leads to fast convergence. We illustrate this effect with a simple linear example. We consider a linear equation
where are two matrices of the size with .
We choose symmetric positive definite matrices with and such that has full rank,
and thus the equation has zero as its unique solution.111The exact construction of and is and , where and are drawn from the standard Gaussian distribution in Matlab.
are drawn from the standard Gaussian distribution in Matlab.Since is single-valued, we directly use the iteration (2).
Note that the shift would allow to treat inhomogeneous equation . If is a solution of this equation, then one sees that iteration (2) applied to is equivalent to applying the iteration to but for the residual .
A not too small stepsize ( in this case) leads to good progress in the beginning, but slows down considerably in the end.
Large stepsizes (larger than in this case) are slower in the beginning and tend to produce non-monotone decrease of the error.
Somewhere in the middle, there is a stepsize which performs much better than the small and large stepsizes.
In this particular example the stepsize greatly outperforms the other stepsizes. On the right of Figure 1 we show the norm of the residual after a fixed number of iterations for varying stepsizes. One can see that there is indeed a sweet spot for the stepsizes around . Note that the value of is by no means universal and this sweet spot of varies with the problem size, with the ranks of and , and even for each particular instance of this linear example.
2 Analysis of the linear monotone inclusion
In order to develop an adaptive stepsize for our non-stationary DR method, we first consider the linear problem instance of (1). We consider the original DR scheme (2) instead of (4) since (2) generates the sequence which converges to a solution of (1), while the sequence computed by (4) does not converge to a solution and its limit does depend on the stepsize in general.
2.1 The construction of adaptive stepsize for single-valued operator
When both and are linear, the DR scheme (2) can be expressed as a fixed-point iteration scheme of the following mapping:
Recall that, by Remark 1.1, all of this section also applies not only to problem but also problem . The notion of a monotone operator has a natural equivalence for matrices, which is, however, not widely used. Hence, we recall that a matrix is called monotone, if, for all , it holds that . Note that any symmetric positive semidefinite (spd) matrix is monotone, but a monotone matrix is not necessarily spd. Examples of a monotone matrices that are not spd are
The first matrix is skew symmetric, i.e.,and any such matrix is monotone. Note that even if and are spd (as in our example in Section 1.2), the iteration map in (6) is not even symmetric. Consequently, the asymptotic convergence rate of the iteration scheme (2) is not governed by the norm of but by its spectral radius
, which is the largest magnitude of an eigenvalue of(cf. [22, Theorem 11.2.1]
). Moreover, the eigenvalues and eigenvectors ofare complex in general.
First, it is clear from the derivation of
that the eigenspace offor the eigenvalue exactly consists of the solutions of .
In the following, for any (the set of complex numbers) and , we denote by the ball of radius centered at . We estimate the eigenvalues of that are different from .
Let be monotone, and be defined by (6). Let be an eigenvalue of with corresponding eigenvector . Assume that and define by
Then, we have and
Note that for a real, linear, and monotone map , and a complex vector , it holds that and thus, . This shows that .
We can see from (6) that any pair of eigenvalue and eigenvector of fulfills
Now, if we denote , then this expression becomes
which, by rearranging, leads to
Hence, by monotonicity of , we can derive from the above relation that
This leads to
Denoting , the last expression reads as
Recalling the definition of in (7), we get
This is equivalent to
which, in turn, is equivalent to . Adding to both sides, it leads to , which shows the desired estimate. ∎
In general, the eigenvalues of depend on in a complicated way. For , we have and hence, all eigenvalues are equal to one. For growing , some eigenvalues move into the interior of the circle and for , it seems that all eigenvalues tend to converge to the boundary of such a circle, see Figure 2 for an illustration of eigenvalue distribution.
It appears that Lemma 2.1 is related to Proposition 4.10 of  and also to the fact that the iteration mapping is (in the general nonlinear case) known to be not only non-expansive, but firmly non-expansive (cf. [16, Lemma 1] and [16, Figure 1]). In general, firmly non-expansiveness allows over-relaxation of the method, and indeed, one can also easily see this in the linear case as well: If is an eigenvalue of , then it lies in (when it is not equal to one) and the corresponding eigenvalue of the relaxed iteration map
is and lies in . Therefore, for all eigenvalues different from one of the relaxed iteration
lie in a circle of radius centered at , and hence, the iteration is still non-expansive. It is know that relaxation can speed up convergence, but we will not investigate this in this paper.
Lemma 2.1 tells us a little more than that all eigenvalues of the iteration map lie in a circle centered at of radius . Especially, all eigenvalues except for have magnitude strictly smaller than one if for all corresponding eigenvectors . This implies that the iteration map is indeed asymptotically contracting outside the set of solutions of (1). This proves that the stationary iteration converges to a zero point of the map at a linear rate. Note that this does not imply the convergence in the non-stationary case.
To optimize the convergence speed, we aim at minimizing the spectral radius of , which is the magnitude of the largest eigenvalue of and there seems to be little hope to explicitly minimize this quantity.
Here is a heuristic argument based on Lemma 2.1, which we will use to derive an adaptive stepsize rule: Note that is increasing and hence, to minimize the upper bound on (more precisely: the distance of to ) we want to make from (7) as large as possible. This is achieved by minimizing the denominator of over which happens for
This gives and note that (which implies ). This motivates an adaptive choice for the stepsize as
in the Douglas-Rachford iteration scheme (2).
One can use the above derivation to deduce that is a good constant step-size. In fact, this is also the stepsize that gives the best linear rate derived in [29, Proposition 4], which is minimized when where is the Lipschitz constant of . However, this choice does not perform well in practice in our experiments.
Since little is known about the non-stationary Douglas-Rachford iteration in general (besides the result from  on convergent stepsizes with summable errors), we turn to an investigation of this method in Section 3. Before we do so, we generalize the heuristic stepsize to the case of multivalued .
2.2 The construction of adaptive stepsize for non-single-valued
In the linear case, the iteration (4) is given by the iteration matrix
i.e., the matrices and are similar and hence, have the same eigenvalues. Moreover, if is an eigenvector of with the eigenvalue , then is an eigenvector of for the same eigenvalue .
However, in the case of the iteration (4) we do not assume that is single-valued, and thus, the adaptive stepsize using the quotient cannot be used. However, again due to (3), we can rewrite this quotient without applying and get, with , that
Note that the two iteration schemes (2) and (4) are not equivalent in the non-stationary and non-linear case. Indeed, let us consider such that . By induction, we have . Substituting into (2), we obtain
From (3) we have
Substituting and into (10), we obtain
Updating by (9) would then give
In summary, we can write an alternative DR scheme for solving (1) as
This scheme essentially has the same per-iteration complexity as in the standard DR method since the computation of does not significantly increase the cost.
3 Convergence of the non-stationary DR method
In this section, we prove weak convergence of the new non-stationary scheme (11). We follow the approach by [38, 39] and restate the DR iteration as follows: Given such that and a sequence , at each iteration , we iterate
Note that, in the case of single-valued , this iteration reduces to
Below are some consequences which we will need in our analysis:
Before proving our convergence result, we state the following lemma.
Let , , and be three nonnegative sequences, and be a bounded sequence such that for :
If , then and are bounded.
If , then
If , then and
By the assumption that and, without loss of generality, we assume that the latter term is positive (which is fulfilled for large enough, because ). Thus, in both cases, we can show that
Recursively, we get
Under the assumption , we have for some and all . Then, we have . This shows that is bounded. Since , , and are all nonnegative, it implies that and are bounded. ∎
Theorem 3.1 (Convergence of non-stationary DR).
The proof of this theorem follows the proof of [39, Theorem 1]. First, we observe that, for any , we have
From this and (14) it follows that
We see from (16) that
and using Lemma 3.1 with , and , we can conclude that both sequences and are bounded.
The first line gives
Summing this inequality from to , we get
Now, since is bounded and it holds that
by our assumption, we can conclude that
i.e., by (15), we have
This expression shows that and are also bounded. Due to the boundedness of , we conclude the existence of weak convergence subsequences and such that
and by the above limits, we also have
Since is bounded and , this shows that the sequence is quasi-Fejer convergent to the extended solution set with respect to the distance . Thus, similar to the proof of [39, Theorem 1], we conclude that the whole sequence weakly converges to an element of . ∎
4 An adaptive step-size for DR methods
The step-size suggested by (8) or by (9) is derived from our analysis of a linear case and it does not guarantee the convergence in general. In this section, we suggest modifying this step-size so that we can prove the convergence of the DR scheme. We build our adaptive step-size based on two insights:
Theorem 3.1 ensures the convergence of the non-stationary DR-iteration as soon as the stepsize sequence is convergent with summable increments.
However, the sequences (18) and (19) are not guaranteed to converge (and numerical experiments indicate that, indeed, divergence may occur). Here is a way to adapt the sequence (18) to produce a suitable stepsize sequence in the single-valued case:
Choose safeguards , a summable “conservation sequence” with and start with .
Let be the projection onto a box . We construct as
The following lemma ensures that this will lead to a convergent sequence .
Let be a bounded sequence, i.e., , and such that and . Then, the sequence defined by and
is in and converges to some and it holds that .
Obviously, and since is a convex combination of and , one can easily see that obeys the same bounds as <