In this paper, we are interested in additive Schwarz methods for a general convex optimization problem
where is a reflexive Banach space, is a Frechét differentiable convex function, and is a proper, convex, and lower semicontinuous function that is possibly nonsmooth. We additionally assume that is coercive, so that Eq. 1 admits a solution .
The importance of studying Schwarz methods arises from both theoretical and computational viewpoints. It is well-known that various iterative methods such as block relaxation methods, multigrid methods, and domain decomposition methods can be interpreted as Schwarz methods, also known as subspace correction methods. Studying Schwarz methods can yield a unified understanding of these methods; there have been several notable works on the analysis of domain decomposition and multigrid methods for linear problems in the framework of Schwarz methods [18, 33, 34, 35]. The convergence theory of Schwarz methods has been developed for several classes of nonlinear problems as well [2, 4, 23, 31]. In the computational viewpoint, Schwarz methods are prominent numerical solvers for large-scale problems because they can efficiently utilize massively parallel computer architectures. There has been plenty of research on Schwarz methods as parallel solvers for large-scale scientific problems of the form Eq. 1, e.g., nonlinear elliptic problems [12, 31], variational inequalities [5, 29, 30], and mathematical imaging problems [11, 14, 25].
An important concern in the research of Schwarz methods is the acceleration of algorithms. One of the most elementary relevant results is optimizing the relaxation parameters of Richardson iterations related to the Schwarz alternating method [33, section C.3]; observing that the Schwarz alternating method for linear elliptic problems can be viewed as a preconditioned Richardson method, one can optimize the relaxation parameters of Richardson iterations to achieve a faster convergence as in [33, Lemma C.5]. Moreover, if one replaces Richardson iterations by conjugate gradient iterations with the same preconditioner, an improved algorithm with faster convergence rate can be obtained. Such an idea of acceleration can be applied to not only linear problems but also nonlinear problems. There have been some recent works on the acceleration of domain decomposition methods for several kinds of nonlinear problems: nonlinear elliptic problems , variational inequalities , and mathematical imaging problems [15, 16, 19]. In particular, in the author’s previous work , an accelerated additive Schwarz method that can be applied to the general convex optimization Eq. 1 was considered. Noticing that additive Schwarz methods for Eq. 1 can be interpreted as gradient methods , acceleration schemes such as momentum [6, 20] and adaptive restarting 
that were originally derived for gradient methods in the field of mathematical optimization were adopted.
In this paper, we consider another acceleration strategy called backtracking from the field of mathematical optimization for applications to additive Schwarz methods. Backtracking was originally considered as a method of line search for step sizes that ensures the global convergence of a gradient method [1, 6]. In some recent works on accelerated gradient methods [10, 20, 28], it was shown both theoretically and numerically that certain backtracking strategies can accelerate the convergence of gradient methods. Allowing for adaptive increasing and decreasing of the step size along the iterations, backtracking can find a nearly-optimal value for the step size that results in large energy decay, so that fast convergence is achieved. Such an acceleration property of backtracking may be considered as a resemblance with the relaxation parameter optimization for Richardson iterations mentioned above. Hence, as in the case of Richardson iterations for linear problems, one may expect that the convergence rate of additive Schwarz methods for Eq. 1 can be improved if an appropriate backtracking strategy is adopted. Unfortunately, applying the existing backtracking strategies such as [10, 20, 28] to additive Schwarz methods is not so straightforward. The existing backtracking strategies require the computation of the underlying distance function of the gradient method. For usual gradient methods, the underlying distance function is simply the -norm of the solution space so that such a requirement does not matter. However, the underlying nonlinear distance function of additive Schwarz methods has a rather complex structure in general (see Eq. 7); this aspect makes direct applications of the existing strategies to additive Schwarz methods cumbersome.
This paper proposes a novel backtracking strategy for additive Schwarz methods, which does not rely on the computation of the underlying distance function. As shown in Algorithm 2, the proposed backtracking strategy does not depend on the computation of the distance function but the computation of the energy functional only. Hence, the proposed backtracking strategy can be easily implemented for additive Schwarz methods for Eq. 1
with any choices of local solvers. Acceleration properties of the proposed backtracking strategy can be analyzed mathematically; we present explicit estimates for the convergence rate of the method in terms of some averaged quantity estimated along the iterations. The proposed backtracking strategy has another interesting feature; since it accelerates the additive Schwarz method in a completely different manner from the momentum acceleration introduced in, both of the momentum acceleration and the proposed backtracking strategy can be applied simultaneously to form a further accelerated method; see Algorithm 3. We present numerical results for various convex optimization problems of the form Eq. 1 to verify our theoretical results and highlight the computational efficiency of the proposed accelerated methods.
This paper is organized as follows. A brief summary of the abstract convergence theory of additive Schwarz methods for convex optimization presented in  is given in Section 2. In Section 3, we present and analyze a novel backtracking strategy for additive Schwarz methods as an acceleration scheme. A fast additive Schwarz method that combines the ideas of the momentum acceleration  and the proposed backtracking strategy is proposed in Section 4. Numerical results for various convex optimization problems are presented in Section 5. We conclude the paper with remarks in Section 6.
2 Additive Schwarz methods
In this section, we briefly review the abstract framework for additive Schwarz methods for the convex optimization problem Eq. 1 presented in . In what follows, an index runs from to . Let be a reflexive Banach space and let be a bounded linear operator such that
and its adjoint is surjective. For the sake of describing local problems, we define and as functionals defined on , which are proper, convex, and lower semicontinuous with respect to their first arguments. Local problems have the following general form:
where and . If we set equationparentequation
|in Eq. 2, then the minimization problem is reduced to|
which is the case of exact local problems. Here denotes the Bregman distance
We note that other choices of and , i.e., cases of inexact local problems, include various existing numerical methods such as block coordinate descent methods  and constraint decomposition methods [11, 29]; see [23, section 6.4] for details.
Note that implies . In what follows, we fix and define a convex subset of by
Since is bounded, there exists a constant such that
In addition, we define
An important observation made in [23, Lemma 4.5] is that Algorithm 1 can be interpreted as a kind of a gradient method equipped with a nonlinear distance function . A rigorous statement is presented in the following.
Lemma 1 (generalized additive Schwarz lemma)
For and , we define
Then we have
where the functional is given by
A fruitful consequence of Lemma 1 is an abstract convergence theory of additive Schwarz methods for convex optimization  that directly generalizes the classical theory for linear problems [33, Chapter 2]. The following three conditions are considered in the convergence theory: stable decomposition, strengthened convexity, and local stability (cf. [33, Assumptions 2.2 to 2.4]).
[stable decomposition] There exists a constant such that for any bounded and convex subset of , the following holds: for any , there exists , , with , such that
where is a positive constant depending on .
[strengthened convexity] There exists a constant which satisfies the following: for any , , , and , we have
[local stability] There exists a constant which satisfies the following: for any , and , , we have
Lemma 1 is compatible with various stable decomposition conditions presented in existing works, e.g., [3, 31, 33]. Lemma 1 trivially holds with due to the convexity of . However, a better value for independent of can be found by the usual coloring technique; see [23, section 5.1] for details. In the same spirit as , Lemma 1 gives a one-sided measure of approximation properties of the local solvers. It was shown in [23, section 4.1] that the above assumptions reduce to [33, Assumptions 2.2 to 2.4] if they are applied to linear elliptic problems. Under the above three assumptions, we have the following convergence theorem for Algorithm 1 [23, Theorem 4.7].
Meanwhile, the Łojasiewicz inequality holds in many applications [8, 36]; it says that the energy functional of Eq. 1 is sharp around the minimizer . We summarize this property in Proposition 1; it is well-known that improved convergence results for first-order optimization methods can be obtained under this assumption [9, 27].
[sharpness] There exists a constant such that for any bounded and convex subset of satisfying , we have
for some .
3 Backtracking strategies
In gradient methods, backtracking strategies are usually adopted to find a suitable step size that ensures sufficient decrease of the energy. For problems of the form Eq. 1, backtracking strategies are necessary in particular to obtain the global convergence to a solution when the Lipschitz constant of is not known [1, 6]. Considering Algorithm 1, a sufficient decrease condition of the energy is satisfied whenever and (see [23, Lemma 4.6]), and the values of and in Lemmas 1 and 1, respectively, can be obtained explicitly in many cases. Indeed, an estimate for independent of can be obtained by the coloring technique [23, section 5.1], and we have when we use the exact local solvers. Therefore, backtracking strategies are not essential for the purpose of ensuring the global convergence of additive Schwarz methods. In this perspective, to the best of our knowledge, there have been no considerations on applying backtracking strategies in the existing works on additive Schwarz methods for convex optimization.
Meanwhile, in several recent works on accelerated first-order methods for convex optimization [10, 20, 28], full backtracking strategies that allow for adaptive increasing and decreasing of the estimated step size along the iterations were considered. While classical one-sided backtracking strategies (see, e.g., ) are known to suffer from degradation of the convergence rate if an inaccurate estimate for the step size is computed, full backtracking strategies can be regarded as acceleration schemes in the sense that a gradient method equipped with full backtracking outperforms the method with the known Lipschitz constant [10, 28].
In this section, we deal with a backtracking strategy for additive Schwarz methods as an acceleration scheme. Existing full backtracking strategies [10, 20, 28] mentioned above cannot be applied directly to additive Schwarz methods because the evaluation of the nonlinear distance function is not straightforward due to its complicated definition (see Lemma 1). Instead, we propose a novel backtracking strategy for additive Schwarz methods, in which the computational cost of the backtracking procedure is insignificant compared to that of solving local problems. The abstract additive Schwarz method equipped with the proposed backtracking strategy is summarized in Algorithm 2.
The parameter in Algorithm 2 plays a role of an adjustment parameter for the grid search. As closer to , the grid for line search of becomes sparser. On the contrary, the greater , the greater is found with the more computational cost for the backtracking process. The condition is not critical in the implementation of Algorithm 2 since can be obtained by the coloring technique.
for the backtracking process can be evaluated without considering to solve the infimum in the definition Eq. 7 of . Moreover, the backtracking process is independent of local problems Eq. 2. That is, the stop criterion Eq. 9 is universal for any choices of and .
The additional computational cost of Algorithm 2 compared to Algorithm 1 comes from the backtracking process. When we evaluate the stop criterion Eq. 9, the values of , , and are needed. Among them, and can be computed prior to the backtracking process since they require and only in their computations. Hence, the computational cost of an additional inner iteration of the backtracking process consists of the computation of only, which is clearly marginal. In conclusion, the most time-consuming part of each iteration of Algorithm 2 is to solve local problems on , i.e., to obtain , and the other part has relatively small computational cost. This highlights the computational efficiency of the backtracking process in Algorithm 2.
Next, we analyze the convergence behavior of Algorithm 2. First, we prove that the backtracking process in Algorithm 2 ends in finite steps and that the step size never becomes smaller than a particular value.
Since Lemma 1 implies that the stop criterion Eq. 9 is satisfied whenever , the backtracking process ends if becomes smaller than or equal to . Now, take any . If were less than , say for some , then in the previous inner iteration is , so that the backtracking process should have stopped there, which is a contradiction. Therefore, we have .
Lemma 2 says that Lemma 1 is a sufficient condition to ensure that is successfully determined by the backtracking process in each iteration of Algorithm 2. It is important to notice that is always greater than or equal to ; the step sizes of Algorithm 2 are larger than or equal to that of Algorithm 1. Meanwhile, similar to the plain additive Schwarz method, Algorithm 2 generates the sequence whose energy is monotonically decreasing. Hence, is contained in defined in Eq. 4.
Take any . By the stop criterion Eq. 9 for backtracking and the minimization property of , we get
which completes the proof.
Equation 11 is identical to the second half of [23, Lemma 4.6]. Nevertheless, it is revisited to highlight that some assumptions given in [23, Lemma 4.6] are not necessary for Lemma 5; for example, need not be less than or equal to as stated in [23, Lemma 4.6] but can be any positive real number.
where (i), (ii), and (iii) are because of Lemmas 5, 1, and 4, respectively. Starting from Eq. 13, we readily obtain the following convergence theorems for Algorithm 2 by proceeding in the same manner as in [23, Appendices A.3 and A.4].
Although Propositions 4 and 3 guarantee the convergence to the energy minimum as well as they provide the order of convergence of Algorithm 2, they are not fully satisfactory results in the sense that they are not able to explain why Algorithm 2 achieves faster convergence that Algorithm 1. In order to explain the acceleration property of the backtracking process, one should obtain an estimate for the convergence rate of Algorithm 2 in terms of the step sizes along the iterations . We first state an elementary lemma that will be used in further analysis of Algorithm 2 (cf. [31, Lemma 3.2]).
Suppose that satisfy the inequality
where and . Then we have
It suffices to show that . We may assume that . By the mean value theorem, there exists a constant such that
Hence, we have , which yields the desired result.
We also need the following lemma that was presented in [23, Lemma A.2].
Let , , and . The minimum of the function , is given as follows:
Now, we present a convergence theorem for Algorithm 2 that reveals the dependency of the convergence rate on the step sizes determined by the backtracking process. More precisely, the following theorems show that the convergence rate of Algorithm 2 is dependent on the -averaged additive Schwarz condition number defined by
where and was defined in Eq. 8.
We take any and write . For , we write
so that . It follows that
where (i) is due to Lemma 5 and (ii) is due to the convexity of . If we set for , then and
By Lemma 6, it follows that
Summation of Eq. 19 over yields
which is the desired result.