Additive Schwarz Methods for Convex Optimization with Backtracking

This paper presents a novel backtracking strategy for additive Schwarz methods for general convex optimization problems as an acceleration scheme. The proposed backtracking strategy is independent of local solvers, so that it can be applied to any algorithms that can be represented in an abstract framework of additive Schwarz methods. Allowing for adaptive increasing and decreasing of the step size along the iterations, the convergence rate of an algorithm is greatly improved. Improved convergence rate of the algorithm is proven rigorously. In addition, combining the proposed backtracking strategy with a momentum acceleration technique, we propose a further accelerated additive Schwarz method. Numerical results for various convex optimization problems that support our theory are presented.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

11/05/2020

Accelerated Additive Schwarz Methods for Convex Optimization with Adaptive Restart

Based on an observation that additive Schwarz methods for general convex...
12/08/2019

Additive Schwarz Methods for Convex Optimization as Gradient Methods

This paper gives a unified convergence analysis of additive Schwarz meth...
09/11/2018

Smooth Structured Prediction Using Quantum and Classical Gibbs Samplers

We introduce a quantum algorithm for solving structured-prediction probl...
10/19/2021

Faster Rates for the Frank-Wolfe Algorithm Using Jacobi Polynomials

The Frank Wolfe algorithm (FW) is a popular projection-free alternative ...
07/12/2012

Distributed Strongly Convex Optimization

A lot of effort has been invested into characterizing the convergence ra...
10/18/2016

Analysis and Implementation of an Asynchronous Optimization Algorithm for the Parameter Server

This paper presents an asynchronous incremental aggregated gradient algo...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In this paper, we are interested in additive Schwarz methods for a general convex optimization problem

(1)

where is a reflexive Banach space, is a Frechét differentiable convex function, and is a proper, convex, and lower semicontinuous function that is possibly nonsmooth. We additionally assume that is coercive, so that Eq. 1 admits a solution .

The importance of studying Schwarz methods arises from both theoretical and computational viewpoints. It is well-known that various iterative methods such as block relaxation methods, multigrid methods, and domain decomposition methods can be interpreted as Schwarz methods, also known as subspace correction methods. Studying Schwarz methods can yield a unified understanding of these methods; there have been several notable works on the analysis of domain decomposition and multigrid methods for linear problems in the framework of Schwarz methods [18, 33, 34, 35]. The convergence theory of Schwarz methods has been developed for several classes of nonlinear problems as well [2, 4, 23, 31]. In the computational viewpoint, Schwarz methods are prominent numerical solvers for large-scale problems because they can efficiently utilize massively parallel computer architectures. There has been plenty of research on Schwarz methods as parallel solvers for large-scale scientific problems of the form Eq. 1, e.g., nonlinear elliptic problems [12, 31], variational inequalities [5, 29, 30], and mathematical imaging problems [11, 14, 25].

An important concern in the research of Schwarz methods is the acceleration of algorithms. One of the most elementary relevant results is optimizing the relaxation parameters of Richardson iterations related to the Schwarz alternating method [33, section C.3]; observing that the Schwarz alternating method for linear elliptic problems can be viewed as a preconditioned Richardson method, one can optimize the relaxation parameters of Richardson iterations to achieve a faster convergence as in [33, Lemma C.5]. Moreover, if one replaces Richardson iterations by conjugate gradient iterations with the same preconditioner, an improved algorithm with faster convergence rate can be obtained. Such an idea of acceleration can be applied to not only linear problems but also nonlinear problems. There have been some recent works on the acceleration of domain decomposition methods for several kinds of nonlinear problems: nonlinear elliptic problems [12], variational inequalities [17], and mathematical imaging problems [15, 16, 19]. In particular, in the author’s previous work [22], an accelerated additive Schwarz method that can be applied to the general convex optimization Eq. 1 was considered. Noticing that additive Schwarz methods for Eq. 1 can be interpreted as gradient methods [23], acceleration schemes such as momentum [6, 20] and adaptive restarting [21]

that were originally derived for gradient methods in the field of mathematical optimization were adopted.

In this paper, we consider another acceleration strategy called backtracking from the field of mathematical optimization for applications to additive Schwarz methods. Backtracking was originally considered as a method of line search for step sizes that ensures the global convergence of a gradient method [1, 6]. In some recent works on accelerated gradient methods [10, 20, 28], it was shown both theoretically and numerically that certain backtracking strategies can accelerate the convergence of gradient methods. Allowing for adaptive increasing and decreasing of the step size along the iterations, backtracking can find a nearly-optimal value for the step size that results in large energy decay, so that fast convergence is achieved. Such an acceleration property of backtracking may be considered as a resemblance with the relaxation parameter optimization for Richardson iterations mentioned above. Hence, as in the case of Richardson iterations for linear problems, one may expect that the convergence rate of additive Schwarz methods for Eq. 1 can be improved if an appropriate backtracking strategy is adopted. Unfortunately, applying the existing backtracking strategies such as [10, 20, 28] to additive Schwarz methods is not so straightforward. The existing backtracking strategies require the computation of the underlying distance function of the gradient method. For usual gradient methods, the underlying distance function is simply the -norm of the solution space so that such a requirement does not matter. However, the underlying nonlinear distance function of additive Schwarz methods has a rather complex structure in general (see Eq. 7); this aspect makes direct applications of the existing strategies to additive Schwarz methods cumbersome.

This paper proposes a novel backtracking strategy for additive Schwarz methods, which does not rely on the computation of the underlying distance function. As shown in Algorithm 2, the proposed backtracking strategy does not depend on the computation of the distance function but the computation of the energy functional only. Hence, the proposed backtracking strategy can be easily implemented for additive Schwarz methods for Eq. 1

with any choices of local solvers. Acceleration properties of the proposed backtracking strategy can be analyzed mathematically; we present explicit estimates for the convergence rate of the method in terms of some averaged quantity estimated along the iterations. The proposed backtracking strategy has another interesting feature; since it accelerates the additive Schwarz method in a completely different manner from the momentum acceleration introduced in 

[22], both of the momentum acceleration and the proposed backtracking strategy can be applied simultaneously to form a further accelerated method; see Algorithm 3. We present numerical results for various convex optimization problems of the form Eq. 1 to verify our theoretical results and highlight the computational efficiency of the proposed accelerated methods.

This paper is organized as follows. A brief summary of the abstract convergence theory of additive Schwarz methods for convex optimization presented in [23] is given in Section 2. In Section 3, we present and analyze a novel backtracking strategy for additive Schwarz methods as an acceleration scheme. A fast additive Schwarz method that combines the ideas of the momentum acceleration [22] and the proposed backtracking strategy is proposed in Section 4. Numerical results for various convex optimization problems are presented in Section 5. We conclude the paper with remarks in Section 6.

2 Additive Schwarz methods

In this section, we briefly review the abstract framework for additive Schwarz methods for the convex optimization problem Eq. 1 presented in [23]. In what follows, an index runs from to . Let be a reflexive Banach space and let be a bounded linear operator such that

and its adjoint is surjective. For the sake of describing local problems, we define and as functionals defined on , which are proper, convex, and lower semicontinuous with respect to their first arguments. Local problems have the following general form:

(2)

where and . If we set equationparentequation

(3a)
in Eq. 2, then the minimization problem is reduced to
(3b)

which is the case of exact local problems. Here denotes the Bregman distance

We note that other choices of and , i.e., cases of inexact local problems, include various existing numerical methods such as block coordinate descent methods [7] and constraint decomposition methods [11, 29]; see [23, section 6.4] for details.

The plain additive Schwarz method for Eq. 1 is presented in Algorithm 1. Constants and in Algorithm 1 will be given in Lemmas 1 and 1, respectively. Note that denotes the effective domain of , i.e.,

  Choose , , and .
  for  do
     
  end for
Algorithm 1 Additive Schwarz method for Eq. 1

Note that implies . In what follows, we fix and define a convex subset of by

(4)

Since is bounded, there exists a constant such that

(5)

In addition, we define

(6)

for .

An important observation made in [23, Lemma 4.5] is that Algorithm 1 can be interpreted as a kind of a gradient method equipped with a nonlinear distance function [32]. A rigorous statement is presented in the following.

Lemma 1 (generalized additive Schwarz lemma)

For and , we define

where

Then we have

where the functional is given by

(7)

A fruitful consequence of Lemma 1 is an abstract convergence theory of additive Schwarz methods for convex optimization [23] that directly generalizes the classical theory for linear problems [33, Chapter 2]. The following three conditions are considered in the convergence theory: stable decomposition, strengthened convexity, and local stability (cf. [33, Assumptions 2.2 to 2.4]).

[stable decomposition] There exists a constant such that for any bounded and convex subset of , the following holds: for any , there exists , , with , such that

where is a positive constant depending on .

[strengthened convexity] There exists a constant which satisfies the following: for any , , , and , we have

[local stability] There exists a constant which satisfies the following: for any , and , , we have

Lemma 1 is compatible with various stable decomposition conditions presented in existing works, e.g., [3, 31, 33]. Lemma 1 trivially holds with due to the convexity of . However, a better value for independent of can be found by the usual coloring technique; see [23, section 5.1] for details. In the same spirit as [33], Lemma 1 gives a one-sided measure of approximation properties of the local solvers. It was shown in [23, section 4.1] that the above assumptions reduce to [33, Assumptions 2.2 to 2.4] if they are applied to linear elliptic problems. Under the above three assumptions, we have the following convergence theorem for Algorithm 1 [23, Theorem 4.7].

Proposition 1

Suppose that Lemmas 1, 1, and 1 hold. In Algorithm 1, we have

where is the additive Schwarz condition number defined by

(8)

and was defined in Eq. 6.

Meanwhile, the Łojasiewicz inequality holds in many applications [8, 36]; it says that the energy functional of Eq. 1 is sharp around the minimizer . We summarize this property in Proposition 1; it is well-known that improved convergence results for first-order optimization methods can be obtained under this assumption [9, 27].

[sharpness] There exists a constant such that for any bounded and convex subset of satisfying , we have

for some .

We present an improved convergence result for Algorithm 1 compared to Proposition 1 under the additional sharpness assumption on  [23, Theorem 4.8].

Proposition 2

Suppose that Propositions 1, 1, 1, and 1 hold. In Algorithm 1, we have

where was defined in Eq. 8.

Propositions 2 and 1 are direct consequences of Lemma 1 in the sense that they can be easily deduced by invoking theories of gradient methods for convex optimization [23, section 2].

3 Backtracking strategies

In gradient methods, backtracking strategies are usually adopted to find a suitable step size that ensures sufficient decrease of the energy. For problems of the form Eq. 1, backtracking strategies are necessary in particular to obtain the global convergence to a solution when the Lipschitz constant of is not known [1, 6]. Considering Algorithm 1, a sufficient decrease condition of the energy is satisfied whenever and  (see [23, Lemma 4.6]), and the values of and in Lemmas 1 and 1, respectively, can be obtained explicitly in many cases. Indeed, an estimate for independent of can be obtained by the coloring technique [23, section 5.1], and we have when we use the exact local solvers. Therefore, backtracking strategies are not essential for the purpose of ensuring the global convergence of additive Schwarz methods. In this perspective, to the best of our knowledge, there have been no considerations on applying backtracking strategies in the existing works on additive Schwarz methods for convex optimization.

Meanwhile, in several recent works on accelerated first-order methods for convex optimization [10, 20, 28], full backtracking strategies that allow for adaptive increasing and decreasing of the estimated step size along the iterations were considered. While classical one-sided backtracking strategies (see, e.g., [6]) are known to suffer from degradation of the convergence rate if an inaccurate estimate for the step size is computed, full backtracking strategies can be regarded as acceleration schemes in the sense that a gradient method equipped with full backtracking outperforms the method with the known Lipschitz constant [10, 28].

In this section, we deal with a backtracking strategy for additive Schwarz methods as an acceleration scheme. Existing full backtracking strategies [10, 20, 28] mentioned above cannot be applied directly to additive Schwarz methods because the evaluation of the nonlinear distance function is not straightforward due to its complicated definition (see Lemma 1). Instead, we propose a novel backtracking strategy for additive Schwarz methods, in which the computational cost of the backtracking procedure is insignificant compared to that of solving local problems. The abstract additive Schwarz method equipped with the proposed backtracking strategy is summarized in Algorithm 2.

  Choose , , , and .
  for  do
     
     
     repeat
        
        
        if  then
           
        end if
     until 
     
  end for
Algorithm 2 Additive Schwarz method for Eq. 1 with backtracking

The parameter in Algorithm 2 plays a role of an adjustment parameter for the grid search. As closer to , the grid for line search of becomes sparser. On the contrary, the greater , the greater is found with the more computational cost for the backtracking process. The condition is not critical in the implementation of Algorithm 2 since can be obtained by the coloring technique.

Different from the existing approaches [10, 20, 28], the backtracking scheme in Algorithm 2 does not depend on the distance function but the energy functional only. Hence, the stop criterion

(9)

for the backtracking process can be evaluated without considering to solve the infimum in the definition Eq. 7 of . Moreover, the backtracking process is independent of local problems Eq. 2. That is, the stop criterion Eq. 9 is universal for any choices of and .

The additional computational cost of Algorithm 2 compared to Algorithm 1 comes from the backtracking process. When we evaluate the stop criterion Eq. 9, the values of , , and are needed. Among them, and can be computed prior to the backtracking process since they require and only in their computations. Hence, the computational cost of an additional inner iteration of the backtracking process consists of the computation of only, which is clearly marginal. In conclusion, the most time-consuming part of each iteration of Algorithm 2 is to solve local problems on , i.e., to obtain , and the other part has relatively small computational cost. This highlights the computational efficiency of the backtracking process in Algorithm 2.

Next, we analyze the convergence behavior of Algorithm 2. First, we prove that the backtracking process in Algorithm 2 ends in finite steps and that the step size never becomes smaller than a particular value.

Lemma 2

Suppose that Lemma 1 holds. The backtracking process in Algorithm 2 terminates in finite steps and we have

for , where was given in Lemma 1.

Proof

Since Lemma 1 implies that the stop criterion Eq. 9 is satisfied whenever , the backtracking process ends if becomes smaller than or equal to . Now, take any . If were less than , say for some , then in the previous inner iteration is , so that the backtracking process should have stopped there, which is a contradiction. Therefore, we have .

Lemma 2 says that Lemma 1 is a sufficient condition to ensure that is successfully determined by the backtracking process in each iteration of Algorithm 2. It is important to notice that is always greater than or equal to ; the step sizes of Algorithm 2 are larger than or equal to that of Algorithm 1. Meanwhile, similar to the plain additive Schwarz method, Algorithm 2 generates the sequence whose energy is monotonically decreasing. Hence, is contained in defined in Eq. 4.

Lemma 3

Suppose that Lemma 1 holds. In Algorithm 2, the sequence is decreasing.

Proof

Take any . By the stop criterion Eq. 9 for backtracking and the minimization property of , we get

which completes the proof.

Note that [23, Lemma 4.6] played a key role in the convergence analysis of Algorithm 1 presented in [23]. Relevant results for Algorithm 2 can be obtained in a similar manner.

Lemma 4

Suppose that Lemmas 1 and 1 hold. In Algorithm 2, we have

for , where the functional and the set were defined in Eqs. 6 and 7, respectively, for .

Proof

Take any such that

(10)

By Lemma 1 and Eq. 9, we get

Taking the infimum over all satisfying Eq. 10 yields the desired result.

Lemma 5

Suppose that Lemma 1 holds. Let . For any bounded and convex subset of , we have

(11)

for , where the functional was given in Eq. 7 and

In addition, the right-hand side of Eq. 11 is decreasing with respect to . More precisely, if , then we have

(12)

for .

Proof

Equation 11 is identical to the second half of [23, Lemma 4.6]. Nevertheless, it is revisited to highlight that some assumptions given in [23, Lemma 4.6] are not necessary for Lemma 5; for example, need not be less than or equal to as stated in [23, Lemma 4.6] but can be any positive real number.

Now, we prove Eq. 12. Since , one can deduce from Eq. 6 that . Hence, by the definition of given in Lemma 1, we get . Meanwhile, the convexity of implies that

which completes the proof.

Recall that the sequence generated by Algorithm 2 has a uniform lower bound by Lemma 2. Hence, for any , we get

(13)

where (i), (ii), and (iii) are because of Lemmas 5, 1, and 4, respectively. Starting from Eq. 13, we readily obtain the following convergence theorems for Algorithm 2 by proceeding in the same manner as in [23, Appendices A.3 and A.4].

Proposition 3

Suppose that Lemmas 1, 1, and 1 hold. In Algorithm 2, we have

where was defined in Eq. 8.

Proposition 4

Suppose that Propositions 1, 1, 1, and 1 hold. In Algorithm 2, we have

where was defined in Eq. 8.

Although Propositions 4 and 3 guarantee the convergence to the energy minimum as well as they provide the order of convergence of Algorithm 2, they are not fully satisfactory results in the sense that they are not able to explain why Algorithm 2 achieves faster convergence that Algorithm 1. In order to explain the acceleration property of the backtracking process, one should obtain an estimate for the convergence rate of Algorithm 2 in terms of the step sizes along the iterations [10]. We first state an elementary lemma that will be used in further analysis of Algorithm 2 (cf. [31, Lemma 3.2]).

Lemma 6

Suppose that satisfy the inequality

where and . Then we have

Proof

It suffices to show that . We may assume that . By the mean value theorem, there exists a constant such that

Hence, we have , which yields the desired result.

We also need the following lemma that was presented in [23, Lemma A.2].

Lemma 7

Let , , and . The minimum of the function , is given as follows:

Now, we present a convergence theorem for Algorithm 2 that reveals the dependency of the convergence rate on the step sizes determined by the backtracking process. More precisely, the following theorems show that the convergence rate of Algorithm 2 is dependent on the -averaged additive Schwarz condition number defined by

(14)

where and was defined in Eq. 8.

Theorem 3.1

Suppose that Lemmas 1, 1, and 1 hold. In Algorithm 2, if , then

for , where , , and were given in Eqs. 14, 5, and 4, respectively.

Proof

We take any and write . For , we write

so that . It follows that

(15)

where (i) is due to Lemma 5 and (ii) is due to the convexity of . If we set for , then and

(16)

Substituting Eq. 16 into Eq. 15 yields

(17)

where the last inequality is due to the convexity of and Eq. 5. The definition Eq. 6 of implies that , so that . Hence, we have . Invoking Lemma 7, we get

(18)

where was defined in Eq. 8. Combining Eqs. 18 and 17 yields

By Lemma 6, it follows that

(19)

Summation of Eq. 19 over yields

or equivalently,

which is the desired result.

Theorem 3.2

Suppose that Propositions 1, 1, 1, and