Inexact Block Coordinate Descent Algorithms for Nonsmooth Nonconvex Optimization

05/10/2019 ∙ by Yang Yang, et al. ∙ University of Luxembourg Technische Universität Darmstadt The Chinese University of Hong Kong, Shenzhen 0

In this paper, we propose an inexact block coordinate descent algorithm for large-scale nonsmooth nonconvex optimization problems. At each iteration, a particular block variable is selected and updated by solving the original optimization problem with respect to that block variable inexactly. More precisely, a local approximation of the original optimization problem is solved. The proposed algorithm has several attractive features, namely, i) high flexibility, as the approximation function only needs to be strictly convex and it does not have to be a global upper bound of the original function; ii) fast convergence, as the approximation function can be designed to exploit the problem structure at hand and the stepsize is calculated by the line search; iii) low complexity, as the approximation subproblems are much easier to solve and the line search scheme is carried out over a properly constructed differentiable function; iv) guaranteed convergence to a stationary point, even when the objective function does not have a Lipschitz continuous gradient. Interestingly, when the approximation subproblem is solved by a descent algorithm, convergence to a stationary point is still guaranteed even if the approximation subproblem is solved inexactly by terminating the descent algorithm after a finite number of iterations. These features make the proposed algorithm suitable for large-scale problems where the dimension exceeds the memory and/or the processing capability of the existing hardware. These features are also illustrated by several applications in signal processing and machine learning, for instance, network anomaly detection and phase retrieval.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

In this paper, we consider the optimization problem

subject to (1)

where the function is proper, is smooth but not necessarily convex, is convex but not necessarily smooth, and the constraint set has a Cartesian product structure with being closed and convex for all . Such a formulation plays a fundamental role in signal processing and machine learning, and typically

models the estimation error or empirical loss while

is a regularization (penalty) function promoting in the solution a certain structure known a priori such as sparsity [1].

For a large-scale nonconvex optimization problem of the form (1), the block coordinate descent (BCD) algorithm has been recognized as an efficient and reliable numerical method. Its variable update is based on the so-called nonlinear best-response [2, 3, 4, 5, 6]: at each iteration of the BCD algorithm, one block variable, say , is updated by its best-response while the other block variables are fixed to their values of the preceding iteration

(2a)
(2b)

That is, the best-response is the optimal point that minimizes with respect to (w.r.t.) the variable .

The BCD algorithm has several notable advantages. First of all, the subproblem (2) (w.r.t. a block variable ) is much easier to solve than the original problem (1) (w.r.t. the whole set of variables ), and the best-response even have a closed-form expression in many applications, for example LASSO [7]. It is thus suitable for implementation on hardware with limited memory and/or computational capability. Secondly, as all block variables are updated sequentially, when a block variable is updated, the newest value of other block variables is always incorporated. These two attractive features can sometimes lead to even faster convergence than their parallel counterpart, namely, the Jacobi algorithm (also known as the parallel best-response algorithm) [2].

In cases where the subproblems (2) are still difficult to solve and/or (sufficient) convergence conditions (mostly on the convexity of and the uniqueness of , see [8, 3, 9] and the references therein) are not satisfied, several extensions have been proposed. Their central idea is to solve the optimization problem (2) inexactly. For example, in the block successive upper bound minimization (BSUM) algorithm [4], a global upper bound function of is minimized at each iteration. Common examples are proximal approximations [9] and, if is block Lipschitz continuous111That is, is Lipschitz continuous for all ., proximal-linear approximation [9, 10]. However, for the BSUM algorithm, a global upper bound function may not exist for some (and thus ).

The block Lipschitz continuity assumption is not needed if a stepsize is employed in the variable update. In practice, the stepsize can be determined by line search [11, 12]. Nevertheless, only a specific approximation of is considered, namely, quadratic approximation. Sometimes it may be desirable to use other approximations to better exploit the problem structure, for example, best-response approximation and partial linearization approximation when the nonconvex function has “partial” convexity (their precise descriptions are provided in Section III) . This is the central idea in recent (parallel) successive convex approximation (SCA) algorithms [13, 14, 15, 16, 17] and block successive convex approximation (BSCA) algorithms [4, 18, 19], which consist in solving a sequence of successively refined convex approximation subproblems. A new line search scheme to determine the stepsize is also proposed in [16, 17]: it is carried out over a properly constructed smooth function and its complexity is much lower than traditional schemes that directly operate on the original nonsmooth function [11, 13, 12]. For example, as we will see later in the applications studied in this paper, when

represents a quadratic loss function, the exact line search has a simple analytical expression.

Nevertheless, existing BSCA schemes also have their limitations: the BSCA algorithm proposed in [4] is not applicable when the objective function is nonsmooth, and the convergence of the BSCA algorithms proposed in [18, 19] is only established under the assumption that is Lipschitz continuous and the stepsizes are decreasing. Although it is shown in [18] that constant stepsizes can also be used, the choice of the constant stepsizes depends on the Lipschitz constant of that is not easy to obtain/estimate when the problem dimension is extremely large.

The standard SCA and BSCA algorithms [12, 4, 18, 16, 17] are based on the assumption that the approximation subproblem is solved perfectly at each iteration. Unless the approximation subproblems have a closed-form solution, this assumption can hardly be satisfied by iterative algorithms that exhibit an asymptotic convergence only as they must be terminated after a finite number of iterations in practice. It is shown in [15, 19] that convergence is still guaranteed if the approximation subproblems are solved more and more accurately. However, the solution accuracy is specified by an error bound which is difficult to verify in practice. A different approach is adopted in [20] where the optimization problem (2) is solved inexactly by running the gradient projection algorithm for a finite number of iterations. Nevertheless, its convergence is only established for the specific application in nonnegative matrix factorization.

In this paper, we propose a block successive convex approximation (BSCA) framework for the nonsmooth nonconvex problem (1) by extending the parallel update scheme in [16] to a block update scheme. The proposed BSCA algorithm consists in optimizing a sequence of successively refined approximation subproblems, and has several attractive features.

  1. The approximation function is a strictly convex approximation of the original function and it does not need to be a global upper bound of the original function;

  2. The stepsize is calculated by performing the (exact or successive) line search scheme along the coordinate of the block variable being updated and has low complexity as it is carried out over a properly constructed smooth function;

  3. If the approximation subproblem does not admit a closed-form solution and is solved iteratively by a descent algorithm, for example the (parallel) SCA algorithm proposed in [16], the descent algorithm can be terminated after a finite number of iterations;

  4. The BSCA algorithm has a guaranteed convergence to a stationary point, even when is not multiconvex and/or is not block Lipschitz continuous.

These features are distinctive from existing works from the following aspects:

  • Feature i) extends the BSUM algorithm [4] and BCD algorithm [9, (1.3b)] where the approximation function must be a global upper bound of the original function, [11, 12] and [18, 19] where the approximation functions must be quadratic and strongly convex, respectively;

  • Feature ii) extends [18, 19] where decreasing stepsizes are used, and [11, 13, 12] where the line search is over the original nonsmooth function and has a high complexity;

  • Feature iii) extends [19] where the approximation subproblems must be solved with increasing accuracy. We remark that this feature is inspired by [13], but we establish convergence under weaker assumptions;

  • Feature iv) extends [9, (1.3a)] where is multi-strongly-convex, [18, 19] where must be Lipschitz continuous, [9, (1.3c)] and [10] where must be block Lipschtiz continuous, and [13, 12] where line search over the original nonsmooth function is used.

These attractive features are illustrated by several applications in signal processing and machine learning, namely, network anomaly detection and phase retrieval.

The rest of the paper is structured as follows. In Sec. II, we give a brief review of the SCA framework proposed in [16]. In Sec. III, the BSCA framework together with the convergence analysis is formally presented. An inexact BSCA framework is proposed in Sec. IV. The attractive features of the proposed (exact and inexact) BSCA framework are illustrated through several applications in Sec. V. Finally some concluding remarks are drawn in Sec. VI.

Notation: We use , and

to denote a scalar, vector and matrix, respectively. We use

and to denote the -th element and the -th column of , respectively; is the -th element of where , and denotes all elements of except : . We denote and as the element-wise operation, i.e., and , respectively. Notation and denotes the Hadamard product between and , and the Kronecker product between and , respectively. The operator returns the element-wise projection of onto : . We denote as the vector that consists of the diagonal elements of and is a diagonal matrix whose diagonal vector is . We use to denote a vector with all elements equal to 1. The operator specifies the -norm of and it denotes the spectral norm when is not specified. denotes the soft-thresholding operator: .

Ii Review of the Successive Convex Approximation framework

In this section, we present a brief review of (a special case of) the SCA framework developed in [16] for problem (1). It consists of solving a sequence of successively refined approximation subproblems: given at iteration , the approximation function of w.r.t. is denoted as , and the approximation subproblem consists of minimizing the approximation function over the constraint set :

(3)

where satisfies several technical assumptions, most notably,

  • Convexity: The function is convex in for any given ;

  • Gradient Consistency: The gradient of and the gradient of are identical at , i.e., ;

The approximation subproblem (3) can readily be decomposed into independent subproblems that can be solved in parallel: and

Remark 1.

The approximation function in (3) is a special case of the general SCA framework developed in [16] because it is separable among the different block variables. More generally, , the approximation function of , only needs to be convex and differentiable with the same gradient as at , and it does not necessarily admit a separable structure.

Since is an optimal point of problem (3), we have

(4)

where , and is due to the optimality of , the convexity of in and the gradient consistency assumption, respectively. Therefore is a descent direction of the original objective function in (1) along which the function value can be further decreased compared with [16, Prop. 1]. This motivates us to refine and define as follows:

(5)

where is the stepsize that needs to be selected properly to yield a fast convergence.

It is natural to select a stepsize such that the function is minimized w.r.t. :

(6)

and this is the so-called exact line search (also known as the minimization rule). For nonsmooth optimization problems, the traditional exact line search usually suffers from a high complexity as the optimization problem (6) is nondifferentiable. It is shown in [16, Sec. III-A] that the stepsize obtained by performing the exact line search over the following differentiable function also yields a decrease in :

(7)

To see this, we remark that firstly, the objective function in (7) is an upper bound of the objective function in (6) which is tight at since is convex:

Secondly, the objective function in (7) has a negative slope at as its gradient is equal to in (4). Therefore, and .

Initialization: and (arbitrary but fixed).

Repeat the following steps until convergence:

  1. Compute by solving the following independent optimization problems in parallel:

  2. Compute by the exact line search (7) or the successive line search (8).

  3. Update : .

  4. and go to S1.

Algorithm 1 The parallel successive convex approximation algorithm for nonsmooth nonconvex optimization problem (1) (proposed in [16])

If the scalar differentiable optimization problem in (7) is still difficult to solve, the low-complexity successive line search (also known as the Armijo rule) can be used instead [16, Sec. III-A]: given scalars and , the stepsize is set to be , where is the smallest nonnegative integer satisfying

(8)

where is the descent defined in (4).

The above steps are summarized in Alg. 1. As a descent algorithm, it generates a monotonically decreasing sequence , and every limit point of is a stationary point of (1) (see [16, Thm. 2] for the proof).

Iii The Proposed Block Successive Convex Approximation Algorithms

From a theoretical perspective, Alg. 1 is fully parallelizable. In practice, however, it may not be fully parallelized when the problem dimension exceeds the hardware’s memory and/or processing capability. We could naively solve the independent subproblems in Step S1 of Alg. 1 sequentially, for example, in a cyclic order. Once all independent subproblems are solved, a joint line search is performed as in Step S2 of Alg. 1. However, when the approximation subproblem w.r.t. is being solved, the solutions of previous approximation subproblems w.r.t. are already available, but they are not exploited.

An alternative is to apply the BCD algorithm, where the variable is first divided into blocks and the block variables are updated sequentially. Suppose is being updated at iteration , the following optimization problem w.r.t. the block variable (rather than the full variable ) is solved while the other block variables are fixed:

(9)

Convergence to a stationary point of problem (1) is guaranteed if, for example, is unique [3]. However, the optimization problem in (9) may still not be easy to solve. One approach is to apply Alg. 1 to solve (9) iteratively, but the resulting algorithm will be of two layers: Alg. 1 keeps iterating in the inner layer until a given accuracy is reached and the block variable to be updated next is selected in the outer layer.

To reduce the stringent requirement on the processing capability of the hardware imposed by the parallel SCA algorithms and the complexity of the BCD algorithm, we design in this section a BSCA algorithm: when the block variable is selected at iteration , all elements of are updated in parallel by solving an approximation subproblem w.r.t. (rather than the whole variable as in Alg. 1) that is presumably much easier to optimize than the original problem (9):

(10)

Note that and defined in (10) is an approximation function of and at a given point , respectively. We assume that the approximation function satisfies the following technical conditions:

(A1) The function is strictly convex in for any given ;

(A2) The function is continuously differentiable in for any given and continuous in for any ;

(A3) The gradient of and the gradient of w.r.t. are identical at for any , i.e., ;

(A4) A solution exists for any .

Since the objective function in (10) is strictly convex, is unique. If , then is a stationary point of the optimization problem in (9) given fixed [16, Prop. 1]. We thus consider the case that , and this implies that

(11)

It follows from the strict convexity of and Assumption (A3) that

(12)

Combining (11) and (12), we readily see that is a descent direction of at along the coordinate of in the sense that:

(13)

Then is updated according to the following expression: and

(14)

In other words, only the block variable is updated while other block variables are equal to their value at the previous iteration. The stepsize in (14) can be determined along the coordinate of efficiently by the line search introduced in the previous section, namely, either the exact line search

(15)

or the successive line search if the nonconvex differentiable function in (15) is still difficult to optimize: given predefined constants and , the stepsize is set to , where is the smallest nonnegative integer satisfying the inequality:

(16)

where is the descent in (13). Note that the line search in (15)-(16) is performed along the coordinate of only.

At the next iteration , a new block variable is selected and updated222We assume for simplicity that a single block variable is updated at each iteration, but all results can be generalized to the more general case that a subset of all block variables are updated at each iteration.. We consider two commonly used rules to select the block variable, namely, the cyclic update rule and the random update rule. Note that both of them are well-known (see [18, 19]), but we give their definitions for the sake of reference in later developments.

  • Cyclic update rule: The block variables are updated in a cyclic order. That is, the block variable with index

    (17a)
    is selected.
  • Random update rule: The block variables are selected randomly according to

    (17b)

    and

    . Note that any block variable can be selected with a nonzero probability, and some examples are given in

    [19].

The proposed BSCA algorithm is summarized in Alg. 2, and its convergence properties are given in the following theorem.

Initialization: , (arbitrary but fixed).

Repeat the following steps until convergence:

  1. Select the block variable to be updated according to (17).

  2. Compute according to (10).

  3. Determine the stepsize by the exact line search (15) or the successive line search (16).

  4. Update according to (14).

  5. and go to S1.

Algorithm 2 The proposed block successive convex approximation algorithm
Theorem 2.

Every limit point of the sequence generated by the BSCA algorithm in Alg. 2 is a stationary point of (1) (with probability 1 for the random update).

Proof:

See Appendix A in the supplementary materials. ∎

If is a global upper bound of , we can simply use a constant unit stepsize , that is,

(18)

for the reason that the constant unit stepsize always yields a larger decrease than the successive line search and the convergence is guaranteed (see the discussion on Assumption (A6) in [16, Sec. III]). In this case, update (18) has the same form as BSUM [4].333it is not possible to treat BSUM as a special case of BSCA or vice versa, as they are derived from different rationales which lead to different convergence conditions. However, their convergence conditions and techniques are different and do not imply each other. As a matter of fact, stronger results may be obtained, see [21].

There are several commonly used choices of approximation function , for example, the linear approximation and the quadratic approximation. We refer to [16, Sec. III-B] and [22, Sec. II.2.1] for more details and just comment on the following important cases.

  • Quadratic approximation:

    (19)

    where is a positive scalar. If is Lipschitz continuous, would be a global upper bound of when is sufficiently large. In this case, the variable update reduces to the well-known proximal operator:

    and this is also known as the proximal linear approximation. If we incorporate a stepsize as in the proposed BSCA algorithm, is strictly convex as long as is positive and the convergence is thus guaranteed by Theorem 2 (even when is not Lipschitz continuous).

  • Best-response approximation: If is strictly convex in each element of the block variable , the “best-response” type approximation function is

    (20)

    and it is not a global upper bound of . Note that is not necessarily convex in and the best-response approximation is different from the above proximal linear approximation and thus cannot be obtained from existing algorithmic frameworks [9, 21, 10].

  • Best-response approximation: If is furthermore strictly convex in , an alternative “best-response” type approximation function is

    (21)

    The approximation function in (21) is a trivial upper bound of , and the BSCA algorithm (18) reduces to the BCD algorithm (9). Adopting the approximation in (21) usually leads to fewer iterations than (20), as (21) is a “better” approximation in the sense that it is on the basis of the block variable , while the approximation in (20) is on the basis of each element of , namely, for all . Nevertheless, the approximation function (20) may be easier to optimize than (21) as the component functions are separable and each component function is a scalar function. This reflects the universal tradeoff between the number of iterations and the complexity per iteration.

  • Partial linearization approximation: Consider the function where is smooth and convex and is smooth. We can adopt the “partial linearization” approximation where is linearized while is left unchanged:

    (22)

    where

    is a positive scalar. The quadratic regularization is incorporated to make the approximation function strictly convex. It can be verified by using the chain rule that

    The partial linearization approximation is expected to yield faster convergence than quadratic approximation because the convexity of function is preserved in (22).

  • Hybrid approximation: For the above composition function , we can also adopt a hybrid approximation by further approximating the partial linearization approximation (22) by the best-response approximation (20):

    (23)

    The hybrid approximation function (23) is separable among the elements of , while it is not necessarily the case for the partial linearization approximation (22). The separable structure is desirable when is also separable among the elements of (for example ), because the approximation subproblem (10) would further boil down to parallel scalar problems. We remark that the partial linearization approximation and the hybrid approximation are only foreseen by SCA framework and cannot be obtained from other existing algorithmic frameworks [9, 21, 10].

Remark 3.

The above approximation is on the basis of blocks and it may be different from block to block. For example, consider where is strictly convex in , nonconvex in , and convex in . Then we can adopt the best-response approximation for , the quadratic approximation for , and partial linearization (or hybrid) approximation for :

The most suitable approximation always depends on the application and the universal tradeoff between the number of iterations and the complexity per iteration. SCA offers sufficient flexibility to address this tradeoff.

The proposed BSCA algorithm described in Alg. 2 is complementary to the parallel SCA algorithm in Alg. 1. On the one hand, the update of the elements of a particular block variable in the BSCA algorithm is based on the same principle as in the parallel SCA algorithm, namely, to obtain the descent direction by minimizing a convex approximation function and to calculate the stepsize by the line search scheme. On the other hand, in contrast to the parallel update in the parallel SCA algorithm, the block variables are updated sequentially in the BSCA algorithm, and it poses a less demanding requirement on the memory/processing unit.

We draw some comments on the proposed BSCA algorithm.

On the connection to traditional BCD algorithms. The point in (14) is obtained by moving from the current point along a descent direction . On the one hand, is in general not the best-response employed in the traditional BCD algorithm (9). That is,

Therefore, the proposed algorithm is essentially an inexact BCD algorithm. On the other hand, Theorem 2 establishes that eventually there is no loss of optimality adopting inexact solutions as long as the approximate functions satisfy the assumptions (A1)-(A4).

On the flexibility. The assumptions (A1)-(A4) on the approximation function are quite general and they include many existing algorithms as a special case; see also Remark 3. Different from the BSUM algorithm, the proposed approximation function does not have to be a global upper bound of the original function, but a stepsize is needed to avoid aggressive update.

On the convergence speed. The mild assumptions on the approximation functions allow us to design an approximation function that exploits the original problem’s structure (such as the partial convexity in (20)-(21)) and this leads to faster convergence. The use of line search also attributes to a faster convergence than decreasing stepsizes used in literature, for example [15, 18].

On the complexity. The proposed BSCA exhibits low complexity for several reasons. Firstly, the exact line search consists of minimizing a differentiable function which in many applications even admits a closed-form solution. In the successive line search, the nonsmooth function only needs to be evaluated once at the point . Secondly, the problem size that can be handled by the BSCA algorithm is much larger. Thirdly, when the number of block variables is small, the BSCA algorithm is easier to implement than the parallel SCA algorithm as no coordination among processors is required.

On the convergence conditions. The BSCA algorithm converges under fairly weak assumptions. Compared with [4] and [15], the BSCA algorithm is applicable for nonsmooth nonconvex optimization problems, and it converges even when the gradient of is not Lipschitz continuous, respectively.

We will show by several applications in Sec. V the benefits of the above attractive features.

Iv The Proposed Inexact Block Successive Convex Approximation Algorithms

In the previous section, the approximation subproblem in (10) is assumed to be solved exactly, and this assumption is satisfied when has a closed-form expression. However, if does not have a closed-form expression for some choice of the approximation function, it must be found numerically by an iterative algorithm. In this case, Alg. 2 would consist of two layers: the outer layer with index follows the same procedure as Alg. 2, while the inner layer comprising the iterative algorithm for (10) is nested under S2 of Alg. 2. As most iterative algorithms exhibit asymptotic convergence only, in practice, they are terminated when we obtain a point, denoted as , that is “sufficiently accurate” in the sense that the so-called error bound is smaller than some given that decreases to zero as increases [15, 19]. Nevertheless, results on the error bound are mostly available when is strongly convex, see [23, Ch. 6]. In general, the error bound is very difficult to verify in practice.

In this section, we propose an inexact BSCA algorithm where (10) is solved inexactly, but we do not pose any quantitative requirement on its error bound. The central idea is hidden in (4): replacing by any point and repeating the same steps in (11)-(13), we see that is a descent direction of at if

Such a point can be obtained by running the standard parallel SCA algorithm (reviewed in Sec. II) to solve the approximation subproblem (10) in S2 of Alg. 2 for a finite number of iterations only. At iteration of the inner layer, we define (the superscript “i” stands for inner layer) as an approximation of the outer-layer approximation function at (with ), and solve the inner-layer approximation subproblem

(24)

where . This problem is presumably designed to be much easier to solve exactly than the outer-layer approximation subproblem in (10), for example, a closed-form solution exists; such an example will be given in Sec. V-C. We assume the function satisfies the following technical assumptions.

(B1) The function is convex in for any given and ;

(B2) The function is continuously differentiable in for any given and , and continuous in and for any ;

(B3) The gradient of and the gradient of w.r.t. are identical at for any , i.e., ;

(B4) A solution exists for any ;

(B5) For any bounded sequence , the sequence is also bounded.

Since is strictly convex in and its (global) minimum value is achieved at , it is either

(25)

or

(26)

On the one hand, if (25) is true, and the outer-layer approximation subproblem in (10) has been solved exactly [16, Prop. 1]. On the other hand, (26) implies that is a descent direction of the outer-layer approximation function at , i.e.,

(27)

Therefore we can update by

(28)

where is calculated by either the exact line search along the coordinate of over the outer-layer approximation function (rather than the original function ):