In this paper, we consider the optimization problem
where the function is proper, is smooth but not necessarily convex, is convex but not necessarily smooth, and the constraint set has a Cartesian product structure with being closed and convex for all . Such a formulation plays a fundamental role in signal processing and machine learning, and typically
models the estimation error or empirical loss whileis a regularization (penalty) function promoting in the solution a certain structure known a priori such as sparsity .
For a large-scale nonconvex optimization problem of the form (1), the block coordinate descent (BCD) algorithm has been recognized as an efficient and reliable numerical method. Its variable update is based on the so-called nonlinear best-response [2, 3, 4, 5, 6]: at each iteration of the BCD algorithm, one block variable, say , is updated by its best-response while the other block variables are fixed to their values of the preceding iteration
That is, the best-response is the optimal point that minimizes with respect to (w.r.t.) the variable .
The BCD algorithm has several notable advantages. First of all, the subproblem (2) (w.r.t. a block variable ) is much easier to solve than the original problem (1) (w.r.t. the whole set of variables ), and the best-response even have a closed-form expression in many applications, for example LASSO . It is thus suitable for implementation on hardware with limited memory and/or computational capability. Secondly, as all block variables are updated sequentially, when a block variable is updated, the newest value of other block variables is always incorporated. These two attractive features can sometimes lead to even faster convergence than their parallel counterpart, namely, the Jacobi algorithm (also known as the parallel best-response algorithm) .
In cases where the subproblems (2) are still difficult to solve and/or (sufficient) convergence conditions (mostly on the convexity of and the uniqueness of , see [8, 3, 9] and the references therein) are not satisfied, several extensions have been proposed. Their central idea is to solve the optimization problem (2) inexactly. For example, in the block successive upper bound minimization (BSUM) algorithm , a global upper bound function of is minimized at each iteration. Common examples are proximal approximations  and, if is block Lipschitz continuous111That is, is Lipschitz continuous for all ., proximal-linear approximation [9, 10]. However, for the BSUM algorithm, a global upper bound function may not exist for some (and thus ).
The block Lipschitz continuity assumption is not needed if a stepsize is employed in the variable update. In practice, the stepsize can be determined by line search [11, 12]. Nevertheless, only a specific approximation of is considered, namely, quadratic approximation. Sometimes it may be desirable to use other approximations to better exploit the problem structure, for example, best-response approximation and partial linearization approximation when the nonconvex function has “partial” convexity (their precise descriptions are provided in Section III) . This is the central idea in recent (parallel) successive convex approximation (SCA) algorithms [13, 14, 15, 16, 17] and block successive convex approximation (BSCA) algorithms [4, 18, 19], which consist in solving a sequence of successively refined convex approximation subproblems. A new line search scheme to determine the stepsize is also proposed in [16, 17]: it is carried out over a properly constructed smooth function and its complexity is much lower than traditional schemes that directly operate on the original nonsmooth function [11, 13, 12]. For example, as we will see later in the applications studied in this paper, when
represents a quadratic loss function, the exact line search has a simple analytical expression.
Nevertheless, existing BSCA schemes also have their limitations: the BSCA algorithm proposed in  is not applicable when the objective function is nonsmooth, and the convergence of the BSCA algorithms proposed in [18, 19] is only established under the assumption that is Lipschitz continuous and the stepsizes are decreasing. Although it is shown in  that constant stepsizes can also be used, the choice of the constant stepsizes depends on the Lipschitz constant of that is not easy to obtain/estimate when the problem dimension is extremely large.
The standard SCA and BSCA algorithms [12, 4, 18, 16, 17] are based on the assumption that the approximation subproblem is solved perfectly at each iteration. Unless the approximation subproblems have a closed-form solution, this assumption can hardly be satisfied by iterative algorithms that exhibit an asymptotic convergence only as they must be terminated after a finite number of iterations in practice. It is shown in [15, 19] that convergence is still guaranteed if the approximation subproblems are solved more and more accurately. However, the solution accuracy is specified by an error bound which is difficult to verify in practice. A different approach is adopted in  where the optimization problem (2) is solved inexactly by running the gradient projection algorithm for a finite number of iterations. Nevertheless, its convergence is only established for the specific application in nonnegative matrix factorization.
In this paper, we propose a block successive convex approximation (BSCA) framework for the nonsmooth nonconvex problem (1) by extending the parallel update scheme in  to a block update scheme. The proposed BSCA algorithm consists in optimizing a sequence of successively refined approximation subproblems, and has several attractive features.
The approximation function is a strictly convex approximation of the original function and it does not need to be a global upper bound of the original function;
The stepsize is calculated by performing the (exact or successive) line search scheme along the coordinate of the block variable being updated and has low complexity as it is carried out over a properly constructed smooth function;
If the approximation subproblem does not admit a closed-form solution and is solved iteratively by a descent algorithm, for example the (parallel) SCA algorithm proposed in , the descent algorithm can be terminated after a finite number of iterations;
The BSCA algorithm has a guaranteed convergence to a stationary point, even when is not multiconvex and/or is not block Lipschitz continuous.
These features are distinctive from existing works from the following aspects:
These attractive features are illustrated by several applications in signal processing and machine learning, namely, network anomaly detection and phase retrieval.
The rest of the paper is structured as follows. In Sec. II, we give a brief review of the SCA framework proposed in . In Sec. III, the BSCA framework together with the convergence analysis is formally presented. An inexact BSCA framework is proposed in Sec. IV. The attractive features of the proposed (exact and inexact) BSCA framework are illustrated through several applications in Sec. V. Finally some concluding remarks are drawn in Sec. VI.
Notation: We use , and
to denote a scalar, vector and matrix, respectively. We useand to denote the -th element and the -th column of , respectively; is the -th element of where , and denotes all elements of except : . We denote and as the element-wise operation, i.e., and , respectively. Notation and denotes the Hadamard product between and , and the Kronecker product between and , respectively. The operator returns the element-wise projection of onto : . We denote as the vector that consists of the diagonal elements of and is a diagonal matrix whose diagonal vector is . We use to denote a vector with all elements equal to 1. The operator specifies the -norm of and it denotes the spectral norm when is not specified. denotes the soft-thresholding operator: .
Ii Review of the Successive Convex Approximation framework
In this section, we present a brief review of (a special case of) the SCA framework developed in  for problem (1). It consists of solving a sequence of successively refined approximation subproblems: given at iteration , the approximation function of w.r.t. is denoted as , and the approximation subproblem consists of minimizing the approximation function over the constraint set :
where satisfies several technical assumptions, most notably,
Convexity: The function is convex in for any given ;
Gradient Consistency: The gradient of and the gradient of are identical at , i.e., ;
The approximation subproblem (3) can readily be decomposed into independent subproblems that can be solved in parallel: and
The approximation function in (3) is a special case of the general SCA framework developed in  because it is separable among the different block variables. More generally, , the approximation function of , only needs to be convex and differentiable with the same gradient as at , and it does not necessarily admit a separable structure.
Since is an optimal point of problem (3), we have
where , and is due to the optimality of , the convexity of in and the gradient consistency assumption, respectively. Therefore is a descent direction of the original objective function in (1) along which the function value can be further decreased compared with [16, Prop. 1]. This motivates us to refine and define as follows:
where is the stepsize that needs to be selected properly to yield a fast convergence.
It is natural to select a stepsize such that the function is minimized w.r.t. :
and this is the so-called exact line search (also known as the minimization rule). For nonsmooth optimization problems, the traditional exact line search usually suffers from a high complexity as the optimization problem (6) is nondifferentiable. It is shown in [16, Sec. III-A] that the stepsize obtained by performing the exact line search over the following differentiable function also yields a decrease in :
Initialization: and (arbitrary but fixed).
Repeat the following steps until convergence:
Compute by solving the following independent optimization problems in parallel:
Update : .
and go to S1.
If the scalar differentiable optimization problem in (7) is still difficult to solve, the low-complexity successive line search (also known as the Armijo rule) can be used instead [16, Sec. III-A]: given scalars and , the stepsize is set to be , where is the smallest nonnegative integer satisfying
where is the descent defined in (4).
Iii The Proposed Block Successive Convex Approximation Algorithms
From a theoretical perspective, Alg. 1 is fully parallelizable. In practice, however, it may not be fully parallelized when the problem dimension exceeds the hardware’s memory and/or processing capability. We could naively solve the independent subproblems in Step S1 of Alg. 1 sequentially, for example, in a cyclic order. Once all independent subproblems are solved, a joint line search is performed as in Step S2 of Alg. 1. However, when the approximation subproblem w.r.t. is being solved, the solutions of previous approximation subproblems w.r.t. are already available, but they are not exploited.
An alternative is to apply the BCD algorithm, where the variable is first divided into blocks and the block variables are updated sequentially. Suppose is being updated at iteration , the following optimization problem w.r.t. the block variable (rather than the full variable ) is solved while the other block variables are fixed:
Convergence to a stationary point of problem (1) is guaranteed if, for example, is unique . However, the optimization problem in (9) may still not be easy to solve. One approach is to apply Alg. 1 to solve (9) iteratively, but the resulting algorithm will be of two layers: Alg. 1 keeps iterating in the inner layer until a given accuracy is reached and the block variable to be updated next is selected in the outer layer.
To reduce the stringent requirement on the processing capability of the hardware imposed by the parallel SCA algorithms and the complexity of the BCD algorithm, we design in this section a BSCA algorithm: when the block variable is selected at iteration , all elements of are updated in parallel by solving an approximation subproblem w.r.t. (rather than the whole variable as in Alg. 1) that is presumably much easier to optimize than the original problem (9):
Note that and defined in (10) is an approximation function of and at a given point , respectively. We assume that the approximation function satisfies the following technical conditions:
(A1) The function is strictly convex in for any given ;
(A2) The function is continuously differentiable in for any given and continuous in for any ;
(A3) The gradient of and the gradient of w.r.t. are identical at for any , i.e., ;
(A4) A solution exists for any .
Since the objective function in (10) is strictly convex, is unique. If , then is a stationary point of the optimization problem in (9) given fixed [16, Prop. 1]. We thus consider the case that , and this implies that
It follows from the strict convexity of and Assumption (A3) that
Then is updated according to the following expression: and
In other words, only the block variable is updated while other block variables are equal to their value at the previous iteration. The stepsize in (14) can be determined along the coordinate of efficiently by the line search introduced in the previous section, namely, either the exact line search
or the successive line search if the nonconvex differentiable function in (15) is still difficult to optimize: given predefined constants and , the stepsize is set to , where is the smallest nonnegative integer satisfying the inequality:
At the next iteration , a new block variable is selected and updated222We assume for simplicity that a single block variable is updated at each iteration, but all results can be generalized to the more general case that a subset of all block variables are updated at each iteration.. We consider two commonly used rules to select the block variable, namely, the cyclic update rule and the random update rule. Note that both of them are well-known (see [18, 19]), but we give their definitions for the sake of reference in later developments.
Cyclic update rule: The block variables are updated in a cyclic order. That is, the block variable with index
(17a) is selected.
The proposed BSCA algorithm is summarized in Alg. 2, and its convergence properties are given in the following theorem.
See Appendix A in the supplementary materials. ∎
If is a global upper bound of , we can simply use a constant unit stepsize , that is,
for the reason that the constant unit stepsize always yields a larger decrease than the successive line search and the convergence is guaranteed (see the discussion on Assumption (A6) in [16, Sec. III]). In this case, update (18) has the same form as BSUM .333it is not possible to treat BSUM as a special case of BSCA or vice versa, as they are derived from different rationales which lead to different convergence conditions. However, their convergence conditions and techniques are different and do not imply each other. As a matter of fact, stronger results may be obtained, see .
There are several commonly used choices of approximation function , for example, the linear approximation and the quadratic approximation. We refer to [16, Sec. III-B] and [22, Sec. II.2.1] for more details and just comment on the following important cases.
where is a positive scalar. If is Lipschitz continuous, would be a global upper bound of when is sufficiently large. In this case, the variable update reduces to the well-known proximal operator:
and this is also known as the proximal linear approximation. If we incorporate a stepsize as in the proposed BSCA algorithm, is strictly convex as long as is positive and the convergence is thus guaranteed by Theorem 2 (even when is not Lipschitz continuous).
Best-response approximation: If is strictly convex in each element of the block variable , the “best-response” type approximation function is
and it is not a global upper bound of . Note that is not necessarily convex in and the best-response approximation is different from the above proximal linear approximation and thus cannot be obtained from existing algorithmic frameworks [9, 21, 10].
Best-response approximation: If is furthermore strictly convex in , an alternative “best-response” type approximation function is
The approximation function in (21) is a trivial upper bound of , and the BSCA algorithm (18) reduces to the BCD algorithm (9). Adopting the approximation in (21) usually leads to fewer iterations than (20), as (21) is a “better” approximation in the sense that it is on the basis of the block variable , while the approximation in (20) is on the basis of each element of , namely, for all . Nevertheless, the approximation function (20) may be easier to optimize than (21) as the component functions are separable and each component function is a scalar function. This reflects the universal tradeoff between the number of iterations and the complexity per iteration.
Partial linearization approximation: Consider the function where is smooth and convex and is smooth. We can adopt the “partial linearization” approximation where is linearized while is left unchanged:
is a positive scalar. The quadratic regularization is incorporated to make the approximation function strictly convex. It can be verified by using the chain rule that
The partial linearization approximation is expected to yield faster convergence than quadratic approximation because the convexity of function is preserved in (22).
Hybrid approximation: For the above composition function , we can also adopt a hybrid approximation by further approximating the partial linearization approximation (22) by the best-response approximation (20):
The hybrid approximation function (23) is separable among the elements of , while it is not necessarily the case for the partial linearization approximation (22). The separable structure is desirable when is also separable among the elements of (for example ), because the approximation subproblem (10) would further boil down to parallel scalar problems. We remark that the partial linearization approximation and the hybrid approximation are only foreseen by SCA framework and cannot be obtained from other existing algorithmic frameworks [9, 21, 10].
The above approximation is on the basis of blocks and it may be different from block to block. For example, consider where is strictly convex in , nonconvex in , and convex in . Then we can adopt the best-response approximation for , the quadratic approximation for , and partial linearization (or hybrid) approximation for :
The most suitable approximation always depends on the application and the universal tradeoff between the number of iterations and the complexity per iteration. SCA offers sufficient flexibility to address this tradeoff.
The proposed BSCA algorithm described in Alg. 2 is complementary to the parallel SCA algorithm in Alg. 1. On the one hand, the update of the elements of a particular block variable in the BSCA algorithm is based on the same principle as in the parallel SCA algorithm, namely, to obtain the descent direction by minimizing a convex approximation function and to calculate the stepsize by the line search scheme. On the other hand, in contrast to the parallel update in the parallel SCA algorithm, the block variables are updated sequentially in the BSCA algorithm, and it poses a less demanding requirement on the memory/processing unit.
We draw some comments on the proposed BSCA algorithm.
On the connection to traditional BCD algorithms. The point in (14) is obtained by moving from the current point along a descent direction . On the one hand, is in general not the best-response employed in the traditional BCD algorithm (9). That is,
Therefore, the proposed algorithm is essentially an inexact BCD algorithm. On the other hand, Theorem 2 establishes that eventually there is no loss of optimality adopting inexact solutions as long as the approximate functions satisfy the assumptions (A1)-(A4).
On the flexibility. The assumptions (A1)-(A4) on the approximation function are quite general and they include many existing algorithms as a special case; see also Remark 3. Different from the BSUM algorithm, the proposed approximation function does not have to be a global upper bound of the original function, but a stepsize is needed to avoid aggressive update.
On the convergence speed. The mild assumptions on the approximation functions allow us to design an approximation function that exploits the original problem’s structure (such as the partial convexity in (20)-(21)) and this leads to faster convergence. The use of line search also attributes to a faster convergence than decreasing stepsizes used in literature, for example [15, 18].
On the complexity. The proposed BSCA exhibits low complexity for several reasons. Firstly, the exact line search consists of minimizing a differentiable function which in many applications even admits a closed-form solution. In the successive line search, the nonsmooth function only needs to be evaluated once at the point . Secondly, the problem size that can be handled by the BSCA algorithm is much larger. Thirdly, when the number of block variables is small, the BSCA algorithm is easier to implement than the parallel SCA algorithm as no coordination among processors is required.
On the convergence conditions. The BSCA algorithm converges under fairly weak assumptions. Compared with  and , the BSCA algorithm is applicable for nonsmooth nonconvex optimization problems, and it converges even when the gradient of is not Lipschitz continuous, respectively.
We will show by several applications in Sec. V the benefits of the above attractive features.
Iv The Proposed Inexact Block Successive Convex Approximation Algorithms
In the previous section, the approximation subproblem in (10) is assumed to be solved exactly, and this assumption is satisfied when has a closed-form expression. However, if does not have a closed-form expression for some choice of the approximation function, it must be found numerically by an iterative algorithm. In this case, Alg. 2 would consist of two layers: the outer layer with index follows the same procedure as Alg. 2, while the inner layer comprising the iterative algorithm for (10) is nested under S2 of Alg. 2. As most iterative algorithms exhibit asymptotic convergence only, in practice, they are terminated when we obtain a point, denoted as , that is “sufficiently accurate” in the sense that the so-called error bound is smaller than some given that decreases to zero as increases [15, 19]. Nevertheless, results on the error bound are mostly available when is strongly convex, see [23, Ch. 6]. In general, the error bound is very difficult to verify in practice.
In this section, we propose an inexact BSCA algorithm where (10) is solved inexactly, but we do not pose any quantitative requirement on its error bound. The central idea is hidden in (4): replacing by any point and repeating the same steps in (11)-(13), we see that is a descent direction of at if
Such a point can be obtained by running the standard parallel SCA algorithm (reviewed in Sec. II) to solve the approximation subproblem (10) in S2 of Alg. 2 for a finite number of iterations only. At iteration of the inner layer, we define (the superscript “i” stands for inner layer) as an approximation of the outer-layer approximation function at (with ), and solve the inner-layer approximation subproblem
where . This problem is presumably designed to be much easier to solve exactly than the outer-layer approximation subproblem in (10), for example, a closed-form solution exists; such an example will be given in Sec. V-C. We assume the function satisfies the following technical assumptions.
(B1) The function is convex in for any given and ;
(B2) The function is continuously differentiable in for any given and , and continuous in and for any ;
(B3) The gradient of and the gradient of w.r.t. are identical at for any , i.e., ;
(B4) A solution exists for any ;
(B5) For any bounded sequence , the sequence is also bounded.
Since is strictly convex in and its (global) minimum value is achieved at , it is either
On the one hand, if (25) is true, and the outer-layer approximation subproblem in (10) has been solved exactly [16, Prop. 1]. On the other hand, (26) implies that is a descent direction of the outer-layer approximation function at , i.e.,
Therefore we can update by
where is calculated by either the exact line search along the coordinate of over the outer-layer approximation function (rather than the original function ):