 # A Block Decomposition Algorithm for Sparse Optimization

Sparse optimization is a central problem in machine learning and computer vision. However, this problem is inherently NP-hard and thus difficult to solve in general. Combinatorial search methods find the global optimal solution but are confined to small-sized problems, while coordinate descent methods are efficient but often suffer from poor local minima. This paper considers a new block decomposition algorithm that combines the effectiveness of combinatorial search methods and the efficiency of coordinate descent methods. Specifically, we consider a random strategy or/and a greedy strategy to select a subset of coordinates as the working set, and then perform a global combinatorial search over the working set based on the original objective function. We show that our method finds stronger stationary points than Amir Beck et al.'s coordinate-wise optimization method. In addition, we establish the global convergence and convergence rate of our block decomposition algorithm. Our experiments on solving sparse regularized and sparsity constrained least squares optimization problems demonstrate that our method achieves state-of-the-art performance in terms of accuracy.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In this paper, we mainly focus on the following sparse optimization problem (‘’ means define):

 minx∈Rn F(x)≜f(x)+h(x) (1)

where is a smooth convex function with its gradient being -Lipschitz continuous, and is the following nonconvex sparse regularized/sparsity constrained function:

 {hregu(x)≜λ∥x∥0+IΩ(x)}       or       {hcons(x)≜IΨ(x), Ψ≜{x | ∥x∥0≤s}}. (2)

Here, , is an indicator function on the set with ,

is a function that counts the number of nonzero elements in a vector,

and are positive constants, and is a positive integer. Problem (1) captures a variety of applications of interest in both machine learning and computer vision (e.g., sparse coding candes2005decoding ; Donoho06 ; aharon2006img ; BaoJQS16 , sparse subspace clustering elhamifar2013sparse ).

This paper proposes a block decomposition algorithm using a proximal strategy and a combinatorial search strategy for solving the sparse optimization problem as in (1). We review existing methods in the literature and summarize the merit of our approach.

The Relaxed Approximation Method. One popular method to solve (1) is the convex or nonconvex relaxed approximation method. Many approaches such as norm, top- norm, Schatten norm, re-weighted norm, capped norm, and half quadratic function have been proposed for solving sparse optimization problems in the last decade. It is generally believed that nonconvex methods often achieve better accuracy than the convex counterparts. However, minimizing the approximate function does not necessarily lead to the minimization of the original function in Problem (1). Our method directly controls the sparsity of the solution and minimize the original problem.

The Greedy Pursuit Method. This method is often used to solve sparsity constrained optimization problems. It greedily selects at each step one coordinate of the variables which have some desirable benefits tropp2007signal ; dai2009subspace ; needell2010signal ; blumensath2008gradient ; needell2009cosamp . This method has a monotonically decreasing property and achieves optimality guarantees in some situations, but it is limited to solving problems with smooth objective functions (typically the square function). Furthermore, the solutions must be initialized to zero and may cause divergence when being incorporated to solve the bilinear matrix factorization problem BaoJQS16 . Our method is a greedy coordinate descent algorithm without forcing the initial solution to zero.

The Combinatorial Search Method. This method is typically concerned with NP-hard problems conforti2014integer

. A naive stradigy is an exhaustive search which systematically enumerates all possible candidates for the solution and picks the best candidate corresponding to the lowest objective value. The cutting plane method solves the convex linear programming relaxation and adds linear constraints to drive the solution towards binary variables, while the branch-and-cut method performs branches and applies cuts at the nodes of the tree having a lower bound that is worse than the current solution. Although in some cases these two methods converge without much effort, in the worse case they end up solving all

convex subproblems. Our approach leverages the effectiveness of combinatorial search methods.

The Proximal Gradient Method. Based on the current gradient , the proximal gradient method beck2013sparsity ; lu2014iterative ; jain2014iterative ; nguyen2014linear ; patrascu2015efficient ; PatrascuN15 ; li2016nonconvex iteratively performs a gradient update followed by a proximal operation: . Here the proximal operator can be evaluated analytically, and is the step size with being the Lipschitz constant. This method is closely related to (block) coordinate descent nesterov2012efficiency ; chang2008coordinate ; breheny2011coordinate ; de2016fast ; razaviyayn2013unified ; beck2013convergence ; hong2013iteration ; lu2015complexity ; xu2013block

in the literature. Due to its simplicity, many strategies (e.g., variance reduction

johnson2013accelerating ; xiao2014proximal ; chen2016accelerated , asynchronous parallelism liu2015asynchronous ; recht2011hogwild , and non-uniform sampling zhang2016accelerated ) have been proposed to accelerate proximal gradient method. However, existing works use a scalar step size and solve a first-order majorization/surrogate function via closed-form updates. Since Problem (1) is nonconvex, such a simple majorization function may not necessarily be a good approximation. Our method significantly outperforms proximal gradient method and inherits its computational advantages.

Contributions: The contributions of this paper are three-fold. (i) Algorithmically, we introduce a novel block decomposition method for sparse optimization (See Section 2). (ii) Theoretically, we establish the optimality hierarchy of our algorithm and show that it always finds stronger stationary points than existing methods (See Section 3). Furthermore, we prove the global convergence and convergence rate of our algorithm (See Section 4). (iii) Empirically, we have conducted experiments on some sparse optimization tasks to show the superiority of our method (See Section 5).

Notation: All vectors are column vectors and superscript denotes transpose. For any vector and any , we denote by the s-th component of . The Euclidean inner product between and is denoted by or . is a unit vector with a in the th entry and in all other entries. The number of possible combinations choosing items from without repetition is denoted by . For any containing unique integers selected from , we define .

## 2 Proposed Block Decomposition Algorithm

This section presents our block decomposition algorithm for solving (1). Our algorithm is an iterative procedure. In every iteration, the index set of variables is separated into two sets and , where is the working set. We fix the variables corresponding to , while minimizing a sub-problem on variables corresponding to . We use to denote the sub-vector of indexed by . The proposed method is summarized in Algorithm 1.

At first glance, Algorithm 1 might seem to be merely a block coordinate descent algorithm tseng2009coordinate applied to (1). However, it has some interesting properties that are worth commenting on.

Two Novel Strategies. (i) Instead of using majorization techniques for optimizing over the block of the variables, we consider minimizing the original objective function. Although the subproblem is NP-hard and admits no closed-form solution, we can use an exhaustive search to solve it exactly. (ii) We consider a proximal point strategy for the subproblem in (3). This is to guarantee sufficient descent condition for the optimization problem and global convergence of Algorithm 1 (refer to Theorem 2).

Solving the Subproblem Globally. The subproblem in (3) essentially contains unknown decision variables and can be solved exactly within sub-exponential time . Using the variational reformulation of pseudo-norm bienstock1996computational 111For all with , it always holds that , Problem (3) can be reformulated as a mixed-integer optimization problem and solved by some global optimization solvers such as ‘CPLEX’ or ‘Gurobi’. For simplicity, we consider a simple exhaustive search (a.k.a. generate and test method) to solve it. Specifically, for every coordinate of the -dimensional subproblem, it has two states, i.e., zero/nonzero. We systematically enumerate the full binary tree to obtain all possible candidate solutions and then pick the best one that leads to the lowest objective value as the optimal solution.

Finding the Working Set. We observe that it contains possible combinations of choice for the working set. One may use a cyclic strategy to alternatingly select all the choices of the working set. However, past results show that the coordinate gradient method results in faster convergence when the working set is chosen in an arbitrary order hsieh2008dual or in a greedy manner tseng2009coordinate ; hsieh2011fast . This inspires us to use a random strategy or a greedy strategy for finding the working set. We remark that the combination of the two strategies is preferred in practice.

We uniformly select one combination (which contains coordinates) from the whole working set of size . One good benefit of this strategy is that our algorithm is ensured to find a block- stationary point (discussed later) in expectation.

Generally speaking, we pick the top- coordinates that lead to the greatest descent when one variable is changed and the rest variables are fixed based on the current solution . We denote and . For , we solve a one-variable subproblem to compute the possible decrease for all of when changing from zero to nonzero:

 ∀i=1,...,|Z|, ci=minαF(xt+αei)−F(xt).

For , we compute the decrease for each coordinate of when changing from nonzero to exactly zero:

 ∀j=1,...,|¯Z|, dj=F(xt+αej)−F(xt), α=xtj.

We sort the vectors and in increasing order and then pick the top- coordinates as the working set.

## 3 Optimality Analysis

This section provides an optimality analysis of our method. In the sequel, we present some necessary optimal conditions for (1). Since the block- optimality condition is novel in this paper, it is necessary to clarify its relations with existing optimality conditions formally. We use , and to denote an arbitrary basic stationary point, an -stationary point, and a block- stationary point, respectively.

###### Definition 1.

(Basic Stationary Point) A solution is called a basic stationary point if the following holds. . Here, .

Remarks: The basic stationary point states that the solution achieves its global optimality when the support set is restricted. One good feature of the basic stationary condition is that the solution set is enumerable and its size is . This makes it possible to validate whether a solution is optimal for the original sparse optimization problem.

###### Definition 2.

(-Stationary Point) A solution is an -stationary point if it holds that: with .

Remarks: This is the well-known proximal thresholding operator beck2013sparsity . The term is a majorization function of and it always holds that for all and . Although it has a closed-form solution, this simple surrogate function may not be a good majorization/surrogate function for the non-convex problem.

###### Definition 3.

(Block- Stationary Point) A solution is a block- stationary point if it holds that:

 ¯x∈argminz∈Rn P(z;¯x,B)≜{F(z), s.t. z¯B=¯x¯B}, ∀|B|=k. (4)

Remarks: (i) The concept of block- stationary point is novel in this paper. (ii) The sub-problem is NP-hard, and it takes sub-exponential time . However, since is often very small, it can be tackled by some practical global optimization methods. (iii) Testing whether a solution is a block- stationary point deterministically requires solving subproblems, therefore leading to a total time complexity of . However, using a random strategy for finding the working set from combinations, we can test whether a solution is the block- stationary point in expectation within a time complexity of with the constant

being the number of iterations which is related to the confidence of the probability.

The following proposition states the relations between the three types of the stationary point.

###### Proposition 1.

Optimality Hierarchy between the Necessary Optimality Conditions. The following optimality hierarchy holds: with for , and for .

###### Proof.

We denote . is the operator that sets all but the largest (in magnitude) elements of to zero.

(1) First, we prove that an -stationary point is also a basic stationary point when . Using Definition 2, we have the following closed-form solution for :

 xi={Π(xi−∇if(x)/L),(xi−∇if(`x)/L)2>2λ/L;0,else.

This implies that there exists a support set such that , which is the optimal condition for a basic stationary point. Defining , we notice that , and . Second, we prove that an -stationary point is also a basic stationary point when . For an -stationary point, we have . This implies that , which is the optimal condition for a basic stationary point.

(2) First, we prove that a block-1 stationary point is also an -stationary point for . Assume that the convex objective function has coordinate-wise Lipschitz continuous gradient with constant . For all , it holds that nesterov2012efficiency : Any block-1 stationary point must satisfy the following relation: We have the following optimal condition for with : Since , the latter formulation implies the former one.

Second, we prove that a block-2 stationary point is also an -stationary point for . Given a vector , we consider the following optimization problem:

 {z∗B=argminzB ∥z−a∥22, s.t. ∥zB∥0+∥z¯B∥0≤s, ∀|B|=2}, (5)

which in fact contains 2-dimensional subproblems. It is not hard to validate that (5) achieves the optimal solution with . For any block-2 stationary point , we have . Applying this conclusion with , we have .

(3) It is not hard to notice that the subproblem for the block- stationary point is a subset of those of the block- stationary point when . Therefore, the block- stationary point implies the block- stationary point when .

(4) It is clear that any block- stationary point is also the optimal global solution.

Remarks: It is worthwhile to point out that the seminal works of beck2013sparsity ; Beck2016 also present a coordinate-wise optimality condition for sparse optimization. However, our block- condition is stronger since their optimality condition corresponds to in our optimality condition framework.

A Running Example. We consider the quadratic optimization problem where . The parameters for and are set to . The stationary point distribution of this example can be found in Table 1. This problem contains basic stationary points for , while it has basic stationary points for . As becomes large, the newly introduced type of local minimizer (i.e., block- stationary point) become more restricted in the sense that they have a smaller number of stationary points.

## 4 Convergence Analysis

This section provides some convergence analysis for Algorithm 1. We assume that the working set of size is selected randomly and uniformly (sample with replacement). Due to space limitations, some proofs are placed into the supplementary material.

###### Proposition 2.

Global Convergence. Letting be the sequence generated by Algorithm 1, we have the following results. (i) It holds that: . (ii) As , converges to the block- stationary point of (1) in expectation.

Remarks: Coordinate descent may cycle indefinitely if each minimization step contains multiple solutions Powell1973 . The introduction of the strongly convex parameter is necessary for our nonconvex problem since it guarantees sufficient decrease condition, which is essential for global convergence. Our algorithm is guaranteed to find the block- stationary point, but it is in expectation.

The following theorem establishes some convergence properties of our algorithm for sparse regularized optimization with .

###### Theorem 1.

Convergence Properties for Sparse Regularized Optimization. For , we have the following results:

(a) It holds that for all with , where . Whenever , we have and the objective value is decreased at least by . The solution changes at most times in expectation for finding a block- stationary point . Here and are respectively defined as:

 D≜kθδ22n,  ¯J≜F(x0)−F(¯x)D. (6)

(b) Assume that is generally convex. If the support set of does not changes for all , Algorithm 1 takes at most iterations in expectation to converge to a stationary point satisfying . Moreover, Algorithm 1 takes at most in expectation to converge to a stationary point satisfying . Here, is defined as:

 V1=max(4ν2θ,√2ν2(F(x0)−F(¯x))θ)/ϵ, % with ν≜2nρ√kθk. (7)

(c) Assume that is -strongly convex. If the support set of does not changes for all , Algorithm 1 takes at most iterations in expectation to converge to a stationary point satisfying . Moreover, Algorithm 1 takes at most in expectation to converge to a stationary point satisfying . Here, is defined as:

 V2=logα(ϵ/(F(x0)−F(¯x))), with α≜nθkσ/(1+nθkσ). (8)

Remarks: (i) When the support set is fixed, the optimization problem reduces to a convex problem. (ii) We derive a upper bound for the number of changes for the support set in (a), and a upper bound on the number of iterations (or ) performed after the support set is fixed in (b) (or (c)). Multiplying these two bounds, we can establish the actual number of iterations for Algorithm 1. However, this bound may not be tight enough.

In what follows, we establish a better convergence rate of our algorithm with .

###### Theorem 2.

Convergence Rate for Sparse Regularized Optimization. For , we have the following results:

(a) Assume that is generally convex, Algorithm 1 takes at most iterations in expectation to converge to a block- stationary point satisfying , where .

(b) Assume that is -strongly convex, Algorithm 1 takes at most iterations in expectation to converge to a block- stationary point satisfying , where .

Remarks: (i) Our proof of Theorem 2 is based on the results in Theorem 1 and a similar bounding technique as in lu2014iterative . (ii) If and the accuracy is sufficiently small such that , we have , leading to . Using the same assumption and strategy, we have . In this situation, the bounds in Theorem 2 are tighter than those in Theorem 1.

We prove the convergence rate of our algorithm for sparsity constrained optimization with .

###### Theorem 3.

Convergence Rate for Sparsity Constrained Optimization. Let be a -strongly convex function. We assume that is Lipschitz continuous such that for some positive constant . Denote . We have the following results:

Remarks: Our convergence rate results are similar to those of the gradient hard thresholding pursuit as in YuanLZ17

. The first term and the second term for our convergence rate are called parameter estimation error and statistical error, respectively. While their analysis relies on the conditions of restricted strong convexity or smoothness, our study only relies on the requirements of generally strong convexity or smoothness. Figure 2: Experimental results on sparsity constrained least squares problems on different data sets.

## 5 Experimental Validation

This section demonstrates the effectiveness of our algorithm on two sparse optimization tasks, namely the sparse regularized least squares problem and the sparsity constrained least squares problem. Given a design matrix and an observation vector , we solve the following optimization problem:

 minx12∥Ax−b∥22+λ∥x∥0   or   minx 12∥Ax−b∥22, s.t. ∥x∥0≤s, (9)

where and are given parameters.

Experimental Settings. We use DEC (RG) to denote our block decomposition method along with selecting coordinates using the Random strategy and coordinates using the Greedy strategy. Since the two strategies may select the same coordinate, the working set at most contains coordinates. We keep a record of the relative changes of the objective function values by . We let DEC run up to iterations and stop it at iteration if . We use the default value for DEC. All codes 222For the purpose of reproducibility, we provide our code in the supplementary material. were implemented in Matlab on an Intel 3.20GHz CPU with 8 GB RAM. We measure the quality of the solution by comparing the objective values for different methods. Note that although DEC uses the randomized strategy to find the working set, we can always measure the quality of the solution by computing the deterministic objective value.

Data Sets. Four types of data sets for are considered in our experiments. (i) ‘random-m-n’: We generate the design matrix as , where

is a function that returns a standard Gaussian random matrix of size

. To generate the sparse original signal , we select a support set of size

uniformly at random and set them to arbitrary number sampled from standard Gaussian distribution, the observation vector is generated via

with . (ii) ‘E2006-m-n’: We use the real-world data set ‘E2006’ . We uniformly select examples and dimensions from the original data set. (iii) ‘random-m-n-corrupted’: To verify the robustness of DEC

, we generate design matrices containing outliers by

. Here, is a noisy version of where of the entries of are corrupted uniformly by scaling the original values by 100 times 444Matlab script: I = randperm(m*p,round(0.02*m*p)); X(I) = X(I)*100.. We use the same strategy to generate as in ‘random-m-n’. Note that the Hessian matrix can be ill-conditioned. (iv) ‘E2006-m-n-corrupted’: We use a similar strategy to generate the corrupted real-world data as in ‘random-m-n-corrupted’.

Sparse Regularized Least Squares Problem. We compare DEC with four state-of-the-art methods: (i) Proximal Gradient Method (PGM) nesterov2013introductory , (ii) Accelerated Proximal Gradient Method (APGM) nesterov2013introductory ; beck2009fast , (iii) Random Coordinate Descent Method (RCDM) necoara2013random , and (iv) Greedy Coordinate Descent Method (GCDM) hsieh2011fast . All the initial solutions for the comparing methods are set to zero.

Several observations can be drawn from Figure 3. (i) PGM and APGM achieve similar performance and they get stuck into poor local minima. (ii) DEC is more effective than PGM. In addition, we find that as the parameter becomes larger, more higher accuracy is achieved. (iii) DEC (RG) converges quickly but it generally leads to worse solution quality than DEC (RG). Based on this observation, we consider a combined random and greedy strategies for finding the working set in our forthcoming experiments.

Sparsity Constrained Least Squares Problem. We compare DEC with seven state-of-the-art sparse optimization algorithms: (i) Regularized Orthogonal Matching Pursuit (ROMP) needell2010signal , (ii) Subspace Pursuit (SSP) dai2009subspace , (iii) Orthogonal Matching Pursuit (OMP) tropp2007signal , (iv) Gradient Pursuit (GP) blumensath2008gradient , (v) Compressive Sampling Matched Pursuit (CoSaMP)needell2009cosamp , (vi) Proximal Gradient Method (PGM) BaoJQS16 , and (vii) Quadratic Penalty Method (QPM) LuZ13 . We remark that ROMP, SSP, OMP, GP and CoSaMP are greedy algorithms and their support sets are selected iteratively. They are non-gradient type algorithms, it is hard to incorporate these methods into other gradient-type based optimization algorithms BaoJQS16 . We use the Matlab implementation in the ‘sparsify’ toolbox. Both PGM and QPM are based on iterative hard thresholding. Since the optimal solution is expected to be sparse, we initialize the solutions of PPA, QPM and DEC to and project them to feasible solutions. The initial solution of greedy pursuit methods are initialized to zero points implicitly. We vary on different data sets and show the average results of using 5 random initial points.

Several conclusions can be drawn from Figure 3. (i) The iterative hard thresholding-based methods PPA and QPM generally lead to the worst performance. (ii) ROMP and COSAMP are not stable and sometimes they present bad performance. (iii) DEC presents comparable performance to the greedy methods on ‘random-256-1024’, ‘E2006-5000-1024’, and ‘random-256-1024-corrupted’, but it significantly outperforms the greedy techniques on the other data sets.

Computational Efficiency. Figure 3 and Figure 3 show the convergence curve of different methods for sparse regularized optimization and sparsity constrained optimization, respectively. Generally speaking, DEC is effective and practical for large-scale sparse optimization. Although it takes longer time to converge than the comparing methods, the computational time is acceptable and it generally takes less than 30 seconds to converge in all our instances. This computation time pays off as DEC achieves significantly higher accuracy. The main bottleneck of computation is on solving the small-sized subproblems using sub-exponential time . The parameter in Algorithm 1 can be viewed as a tuning parameter to balance the efficacy and efficiency.

## 6 Conclusions

This paper presents an effective and practical method for solving sparse optimization problems. Our approach takes advantage of the effectiveness of the combinatorial search and the efficiency of coordinate descent. We provided rigorous optimality analysis and convergence analysis for the proposed algorithm. Our experiments show that our method achieves state-of-the-art performance. Our block decomposition algorithm can be extended to solve binary optimization problems. A full version of this paper can be found in yuanhybrid2017 .

## References

• (1) Michal Aharon, Michael Elad, and Alfred Bruckstein. K-svd: An algorithm for designing overcomplete dictionaries for sparse representation. IEEE Transactions on Signal Processing, 54(11):4311–4322, 2006.
• (2) Chenglong Bao, Hui Ji, Yuhui Quan, and Zuowei Shen. Dictionary learning for sparse coding: Algorithms and convergence analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 38(7):1356–1369, 2016.
• (3) Amir Beck and Yonina C Eldar. Sparsity constrained nonlinear optimization: Optimality conditions and algorithms. SIAM Journal on Optimization (SIOPT), 23(3):1480–1509, 2013.
• (4) Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences (SIIMS), 2(1):183–202, 2009.
• (5) Amir Beck and Luba Tetruashvili. On the convergence of block coordinate descent type methods. SIAM Journal on Optimization (SIOPT), 23(4):2037–2060, 2013.
• (6) Amir Beck and Yakov Vaisbourd.

The sparse principal component analysis problem: Optimality conditions and algorithms.

J. Optim. Theory Appl., 170(1):119–143, 2016.
• (7) Daniel Bienstock. Computational study of a family of mixed-integer quadratic programming problems. Mathematical Programming, 74(2):121–140, 1996.
• (8) Thomas Blumensath and Mike E Davies. Gradient pursuits. IEEE Transactions on Signal Processing, 56(6):2370–2382, 2008.
• (9) Patrick Breheny and Jian Huang.

Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection.

The Annals of Applied Statistics, 5(1):232, 2011.
• (10) Emmanuel J Candes and Terence Tao. Decoding by linear programming. IEEE Transactions on Information Theory, 51(12):4203–4215, 2005.
• (11) Kai-Wei Chang, Cho-Jui Hsieh, and Chih-Jen Lin.

Coordinate descent method for large-scale l2-loss linear support vector machines.

Journal of Machine Learning Research (JMLR), 9(Jul):1369–1398, 2008.
• (12) Jinghui Chen and Quanquan Gu. Accelerated stochastic block coordinate gradient descent for sparsity constrained nonconvex optimization. In

Uncertainty in Artificial Intelligence (UAI)

, 2016.
• (13) Michele Conforti, Gérard Cornuéjols, and Giacomo Zambelli. Integer programming, volume 271. Springer, 2014.
• (14) Wei Dai and Olgica Milenkovic. Subspace pursuit for compressive sensing signal reconstruction. IEEE Transactions on Information Theory, 55(5):2230–2249, 2009.
• (15) Marianna De Santis, Stefano Lucidi, and Francesco Rinaldi. A fast active set block coordinate descent algorithm for ell_1-regularized least squares. SIAM Journal on Optimization (SIOPT), 26(1):781–809, 2016.
• (16) David L. Donoho. Compressed sensing. IEEE Transactions on Information Theory, 52(4):1289–1306, 2006.
• (17) Ehsan Elhamifar and Rene Vidal. Sparse subspace clustering: Algorithm, theory, and applications. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 35(11):2765–2781, 2013.
• (18) Mingyi Hong, Xiangfeng Wang, Meisam Razaviyayn, and Zhi-Quan Luo. Iteration complexity analysis of block coordinate descent methods. Mathematical Programming, pages 1–30, 2013.
• (19) Cho-Jui Hsieh, Kai-Wei Chang, Chih-Jen Lin, S Sathiya Keerthi, and Sellamanickam Sundararajan. A dual coordinate descent method for large-scale linear svm. In International Conference on Machine Learning (ICML), pages 408–415, 2008.
• (20) Cho-Jui Hsieh and Inderjit S Dhillon. Fast coordinate descent methods with variable selection for non-negative matrix factorization. In ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), pages 1064–1072, 2011.
• (21) Prateek Jain, Ambuj Tewari, and Purushottam Kar. On iterative hard thresholding methods for high-dimensional m-estimation. In Neural Information Processing Systems (NIPS), pages 685–693, 2014.
• (22) Rie Johnson and Tong Zhang.

Accelerating stochastic gradient descent using predictive variance reduction.

In Advances in Neural Information Processing Systems (NIPS), pages 315–323, 2013.
• (23) Xingguo Li, Raman Arora, Han Liu, Jarvis Haupt, and Tuo Zhao. Nonconvex sparse learning via stochastic optimization with progressive variance reduction. arXiv Preprint, 2016.
• (24) Ji Liu, Stephen J Wright, Christopher Ré, Victor Bittorf, and Srikrishna Sridhar. An asynchronous parallel stochastic coordinate descent algorithm. Journal of Machine Learning Research (JMLR), 16(285-322):1–5, 2015.
• (25) Zhaosong Lu. Iterative hard thresholding methods for regularized convex cone programming. Mathematical Programming, 147(1-2):125–154, 2014.
• (26) Zhaosong Lu and Lin Xiao. On the complexity analysis of randomized block-coordinate descent methods. Mathematical Programming, 152(1-2):615–642, 2015.
• (27) Zhaosong Lu and Yong Zhang. Sparse approximation via penalty decomposition methods. SIAM Journal on Optimization (SIOPT), 23(4):2448–2478, 2013.
• (28) Ion Necoara. Random coordinate descent algorithms for multi-agent convex optimization over networks. IEEE Transactions on Automatic Control, 58(8):2001–2012, 2013.
• (29) Deanna Needell and Joel A Tropp. Cosamp: Iterative signal recovery from incomplete and inaccurate samples. Applied and Computational Harmonic Analysis, 26(3):301–321, 2009.
• (30) Deanna Needell and Roman Vershynin. Signal recovery from incomplete and inaccurate measurements via regularized orthogonal matching pursuit. IEEE Journal of Selected Topics in Signal Processing, 4(2):310–316, 2010.
• (31) Yu Nesterov. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization (SIOPT), 22(2):341–362, 2012.
• (32) Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013.
• (33) Nam Nguyen, Deanna Needell, and Tina Woolf. Linear convergence of stochastic iterative greedy algorithms with sparse constraints. arXiv Preprint, 2014.
• (34) Andrei Patrascu and Ion Necoara. Efficient random coordinate descent algorithms for large-scale structured nonconvex optimization. Journal of Global Optimization, 61(1):19–46, 2015.
• (35) Andrei Patrascu and Ion Necoara. Random coordinate descent methods for regularized convex optimization. IEEE Transactions on Automatic Control, 60(7):1811–1824, 2015.
• (36) M. J. D. Powell. On search directions for minimization algorithms. Mathematical Programming, 4(1):193–201, 1973.
• (37) Meisam Razaviyayn, Mingyi Hong, and Zhi-Quan Luo. A unified convergence analysis of block successive minimization methods for nonsmooth optimization. SIAM Journal on Optimization (SIOPT), 23(2):1126–1153, 2013.
• (38) Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Neural Information Processing Systems (NIPS), pages 693–701, 2011.
• (39) Joel A Tropp and Anna C Gilbert. Signal recovery from random measurements via orthogonal matching pursuit. IEEE Transactions on Information Theory, 53(12):4655–4666, 2007.
• (40) Paul Tseng and Sangwoon Yun. A coordinate gradient descent method for nonsmooth separable minimization. Mathematical Programming, 117(1):387–423, 2009.
• (41) Lin Xiao and Tong Zhang. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization (SIOPT), 24(4):2057–2075, 2014.
• (42) Yangyang Xu and Wotao Yin.

A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factorization and completion.

SIAM Journal on Imaging Sciences (SIIMS), 6(3):1758–1789, 2013.
• (43) Ganzhao Yuan, Li Shen, and Wei-Shi Zheng. A hybrid method of combinatorial search and coordinate descent for discrete optimization. arXiv, https://arxiv.org/abs/1706.06493, June 2017.
• (44) Xiao-Tong Yuan, Ping Li, and Tong Zhang. Gradient hard thresholding pursuit. Journal of Machine Learning Research (JMLR), 18:166:1–166:43, 2017.
• (45) Aston Zhang and Quanquan Gu. Accelerated stochastic block coordinate descent with optimal sampling. In ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), pages 2035–2044, 2016.

## 7 Proof of Global Convergence

###### Proposition 2.

Global Convergence. Letting be the sequence generated by Algorithm 1, we have the following results. (i) It holds that: . (ii) As , converges to the block- stationary point of (1) in expectation.

###### Proof.

(i) Due to the optimality of , we have: for all . Letting , we obtain the sufficient decrease condition:

 F(xt+1)≤F(xt)−θ2∥xt+1−xt∥2 (10)

Taking the expectation of for the sufficient descent inequality, we have . Summing this inequality over , we have:

Using the fact that , we obtain:

 mini=1,...,tE[θ2∥xi+1−xi∥22]≤θ2t∑ti=0E[∥xi+1−xi∥22]≤F(x0)−F(¯x)t (11)

Therefore, we have .

(ii) We assume that the stationary point is not a block- stationary point. In expectation there exists a block of coordinates such that for some , where is defined in Definition 3. However, according to the fact that and subproblem (3) in Algorithm 1, we have . Hence, we have . This contradicts with the fact that as . We conclude that converges to the block- stationary point.

## 8 Proof of Convergence Rate for Sparse Regularized Optimization

The following lemma is useful in our proof.

###### Lemma 1.

Assume a nonnegative sequence satisfies for some constant . We have:

 ut≤max(2C,√Cu0)t (12)
###### Proof.

We denote . Solving this quadratic inequality, we have:

 ut+1≤−C2+C2√1+4utC (13)

We now show that , which can be obtained by mathematical induction. (i) When , we have . (ii) When , we assume that holds. We derive the following results: . Here, step uses ; step uses ; step uses .

###### Theorem 1.

Convergence Properties for Sparse Regularized Optimization. For , we have the following results:

(a) It holds that for all with , where . Whenever , we have