Efficient Relaxed Gradient Support Pursuit for Sparsity Constrained Non-convex Optimization

12/02/2019
by   Fanhua Shang, et al.
The University of Texas at Austin
0

Large-scale non-convex sparsity-constrained problems have recently gained extensive attention. Most existing deterministic optimization methods (e.g., GraSP) are not suitable for large-scale and high-dimensional problems, and thus stochastic optimization methods with hard thresholding (e.g., SVRGHT) become more attractive. Inspired by GraSP, this paper proposes a new general relaxed gradient support pursuit (RGraSP) framework, in which the sub-algorithm only requires to satisfy a slack descent condition. We also design two specific semi-stochastic gradient hard thresholding algorithms. In particular, our algorithms have much less hard thresholding operations than SVRGHT, and their average per-iteration cost is much lower (i.e., O(d) vs. O(d log(d)) for SVRGHT), which leads to faster convergence. Our experimental results on both synthetic and real-world datasets show that our algorithms are superior to the state-of-the-art gradient hard thresholding methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

11/22/2013

Gradient Hard Thresholding Pursuit for Sparsity-Constrained Optimization

Hard Thresholding Pursuit (HTP) is an iterative greedy selection procedu...
05/09/2019

Stochastic Iterative Hard Thresholding for Graph-structured Sparsity Optimization

Stochastic optimization algorithms update models with cheap per-iteratio...
02/19/2016

Structured Sparse Regression via Greedy Hard-Thresholding

Several learning applications require solving high-dimensional regressio...
03/01/2017

Dual Iterative Hard Thresholding: From Non-convex Sparse Minimization to Non-smooth Concave Maximization

Iterative Hard Thresholding (IHT) is a class of projected gradient desce...
04/11/2012

Least Absolute Gradient Selector: Statistical Regression via Pseudo-Hard Thresholding

Variable selection in linear models plays a pivotal role in modern stati...
08/01/2019

Gradient Pursuit-Based Channel Estimation for MmWave Massive MIMO Systems with One-Bit ADCs

In this paper, channel estimation for millimeter wave (mmWave) massive m...
11/05/2017

Stochastic Greedy Algorithms For Multiple Measurement Vectors

Sparse representation of a single measurement vector (SMV) has been expl...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Massive high-dimensional data are common nowadays and impose new challenges to algorithms for sparse learning. For high-dimensional data analysis, it is important to exploit the low intrinsic structure and dimensionality of the data, such as sparsity and low-rank structures, which is often attained by imposing certain structural assumptions on the parameter of the underlying model. In recent years, the

-norm regularized models, such as Lasso [34] and

-norm regularized logistic regression

[36, 21], were proposed to pursue computational tractability by using the -norm as a surrogate to the -norm. In spite of computational advantages, the -norm models have some limits [6] and attain worse empirical performance than -norm models [9]. Thus, it is necessary and challenging to solve the -norm constrained problem directly. In this paper, we focus on the following sparsity-constrained optimization problem,

(1)

where is the sum of a finite number of smooth convex component functions , and means that the number of nonzero entries in x is no more than , and is used to control the sparsity level of the model parameter. The problem (1

) arises in machine learning, statistics and related areas, e.g., the sparsity-constrained linear regression problem:

(2)

where is the unknown model parameter,

is the response vector, and

is a design matrix.

Due to the non-convexity of the sparsity constraint, Problem (1) is NP-hard. In order to obtain an approximate solution to Problem (1), a large family of greedy algorithms [18, 20, 35, 10, 25, 3, 38, 33] have been proposed. Besides them, there has been much progress towards gradient hard thresholding methods such as fast gradient hard thresholding pursuit (FG-HT) [37], iterative hard thresholding (IHT) [5]. However, all the algorithms are based on deterministic optimization such as gradient descent. In each iteration, the gradient descent algorithms require the computation of a full gradient over very large component functions, which is expensive in solving large-scale problems (i.e., ).

To address this issue, several stochastic optimization algorithms have been proposed. For example, Nguyen et al. DBLP:journals/corr/NguyenNW14 proposed a stochastic gradient hard thresholding algorithm (SG-HT), and Li et al.

 DBLP:journals/corr/LiZALH16 proposed a stochastic variance reduced gradient hard thresholding algorithm (SVRGHT), which is based on the stochastic variance reduction gradient (SVRG, also called semi-stochastic gradient in

[13]) method [12]

. With the help of variance reduction technique, SVRGHT can converge stably and efficiently, and also obtain a better estimation accuracy than SG-HT

[32]. Besides these methods, there are still several stochastic first- or second-order nonconvex optimization algorithms, such as ASBCDHT [7], HSG-HT [41], FNHTP [8] and SL-BFGS [11]. However, most algorithms mentioned above need one hard thresholding operation in each inner-iteration, which is time-consuming for high-dimensional data (i.e., in general and we can also improve the time complexity using a max-heap strategy). On the other hand, a hard thresholding operation used in each inner-iteration breaks the information of current solution, which may require more gradient descent steps to reach the same accuracy. It should be emphasized that in this paper, we do not consider the coordinate-descent type algorithms [22, 4, 7]. As a result, a new algorithm that needs less hard thresholding operations and yet remains fast stochastic updates becomes more attractive for large-scale and high-dimensional problems.

Inspired by the gradient support pursuit (GraSP) [3] and compressive sampling matching pursuit (CoSaMP) [20] algorithms, this paper proposes a new relaxed gradient support pursuit (RGraSP) framework to solve large-scale sparsity-constrained problems. In each iteration of our framework, we first find the most relevant support set, minimize slackly over the support set by an algorithm, which only requires to satisfy a certain descent condition, and then perform a hard thresholding operator on the updated parameter. For minimizing objective functions, we introduce two efficient semi-stochastic gradient algorithms into our RGraSP framework and propose a stochastic variance reduced gradient support pursuit (SVRGSP) algorithm and its fast version (SVRGSP+). Moreover, benefiting from significantly less hard thresholding operations than SVRGHT, the average per-iteration computational cost in our algorithms is much lower (i.e., for our algorithms vs.  for the algorithms mentioned above such as SVRGHT), which leads to faster convergence. Experimental results on synthetic and real-world datasets verify the superiority of our algorithms against the state-of-the-art methods.

Algorithms Per-iteration Complexity
FG-HT and IHT (deterministic)
SG-HT and SVRGHT (stochastic)
SVRGSP and SVRGSP+ (Ours)
Table 1: Comparison of the per-iteration complexities of some sparsity-constrained optimization methods.

2 Notations

Throughout this paper, we use to denote the design matrix, to denote the response vector, and to denote the model parameter. is the optimal solution of Problem (1) and is the optimal sparsity level which satisfies . For the parameter , is the number of nonzero entries in the vector x, is the -norm, and is the -norm. supp(x) denotes the index set of nonzero entries of x, and is the index set of the largest entries of x in terms of magnitude. denotes the complement set of ,   is a vector that equals x except for coordinates in where it is zero, and denotes the cardinality of . In addition, denotes the gradient of the objective function at x, and

is an identity matrix.

3 Relaxed Gradient Support Pursuit

In this section, we propose an efficient relaxed gradient support pursuit framework for sparsity-constrained non-convex optimization problems. Moreover, we also present the details of two specific stochastic variance reduced gradient support pursuit algorithms (called SVRGSP and SVRGSP+).

3.1 Our Gradient Support Pursuit Framework

Most of existing gradient support pursuit algorithms use deterministic optimization methods to minimize various sparsity-constrained problems (e.g., Problem (1)). However, there are several stochastic algorithms such as SVRGHT [14], which can be used to attain better performance [14, 32, 11]. Inspired by the well-known GraSP [3] and CoSaMP [20], this paper proposes a Relaxed Gradient Support Pursuit (RGraSP) framework to quickly find an approximate solution to Problem (1). As pointed out in [3], CoSaMP can be viewed as a special case of GraSP, when the squared error is the cost function.

0:  , , and the number of iterations, .
0:  .
1:  for  do
2:     Compute local gradient: ;
3:     Identify directions: ;
4:     Merge supports: ;
5:     Minimize slackly: find b, s.t., , where is an optimal solution to Problem (3), and set ;
6:     Perform hard thresholding over : ;
7:  end for
7:  .
Algorithm 1 : Relaxed Gradient Support Pursuit Framework

The main difference between GraSP [3] and our RGraSP framework is that the former needs to yield the exact solution to the following problem:

while the latter only requires a solver (e.g., Algorithm 2 below) for an approximation solution b to the above problem. RGraSP is used to solve sparsity-constrained non-convex optimization problems by allowing users to pick a specially designed algorithm according to the properties of . In other words, we can choose different solvers to solve the sub-problem in Step 5 of Algorithm 1, as long as the algorithm satisfies a certain descent condition. In this sense, both GraSP [3] and CoSaMP [20] can be viewed as a special case of RGraSP, when . Our general RGraSP framework is outlined in Algorithm 1.

At each iteration of Algorithm 1, we first compute the gradient of at the current estimate, i.e., . Then we choose coordinates of g that have the largest magnitude as the direction, in which pursuing the minimization will be most effective, and denote their indices by , where is the sparsity constant. Merging the support of the current estimate with the coordinates mentioned above, we can obtain the combined support, which is a set of at most indices, i.e., . Over , we compute an estimate b as the approximate solution to the sub-problem,

(3)

The parameter is then updated using the hard thresholding operator, which keeps the largest terms of the intermediate estimate b. This step makes as the best -term approximation of the estimate b. The hard thresholding operator is defined as follows:

Essentially, the hard thresholding operator keeps the largest (in magnitude) entries and sets the other entries equal to zero.

In Algorithm 1, we only require that the solver for solving the sub-problem (3) has a performance guarantee as:

where is an optimal solution to the sub-problem (3), and is an error tolerance, which implies that our solver proposed below has to achieve a certain accuracy for our RGraSP framework. In fact, our solver can approach to a given accuracy after sufficient iterations, as suggested by [1]. Although there are quite a number of solvers that we can use, such as the first-order solvers [3, 17, 39] and the second-order solvers [2, 8, 11], we observe that the semi-stochastic gradient solver outperforms other solvers in most cases as in [15]. In the following, we will present two efficient semi-stochastic gradient algorithms as our solver in Algorithm 1.

(a)
(b)
Figure 1: Comparison of all the methods for solving sparsity-constrained linear regression problems on the synthetic datasets. In each plot, the vertical axis denotes the logarithm of the objective function values or estimation error, and the horizontal axis is the number of effective passes over data.
0:  , , and the step-size .
0:  , , ;
1:  for  do
2:     Randomly pick ;
3:     ;
4:     ;   // For plain solverOption:  if  mod  then   // For fast solver                      ;               end if
5:  end for
5:  .
Algorithm 2 : Our Semi-Stochastic Gradient Solvers

3.2 Our Semi-Stochastic Gradient Solvers

In our RGraSP framework, we apply many semi-stochastic iterations as a solver. Combining our semi-stochastic gradient solver (i.e., Algorithm 2) with our RGraSP framework, we propose a stochastic variance reduced gradient support pursuit (SVRGSP) algorithm and its fast variant (SVRGSP+) to solve Problem (1). The semi-stochastic gradient solver is outlined in Algorithm 2.

(a) rcv1
(b) real-sim
Figure 2: Comparison of all the algorithms for solving sparsity-constrained logistic regression problems on the two real-world datasets. In each plot, the vertical axis is the logarithm of the objective value minus the minimum, and the horizontal axis denotes the number of effective passes over data or running time (seconds).

In each iteration of Algorithm 2, we first initialize g as a snapshot gradient in our semi-stochastic gradient update, which has been computed in Algorithm 1. Then we select a sample uniformly at random from . The semi-stochastic gradient defined in Step 3 of Algorithm 2 is updated based on this sample. Note that the gradient is called a semi-stochastic gradient because it includes a deterministic full gradient g and two stochastic gradients, as shown in Step 3 of Algorithm 2. Meanwhile, this semi-stochastic gradient can reduce the variance introduced by randomly sampling and thus can accelerate convergence [12, 13]. Finally, is updated by using the semi-stochastic gradient with a constant step-size , as shown in Step 4 of Algorithm 2

. Therefore, our SVRGSP solver has no hard thresholding operations in the whole epoch, while existing stochastic algorithms such as SVRGHT

[14] have one hard thresholding operation in each stochastic iteration, which naturally leads to significantly slower convergence.

Moreover, we can also use a few hard thresholding operators (e.g., ) in each epoch to maintain the main coordinates to obtain faster convergence. In other words, Step 4 in Algorithm 2 has another option for a fast variant of our SVRGSP solver (i.e., SVRGSP+), which also has significantly less hard thresholding operations than existing algorithms such as SVRGHT [14]. Therefore, the average per-iteration computational complexity of both SVRGSP and SVRGSP+ is much lower, i.e., for both SVRGSP and SVRGSP+ vs.  for SVRGHT.

From the above analysis, we can find that our algorithms (i.e., SVRGSP and SVRGSP+) use a hard thresholding operation after a large number of stochastic gradient iterations, while existing stochastic algorithms (e.g., SVRGHT [14]) perform a hard thresholding operation in each iteration, which is very time-consuming for high-dimensional problems and thus leads to a much slower convergence speed. Although many hard thresholding operations can keep the sparsity of model parameter , this will lose more gradient information, which is not desirable for fast convergence. In our experiments, we usually set , and as the number of iterations. In particular, it is not difficult to prove that our algorithms (i.e., SVRGSP and SVRGSP+) also have a fast linear convergence rate as SVRGHT. Please refer to the long version of this paper for detailed convergence analysis.

4 Experimental Results

In this section, we apply the proposed algorithms111The source codes of our two algorithms can be downloaded from the authors’ webpage. (i.e., SVRGSP and SVRGSP+) to solve sparsity-constrained linear regression and sparsity-constrained logistic regression problems, and evaluate their empirical performance on many synthetic and real-world datasets. All the experiments were performed on a PC with an Intel i7-7700 CPU and 32GB RAM.

4.1 Baseline Methods

In all the experiments, we compare the proposed algorithms (i.e., SVRGSP and SVRGSP+) with the following four state-of-the-art sparsity-constrained algorithms:

  • Gradient Support Pursuit (GraSP) [3];

  • Fast Gradient descent with Hard Thresholding (FG-HT) [37];

  • Stochastic Gradient descent with Hard Thresholding (SG-HT) [23];

  • and Stochastic Variance Reduced Gradient with Hard Thresholding (SVRGHT) [14].

Note that GraSP and FG-HT are two well-known deterministic optimization methods, while SG-HT and SVRGHT are two recently proposed stochastic optimization methods. The learning rates of all these methods (as well as other parameters) need to be tuned. Here, we set and for our two algorithms. It should be noted that we can get better results by tuning these two parameters together.

4.2 Synthetic Data

In this part, we empirically investigate the performance of our algorithms for solving the sparsity-constrained linear model (2) on many synthetic datasets. We first generate some synthetic matrices W, each row of which is drawn independently from a

-dimensional Gaussian distribution with mean

and covariance matrix . The response vector is generated from the model , where is the -sparse coefficient vector, and we need to generate the noise

drawn from a multivariate normal distribution

with . The nonzero entries in

are sampled independently from a uniform distribution over the interval

. For the experiments, we construct the following two synthetic data sets:

  1. , , , ;

  2. , , ,

and the diagonal entries of the covariance matrix are set to 1, and the other entries are set to . The sparsity parameter is set to for all the algorithms.

Figure 1 shows the computational performance (including the logarithm of the objective function values and the estimation error ) of all the algorithms on the synthetic datasets. All the results show that all the stochastic variance reduction algorithms (i.e., SVRGHT and our SVRGSP and SVRGSP+ algorithms) perform better than the deterministic methods (i.e., GraSP and FG-HT) and the stochastic method, SG-HT. We also can see that both SVRGSP and SVRGSP+ converge significantly faster than all the stare-of-the-art methods in terms of function values and estimation error for all the settings. Although SVRGHT was theoretically proved to have a linear convergence rate for sparsity-constrained linear regression problems, both our algorithms consistently outperform SVRGHT due to less hard thresholding operations, which is consistent with our analysis. Moreover, SVRGSP+ has a slightly faster convergence speed than SVRGSP in all the settings, which validates the importance of a few hard thresholding operators in each epoch.

4.3 Real-World Data

We also conduct many experiments on two large-scale real-world datasets: rcv1 and real-sim, which can be downloaded from the LIBSVM Data website222https://www.csie.ntu.edu.tw/~cjlin/libsvm/. They contain samples with dimensions and samples with dimensions, respectively, as shown in Table 2. We test all the methods for solving sparsity-constrained logistic regression (i.e., ) and sparsity-constrained linear regression problems with the sparse parameter and all the algorithms are initialized with .

Datasets #Data #Features Sparsity
rcv1 20,242 47,236 0.16%
real-sim 72,309 20,958 0.024%
Table 2: Summary of the two large-scale sparse datasets.

Figure 2 shows the logarithm of sparsity-constrained logistic regression function gap (i.e., ) with respect to the number of effective passes and running time on the rcv1 and real-sim datasets. Similar experimental results of all the algorithms for solving the sparsity-constrained linear regression problem are shown in Figure 3. From all the experimental results, it is clear that the proposed algorithms (i.e., SVRGSP and SVRGSP+) outperform the other state-of-art sparsity-constrained optimization algorithms in terms of both the number of effective passes and running time, meaning that our algorithms (including SVRGSP and SVRGSP+) converge significantly faster than the other algorithms. In particular, SVRGSP+ is consistently faster than all the other algorithms including SVRGSP.

Although the performance of GraSP [3] in terms of the number of effective passes is similar to that of FG-HT [37], GraSP is usually slower due to its higher per-iteration complexity. Both our algorithms are significantly faster than GraSP in terms of both the number of effective passes and running time, which verifies the effectiveness of our relaxed gradient support pursuit framework. Moreover, all the semi-stochastic descent algorithms (including SVRGHT and our two algorithms) can find some better solutions than the first-order deterministic methods (i.e., GraSP and FG-HT) and the stochastic gradient method, SG-HT, which demonstrates the efficient of the stochastic variance reduced technique. Both our algorithms are much faster than SVRGHT in the all settings, which means that our algorithms are very suitable for real-world large-scale sparse learning problems.

Figure 3: Comparison of all the algorithms for solving sparsity-constrained linear regression problems on rcv1.

5 Conclusions and Future Work

In this paper, we proposed a relaxed gradient support pursuit (RGraSP) framework for solving various large-scale sparsity-constrained optimization problems. Then we also presented two efficient semi-stochastic gradient hard thresholding algorithms as the solver of our RGraSP framework. Our theoretical analysis in the long version of this paper shows that both our algorithms have a fast linear convergence rate. In particular, our algorithms require much less hard thresholding operations than most existing algorithms, and their average per-iteration computational cost is much lower (i.e., for our algorithms vs.  for the algorithms mentioned above such as SVRGHT), which leads to faster convergence. Various experimental results on synthetic and real-world datasets verified our theoretical results and demonstrated the effectiveness and efficiency of our algorithms.

Unlike GraSP [3] and CoSaMP [20], RGraSP is a more general algorithm framework, especially for minimizing many complex cost functions, whose exact solutions cannot be verified in polynomial time. In particular, our proposed semi-stochastic gradient solver is very friendly to asynchronous parallel and distributed implementation similar to [24, 19]. Some recently proposed accelerated SVRG algorithms [27, 1, 40, 30, 26, 31] (e.g., Katyusha [1] and MiG [40]) and their asynchronous parallel and distributed variants can also be used to solve the subproblem in our framework. Moreover, our algorithms and their convergence results can be extended to the non-smooth setting (e.g., non-smooth cost functions) by using the property of stable restricted linearization as in [3]

, and low-rank matrix and tensor settings such as

[28, 29, 16].

Acknowledgments

This work was supported by the State Key Program of National Natural Science of China (No. 61836009), Project supported the Foundation for Innovative Research Groups of the National Natural Science Foundation of China (No. 61621005), Major Research Plan of the National Natural Science Foundation of China (Nos. 91438201 and 91438103), Fund for Foreign Scholars in University Research and Teaching Programs (the 111 Project) (No. B07048), National Natural Science Foundation of China (Nos. 61976164, 61876220, 61876221, U1701267, U1730109, and 61871310), Program for Cheung Kong Scholars and Innovative Research Team in University (No. IRT_15R53), Science Foundation of Xidian University (Nos. 10251180018 and 10251180019), Fundamental Research Funds for the Central Universities under Grant (No. 20101195989), National Science Basic Research Plan in Shaanxi Province of China (No. 2019JQ-657), and Key Special Project of China High Resolution Earth Observation System-Young Scholar Innovation Fund.

References

  • [1] Z. Allen-Zhu (2018) Katyusha: the first direct acceleration of stochastic gradient methods. J. Mach. Learn. Res. 18, pp. 1–51. Cited by: §3.1, §5.
  • [2] G. Andrew and J. Gao (2007) Scalable training of -regularized log-linear models. In ICML, pp. 33–40. Cited by: §3.1.
  • [3] S. Bahmani, B. Raj, and P. Boufounos (2013) Greedy sparsity-constrained optimization. J. Mach. Learn. Res. 14 (1), pp. 807–841. Cited by: §1, §1, §3.1, §3.1, §3.1, 1st item, §4.3, §5.
  • [4] A. Beck and Y. C. Eldar (2013) Sparsity constrained nonlinear optimization: optimality conditions and algorithms. SIAM J. Optim. 23 (3), pp. 1480–1509. Cited by: §1.
  • [5] T. Blumensath and M. Davies (2009) Iterative hard thresholding for compressed sensing. Appl. Comput. Harmon. Anal. 27 (3), pp. 265–274. Cited by: §1.
  • [6] E. Candes, M. Wakin, and S. Boyd (2008) Enhancing sparsity by reweighted minimization. J. Fourier Anal. Appl. 14 (5-6), pp. 877–905. Cited by: §1.
  • [7] J. Chen and Q. Gu (2016) Accelerated stochastic block coordinate gradient descent for sparsity constrained optimization. In UAI, Cited by: §1.
  • [8] J. Chen and Q. Gu (2017) Fast newton hard thresholding pursuit for sparsity constrained nonconvex optimization. In KDD, pp. 757–766. Cited by: §1, §3.1.
  • [9] J. Fan and R. Li (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96 (456), pp. 1348–1360. Cited by: §1.
  • [10] S. Foucart (2011) Hard thresholding pursuit: an algorithm for compressive sensing. SIAM J. Numer. Anal. 49 (6), pp. 2543–2563. Cited by: §1.
  • [11] H. Gao and H. Huang (2018) Stochastic second-order method for large-scale nonconvex sparse learning models. In IJCAI, pp. 2128–2134. Cited by: §1, §3.1, §3.1.
  • [12] R. Johnson and T. Zhang (2013) Accelerating stochastic gradient descent using predictive variance reduction. In NIPS, pp. 315–323. Cited by: §1, §3.2.
  • [13] J. Konečnỳ and P. Richtárik (2017) Semi-stochastic gradient descent methods. Front. Appl. Math. Stat. 3. Cited by: §1, §3.2.
  • [14] X. Li, T. Zhao, R. Arora, H. Liu, and J. Haupt (2016) Stochastic variance reduced optimization for nonconvex sparse learning. In ICML, pp. 917–925. Cited by: §3.1, §3.2, §3.2, §3.2, 4th item.
  • [15] X. Liu, B. Wei, F. Shang, and H. Liu (2019) Loopless semi-stochastic gradient descent with less hard thresholding for sparse learning. In CIKM, pp. 881–890. Cited by: §3.1.
  • [16] Y. Liu, F. Shang, W. Fan, J. Cheng, and H. Cheng (2016) Generalized higher order orthogonal iteration for tensor learning and decomposition. IEEE Trans. Neural Netw. Learning Syst. 27 (12), pp. 2551–2563. Cited by: §5.
  • [17] Y. Liu, F. Shang, and L. Jiao (2019) Accelerated incremental gradient descent using momentum acceleration with scaling factor. In IJCAI, pp. 3045–3051. Cited by: §3.1.
  • [18] S. Mallat and Z. Zhang (1993) Matching pursuits with time-frequency dictionaries. IEEE Tran. Signal Proces. 41 (12), pp. 3397–3415. Cited by: §1.
  • [19] H. Mania, X. Pan, D. Papailiopoulos, B. Recht, K. Ramchandran, and M. I. Jordan (2017) Perturbed iterate analysis for asynchronous stochastic optimization. SIAM J. Optim. 27 (4), pp. 2202–2229. Cited by: §5.
  • [20] D. Needell and J. Tropp (2010) CoSaMP: iterative signal recovery from incomplete and inaccurate samples. Commun. ACM 53 (12), pp. 93–100. Cited by: §1, §1, §3.1, §3.1, §5.
  • [21] S. N. Negahban, P. Ravikumar, M. J. Wainwright, and B. Yu (2009) A unified framework for high-dimensional analysis of m-estimators with decomposable regularizers. In NIPS, pp. 1348–1356. Cited by: §1.
  • [22] Y. Nesterov (2012) Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J. Optim. 22 (2), pp. 341–362. Cited by: §1.
  • [23] N. Nguyen, D. Needell, and T. Woolf (2014) Linear convergence of stochastic iterative greedy algorithms with sparse constraints. CoRR abs/1407.0088. Cited by: 3rd item.
  • [24] S. J. Reddi, A. Hefny, S. Sra, B. Poczos, and A. Smola (2015) On variance reduction in stochastic gradient descent and its asynchronous variants. In NIPS, pp. 2629–2637. Cited by: §5.
  • [25] S. Shalev-Shwartz, N. Srebro, and T. Zhang (2010) Trading accuracy for sparsity in optimization problems with sparsity constraints. SIAM J. Optim. 20 (6), pp. 2807–2832. Cited by: §1.
  • [26] F. Shang, Y. Liu, J. Cheng, K. W. Ng, and Y. Yoshida (2018) Guaranteed sufficient decrease for stochastic variance reduced gradient optimization. In AISTATS, pp. 1027–1036. Cited by: §5.
  • [27] F. Shang, Y. Liu, J. Cheng, and J. Zhuo (2017) Fast stochastic variance reduced gradient method with momentum acceleration for machine learning. CoRR arXiv:1703.07948. Cited by: §5.
  • [28] F. Shang, Y. Liu, and J. Cheng (2016) Scalable algorithms for tractable Schatten quasi-norm minimization. In AAAI, pp. 2016–2022. Cited by: §5.
  • [29] F. Shang, Y. Liu, and J. Cheng (2016) Tractable and scalable Schatten quasi-norm approximations for rank minimization. In AISTATS, pp. 620–629. Cited by: §5.
  • [30] F. Shang, Y. Liu, L. Jiao, K. Zhou, J. Cheng, Y. Ren, and Y. Jin (2018) ASVRG: accelerated proximal SVRG. In Proc. Mach. Learn. Res., pp. 1–16. Cited by: §5.
  • [31] F. Shang, K. Zhou, H. Liu, J. Cheng, I. Tsang, L. Zhang, D. Tao, and L. Jiao (2019) VR-SGD: a simple stochastic variance reduction method for machine learning. IEEE Trans. Knowl. Data Eng.. Cited by: §5.
  • [32] J. Shen and P. Li (2018) A tight bound of hard thresholding. J. Mach. Learn. Res. 18, pp. 1–42. Cited by: §1, §3.1.
  • [33] L. Tian, P. Xu, and Q. Gu (2016) Forward backward greedy algorithms for multi-task learning with faster rates. In UAI, pp. 735–744. Cited by: §1.
  • [34] R. Tibshirani (1996) Regression shrinkage and selection via the lasso. J. R. Statist. Soc. B 58 (1), pp. 267–288. Cited by: §1.
  • [35] J. A. Tropp and A. C. Gilbert (2007) Signal recovery from random measurements via orthogonal matching pursuit. IEEE Tran. Inform. Theory 53 (12), pp. 4655–4666. Cited by: §1.
  • [36] S. A. van de Geer (2008) High-dimensional generalized linear models and the lasso. Ann. Statist. 36 (2), pp. 614–645. Cited by: §1.
  • [37] X. Yuan, P. Li, and T. Zhang (2014) Gradient hard thresholding pursuit for sparsity-constrained optimization. In ICML, pp. 127–135. Cited by: §1, 2nd item, §4.3.
  • [38] T. Zhang (2011) Adaptive forward-backward greedy algorithm for learning sparse representations. IEEE Tran. Inform. Theory 57 (7), pp. 4689–4708. Cited by: §1.
  • [39] K. Zhou, Q. Ding, F. Shang, J. Cheng, D. Li, and Z. Luo (2019) Direct acceleration of SAGA using sampled negative momentum. In AISTATS, pp. 1602–1610. Cited by: §3.1.
  • [40] K. Zhou, F. Shang, and J. Cheng (2018) A simple stochastic variance reduced algorithm with fast convergence rates. In ICML, pp. 5975–5984. Cited by: §5.
  • [41] P. Zhou, X. Yuan, and J. Feng (2018) Efficient stochastic gradient hard thresholding. In NIPS, pp. 1988–1997. Cited by: §1.