1 Introduction
Massive highdimensional data are common nowadays and impose new challenges to algorithms for sparse learning. For highdimensional data analysis, it is important to exploit the low intrinsic structure and dimensionality of the data, such as sparsity and lowrank structures, which is often attained by imposing certain structural assumptions on the parameter of the underlying model. In recent years, the
norm regularized models, such as Lasso [34] andnorm regularized logistic regression
[36, 21], were proposed to pursue computational tractability by using the norm as a surrogate to the norm. In spite of computational advantages, the norm models have some limits [6] and attain worse empirical performance than norm models [9]. Thus, it is necessary and challenging to solve the norm constrained problem directly. In this paper, we focus on the following sparsityconstrained optimization problem,(1) 
where is the sum of a finite number of smooth convex component functions , and means that the number of nonzero entries in x is no more than , and is used to control the sparsity level of the model parameter. The problem (1
) arises in machine learning, statistics and related areas, e.g., the sparsityconstrained linear regression problem:
(2) 
where is the unknown model parameter,
is the response vector, and
is a design matrix.Due to the nonconvexity of the sparsity constraint, Problem (1) is NPhard. In order to obtain an approximate solution to Problem (1), a large family of greedy algorithms [18, 20, 35, 10, 25, 3, 38, 33] have been proposed. Besides them, there has been much progress towards gradient hard thresholding methods such as fast gradient hard thresholding pursuit (FGHT) [37], iterative hard thresholding (IHT) [5]. However, all the algorithms are based on deterministic optimization such as gradient descent. In each iteration, the gradient descent algorithms require the computation of a full gradient over very large component functions, which is expensive in solving largescale problems (i.e., ).
To address this issue, several stochastic optimization algorithms have been proposed. For example, Nguyen et al. DBLP:journals/corr/NguyenNW14 proposed a stochastic gradient hard thresholding algorithm (SGHT), and Li et al.
DBLP:journals/corr/LiZALH16 proposed a stochastic variance reduced gradient hard thresholding algorithm (SVRGHT), which is based on the stochastic variance reduction gradient (SVRG, also called semistochastic gradient in
[13]) method [12]. With the help of variance reduction technique, SVRGHT can converge stably and efficiently, and also obtain a better estimation accuracy than SGHT
[32]. Besides these methods, there are still several stochastic first or secondorder nonconvex optimization algorithms, such as ASBCDHT [7], HSGHT [41], FNHTP [8] and SLBFGS [11]. However, most algorithms mentioned above need one hard thresholding operation in each inneriteration, which is timeconsuming for highdimensional data (i.e., in general and we can also improve the time complexity using a maxheap strategy). On the other hand, a hard thresholding operation used in each inneriteration breaks the information of current solution, which may require more gradient descent steps to reach the same accuracy. It should be emphasized that in this paper, we do not consider the coordinatedescent type algorithms [22, 4, 7]. As a result, a new algorithm that needs less hard thresholding operations and yet remains fast stochastic updates becomes more attractive for largescale and highdimensional problems.Inspired by the gradient support pursuit (GraSP) [3] and compressive sampling matching pursuit (CoSaMP) [20] algorithms, this paper proposes a new relaxed gradient support pursuit (RGraSP) framework to solve largescale sparsityconstrained problems. In each iteration of our framework, we first find the most relevant support set, minimize slackly over the support set by an algorithm, which only requires to satisfy a certain descent condition, and then perform a hard thresholding operator on the updated parameter. For minimizing objective functions, we introduce two efficient semistochastic gradient algorithms into our RGraSP framework and propose a stochastic variance reduced gradient support pursuit (SVRGSP) algorithm and its fast version (SVRGSP+). Moreover, benefiting from significantly less hard thresholding operations than SVRGHT, the average periteration computational cost in our algorithms is much lower (i.e., for our algorithms vs. for the algorithms mentioned above such as SVRGHT), which leads to faster convergence. Experimental results on synthetic and realworld datasets verify the superiority of our algorithms against the stateoftheart methods.
Algorithms  Periteration Complexity 

FGHT and IHT (deterministic)  
SGHT and SVRGHT (stochastic)  
SVRGSP and SVRGSP+ (Ours) 
2 Notations
Throughout this paper, we use to denote the design matrix, to denote the response vector, and to denote the model parameter. is the optimal solution of Problem (1) and is the optimal sparsity level which satisfies . For the parameter , is the number of nonzero entries in the vector x, is the norm, and is the norm. supp(x) denotes the index set of nonzero entries of x, and is the index set of the largest entries of x in terms of magnitude. denotes the complement set of , is a vector that equals x except for coordinates in where it is zero, and denotes the cardinality of . In addition, denotes the gradient of the objective function at x, and
is an identity matrix.
3 Relaxed Gradient Support Pursuit
In this section, we propose an efficient relaxed gradient support pursuit framework for sparsityconstrained nonconvex optimization problems. Moreover, we also present the details of two specific stochastic variance reduced gradient support pursuit algorithms (called SVRGSP and SVRGSP+).
3.1 Our Gradient Support Pursuit Framework
Most of existing gradient support pursuit algorithms use deterministic optimization methods to minimize various sparsityconstrained problems (e.g., Problem (1)). However, there are several stochastic algorithms such as SVRGHT [14], which can be used to attain better performance [14, 32, 11]. Inspired by the wellknown GraSP [3] and CoSaMP [20], this paper proposes a Relaxed Gradient Support Pursuit (RGraSP) framework to quickly find an approximate solution to Problem (1). As pointed out in [3], CoSaMP can be viewed as a special case of GraSP, when the squared error is the cost function.
The main difference between GraSP [3] and our RGraSP framework is that the former needs to yield the exact solution to the following problem:
while the latter only requires a solver (e.g., Algorithm 2 below) for an approximation solution b to the above problem. RGraSP is used to solve sparsityconstrained nonconvex optimization problems by allowing users to pick a specially designed algorithm according to the properties of . In other words, we can choose different solvers to solve the subproblem in Step 5 of Algorithm 1, as long as the algorithm satisfies a certain descent condition. In this sense, both GraSP [3] and CoSaMP [20] can be viewed as a special case of RGraSP, when . Our general RGraSP framework is outlined in Algorithm 1.
At each iteration of Algorithm 1, we first compute the gradient of at the current estimate, i.e., . Then we choose coordinates of g that have the largest magnitude as the direction, in which pursuing the minimization will be most effective, and denote their indices by , where is the sparsity constant. Merging the support of the current estimate with the coordinates mentioned above, we can obtain the combined support, which is a set of at most indices, i.e., . Over , we compute an estimate b as the approximate solution to the subproblem,
(3) 
The parameter is then updated using the hard thresholding operator, which keeps the largest terms of the intermediate estimate b. This step makes as the best term approximation of the estimate b. The hard thresholding operator is defined as follows:
Essentially, the hard thresholding operator keeps the largest (in magnitude) entries and sets the other entries equal to zero.
In Algorithm 1, we only require that the solver for solving the subproblem (3) has a performance guarantee as:
where is an optimal solution to the subproblem (3), and is an error tolerance, which implies that our solver proposed below has to achieve a certain accuracy for our RGraSP framework. In fact, our solver can approach to a given accuracy after sufficient iterations, as suggested by [1]. Although there are quite a number of solvers that we can use, such as the firstorder solvers [3, 17, 39] and the secondorder solvers [2, 8, 11], we observe that the semistochastic gradient solver outperforms other solvers in most cases as in [15]. In the following, we will present two efficient semistochastic gradient algorithms as our solver in Algorithm 1.


3.2 Our SemiStochastic Gradient Solvers
In our RGraSP framework, we apply many semistochastic iterations as a solver. Combining our semistochastic gradient solver (i.e., Algorithm 2) with our RGraSP framework, we propose a stochastic variance reduced gradient support pursuit (SVRGSP) algorithm and its fast variant (SVRGSP+) to solve Problem (1). The semistochastic gradient solver is outlined in Algorithm 2.


In each iteration of Algorithm 2, we first initialize g as a snapshot gradient in our semistochastic gradient update, which has been computed in Algorithm 1. Then we select a sample uniformly at random from . The semistochastic gradient defined in Step 3 of Algorithm 2 is updated based on this sample. Note that the gradient is called a semistochastic gradient because it includes a deterministic full gradient g and two stochastic gradients, as shown in Step 3 of Algorithm 2. Meanwhile, this semistochastic gradient can reduce the variance introduced by randomly sampling and thus can accelerate convergence [12, 13]. Finally, is updated by using the semistochastic gradient with a constant stepsize , as shown in Step 4 of Algorithm 2
. Therefore, our SVRGSP solver has no hard thresholding operations in the whole epoch, while existing stochastic algorithms such as SVRGHT
[14] have one hard thresholding operation in each stochastic iteration, which naturally leads to significantly slower convergence.Moreover, we can also use a few hard thresholding operators (e.g., ) in each epoch to maintain the main coordinates to obtain faster convergence. In other words, Step 4 in Algorithm 2 has another option for a fast variant of our SVRGSP solver (i.e., SVRGSP+), which also has significantly less hard thresholding operations than existing algorithms such as SVRGHT [14]. Therefore, the average periteration computational complexity of both SVRGSP and SVRGSP+ is much lower, i.e., for both SVRGSP and SVRGSP+ vs. for SVRGHT.
From the above analysis, we can find that our algorithms (i.e., SVRGSP and SVRGSP+) use a hard thresholding operation after a large number of stochastic gradient iterations, while existing stochastic algorithms (e.g., SVRGHT [14]) perform a hard thresholding operation in each iteration, which is very timeconsuming for highdimensional problems and thus leads to a much slower convergence speed. Although many hard thresholding operations can keep the sparsity of model parameter , this will lose more gradient information, which is not desirable for fast convergence. In our experiments, we usually set , and as the number of iterations. In particular, it is not difficult to prove that our algorithms (i.e., SVRGSP and SVRGSP+) also have a fast linear convergence rate as SVRGHT. Please refer to the long version of this paper for detailed convergence analysis.
4 Experimental Results
In this section, we apply the proposed algorithms^{1}^{1}1The source codes of our two algorithms can be downloaded from the authors’ webpage. (i.e., SVRGSP and SVRGSP+) to solve sparsityconstrained linear regression and sparsityconstrained logistic regression problems, and evaluate their empirical performance on many synthetic and realworld datasets. All the experiments were performed on a PC with an Intel i77700 CPU and 32GB RAM.
4.1 Baseline Methods
In all the experiments, we compare the proposed algorithms (i.e., SVRGSP and SVRGSP+) with the following four stateoftheart sparsityconstrained algorithms:

Gradient Support Pursuit (GraSP) [3];

Fast Gradient descent with Hard Thresholding (FGHT) [37];

Stochastic Gradient descent with Hard Thresholding (SGHT) [23];

and Stochastic Variance Reduced Gradient with Hard Thresholding (SVRGHT) [14].
Note that GraSP and FGHT are two wellknown deterministic optimization methods, while SGHT and SVRGHT are two recently proposed stochastic optimization methods. The learning rates of all these methods (as well as other parameters) need to be tuned. Here, we set and for our two algorithms. It should be noted that we can get better results by tuning these two parameters together.
4.2 Synthetic Data
In this part, we empirically investigate the performance of our algorithms for solving the sparsityconstrained linear model (2) on many synthetic datasets. We first generate some synthetic matrices W, each row of which is drawn independently from a
dimensional Gaussian distribution with mean
and covariance matrix . The response vector is generated from the model , where is the sparse coefficient vector, and we need to generate the noisedrawn from a multivariate normal distribution
with . The nonzero entries inare sampled independently from a uniform distribution over the interval
. For the experiments, we construct the following two synthetic data sets:
, , , ;

, , ,
and the diagonal entries of the covariance matrix are set to 1, and the other entries are set to . The sparsity parameter is set to for all the algorithms.
Figure 1 shows the computational performance (including the logarithm of the objective function values and the estimation error ) of all the algorithms on the synthetic datasets. All the results show that all the stochastic variance reduction algorithms (i.e., SVRGHT and our SVRGSP and SVRGSP+ algorithms) perform better than the deterministic methods (i.e., GraSP and FGHT) and the stochastic method, SGHT. We also can see that both SVRGSP and SVRGSP+ converge significantly faster than all the stareoftheart methods in terms of function values and estimation error for all the settings. Although SVRGHT was theoretically proved to have a linear convergence rate for sparsityconstrained linear regression problems, both our algorithms consistently outperform SVRGHT due to less hard thresholding operations, which is consistent with our analysis. Moreover, SVRGSP+ has a slightly faster convergence speed than SVRGSP in all the settings, which validates the importance of a few hard thresholding operators in each epoch.
4.3 RealWorld Data
We also conduct many experiments on two largescale realworld datasets: rcv1 and realsim, which can be downloaded from the LIBSVM Data website^{2}^{2}2https://www.csie.ntu.edu.tw/~cjlin/libsvm/. They contain samples with dimensions and samples with dimensions, respectively, as shown in Table 2. We test all the methods for solving sparsityconstrained logistic regression (i.e., ) and sparsityconstrained linear regression problems with the sparse parameter and all the algorithms are initialized with .
Datasets  #Data  #Features  Sparsity 

rcv1  20,242  47,236  0.16% 
realsim  72,309  20,958  0.024% 
Figure 2 shows the logarithm of sparsityconstrained logistic regression function gap (i.e., ) with respect to the number of effective passes and running time on the rcv1 and realsim datasets. Similar experimental results of all the algorithms for solving the sparsityconstrained linear regression problem are shown in Figure 3. From all the experimental results, it is clear that the proposed algorithms (i.e., SVRGSP and SVRGSP+) outperform the other stateofart sparsityconstrained optimization algorithms in terms of both the number of effective passes and running time, meaning that our algorithms (including SVRGSP and SVRGSP+) converge significantly faster than the other algorithms. In particular, SVRGSP+ is consistently faster than all the other algorithms including SVRGSP.
Although the performance of GraSP [3] in terms of the number of effective passes is similar to that of FGHT [37], GraSP is usually slower due to its higher periteration complexity. Both our algorithms are significantly faster than GraSP in terms of both the number of effective passes and running time, which verifies the effectiveness of our relaxed gradient support pursuit framework. Moreover, all the semistochastic descent algorithms (including SVRGHT and our two algorithms) can find some better solutions than the firstorder deterministic methods (i.e., GraSP and FGHT) and the stochastic gradient method, SGHT, which demonstrates the efficient of the stochastic variance reduced technique. Both our algorithms are much faster than SVRGHT in the all settings, which means that our algorithms are very suitable for realworld largescale sparse learning problems.
5 Conclusions and Future Work
In this paper, we proposed a relaxed gradient support pursuit (RGraSP) framework for solving various largescale sparsityconstrained optimization problems. Then we also presented two efficient semistochastic gradient hard thresholding algorithms as the solver of our RGraSP framework. Our theoretical analysis in the long version of this paper shows that both our algorithms have a fast linear convergence rate. In particular, our algorithms require much less hard thresholding operations than most existing algorithms, and their average periteration computational cost is much lower (i.e., for our algorithms vs. for the algorithms mentioned above such as SVRGHT), which leads to faster convergence. Various experimental results on synthetic and realworld datasets verified our theoretical results and demonstrated the effectiveness and efficiency of our algorithms.
Unlike GraSP [3] and CoSaMP [20], RGraSP is a more general algorithm framework, especially for minimizing many complex cost functions, whose exact solutions cannot be verified in polynomial time. In particular, our proposed semistochastic gradient solver is very friendly to asynchronous parallel and distributed implementation similar to [24, 19]. Some recently proposed accelerated SVRG algorithms [27, 1, 40, 30, 26, 31] (e.g., Katyusha [1] and MiG [40]) and their asynchronous parallel and distributed variants can also be used to solve the subproblem in our framework. Moreover, our algorithms and their convergence results can be extended to the nonsmooth setting (e.g., nonsmooth cost functions) by using the property of stable restricted linearization as in [3]
, and lowrank matrix and tensor settings such as
[28, 29, 16].Acknowledgments
This work was supported by the State Key Program of National Natural Science of China (No. 61836009), Project supported the Foundation for Innovative Research Groups of the National Natural Science Foundation of China (No. 61621005), Major Research Plan of the National Natural Science Foundation of China (Nos. 91438201 and 91438103), Fund for Foreign Scholars in University Research and Teaching Programs (the 111 Project) (No. B07048), National Natural Science Foundation of China (Nos. 61976164, 61876220, 61876221, U1701267, U1730109, and 61871310), Program for Cheung Kong Scholars and Innovative Research Team in University (No. IRT_15R53), Science Foundation of Xidian University (Nos. 10251180018 and 10251180019), Fundamental Research Funds for the Central Universities under Grant (No. 20101195989), National Science Basic Research Plan in Shaanxi Province of China (No. 2019JQ657), and Key Special Project of China High Resolution Earth Observation SystemYoung Scholar Innovation Fund.
References
 [1] (2018) Katyusha: the first direct acceleration of stochastic gradient methods. J. Mach. Learn. Res. 18, pp. 1–51. Cited by: §3.1, §5.
 [2] (2007) Scalable training of regularized loglinear models. In ICML, pp. 33–40. Cited by: §3.1.
 [3] (2013) Greedy sparsityconstrained optimization. J. Mach. Learn. Res. 14 (1), pp. 807–841. Cited by: §1, §1, §3.1, §3.1, §3.1, 1st item, §4.3, §5.
 [4] (2013) Sparsity constrained nonlinear optimization: optimality conditions and algorithms. SIAM J. Optim. 23 (3), pp. 1480–1509. Cited by: §1.
 [5] (2009) Iterative hard thresholding for compressed sensing. Appl. Comput. Harmon. Anal. 27 (3), pp. 265–274. Cited by: §1.
 [6] (2008) Enhancing sparsity by reweighted minimization. J. Fourier Anal. Appl. 14 (56), pp. 877–905. Cited by: §1.
 [7] (2016) Accelerated stochastic block coordinate gradient descent for sparsity constrained optimization. In UAI, Cited by: §1.
 [8] (2017) Fast newton hard thresholding pursuit for sparsity constrained nonconvex optimization. In KDD, pp. 757–766. Cited by: §1, §3.1.
 [9] (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96 (456), pp. 1348–1360. Cited by: §1.
 [10] (2011) Hard thresholding pursuit: an algorithm for compressive sensing. SIAM J. Numer. Anal. 49 (6), pp. 2543–2563. Cited by: §1.
 [11] (2018) Stochastic secondorder method for largescale nonconvex sparse learning models. In IJCAI, pp. 2128–2134. Cited by: §1, §3.1, §3.1.
 [12] (2013) Accelerating stochastic gradient descent using predictive variance reduction. In NIPS, pp. 315–323. Cited by: §1, §3.2.
 [13] (2017) Semistochastic gradient descent methods. Front. Appl. Math. Stat. 3. Cited by: §1, §3.2.
 [14] (2016) Stochastic variance reduced optimization for nonconvex sparse learning. In ICML, pp. 917–925. Cited by: §3.1, §3.2, §3.2, §3.2, 4th item.
 [15] (2019) Loopless semistochastic gradient descent with less hard thresholding for sparse learning. In CIKM, pp. 881–890. Cited by: §3.1.
 [16] (2016) Generalized higher order orthogonal iteration for tensor learning and decomposition. IEEE Trans. Neural Netw. Learning Syst. 27 (12), pp. 2551–2563. Cited by: §5.
 [17] (2019) Accelerated incremental gradient descent using momentum acceleration with scaling factor. In IJCAI, pp. 3045–3051. Cited by: §3.1.
 [18] (1993) Matching pursuits with timefrequency dictionaries. IEEE Tran. Signal Proces. 41 (12), pp. 3397–3415. Cited by: §1.
 [19] (2017) Perturbed iterate analysis for asynchronous stochastic optimization. SIAM J. Optim. 27 (4), pp. 2202–2229. Cited by: §5.
 [20] (2010) CoSaMP: iterative signal recovery from incomplete and inaccurate samples. Commun. ACM 53 (12), pp. 93–100. Cited by: §1, §1, §3.1, §3.1, §5.
 [21] (2009) A unified framework for highdimensional analysis of mestimators with decomposable regularizers. In NIPS, pp. 1348–1356. Cited by: §1.
 [22] (2012) Efficiency of coordinate descent methods on hugescale optimization problems. SIAM J. Optim. 22 (2), pp. 341–362. Cited by: §1.
 [23] (2014) Linear convergence of stochastic iterative greedy algorithms with sparse constraints. CoRR abs/1407.0088. Cited by: 3rd item.
 [24] (2015) On variance reduction in stochastic gradient descent and its asynchronous variants. In NIPS, pp. 2629–2637. Cited by: §5.
 [25] (2010) Trading accuracy for sparsity in optimization problems with sparsity constraints. SIAM J. Optim. 20 (6), pp. 2807–2832. Cited by: §1.
 [26] (2018) Guaranteed sufficient decrease for stochastic variance reduced gradient optimization. In AISTATS, pp. 1027–1036. Cited by: §5.
 [27] (2017) Fast stochastic variance reduced gradient method with momentum acceleration for machine learning. CoRR arXiv:1703.07948. Cited by: §5.
 [28] (2016) Scalable algorithms for tractable Schatten quasinorm minimization. In AAAI, pp. 2016–2022. Cited by: §5.
 [29] (2016) Tractable and scalable Schatten quasinorm approximations for rank minimization. In AISTATS, pp. 620–629. Cited by: §5.
 [30] (2018) ASVRG: accelerated proximal SVRG. In Proc. Mach. Learn. Res., pp. 1–16. Cited by: §5.
 [31] (2019) VRSGD: a simple stochastic variance reduction method for machine learning. IEEE Trans. Knowl. Data Eng.. Cited by: §5.
 [32] (2018) A tight bound of hard thresholding. J. Mach. Learn. Res. 18, pp. 1–42. Cited by: §1, §3.1.
 [33] (2016) Forward backward greedy algorithms for multitask learning with faster rates. In UAI, pp. 735–744. Cited by: §1.
 [34] (1996) Regression shrinkage and selection via the lasso. J. R. Statist. Soc. B 58 (1), pp. 267–288. Cited by: §1.
 [35] (2007) Signal recovery from random measurements via orthogonal matching pursuit. IEEE Tran. Inform. Theory 53 (12), pp. 4655–4666. Cited by: §1.
 [36] (2008) Highdimensional generalized linear models and the lasso. Ann. Statist. 36 (2), pp. 614–645. Cited by: §1.
 [37] (2014) Gradient hard thresholding pursuit for sparsityconstrained optimization. In ICML, pp. 127–135. Cited by: §1, 2nd item, §4.3.
 [38] (2011) Adaptive forwardbackward greedy algorithm for learning sparse representations. IEEE Tran. Inform. Theory 57 (7), pp. 4689–4708. Cited by: §1.
 [39] (2019) Direct acceleration of SAGA using sampled negative momentum. In AISTATS, pp. 1602–1610. Cited by: §3.1.
 [40] (2018) A simple stochastic variance reduced algorithm with fast convergence rates. In ICML, pp. 5975–5984. Cited by: §5.
 [41] (2018) Efficient stochastic gradient hard thresholding. In NIPS, pp. 1988–1997. Cited by: §1.
Comments
There are no comments yet.