Massive high-dimensional data are common nowadays and impose new challenges to algorithms for sparse learning. For high-dimensional data analysis, it is important to exploit the low intrinsic structure and dimensionality of the data, such as sparsity and low-rank structures, which is often attained by imposing certain structural assumptions on the parameter of the underlying model. In recent years, the-norm regularized models, such as Lasso  and
-norm regularized logistic regression[36, 21], were proposed to pursue computational tractability by using the -norm as a surrogate to the -norm. In spite of computational advantages, the -norm models have some limits  and attain worse empirical performance than -norm models . Thus, it is necessary and challenging to solve the -norm constrained problem directly. In this paper, we focus on the following sparsity-constrained optimization problem,
where is the sum of a finite number of smooth convex component functions , and means that the number of nonzero entries in x is no more than , and is used to control the sparsity level of the model parameter. The problem (1
where is the unknown model parameter,
is the response vector, andis a design matrix.
Due to the non-convexity of the sparsity constraint, Problem (1) is NP-hard. In order to obtain an approximate solution to Problem (1), a large family of greedy algorithms [18, 20, 35, 10, 25, 3, 38, 33] have been proposed. Besides them, there has been much progress towards gradient hard thresholding methods such as fast gradient hard thresholding pursuit (FG-HT) , iterative hard thresholding (IHT) . However, all the algorithms are based on deterministic optimization such as gradient descent. In each iteration, the gradient descent algorithms require the computation of a full gradient over very large component functions, which is expensive in solving large-scale problems (i.e., ).
To address this issue, several stochastic optimization algorithms have been proposed. For example, Nguyen et al. DBLP:journals/corr/NguyenNW14 proposed a stochastic gradient hard thresholding algorithm (SG-HT), and Li et al.
DBLP:journals/corr/LiZALH16 proposed a stochastic variance reduced gradient hard thresholding algorithm (SVRGHT), which is based on the stochastic variance reduction gradient (SVRG, also called semi-stochastic gradient in) method 
. With the help of variance reduction technique, SVRGHT can converge stably and efficiently, and also obtain a better estimation accuracy than SG-HT. Besides these methods, there are still several stochastic first- or second-order nonconvex optimization algorithms, such as ASBCDHT , HSG-HT , FNHTP  and SL-BFGS . However, most algorithms mentioned above need one hard thresholding operation in each inner-iteration, which is time-consuming for high-dimensional data (i.e., in general and we can also improve the time complexity using a max-heap strategy). On the other hand, a hard thresholding operation used in each inner-iteration breaks the information of current solution, which may require more gradient descent steps to reach the same accuracy. It should be emphasized that in this paper, we do not consider the coordinate-descent type algorithms [22, 4, 7]. As a result, a new algorithm that needs less hard thresholding operations and yet remains fast stochastic updates becomes more attractive for large-scale and high-dimensional problems.
Inspired by the gradient support pursuit (GraSP)  and compressive sampling matching pursuit (CoSaMP)  algorithms, this paper proposes a new relaxed gradient support pursuit (RGraSP) framework to solve large-scale sparsity-constrained problems. In each iteration of our framework, we first find the most relevant support set, minimize slackly over the support set by an algorithm, which only requires to satisfy a certain descent condition, and then perform a hard thresholding operator on the updated parameter. For minimizing objective functions, we introduce two efficient semi-stochastic gradient algorithms into our RGraSP framework and propose a stochastic variance reduced gradient support pursuit (SVRGSP) algorithm and its fast version (SVRGSP+). Moreover, benefiting from significantly less hard thresholding operations than SVRGHT, the average per-iteration computational cost in our algorithms is much lower (i.e., for our algorithms vs. for the algorithms mentioned above such as SVRGHT), which leads to faster convergence. Experimental results on synthetic and real-world datasets verify the superiority of our algorithms against the state-of-the-art methods.
|FG-HT and IHT (deterministic)|
|SG-HT and SVRGHT (stochastic)|
|SVRGSP and SVRGSP+ (Ours)|
Throughout this paper, we use to denote the design matrix, to denote the response vector, and to denote the model parameter. is the optimal solution of Problem (1) and is the optimal sparsity level which satisfies . For the parameter , is the number of nonzero entries in the vector x, is the -norm, and is the -norm. supp(x) denotes the index set of nonzero entries of x, and is the index set of the largest entries of x in terms of magnitude. denotes the complement set of , is a vector that equals x except for coordinates in where it is zero, and denotes the cardinality of . In addition, denotes the gradient of the objective function at x, and
is an identity matrix.
3 Relaxed Gradient Support Pursuit
In this section, we propose an efficient relaxed gradient support pursuit framework for sparsity-constrained non-convex optimization problems. Moreover, we also present the details of two specific stochastic variance reduced gradient support pursuit algorithms (called SVRGSP and SVRGSP+).
3.1 Our Gradient Support Pursuit Framework
Most of existing gradient support pursuit algorithms use deterministic optimization methods to minimize various sparsity-constrained problems (e.g., Problem (1)). However, there are several stochastic algorithms such as SVRGHT , which can be used to attain better performance [14, 32, 11]. Inspired by the well-known GraSP  and CoSaMP , this paper proposes a Relaxed Gradient Support Pursuit (RGraSP) framework to quickly find an approximate solution to Problem (1). As pointed out in , CoSaMP can be viewed as a special case of GraSP, when the squared error is the cost function.
The main difference between GraSP  and our RGraSP framework is that the former needs to yield the exact solution to the following problem:
while the latter only requires a solver (e.g., Algorithm 2 below) for an approximation solution b to the above problem. RGraSP is used to solve sparsity-constrained non-convex optimization problems by allowing users to pick a specially designed algorithm according to the properties of . In other words, we can choose different solvers to solve the sub-problem in Step 5 of Algorithm 1, as long as the algorithm satisfies a certain descent condition. In this sense, both GraSP  and CoSaMP  can be viewed as a special case of RGraSP, when . Our general RGraSP framework is outlined in Algorithm 1.
At each iteration of Algorithm 1, we first compute the gradient of at the current estimate, i.e., . Then we choose coordinates of g that have the largest magnitude as the direction, in which pursuing the minimization will be most effective, and denote their indices by , where is the sparsity constant. Merging the support of the current estimate with the coordinates mentioned above, we can obtain the combined support, which is a set of at most indices, i.e., . Over , we compute an estimate b as the approximate solution to the sub-problem,
The parameter is then updated using the hard thresholding operator, which keeps the largest terms of the intermediate estimate b. This step makes as the best -term approximation of the estimate b. The hard thresholding operator is defined as follows:
Essentially, the hard thresholding operator keeps the largest (in magnitude) entries and sets the other entries equal to zero.
where is an optimal solution to the sub-problem (3), and is an error tolerance, which implies that our solver proposed below has to achieve a certain accuracy for our RGraSP framework. In fact, our solver can approach to a given accuracy after sufficient iterations, as suggested by . Although there are quite a number of solvers that we can use, such as the first-order solvers [3, 17, 39] and the second-order solvers [2, 8, 11], we observe that the semi-stochastic gradient solver outperforms other solvers in most cases as in . In the following, we will present two efficient semi-stochastic gradient algorithms as our solver in Algorithm 1.
3.2 Our Semi-Stochastic Gradient Solvers
In our RGraSP framework, we apply many semi-stochastic iterations as a solver. Combining our semi-stochastic gradient solver (i.e., Algorithm 2) with our RGraSP framework, we propose a stochastic variance reduced gradient support pursuit (SVRGSP) algorithm and its fast variant (SVRGSP+) to solve Problem (1). The semi-stochastic gradient solver is outlined in Algorithm 2.
In each iteration of Algorithm 2, we first initialize g as a snapshot gradient in our semi-stochastic gradient update, which has been computed in Algorithm 1. Then we select a sample uniformly at random from . The semi-stochastic gradient defined in Step 3 of Algorithm 2 is updated based on this sample. Note that the gradient is called a semi-stochastic gradient because it includes a deterministic full gradient g and two stochastic gradients, as shown in Step 3 of Algorithm 2. Meanwhile, this semi-stochastic gradient can reduce the variance introduced by randomly sampling and thus can accelerate convergence [12, 13]. Finally, is updated by using the semi-stochastic gradient with a constant step-size , as shown in Step 4 of Algorithm 2
. Therefore, our SVRGSP solver has no hard thresholding operations in the whole epoch, while existing stochastic algorithms such as SVRGHT have one hard thresholding operation in each stochastic iteration, which naturally leads to significantly slower convergence.
Moreover, we can also use a few hard thresholding operators (e.g., ) in each epoch to maintain the main coordinates to obtain faster convergence. In other words, Step 4 in Algorithm 2 has another option for a fast variant of our SVRGSP solver (i.e., SVRGSP+), which also has significantly less hard thresholding operations than existing algorithms such as SVRGHT . Therefore, the average per-iteration computational complexity of both SVRGSP and SVRGSP+ is much lower, i.e., for both SVRGSP and SVRGSP+ vs. for SVRGHT.
From the above analysis, we can find that our algorithms (i.e., SVRGSP and SVRGSP+) use a hard thresholding operation after a large number of stochastic gradient iterations, while existing stochastic algorithms (e.g., SVRGHT ) perform a hard thresholding operation in each iteration, which is very time-consuming for high-dimensional problems and thus leads to a much slower convergence speed. Although many hard thresholding operations can keep the sparsity of model parameter , this will lose more gradient information, which is not desirable for fast convergence. In our experiments, we usually set , and as the number of iterations. In particular, it is not difficult to prove that our algorithms (i.e., SVRGSP and SVRGSP+) also have a fast linear convergence rate as SVRGHT. Please refer to the long version of this paper for detailed convergence analysis.
4 Experimental Results
In this section, we apply the proposed algorithms111The source codes of our two algorithms can be downloaded from the authors’ webpage. (i.e., SVRGSP and SVRGSP+) to solve sparsity-constrained linear regression and sparsity-constrained logistic regression problems, and evaluate their empirical performance on many synthetic and real-world datasets. All the experiments were performed on a PC with an Intel i7-7700 CPU and 32GB RAM.
4.1 Baseline Methods
In all the experiments, we compare the proposed algorithms (i.e., SVRGSP and SVRGSP+) with the following four state-of-the-art sparsity-constrained algorithms:
Gradient Support Pursuit (GraSP) ;
Fast Gradient descent with Hard Thresholding (FG-HT) ;
and Stochastic Variance Reduced Gradient with Hard Thresholding (SVRGHT) .
Note that GraSP and FG-HT are two well-known deterministic optimization methods, while SG-HT and SVRGHT are two recently proposed stochastic optimization methods. The learning rates of all these methods (as well as other parameters) need to be tuned. Here, we set and for our two algorithms. It should be noted that we can get better results by tuning these two parameters together.
4.2 Synthetic Data
In this part, we empirically investigate the performance of our algorithms for solving the sparsity-constrained linear model (2) on many synthetic datasets. We first generate some synthetic matrices W, each row of which is drawn independently from a
-dimensional Gaussian distribution with meanand covariance matrix . The response vector is generated from the model , where is the -sparse coefficient vector, and we need to generate the noise
drawn from a multivariate normal distributionwith . The nonzero entries in
are sampled independently from a uniform distribution over the interval. For the experiments, we construct the following two synthetic data sets:
, , , ;
, , ,
and the diagonal entries of the covariance matrix are set to 1, and the other entries are set to . The sparsity parameter is set to for all the algorithms.
Figure 1 shows the computational performance (including the logarithm of the objective function values and the estimation error ) of all the algorithms on the synthetic datasets. All the results show that all the stochastic variance reduction algorithms (i.e., SVRGHT and our SVRGSP and SVRGSP+ algorithms) perform better than the deterministic methods (i.e., GraSP and FG-HT) and the stochastic method, SG-HT. We also can see that both SVRGSP and SVRGSP+ converge significantly faster than all the stare-of-the-art methods in terms of function values and estimation error for all the settings. Although SVRGHT was theoretically proved to have a linear convergence rate for sparsity-constrained linear regression problems, both our algorithms consistently outperform SVRGHT due to less hard thresholding operations, which is consistent with our analysis. Moreover, SVRGSP+ has a slightly faster convergence speed than SVRGSP in all the settings, which validates the importance of a few hard thresholding operators in each epoch.
4.3 Real-World Data
We also conduct many experiments on two large-scale real-world datasets: rcv1 and real-sim, which can be downloaded from the LIBSVM Data website222https://www.csie.ntu.edu.tw/~cjlin/libsvm/. They contain samples with dimensions and samples with dimensions, respectively, as shown in Table 2. We test all the methods for solving sparsity-constrained logistic regression (i.e., ) and sparsity-constrained linear regression problems with the sparse parameter and all the algorithms are initialized with .
Figure 2 shows the logarithm of sparsity-constrained logistic regression function gap (i.e., ) with respect to the number of effective passes and running time on the rcv1 and real-sim datasets. Similar experimental results of all the algorithms for solving the sparsity-constrained linear regression problem are shown in Figure 3. From all the experimental results, it is clear that the proposed algorithms (i.e., SVRGSP and SVRGSP+) outperform the other state-of-art sparsity-constrained optimization algorithms in terms of both the number of effective passes and running time, meaning that our algorithms (including SVRGSP and SVRGSP+) converge significantly faster than the other algorithms. In particular, SVRGSP+ is consistently faster than all the other algorithms including SVRGSP.
Although the performance of GraSP  in terms of the number of effective passes is similar to that of FG-HT , GraSP is usually slower due to its higher per-iteration complexity. Both our algorithms are significantly faster than GraSP in terms of both the number of effective passes and running time, which verifies the effectiveness of our relaxed gradient support pursuit framework. Moreover, all the semi-stochastic descent algorithms (including SVRGHT and our two algorithms) can find some better solutions than the first-order deterministic methods (i.e., GraSP and FG-HT) and the stochastic gradient method, SG-HT, which demonstrates the efficient of the stochastic variance reduced technique. Both our algorithms are much faster than SVRGHT in the all settings, which means that our algorithms are very suitable for real-world large-scale sparse learning problems.
5 Conclusions and Future Work
In this paper, we proposed a relaxed gradient support pursuit (RGraSP) framework for solving various large-scale sparsity-constrained optimization problems. Then we also presented two efficient semi-stochastic gradient hard thresholding algorithms as the solver of our RGraSP framework. Our theoretical analysis in the long version of this paper shows that both our algorithms have a fast linear convergence rate. In particular, our algorithms require much less hard thresholding operations than most existing algorithms, and their average per-iteration computational cost is much lower (i.e., for our algorithms vs. for the algorithms mentioned above such as SVRGHT), which leads to faster convergence. Various experimental results on synthetic and real-world datasets verified our theoretical results and demonstrated the effectiveness and efficiency of our algorithms.
Unlike GraSP  and CoSaMP , RGraSP is a more general algorithm framework, especially for minimizing many complex cost functions, whose exact solutions cannot be verified in polynomial time. In particular, our proposed semi-stochastic gradient solver is very friendly to asynchronous parallel and distributed implementation similar to [24, 19]. Some recently proposed accelerated SVRG algorithms [27, 1, 40, 30, 26, 31] (e.g., Katyusha  and MiG ) and their asynchronous parallel and distributed variants can also be used to solve the subproblem in our framework. Moreover, our algorithms and their convergence results can be extended to the non-smooth setting (e.g., non-smooth cost functions) by using the property of stable restricted linearization as in 
, and low-rank matrix and tensor settings such as[28, 29, 16].
This work was supported by the State Key Program of National Natural Science of China (No. 61836009), Project supported the Foundation for Innovative Research Groups of the National Natural Science Foundation of China (No. 61621005), Major Research Plan of the National Natural Science Foundation of China (Nos. 91438201 and 91438103), Fund for Foreign Scholars in University Research and Teaching Programs (the 111 Project) (No. B07048), National Natural Science Foundation of China (Nos. 61976164, 61876220, 61876221, U1701267, U1730109, and 61871310), Program for Cheung Kong Scholars and Innovative Research Team in University (No. IRT_15R53), Science Foundation of Xidian University (Nos. 10251180018 and 10251180019), Fundamental Research Funds for the Central Universities under Grant (No. 20101195989), National Science Basic Research Plan in Shaanxi Province of China (No. 2019JQ-657), and Key Special Project of China High Resolution Earth Observation System-Young Scholar Innovation Fund.
-  (2018) Katyusha: the first direct acceleration of stochastic gradient methods. J. Mach. Learn. Res. 18, pp. 1–51. Cited by: §3.1, §5.
-  (2007) Scalable training of -regularized log-linear models. In ICML, pp. 33–40. Cited by: §3.1.
-  (2013) Greedy sparsity-constrained optimization. J. Mach. Learn. Res. 14 (1), pp. 807–841. Cited by: §1, §1, §3.1, §3.1, §3.1, 1st item, §4.3, §5.
-  (2013) Sparsity constrained nonlinear optimization: optimality conditions and algorithms. SIAM J. Optim. 23 (3), pp. 1480–1509. Cited by: §1.
-  (2009) Iterative hard thresholding for compressed sensing. Appl. Comput. Harmon. Anal. 27 (3), pp. 265–274. Cited by: §1.
-  (2008) Enhancing sparsity by reweighted minimization. J. Fourier Anal. Appl. 14 (5-6), pp. 877–905. Cited by: §1.
-  (2016) Accelerated stochastic block coordinate gradient descent for sparsity constrained optimization. In UAI, Cited by: §1.
-  (2017) Fast newton hard thresholding pursuit for sparsity constrained nonconvex optimization. In KDD, pp. 757–766. Cited by: §1, §3.1.
-  (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96 (456), pp. 1348–1360. Cited by: §1.
-  (2011) Hard thresholding pursuit: an algorithm for compressive sensing. SIAM J. Numer. Anal. 49 (6), pp. 2543–2563. Cited by: §1.
-  (2018) Stochastic second-order method for large-scale nonconvex sparse learning models. In IJCAI, pp. 2128–2134. Cited by: §1, §3.1, §3.1.
-  (2013) Accelerating stochastic gradient descent using predictive variance reduction. In NIPS, pp. 315–323. Cited by: §1, §3.2.
-  (2017) Semi-stochastic gradient descent methods. Front. Appl. Math. Stat. 3. Cited by: §1, §3.2.
-  (2016) Stochastic variance reduced optimization for nonconvex sparse learning. In ICML, pp. 917–925. Cited by: §3.1, §3.2, §3.2, §3.2, 4th item.
-  (2019) Loopless semi-stochastic gradient descent with less hard thresholding for sparse learning. In CIKM, pp. 881–890. Cited by: §3.1.
-  (2016) Generalized higher order orthogonal iteration for tensor learning and decomposition. IEEE Trans. Neural Netw. Learning Syst. 27 (12), pp. 2551–2563. Cited by: §5.
-  (2019) Accelerated incremental gradient descent using momentum acceleration with scaling factor. In IJCAI, pp. 3045–3051. Cited by: §3.1.
-  (1993) Matching pursuits with time-frequency dictionaries. IEEE Tran. Signal Proces. 41 (12), pp. 3397–3415. Cited by: §1.
-  (2017) Perturbed iterate analysis for asynchronous stochastic optimization. SIAM J. Optim. 27 (4), pp. 2202–2229. Cited by: §5.
-  (2010) CoSaMP: iterative signal recovery from incomplete and inaccurate samples. Commun. ACM 53 (12), pp. 93–100. Cited by: §1, §1, §3.1, §3.1, §5.
-  (2009) A unified framework for high-dimensional analysis of m-estimators with decomposable regularizers. In NIPS, pp. 1348–1356. Cited by: §1.
-  (2012) Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J. Optim. 22 (2), pp. 341–362. Cited by: §1.
-  (2014) Linear convergence of stochastic iterative greedy algorithms with sparse constraints. CoRR abs/1407.0088. Cited by: 3rd item.
-  (2015) On variance reduction in stochastic gradient descent and its asynchronous variants. In NIPS, pp. 2629–2637. Cited by: §5.
-  (2010) Trading accuracy for sparsity in optimization problems with sparsity constraints. SIAM J. Optim. 20 (6), pp. 2807–2832. Cited by: §1.
-  (2018) Guaranteed sufficient decrease for stochastic variance reduced gradient optimization. In AISTATS, pp. 1027–1036. Cited by: §5.
-  (2017) Fast stochastic variance reduced gradient method with momentum acceleration for machine learning. CoRR arXiv:1703.07948. Cited by: §5.
-  (2016) Scalable algorithms for tractable Schatten quasi-norm minimization. In AAAI, pp. 2016–2022. Cited by: §5.
-  (2016) Tractable and scalable Schatten quasi-norm approximations for rank minimization. In AISTATS, pp. 620–629. Cited by: §5.
-  (2018) ASVRG: accelerated proximal SVRG. In Proc. Mach. Learn. Res., pp. 1–16. Cited by: §5.
-  (2019) VR-SGD: a simple stochastic variance reduction method for machine learning. IEEE Trans. Knowl. Data Eng.. Cited by: §5.
-  (2018) A tight bound of hard thresholding. J. Mach. Learn. Res. 18, pp. 1–42. Cited by: §1, §3.1.
-  (2016) Forward backward greedy algorithms for multi-task learning with faster rates. In UAI, pp. 735–744. Cited by: §1.
-  (1996) Regression shrinkage and selection via the lasso. J. R. Statist. Soc. B 58 (1), pp. 267–288. Cited by: §1.
-  (2007) Signal recovery from random measurements via orthogonal matching pursuit. IEEE Tran. Inform. Theory 53 (12), pp. 4655–4666. Cited by: §1.
-  (2008) High-dimensional generalized linear models and the lasso. Ann. Statist. 36 (2), pp. 614–645. Cited by: §1.
-  (2014) Gradient hard thresholding pursuit for sparsity-constrained optimization. In ICML, pp. 127–135. Cited by: §1, 2nd item, §4.3.
-  (2011) Adaptive forward-backward greedy algorithm for learning sparse representations. IEEE Tran. Inform. Theory 57 (7), pp. 4689–4708. Cited by: §1.
-  (2019) Direct acceleration of SAGA using sampled negative momentum. In AISTATS, pp. 1602–1610. Cited by: §3.1.
-  (2018) A simple stochastic variance reduced algorithm with fast convergence rates. In ICML, pp. 5975–5984. Cited by: §5.
-  (2018) Efficient stochastic gradient hard thresholding. In NIPS, pp. 1988–1997. Cited by: §1.