In real-world sequential prediction scenarios, the features (or attributes) of examples are typically high-dimensional and construction of the all features for each example may be expensive or impossible. One of the example of these scenarios arises in the context of medical diagnosis of a disease, where each attribute is the result of a medical test on a patient (Cesa-Bianchi et al., 2011). In this scenarios, observations of the all features for each patient may be impossible because it is undesirable to conduct the all medical tests on each patient due to its physical and mental burden.
In limited attribute observation settings (Ben-David and Dichterman, 1993; Cesa-Bianchi et al., 2011) , learners are only allowed to observe a given number of attributes per example at training time. Hence learners need to update their predictor based on the actively chosen attributes which possibly differ from example to example.
Several methods have been proposed to deal with this setting in linear regression problems. Cesa-Bianchi et al. (Cesa-Bianchi et al., 2011)
have proposed a generalized stochastic gradient descent algorithm(Zinkevich, 2003; Duchi and Singer, 2009; Shalev-Shwartz et al., 2011) based on the ideas of picking observed attributes randomly and constructing a noisy version of all attributes using them. Hazan and Koren (Hazan and Koren, 2012) have proposed an algorithm combining a stochastic variant of EG algorithm (Kivinen and Warmuth, 1997) with the idea in (Cesa-Bianchi et al., 2011), which improves the dependency of the problem dimension of the convergence rate proven in (Cesa-Bianchi et al., 2011).
In these work, limited attribute observation settings only at training time have been considered. However, it is natural to assume that the observable number of attributes at prediction time is same as the one at training time. This assumption naturally requires the sparsity of output predictors.
Despite the importance of the requirement of the sparsity of predictors, a hardness result in this setting is known. Foster et al. (Foster et al., 2016) have considered online (agnostic) sparse linear regression in the limited attribute observation setting. They have shown that no algorithm running in polynomial time per example can achieve any sub-linear regret unless . Also it has been shown that this hardness result holds in stochastic i.i.d. (non-agnostic) settings in (Ito et al., 2017). These hardness results suggest that some additional assumptions are needed.
More recently, Kale and Karnin (Kale et al., 2017) have proposed an algorithm based on Dantzig Selector (Candes et al., 2007), which run in polynomial time per example and achieve sub-linear regrets under restricted isometry condition (Bühlmann and Van De Geer, 2011), which is well-known in sparse recovery literature. Particularly in non-agnostic settings, the proposed algorithm achieves a sample complexity of 111 hides extra log-factors., but the rate has bad dependency on the problem dimension. Additionally, this algorithm requires large memory cost since it needs to store the all observed samples due to the applications of Dantzig Selector to the updated design matrices. Independently, Ito et al. (Ito et al., 2017) have also proposed three efficient runtime algorithms based on regularized dual averaging (Xiao, 2010) with their proposed exploration-exploitation strategies in non-agnostic settings under linear independence of features or compatibility(Bühlmann and Van De Geer, 2011). The one of the three algorithms achieves a sample complexity of under linear independence of features, which is worse than the one in (Kale et al., 2017), but has better dependency on the problem dimension. The other two algorithms also achieve a sample complexity of , but the additional term independent to has unacceptable dependency on the problem dimension.
As mentioned above, there exist several efficient runtime algorithms which solve sparse linear regression problem with limited attribute observations under suitable conditions. However , the convergence rates of these algorithms have bad dependency on the problem dimension or on desired accuracy. Whether more efficient algorithms exist is a quite important and interesting question.
In this paper, we focus on stochastic i.i.d. (non-agnostic) sparse linear regression in the limited attribute observation setting and propose new sample efficient algorithms in this setting. The main feature of proposed algorithms is summarized as follows:
Additionally, our algorithms also possess run-time efficiency and memory efficiency, since the average run-time cost per example and the memory cost of the proposed algorithms are in order of the number of observed attributes per example and of the problem dimension respectively, that are better or comparable to the ones of existing methods.
We list the comparisons of our methods with several preceding methods in our setting in Table 1.
|Dantzig (Kale et al., 2017)||
|RDA1 (Ito et al., 2017)||
|RDA2 (Ito et al., 2017)||
|RDA3 (Ito et al., 2017)||compatibility||Regret|
2 Notation and Problem Setting
In this section, we formally describe the problem to be considered in this paper and the assumptions for our theory.
We use the following notation in this paper.
denotes the Euclidean norm : .
For natural number , denotes the set . If , is abbreviated as .
denotes the projection onto
-sparse vectors, i.e.,for , where denotes the number of non-zero elements of .
For , denotes the -th element of . For , we use to denote the restriction of to : for .
2.2 Problem definition
In this paper, we consider the following sparse linear regression model:
is a mean zero sub-Gaussian random variable with parameter, which is independent to . We denote
as the joint distribution ofand .
For finding true parameter of model (1), we focus on the following optimization problem:
where is the standard squared loss and is some integer. We can easily see that true parameter is an optimal solution of the problem (2).
Limited attribute observation
We assume that only a small subset of the attributes which we actively choose per example rather than all attributes can be observed at both training and prediction time. In this paper, we aim to construct algorithms which solve problem (2) with only observing attributes per example. Typically, the situation is considered.
We make the following assumptions for our analysis.
Assumption 1 (Boundedness of data).
with probability one.
Assumption 2 (Restricted smoothness of ).
Objective function satisfies the following restricted smoothness condition:
Assumption 3 (Restricted strong convexity of ).
Objective function satisfies the following restricted strong convexity condition:
3 Approach and Algorithm Description
In this section, we illustrate our main ideas and describe the proposed algorithms in detail.
3.1 Exploration algorithm
One of the difficulties in partial information settings is that the standard stochastic gradient is no more available. In linear regression settings, the gradient what we want to estimate is given by
. In general, we need to construct unbiased estimators ofand . A standard technique is an usage of , which is defined as and , where is randomly observed with and . Then we obtain an unbiased estimator of as . Similarly, an unbiased estimator of is given by
with adequate element-wise scaling. Note that particularly the latter estimator has a quite large variance because the probability that the- entry of becomes non-zero is when , which is very small when .
If the updated solution is sparse, computing requires only observing the attributes of which correspond to the support of and there exists no need to estimate , which has a potentially large variance. However, this idea is not applied to existing methods because they do not ensure the sparsity of the updated solutions at training time and generate sparse output solutions only at prediction time by using the hard thresholding operator.
Iterative applications of the hard thresholding to the updated solutions at training time ensure the sparsity of them and an efficient construction of unbiased gradient estimators is enabled. Also we can fully utilize the restricted smoothness and restricted strong convexity of the objective (Assumption 2 and 3) due to the sparsity of the updated solutions if the optimal solution of the objective is sufficiently sparse.
Now we present our proposed estimator. Motivated by the above discussion, we adopt the iterative usage of the hard thresholding at training time. Thanks to the usage of the hard thresholding operator that projects dense vectors to -sparse ones, we are guaranteed that the updated solutions are -sparse, where is the number of observable attributes per example. Hence we can efficiently estimate as with adequate scaling. As described above, computing can be efficiently executed and only requires observing attributes of . Thus an naive algorithm based on this idea becomes as follows:
for . Unfortunately, this algorithm has no theoretical guarantee due to the use of the hard thresholding. Generally, stochastic gradient methods need to decrease the learning rate as for reducing the noise effect caused by the randomness in the construction of gradient estimators. Then a large amount of stochastic gradients with small step sizes are cumulated for proper updates of solutions. However, the hard thresholding operator clears the cumulated effect on the outside of the support of the current solution at every update and thus the convergence of the above algorithm is not ensured if decreasing learning rate is used. For overcoming this problem, we adopt the standard mini-batch strategy for reducing the variance of the gradient estimator without decreasing the learning rate.
We provide the concrete procedure based on the above ideas in Algorithm 1. We sample examples per one update. The support of the current solution and deterministically selected attributes are observed for each example. For constructing unbiased gradient estimator , we average the unbiased gradient estimators, where each estimator is the concatenation of block-wise unbiased gradient estimators of examples. Note that a constant step size is adopted. We call Algorithm 1 as Exploration since each coordinate is equally treated with respect to the construction of the gradient estimator.
3.2 Refinement of Algorithm 1 using exploitation and its adaptation
As we will state in Theorem 4.1 of Section 4, Exploration (Algorithm 1) achieves a linear convergence when adequate leaning rate, support size and mini-batch sizes are chosen. Using this fact, we can show that Algorithm 1 identifies the optimal support in finite iterations with high probability. When once we find the optimal support, it is much efficient to optimize the parameter on it rather than globally. We call this algorithm as Exploitation and describe the detail in Algorithm 2. Ideally, it is desirable that first we run Exploration (Algorithm 1) and if we find the optimal support, then we switch from Exploration to Exploitation (Algorithm 2). However, whether the optimal support has been found is uncheckable in practice and the theoretical number of updates for finding it depends on the smallest magnitude of the non-zero components of the optimal solution, which is unknown. Therefore, we need to construct an algorithm which combines Exploration and Exploitation, and is adaptive to the unknown value. We give this adaptive algorithm in Algorithm 3. This algorithm alternately uses Exploration and Exploitation. We can show that Algorithm 3 achieves at least the same convergence rate as Exploration, and thanks to the usage of Exploitation, its rate can be much boosted when the smallest magnitude of the non-zero components of the optimal solution is not too small. We call this algorithm as Hybrid.
4 Convergence Analysis
In this section, we provide the convergence analysis of our proposed algorithms. We use notation to hide extra log-factors for simplifying the statements. Here, the log-factors have the form , where is a confidence parameter used in the statements.
4.1 Analysis of Algorithm 1
The following theorem implies that Algorithm 1 with sufficiently large mini-batch sizes achieves a linear convergence.
Theorem 4.1 (Exploration).
Let and . For Algorithm 1, if we adequately choose , and , then for any , and there exists such that
From Theorem 4.1, we obtain the following corollary, which gives a sample complexity of the algorithm.
Corollary 4.2 (Exploration).
If we set and assume that , and are , Corollary 4.2 gives the sample complexity of .
The estimator is guaranteed to be asymptotically consistent, because it can be easily seen that converges to as by using the restricted strong convexity of the objective and its convergence rate is nearly same as the one of the objective gap .
4.2 Analysis of Algorithm 2
Generally, Algorithm 2 does not ensure its convergence. However, the following theorem shows that running Algorithm 2 with sufficiently large batch sizes will not increase the objective values too much. Moreover, if the support of the optimal solution is included in the one of a initial point, then Algorithm 2 also achieves a linear convergence.
Theorem 4.3 (Exploitation).
Let , and . For Algorithm 2, if we adequately choose and , then for any and , there exists such that
4.3 Analysis of Algorithm 3
Combining Theorem 4.1 and Theorem 4.3, we obtain the following theorem and corollary. These imply that using the adequate numbers of inner loops , and mini-batch sizes , of Algorithm 1 and Algorithm 2 respectively, Algorithm 3 is guaranteed to achieve the same sample complexity as the one of Algorithm 1 at least. Furthermore, if the smallest magnitude of the non-zero components of the optimal solution is not too small, its sample complexity can be much reduced.
Theorem 4.4 (Hybrid).
We denote and . Let and . If we adequately choose , and , for any and , Algorithm 3 with , and adequate , and satisfies
Corollary 4.5 (Hybrid).
From Corollary 4.5, if , the sample complexity of Hybrid can be much better than the one of Exploration only. Particularly, if we assume that , and are and , Algorithm 3 achieves a sample complexity of , which is asymptotically near the minimax optimal sample complexity of full information algorithms even in partial information settings. In this case, the complexity is significantly smaller than of Algorithm 1 in this situation.
5 Relation to Existing Work
In this section, we describe the relation between our methods and the most relevant existing methods. The methods of (Cesa-Bianchi et al., 2011) and (Hazan and Koren, 2012) solve the stochastic linear regression with limited attribute observation, but the limited information setting is only assumed at training time and not at prediction time, which is different from ours. Also their theoretical sample complexities are which is worse than ours. The method of (Kale et al., 2017) solve the sparse linear regression with limited information based on Dantzig Selector. It has been shown that the method achieves sub-linear regret in both agnostic (online) and non-agnostic (stochastic) settings under an online variant of restricted isometry condition. The convergence rate in non-agnostic cases is much worse than the ones of ours in terms of the dependency on the problem dimension , but the method has high versatility since it has theoretical guarantees also in agnostic settings, which have not been focused in our work. The methods of (Ito et al., 2017) are based on regularized dual averaging with their exploration-exploitation strategies and achieve a sample complexity of under linear independence of features or compatibility, which is worse than of ours. Also the rate of Algorithm 1 in (Ito et al., 2017) has worse dependency on the dimension than the ones of ours. Additionally theoretical analysis of the method assumes linear independence of features, which is much stronger than restricted isometry condition or our restricted smoothness and strong convexity conditions. The rate of Algorithm 2, 3 in (Ito et al., 2017) has an additional term which has quite terrible dependency on , though it is independent to . Their exploration-exploitation idea is different from ours. Roughly speaking, these methods observe attributes which correspond to the coordinates that have large magnitude of the updated solution, and attributes uniformly at random. This means that exploration and exploitation are combined in single updates. In contrast, our proposed Hybrid updates a predictor alternatively using Exploration and Exploitation. This is a big difference: if their scheme is adopted, the variance of the gradient estimator on the coordinates that have large magnitude of the updated solution becomes small, however the variance reduction effect is buried in the large noise derived from the other coordinates, and this makes efficient exploitation impossible. In (Jain et al., 2014) and (Nguyen et al., 2017), (stochastic) gradient iterative hard thresholding methods for solving empirical risk minimization with sparse constraints in full information settings have been proposed. Our Exploration algorithm can be regard as generalization of these methods to limited information settings.
6 Numerical Experiments
In this section, we provide numerical experiments to demonstrate the performance of the proposed algorithms through synthetic data and real data.
We compare our proposed Exploration and Hybrid with state-of-the-art Dantzig (Kale et al., 2017) and RDA (Algorithm 1333In (Ito et al., 2017), three algorithms have been proposed (Algorithm 1, 2 and 3). We did not implement the latter two ones because the theoretical sample complexity of these algorithms makes no sense unless is quite small due to the existence of the additional term in it.in (Ito et al., 2017)) in our limited attribute observation setting on a synthetic and real dataset. We randomly split the dataset into training (90%) and testing (10%) set and then we trained each algorithm on the training set and executed the mean squared error on the test set. We independently repeated the experiments 5 times and averaged the mean squared error. For each algorithm, we appropriately tuned the hyper-parameters and selected the ones with the lowest mean squared error.
Here we compare the performances in synthetic data. We generated samples with dimension . Each feature was generated from an i.i.d. standard normal. The optimal predictor was constructed as follows: for , for and for the other . The optimal predictor has only non-zero components and thus . The output was generated as , where was generated from an i.i.d. standard normal. We set the number of observed attributes per example as . Figure 2
shows the averaged mean squared error as a function of the number of observed samples. The error bars depict two standard deviation of the measurements. Our proposed Hybrid and Exploration outperformed the other two methods. RDA initially performed well, but its convergence slowed down. Dantzig showed worse performance than all the other methods. Hybrid performed better than Exploration and showed rapid convergence.
Finally, we show the experimental results on a real dataset CT-slice444This dataset is publicly available on https://archive.ics.uci.edu/ml/datasets/Relative+location+of+CT+slices+on+axial+axis.. CT-slice dataset consists of CT images with features. The target variable of each image denotes the relative location of the image on the axial axis. We set the number of observable attributes per example as . In figure 2, the mean squared error is depicted against the number of observed examples. The error bars show two standard deviation of the measurements. Again, our proposed methods surpasses the performances of the existing methods. Particularly, the convergence of Hybrid was significantly fast and stable. In this dataset, Dantzig showed nice convergence and comparable to our Exploration. The convergence of RDA was quite slow and a bit unstable.
We presented sample efficient algorithms for stochastic sparse linear regression problem with limited attribute observation. We developed Exploration algorithm based on an efficient construction of an unbiased gradient estimator by taking advantage of the iterative usage of hard thresholding in the updates of predictors . Also we refined Exploration by adaptively combining it with Exploitation and proposed Hybrid algorithm. We have shown that Exploration and Hybrid achieve a sample complexity of with much better dependency on the problem dimension than the ones in existing work. Particularly, if the smallest magnitude of the non-zero components of the optimal solution is not too small, the rate of Hybrid can be boosted to near the minimax optimal sample complexity of full information algorithms. In numerical experiments, our methods showed superior convergence behaviors compared to preceding methods on synthetic and real data sets.
TS was partially supported by MEXT Kakenhi (25730013, 25120012, 26280009, 15H05707 and 18H03201), Japan Digital Design, and JST-CREST.
Ben-David and Dichterman (1993)
S. Ben-David and E. Dichterman.
Learning with restricted focus of attention.
Proceedings of the sixth annual conference on Computational learning theory, pages 287–296. ACM, 1993.
Bühlmann and Van De Geer (2011)
P. Bühlmann and S. Van De Geer.
Statistics for high-dimensional data: methods, theory and applications. Springer Science & Business Media, 2011.
- Candes et al. (2007) E. Candes, T. Tao, et al. The dantzig selector: Statistical estimation when p is much larger than n. The Annals of Statistics, 35(6):2313–2351, 2007.
Cesa-Bianchi et al. (2011)
N. Cesa-Bianchi, S. Shalev-Shwartz, and O. Shamir.
Efficient learning with partially observed attributes.
Journal of Machine Learning Research, 12(Oct):2857–2878, 2011.
- Duchi and Singer (2009) J. Duchi and Y. Singer. Efficient online and batch learning using forward backward splitting. Journal of Machine Learning Research, 10(Dec):2899–2934, 2009.
- Foster et al. (2016) D. Foster, S. Kale, and H. Karloff. Online sparse linear regression. In Conference on Learning Theory, pages 960–970, 2016.
- Hazan and Koren (2012) E. Hazan and T. Koren. Linear regression with limited observation. arXiv preprint arXiv:1206.4678, 2012.
- Ito et al. (2017) S. Ito, D. Hatano, H. Sumita, A. Yabe, T. Fukunaga, N. Kakimura, and K.-I. Kawarabayashi. Efficient sublinear-regret algorithms for online sparse linear regression with limited observation. In Advances in Neural Information Processing Systems, pages 4102–4111, 2017.
- Jain et al. (2014) P. Jain, A. Tewari, and P. Kar. On iterative hard thresholding methods for high-dimensional m-estimation. In Advances in Neural Information Processing Systems, pages 685–693, 2014.
Kale et al. (2017)
S. Kale, Z. Karnin, T. Liang, and D. Pál.
Adaptive feature selection: Computationally efficient online sparse linear regression under RIP.In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1780–1788, 2017.
- Kivinen and Warmuth (1997) J. Kivinen and M. K. Warmuth. Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 132(1):1–63, 1997.
- Nguyen et al. (2017) N. Nguyen, D. Needell, and T. Woolf. Linear convergence of stochastic iterative greedy algorithms with sparse constraints. IEEE Transactions on Information Theory, 63(11):6869–6895, 2017.
- Raskutti et al. (2011) G. Raskutti, M. J. Wainwright, and B. Yu. Minimax rates of estimation for high-dimensional linear regression over -balls. IEEE transactions on information theory, 57(10):6976–6994, 2011.
- Shalev-Shwartz et al. (2011) S. Shalev-Shwartz, Y. Singer, N. Srebro, and A. Cotter. Pegasos: Primal estimated sub-gradient solver for svm. Mathematical programming, 127(1):3–30, 2011.
- Xiao (2010) L. Xiao. Dual averaging methods for regularized stochastic learning and online optimization. Journal of Machine Learning Research, 11(Oct):2543–2596, 2010.
- Zinkevich (2003) M. Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), pages 928–936, 2003.
Appendix A Analysis of Algorithm 1
a.1 Proof of Theorem 4.1
The proof is essentially generalization of the one in [Jain et al., 2014] to stochastic and partial information settings.
Proposition A.1 (Exploration).
Suppose that , and . Then for any , Algorithm 1 satisfies
with probability , where and .
We denote , and as , and respectively. Also we define .
Since and are -sparse, restricted smoothness of implies
Using Cauchy-Schwarz inequality and Young’s inequality, we have
Here the second inequality follows from the fact .
Thus we get
Also we have
Here, the second inequality is due to Cauchy-Schwartz inequality. The third inequality follows from the fact that and the definition of hard thresholding operator. The third equality is by the fact that .
If is assumed, combing (A.1) with this fact yields