Sample Efficient Stochastic Gradient Iterative Hard Thresholding Method for Stochastic Sparse Linear Regression with Limited Attribute Observation

We develop new stochastic gradient methods for efficiently solving sparse linear regression in a partial attribute observation setting, where learners are only allowed to observe a fixed number of actively chosen attributes per example at training and prediction times. It is shown that the methods achieve essentially a sample complexity of O(1/ε) to attain an error of ε under a variant of restricted eigenvalue condition, and the rate has better dependency on the problem dimension than existing methods. Particularly, if the smallest magnitude of the non-zero components of the optimal solution is not too small, the rate of our proposed Hybrid algorithm can be boosted to near the minimax optimal sample complexity of full information algorithms. The core ideas are (i) efficient construction of an unbiased gradient estimator by the iterative usage of the hard thresholding operator for configuring an exploration algorithm; and (ii) an adaptive combination of the exploration and an exploitation algorithms for quickly identifying the support of the optimum and efficiently searching the optimal parameter in its support. Experimental results are presented to validate our theoretical findings and the superiority of our proposed methods.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

04/24/2018

Between hard and soft thresholding: optimal iterative thresholding algorithms

Iterative thresholding algorithms seek to optimize a differentiable obje...
08/27/2020

Scaled minimax optimality in high-dimensional linear regression: A non-convex algorithmic regularization approach

The question of fast convergence in the classical problem of high dimens...
03/19/2019

Adaptive Hard Thresholding for Near-optimal Consistent Robust Regression

We study the problem of robust linear regression with response variable ...
12/13/2014

The Statistics of Streaming Sparse Regression

We present a sparse analogue to stochastic gradient descent that is guar...
06/11/2020

Sparse recovery by reduced variance stochastic approximation

In this paper, we discuss application of iterative Stochastic Optimizati...
10/29/2021

Does Momentum Help? A Sample Complexity Analysis

Momentum methods are popularly used in accelerating stochastic iterative...
10/23/2014

Attribute Efficient Linear Regression with Data-Dependent Sampling

In this paper we analyze a budgeted learning setting, in which the learn...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In real-world sequential prediction scenarios, the features (or attributes) of examples are typically high-dimensional and construction of the all features for each example may be expensive or impossible. One of the example of these scenarios arises in the context of medical diagnosis of a disease, where each attribute is the result of a medical test on a patient (Cesa-Bianchi et al., 2011). In this scenarios, observations of the all features for each patient may be impossible because it is undesirable to conduct the all medical tests on each patient due to its physical and mental burden.

In limited attribute observation settings (Ben-David and Dichterman, 1993; Cesa-Bianchi et al., 2011) , learners are only allowed to observe a given number of attributes per example at training time. Hence learners need to update their predictor based on the actively chosen attributes which possibly differ from example to example.

Several methods have been proposed to deal with this setting in linear regression problems. Cesa-Bianchi et al. (Cesa-Bianchi et al., 2011)

have proposed a generalized stochastic gradient descent algorithm

(Zinkevich, 2003; Duchi and Singer, 2009; Shalev-Shwartz et al., 2011) based on the ideas of picking observed attributes randomly and constructing a noisy version of all attributes using them. Hazan and Koren (Hazan and Koren, 2012) have proposed an algorithm combining a stochastic variant of EG algorithm (Kivinen and Warmuth, 1997) with the idea in (Cesa-Bianchi et al., 2011), which improves the dependency of the problem dimension of the convergence rate proven in (Cesa-Bianchi et al., 2011).

In these work, limited attribute observation settings only at training time have been considered. However, it is natural to assume that the observable number of attributes at prediction time is same as the one at training time. This assumption naturally requires the sparsity of output predictors.

Despite the importance of the requirement of the sparsity of predictors, a hardness result in this setting is known. Foster et al. (Foster et al., 2016) have considered online (agnostic) sparse linear regression in the limited attribute observation setting. They have shown that no algorithm running in polynomial time per example can achieve any sub-linear regret unless . Also it has been shown that this hardness result holds in stochastic i.i.d. (non-agnostic) settings in (Ito et al., 2017). These hardness results suggest that some additional assumptions are needed.

More recently, Kale and Karnin (Kale et al., 2017) have proposed an algorithm based on Dantzig Selector (Candes et al., 2007), which run in polynomial time per example and achieve sub-linear regrets under restricted isometry condition (Bühlmann and Van De Geer, 2011), which is well-known in sparse recovery literature. Particularly in non-agnostic settings, the proposed algorithm achieves a sample complexity of 111 hides extra log-factors., but the rate has bad dependency on the problem dimension. Additionally, this algorithm requires large memory cost since it needs to store the all observed samples due to the applications of Dantzig Selector to the updated design matrices. Independently, Ito et al. (Ito et al., 2017) have also proposed three efficient runtime algorithms based on regularized dual averaging (Xiao, 2010) with their proposed exploration-exploitation strategies in non-agnostic settings under linear independence of features or compatibility(Bühlmann and Van De Geer, 2011). The one of the three algorithms achieves a sample complexity of under linear independence of features, which is worse than the one in (Kale et al., 2017), but has better dependency on the problem dimension. The other two algorithms also achieve a sample complexity of , but the additional term independent to has unacceptable dependency on the problem dimension.

As mentioned above, there exist several efficient runtime algorithms which solve sparse linear regression problem with limited attribute observations under suitable conditions. However , the convergence rates of these algorithms have bad dependency on the problem dimension or on desired accuracy. Whether more efficient algorithms exist is a quite important and interesting question.

Main contribution

In this paper, we focus on stochastic i.i.d. (non-agnostic) sparse linear regression in the limited attribute observation setting and propose new sample efficient algorithms in this setting. The main feature of proposed algorithms is summarized as follows:

Our algorithms achieve a sample complexity of with much better dependency on the problem dimension than the ones in existing work. Particularly, if the smallest magnitude of the non-zero components of the optimal solution is not too small, the rate can be boosted to near the minimax optimal sample complexity of full information algorithms.

Additionally, our algorithms also possess run-time efficiency and memory efficiency, since the average run-time cost per example and the memory cost of the proposed algorithms are in order of the number of observed attributes per example and of the problem dimension respectively, that are better or comparable to the ones of existing methods.

We list the comparisons of our methods with several preceding methods in our setting in Table 1.

Sample complexity
# of observed
attrs per ex.
Additional
assumptions
Objective type
Dantzig (Kale et al., 2017)
restricted isometry
condition
Regret
RDA1 (Ito et al., 2017)
linear independence
of features
Regret
RDA2 (Ito et al., 2017)
linear independence
of features
Regret
RDA3 (Ito et al., 2017) compatibility Regret
Exploration
restricted smoothness &
restricted strong convexity
Expected risk
Hybrid
restricted smoothness &
restricted strong convexity
Expected risk
Table 1: Comparisons of our methods with existing methods in our problem setting. Sample complexity means the necessary number of samples to attain an error . "# of observed attrs per ex." indicates the necessary number of observed attributes per example at training time which the algorithm requires at least. is the number of observed attributes per example, is the size of the support of the optimal solution, is the problem dimension, is the desired accuracy and is the smallest magnitude of the non-zero components of the optimal solution. We regard the smoothness and strong convexity parameters of the objectives derived from the additional assumptions and the boundedness parameter of the input data distribution as constants. hides extra log-factors for simplifying the notation.
22footnotetext: Note that the necessary number of observed attributes per example at prediction time is , that is nearly same as the other algorithms in table 1.

2 Notation and Problem Setting

In this section, we formally describe the problem to be considered in this paper and the assumptions for our theory.

2.1 Notation

We use the following notation in this paper.

  • denotes the Euclidean norm : .

  • For natural number , denotes the set . If , is abbreviated as .

  • denotes the projection onto

    -sparse vectors, i.e.,

    for , where denotes the number of non-zero elements of .

  • For , denotes the -th element of . For , we use to denote the restriction of to : for .

2.2 Problem definition

In this paper, we consider the following sparse linear regression model:

(1)

where

is a mean zero sub-Gaussian random variable with parameter

, which is independent to . We denote

as the joint distribution of

and .

For finding true parameter of model (1), we focus on the following optimization problem:

(2)

where is the standard squared loss and is some integer. We can easily see that true parameter is an optimal solution of the problem (2).

Limited attribute observation

We assume that only a small subset of the attributes which we actively choose per example rather than all attributes can be observed at both training and prediction time. In this paper, we aim to construct algorithms which solve problem (2) with only observing attributes per example. Typically, the situation is considered.

2.3 Assumptions

We make the following assumptions for our analysis.

Assumption 1 (Boundedness of data).

For ,

with probability one.

Assumption 2 (Restricted smoothness of ).

Objective function satisfies the following restricted smoothness condition:

Assumption 3 (Restricted strong convexity of ).

Objective function satisfies the following restricted strong convexity condition:

By the restricted strong convexity of , we can easily see that the true parameter of model (1) is the unique optimal solution of optimization problem (2). We denote the condition number by .

Remark.

In linear regression settings, Assumptions 2 and 3 are equivalent to assuming

and

respectively. Note that these conditions are stronger than restricted eigenvalue condition, but are weaker than restricted isometry condition.

3 Approach and Algorithm Description

In this section, we illustrate our main ideas and describe the proposed algorithms in detail.

3.1 Exploration algorithm

One of the difficulties in partial information settings is that the standard stochastic gradient is no more available. In linear regression settings, the gradient what we want to estimate is given by

. In general, we need to construct unbiased estimators of

and . A standard technique is an usage of , which is defined as and , where is randomly observed with and . Then we obtain an unbiased estimator of as . Similarly, an unbiased estimator of is given by

with adequate element-wise scaling. Note that particularly the latter estimator has a quite large variance because the probability that the

- entry of becomes non-zero is when , which is very small when .

If the updated solution is sparse, computing requires only observing the attributes of which correspond to the support of and there exists no need to estimate , which has a potentially large variance. However, this idea is not applied to existing methods because they do not ensure the sparsity of the updated solutions at training time and generate sparse output solutions only at prediction time by using the hard thresholding operator.

Iterative applications of the hard thresholding to the updated solutions at training time ensure the sparsity of them and an efficient construction of unbiased gradient estimators is enabled. Also we can fully utilize the restricted smoothness and restricted strong convexity of the objective (Assumption 2 and 3) due to the sparsity of the updated solutions if the optimal solution of the objective is sufficiently sparse.

Now we present our proposed estimator. Motivated by the above discussion, we adopt the iterative usage of the hard thresholding at training time. Thanks to the usage of the hard thresholding operator that projects dense vectors to -sparse ones, we are guaranteed that the updated solutions are -sparse, where is the number of observable attributes per example. Hence we can efficiently estimate as with adequate scaling. As described above, computing can be efficiently executed and only requires observing attributes of . Thus an naive algorithm based on this idea becomes as follows:

for . Unfortunately, this algorithm has no theoretical guarantee due to the use of the hard thresholding. Generally, stochastic gradient methods need to decrease the learning rate as for reducing the noise effect caused by the randomness in the construction of gradient estimators. Then a large amount of stochastic gradients with small step sizes are cumulated for proper updates of solutions. However, the hard thresholding operator clears the cumulated effect on the outside of the support of the current solution at every update and thus the convergence of the above algorithm is not ensured if decreasing learning rate is used. For overcoming this problem, we adopt the standard mini-batch strategy for reducing the variance of the gradient estimator without decreasing the learning rate.

We provide the concrete procedure based on the above ideas in Algorithm 1. We sample examples per one update. The support of the current solution and deterministically selected attributes are observed for each example. For constructing unbiased gradient estimator , we average the unbiased gradient estimators, where each estimator is the concatenation of block-wise unbiased gradient estimators of examples. Note that a constant step size is adopted. We call Algorithm 1 as Exploration since each coordinate is equally treated with respect to the construction of the gradient estimator.

3.2 Refinement of Algorithm 1 using exploitation and its adaptation

As we will state in Theorem 4.1 of Section 4, Exploration (Algorithm 1) achieves a linear convergence when adequate leaning rate, support size and mini-batch sizes are chosen. Using this fact, we can show that Algorithm 1 identifies the optimal support in finite iterations with high probability. When once we find the optimal support, it is much efficient to optimize the parameter on it rather than globally. We call this algorithm as Exploitation and describe the detail in Algorithm 2. Ideally, it is desirable that first we run Exploration (Algorithm 1) and if we find the optimal support, then we switch from Exploration to Exploitation (Algorithm 2). However, whether the optimal support has been found is uncheckable in practice and the theoretical number of updates for finding it depends on the smallest magnitude of the non-zero components of the optimal solution, which is unknown. Therefore, we need to construct an algorithm which combines Exploration and Exploitation, and is adaptive to the unknown value. We give this adaptive algorithm in Algorithm 3. This algorithm alternately uses Exploration and Exploitation. We can show that Algorithm 3 achieves at least the same convergence rate as Exploration, and thanks to the usage of Exploitation, its rate can be much boosted when the smallest magnitude of the non-zero components of the optimal solution is not too small. We call this algorithm as Hybrid.

  Set , and for .
  for  to  do
     Set .
     Sample for and .
     Observe , and for and .
     Compute for .
     Update .
  end for
  return  .
Algorithm 1 Exploration(, , , , )
  Set .
  for  to  do
     Sample for .
     Observe and for .
     Compute .
     Set .
     Update .
  end for
  return  .
Algorithm 2 Exploitation(, , , )
  for  to  do
     Update .
     Update .
  end for
  return  .
Algorithm 3 Hybrid(, , , , , , , )

4 Convergence Analysis

In this section, we provide the convergence analysis of our proposed algorithms. We use notation to hide extra log-factors for simplifying the statements. Here, the log-factors have the form , where is a confidence parameter used in the statements.

4.1 Analysis of Algorithm 1

The following theorem implies that Algorithm 1 with sufficiently large mini-batch sizes achieves a linear convergence.

Theorem 4.1 (Exploration).

Let and . For Algorithm 1, if we adequately choose , and , then for any , and there exists such that

The proof of Theorem 4.1 is found in Section A.1 of the supplementary material.

From Theorem 4.1, we obtain the following corollary, which gives a sample complexity of the algorithm.

Corollary 4.2 (Exploration).

For Algorithm 1, under the settings of Theorem 4.1 with , the necessary number of observed samples to achieve is

The proof of Corollary 4.2 is given in Section A.2 of the supplementary material.

Remark.

If we set and assume that , and are , Corollary 4.2 gives the sample complexity of .

Remark.

Corollary 4.2 implies that in full information settings, i.e., , Algorithm 4.2 achieves a sample complexity of , if , and are regard as . This rate is near the minimax optimal sample complexity of in full information settings (Raskutti et al., 2011).

Remark.

The estimator is guaranteed to be asymptotically consistent, because it can be easily seen that converges to as by using the restricted strong convexity of the objective and its convergence rate is nearly same as the one of the objective gap .

4.2 Analysis of Algorithm 2

Generally, Algorithm 2 does not ensure its convergence. However, the following theorem shows that running Algorithm 2 with sufficiently large batch sizes will not increase the objective values too much. Moreover, if the support of the optimal solution is included in the one of a initial point, then Algorithm 2 also achieves a linear convergence.

Theorem 4.3 (Exploitation).

Let , and . For Algorithm 2, if we adequately choose and , then for any and , there exists such that

The proof of Theorem 4.3 is found in Section B of the supplementary material.

4.3 Analysis of Algorithm 3

Combining Theorem 4.1 and Theorem 4.3, we obtain the following theorem and corollary. These imply that using the adequate numbers of inner loops , and mini-batch sizes , of Algorithm 1 and Algorithm 2 respectively, Algorithm 3 is guaranteed to achieve the same sample complexity as the one of Algorithm 1 at least. Furthermore, if the smallest magnitude of the non-zero components of the optimal solution is not too small, its sample complexity can be much reduced.

Theorem 4.4 (Hybrid).

We denote and . Let and . If we adequately choose , and , for any and , Algorithm 3 with , and adequate , and satisfies

where .

The proof of Theorem 4.4 is found in Section C.1 of the supplementary material.

Corollary 4.5 (Hybrid).

Under the settings of Theorem 4.5, the necessary number of observed samples to achieve for Algorithm 3 is

The proof of Corollary 4.5 is given in Section C.2 of the supplementary material.

Remark.

From Corollary 4.5, if , the sample complexity of Hybrid can be much better than the one of Exploration only. Particularly, if we assume that , and are and , Algorithm 3 achieves a sample complexity of , which is asymptotically near the minimax optimal sample complexity of full information algorithms even in partial information settings. In this case, the complexity is significantly smaller than of Algorithm 1 in this situation.

5 Relation to Existing Work

In this section, we describe the relation between our methods and the most relevant existing methods. The methods of (Cesa-Bianchi et al., 2011) and (Hazan and Koren, 2012) solve the stochastic linear regression with limited attribute observation, but the limited information setting is only assumed at training time and not at prediction time, which is different from ours. Also their theoretical sample complexities are which is worse than ours. The method of (Kale et al., 2017) solve the sparse linear regression with limited information based on Dantzig Selector. It has been shown that the method achieves sub-linear regret in both agnostic (online) and non-agnostic (stochastic) settings under an online variant of restricted isometry condition. The convergence rate in non-agnostic cases is much worse than the ones of ours in terms of the dependency on the problem dimension , but the method has high versatility since it has theoretical guarantees also in agnostic settings, which have not been focused in our work. The methods of (Ito et al., 2017) are based on regularized dual averaging with their exploration-exploitation strategies and achieve a sample complexity of under linear independence of features or compatibility, which is worse than of ours. Also the rate of Algorithm 1 in (Ito et al., 2017) has worse dependency on the dimension than the ones of ours. Additionally theoretical analysis of the method assumes linear independence of features, which is much stronger than restricted isometry condition or our restricted smoothness and strong convexity conditions. The rate of Algorithm 2, 3 in (Ito et al., 2017) has an additional term which has quite terrible dependency on , though it is independent to . Their exploration-exploitation idea is different from ours. Roughly speaking, these methods observe attributes which correspond to the coordinates that have large magnitude of the updated solution, and attributes uniformly at random. This means that exploration and exploitation are combined in single updates. In contrast, our proposed Hybrid updates a predictor alternatively using Exploration and Exploitation. This is a big difference: if their scheme is adopted, the variance of the gradient estimator on the coordinates that have large magnitude of the updated solution becomes small, however the variance reduction effect is buried in the large noise derived from the other coordinates, and this makes efficient exploitation impossible. In (Jain et al., 2014) and (Nguyen et al., 2017), (stochastic) gradient iterative hard thresholding methods for solving empirical risk minimization with sparse constraints in full information settings have been proposed. Our Exploration algorithm can be regard as generalization of these methods to limited information settings.

6 Numerical Experiments

In this section, we provide numerical experiments to demonstrate the performance of the proposed algorithms through synthetic data and real data.

We compare our proposed Exploration and Hybrid with state-of-the-art Dantzig (Kale et al., 2017) and RDA (Algorithm 1333In (Ito et al., 2017), three algorithms have been proposed (Algorithm 1, 2 and 3). We did not implement the latter two ones because the theoretical sample complexity of these algorithms makes no sense unless is quite small due to the existence of the additional term in it.in (Ito et al., 2017)) in our limited attribute observation setting on a synthetic and real dataset. We randomly split the dataset into training (90%) and testing (10%) set and then we trained each algorithm on the training set and executed the mean squared error on the test set. We independently repeated the experiments 5 times and averaged the mean squared error. For each algorithm, we appropriately tuned the hyper-parameters and selected the ones with the lowest mean squared error.

Synthetic dataset

Here we compare the performances in synthetic data. We generated samples with dimension . Each feature was generated from an i.i.d. standard normal. The optimal predictor was constructed as follows: for , for and for the other . The optimal predictor has only non-zero components and thus . The output was generated as , where was generated from an i.i.d. standard normal. We set the number of observed attributes per example as . Figure 2

shows the averaged mean squared error as a function of the number of observed samples. The error bars depict two standard deviation of the measurements. Our proposed Hybrid and Exploration outperformed the other two methods. RDA initially performed well, but its convergence slowed down. Dantzig showed worse performance than all the other methods. Hybrid performed better than Exploration and showed rapid convergence.

Real dataset

Finally, we show the experimental results on a real dataset CT-slice444This dataset is publicly available on https://archive.ics.uci.edu/ml/datasets/Relative+location+of+CT+slices+on+axial+axis.. CT-slice dataset consists of CT images with features. The target variable of each image denotes the relative location of the image on the axial axis. We set the number of observable attributes per example as . In figure 2, the mean squared error is depicted against the number of observed examples. The error bars show two standard deviation of the measurements. Again, our proposed methods surpasses the performances of the existing methods. Particularly, the convergence of Hybrid was significantly fast and stable. In this dataset, Dantzig showed nice convergence and comparable to our Exploration. The convergence of RDA was quite slow and a bit unstable.

Figure 1: Comparison on synthetic data.
Figure 2: Comparison on CT-slice data.

7 Conclusion

We presented sample efficient algorithms for stochastic sparse linear regression problem with limited attribute observation. We developed Exploration algorithm based on an efficient construction of an unbiased gradient estimator by taking advantage of the iterative usage of hard thresholding in the updates of predictors . Also we refined Exploration by adaptively combining it with Exploitation and proposed Hybrid algorithm. We have shown that Exploration and Hybrid achieve a sample complexity of with much better dependency on the problem dimension than the ones in existing work. Particularly, if the smallest magnitude of the non-zero components of the optimal solution is not too small, the rate of Hybrid can be boosted to near the minimax optimal sample complexity of full information algorithms. In numerical experiments, our methods showed superior convergence behaviors compared to preceding methods on synthetic and real data sets.

Acknowledgement

TS was partially supported by MEXT Kakenhi (25730013, 25120012, 26280009, 15H05707 and 18H03201), Japan Digital Design, and JST-CREST.

References

  • Ben-David and Dichterman (1993) S. Ben-David and E. Dichterman. Learning with restricted focus of attention. In

    Proceedings of the sixth annual conference on Computational learning theory

    , pages 287–296. ACM, 1993.
  • Bühlmann and Van De Geer (2011) P. Bühlmann and S. Van De Geer.

    Statistics for high-dimensional data: methods, theory and applications

    .
    Springer Science & Business Media, 2011.
  • Candes et al. (2007) E. Candes, T. Tao, et al. The dantzig selector: Statistical estimation when p is much larger than n. The Annals of Statistics, 35(6):2313–2351, 2007.
  • Cesa-Bianchi et al. (2011) N. Cesa-Bianchi, S. Shalev-Shwartz, and O. Shamir. Efficient learning with partially observed attributes.

    Journal of Machine Learning Research

    , 12(Oct):2857–2878, 2011.
  • Duchi and Singer (2009) J. Duchi and Y. Singer. Efficient online and batch learning using forward backward splitting. Journal of Machine Learning Research, 10(Dec):2899–2934, 2009.
  • Foster et al. (2016) D. Foster, S. Kale, and H. Karloff. Online sparse linear regression. In Conference on Learning Theory, pages 960–970, 2016.
  • Hazan and Koren (2012) E. Hazan and T. Koren. Linear regression with limited observation. arXiv preprint arXiv:1206.4678, 2012.
  • Ito et al. (2017) S. Ito, D. Hatano, H. Sumita, A. Yabe, T. Fukunaga, N. Kakimura, and K.-I. Kawarabayashi. Efficient sublinear-regret algorithms for online sparse linear regression with limited observation. In Advances in Neural Information Processing Systems, pages 4102–4111, 2017.
  • Jain et al. (2014) P. Jain, A. Tewari, and P. Kar. On iterative hard thresholding methods for high-dimensional m-estimation. In Advances in Neural Information Processing Systems, pages 685–693, 2014.
  • Kale et al. (2017) S. Kale, Z. Karnin, T. Liang, and D. Pál.

    Adaptive feature selection: Computationally efficient online sparse linear regression under RIP.

    In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1780–1788, 2017.
  • Kivinen and Warmuth (1997) J. Kivinen and M. K. Warmuth. Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 132(1):1–63, 1997.
  • Nguyen et al. (2017) N. Nguyen, D. Needell, and T. Woolf. Linear convergence of stochastic iterative greedy algorithms with sparse constraints. IEEE Transactions on Information Theory, 63(11):6869–6895, 2017.
  • Raskutti et al. (2011) G. Raskutti, M. J. Wainwright, and B. Yu. Minimax rates of estimation for high-dimensional linear regression over -balls. IEEE transactions on information theory, 57(10):6976–6994, 2011.
  • Shalev-Shwartz et al. (2011) S. Shalev-Shwartz, Y. Singer, N. Srebro, and A. Cotter. Pegasos: Primal estimated sub-gradient solver for svm. Mathematical programming, 127(1):3–30, 2011.
  • Xiao (2010) L. Xiao. Dual averaging methods for regularized stochastic learning and online optimization. Journal of Machine Learning Research, 11(Oct):2543–2596, 2010.
  • Zinkevich (2003) M. Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), pages 928–936, 2003.

Appendix A Analysis of Algorithm 1

In this section, we give the comprehensive proofs of Theorem 4.1 and Corollary 4.2.

a.1 Proof of Theorem 4.1

The proof is essentially generalization of the one in [Jain et al., 2014] to stochastic and partial information settings.

Proposition A.1 (Exploration).

Suppose that , and . Then for any , Algorithm 1 satisfies

with probability , where and .

Proof.

We denote , and as , and respectively. Also we define .
Since and are -sparse, restricted smoothness of implies

Using Cauchy-Schwarz inequality and Young’s inequality, we have

Here the second inequality follows from the fact .

Thus we get

(3)

Also we have

Here, the second inequality is due to Cauchy-Schwartz inequality. The third inequality follows from the fact that and the definition of hard thresholding operator. The third equality is by the fact that .

If is assumed, combing (A.1) with this fact yields