## 1 Introduction

Linear regression is a longstanding and effective method for learning a prediction model from various data sets. However, in some scenarios, it cannot be utilized as-is: The algorithms that solve this problem assume they can access all the attributes of the training set, whereas there are some real-life scenarios where the learner can access only a small number of attributes per training example.

Consider, for example, the problem of medical diagnosis in which the learner wishes to determine whether a patient has some disease based on a series of medical tests. In order to build a linear model, the learner has to gather a set of volunteers, perform diagnostic tests on them and use the tests results as features. However, some of the volunteers may be reluctant to undergo a large number of tests, as medical tests may cause physical discomfort, and will prefer to undergo only a small number of them. During test time, however, patients are more likely to agree to undergo all tests, to find a diagnosis to their illness.

Another example is the case where there is some cost associated with each attribute, whether computational or financial. For example, the http://intelligence.towerdata.com web site allows users to buy marketing data about email addresses and pay per feature. The learner would like to minimize the cost, which is not necessarily the number of examples.

This problem is known as budgeted learning [1] or learning with limited attribute observation (LAO) [2]. Formally, we use the local budget setting presented in [3]: For each training example (composed of a

-dimensional attribute vector

and a target value ), we have a budget of attributes, where, and we are able to choose which attribute we wish to reveal. This is different from the missing data setting, in which the selected attributes are given to us, and we are not able to choose which attributes to reveal, and from the feature selection setting, in which the output model includes only a subset of the attributes. In our setting, the goal is to find a good predictor despite the partial information at training time where a good predictor is defined as one that minimizes the expected discrepancy between the predicted value,

, and the target value,. This discrepancy is generally measured by some kind of loss function, and we focus on the squared loss i.e.

. The expected discrepancy over the training set is called the risk.We consider learning with respect to linear predictors, parameterized by a vector . Given an unlabeled example (a vector of attributes), the prediction is defined as ^{1}^{1}1We ignore the bias term here, but this can be easily handled by adding a constant dummy attribute that will always be revealed.. In particular, we focus on two standard types of linear prediction problems: Those with

bounded norm, which are the ridge regression scenario; and those with

bounded norm, which are the lasso regression scenario.

Our basic approach is similar to the one proposed in [4, 3]

, which uses online gradient descent with stochastic gradient estimates. The general idea behind it is to scan through the training set, calculate an unbiased gradient estimator based on each example (using only a small number of attributes), and plug it into a stochastic gradient descent method, thus minimizing the loss over the training set.

build the unbiased estimator using uniform sampling from the attributes of the example, eventually leading to ridge algorithms with expected excess risk bounds of

after examples, compared with for the online full-information algorithms that can view all the attributes [5], and lasso algorithms with an additional factor for both settings (see Table 1). Another interpretation of these results is that when viewing only out of attributes, the algorithms need times as many examples to obtain the same accuracy, thus examining the same number of attributes. [4] also provides a lower bound for the ridge scenario establishing that the ridge bound is not improvable in general.In this paper, despite these seemingly unimprovable results, we show that they can in fact be improved. We do this by developing a novel sampling scheme which samples the attributes in a data-dependent manner: We sample attributes with large second moments more than others, thus are able to gain a *data-dependent* improvement factor. In other words, our sampling methods take advantage
of the geometry of the data distribution, and utilize it to extract more ’information’
out of each sample. Under reasonable assumptions, our methods
need to examine *less* attributes to reach the same accuracy than the online full-information algorithms, thus optimizing the principal goal in
budgeted scenarios. To the best of our knowledge, ours are the first methods able to do so in the local budget setting.

We begin by assuming prior knowledge of the second moments of the data, namely for , where we use to denote the expectation with respect the data distribution. Our risk bounds, under the assumptions of in the ridge scenario and in the lasso scenario are also summarized in Table 1. To clarify the notation, is defined as , and is defined as .

New Bound | Old Bound | Online Full-Information Bound | |
---|---|---|---|

Ridge Regression | |||

Lasso Regression |

It can be easily shown that both and are smaller than or equal to , which proves that our bounds are always as good as the previous bounds. In fact, the equalities hold only when all the moments are exactly the same. Otherwise, both values are strictly smaller than , making our bounds better than the previous. This improvement factor is data-dependent and may be as large as as both values can be small as , when the moments decay at a sufficient rate. In fact, similar distributional assumptions are made in other successful algorithmic approaches such as AdaGrad (we further elaborate on the connection between our work and AdaGrad in Appendix A). When the attribute budget satisfies (or in the lasso scenario) our bounds also coincide with the online full-information scenario.

Of course, a practical limitation of our approach is that the second moments of the data may not be known in advance or easily computable in our attribute efficient setting. To address this, we split our algorithms into two phases: In the first phase, we use a simple yet effective estimation scheme that estimates the second moments of the attributes. In the second phase, we use the same sampling scheme but with smoothed probabilities, to compensate for the stochastic nature of the estimation phase. We prove that this method is always as good as the previous algorithms (up to constant factors) and given sufficient training examples, achieves the same bounds as our algorithms with prior knowledge on the second moments of the attributes (up to constant factors).

The rest of this paper is organized as follows: In section 2 we provide necessary background. In section 3 we describe the existing state of the art algorithms for attribute efficient ridge regression, and develop our sampling scheme for the case where we have prior knowledge of the second moments of the attributes. We also develop an estimation scheme for the case where we do not assume any prior knowledge of the second moments of the attributes, and present two variants of the algorithm: one that does assume prior knowledge of only, and one that does not assume any prior knowledge at all. These two variants have the same expected risk bounds (up to a constant factor), but differ by the number of training examples needed in the estimation phase. In section 4 we provide similar results, this time for attribute efficient lasso regression. When no prior knowledge of the second moment of the attributes is available, the lasso scenario is simpler than the ridge scenario, as prior knowledge of does not improve the results. In section 5 we show experimental results that support our theoretical claims, both on simulated and on well known data sets. We finish with a summary in section 6, and short discussion about the connection between the AdaGrad method and our sampling scheme in appendix A.

## 2 Preliminaries

### 2.1 Notation

Throughout this paper we use the following notations: We indicate scalars by a small letter, , and vectors by a bold font, . We use to indicate the vector for which for all , and to indicate the vector for which . We denote the -th vector of the standard basis by . All our vectors lie in , where is the dimension. We indicate the set of indices by . We use to indicate the -norm of the vector, equal to . We apply this notation also for the case where i.e. , even though this is not a proper norm, as the triangle inequality does not hold. We also use to indicate the infinity norm, . We use to indicate the standard inner product, . We denote the expectation with respect to the randomness of the algorithm (attribute sampling) by , the expectation with respect to the data distribution by and the expectation with respect to both by . For the two-phased algorithms, we use where to denote the expectation with respect to the data distributions and the randomness of the algorithm during the -th phase.

### 2.2 Linear Regression

The general framework for regression assumes the learner has a training set: , where each is a data point, represented by a vector of attributes, and is the desired target value. The goal of the learner is to find a weight vector , such that is a good estimator of , in the sense that it minimizes some penalty function over the entire data set. We focus on the most popular choice for such a function - the squared loss: . We denote the loss induced by as .

We follow the standard framework for statistical learning [6]

and assume the training set was sampled i.i.d. from some joint distribution

. The goal of the learner is to find a predictor that minimizes the risk, defined as the expected loss:Since the distribution is unknown, the learner relies on the given training set, that is assumed to be sampled i.i.d. from .

Finding this minimum may result in over fitting the data, therefore it is common to limit the size of the hypothesis class by adding some regularization constraint on the norm of , requiring it to be smaller than or equal to some value. The first of the two main scenarios of regression is ridge regression, where we have the 2-norm constraint, and the hypothesis class is . If we assume , using the Cauchy-Schwarz inequality, we can assume without loss of generality that with probability . The second is lasso regression, where we have the 1-norm constraint, and the hypothesis class is . Since we assume , using the Hölder inequality, we can assume without loss of generality that with probability .

In the full-information regression scenario, the learner has access to all the attributes of , whereas in the attribute efficient scenario, the learner can sample at most attributes out of from each vector .

### 2.3 Related Work

The scenario of learning with limited attribute access was first introduced by Ben-David & Dichterman [2], under the term ”Learning with Restricted Focus of Attention”. There are two popular types of constraints: The first, which we address in this paper, is the local budget constraint, where the learner has access to attributes per training example. The second is the global budget constraint where the learner has access to a total number of attributes, and may spread them freely among all the training examples, as long as the total number of attributes seen does not exceed . Clearly, any upper bound for the local budget setting holds also in the global budget setting for .

Cesa-Bianchi et al. in [3] were the first to build an efficient linear algorithm for the local budget scenario, and asked the question of whether there exists an efficient algorithm for the attribute efficient scenario that can reach a similar accuracy as the full attribute scenario, while seeing examples and from each example being able to sample only attributes. Such a result would imply that in the attribute efficient scenario, we can learn just as well as in the full-information scenario, after seeing the same number of attributes ( in both cases). Thus, we can trade-off between the number of examples and the amount of information received on each example. They also proved a lower sample complexity bound of examples for learning an -accurate linear regressor.

Later on, Hazan et al. [4] showed that the answer is yes, up to global constants for both the ridge and lasso scenarios. Their approach for ridge regression was based on the Online Gradient Descent method [7] and on the EG algorithm [8] for the lasso scenario. In both cases, at each iteration, the learner uses an unbiased estimator of the gradient, and updates the current weight vector accordingly. The key idea is that by sampling just a few attributes using an appropriate scheme, the learner can still build an unbiased estimator of the gradient, even in the attribute efficient scenario, and by expectation, perform a gradient step in the correct direction. They also complemented the ridge regression result by proving a corresponding lower sample complexity bound of examples for learning an -accurate ridge regressor.

## 3 Attribute Efficient Ridge Regression

In this section we present our algorithms for ridge regression, where the loss is the squared loss: , and the 2-norm is bounded, . The generic approach to the ridge attribute efficient scenario, which we call the General Attribute Efficient Ridge Regression (GAERR) algorithm and is presented in Algorithm 1, was first developed in [3, 4] and is based on the Online Gradient Descent (OGD) algorithm with gradient estimates.

The OGD algorithm goes over the training set, and for each example builds an unbiased estimator of the gradient. Afterwards, the algorithm updates the current weight vector, , by performing a step of size in the opposite direction to the gradient estimator. The result is projected over the ball of size , yielding . At the end, the algorithm outputs the average of all . The algorithm converges to the global minimum, as the minimization problem is convex in .

The gradient of the squared loss is ,
and the key idea of the GAERR algorithm is how to use the budgeted sampling to construct an
unbiased estimator for the gradient. The GAERR algorithm does so by sampling attributes out of the attributes of the sample where is the a budget parameter^{2}^{2}2As in the AERR algorithm, we assume we have a budget of at least attributes per training sample.: First, it samples attributes with probabilities
and by weighting them correctly, builds an unbiased estimator for the data point . Then it samples one attribute with probability
and by a simple calculation obtains an unbiased estimator of the inner product.
Reducing the label, , yields the unbiased estimator, .
Finally, the algorithms multiplies the estimator of the inner product minus
the label, , by the estimator of the data point, , thus
building an unbiased estimator of the gradient for the point, .

The expected risk bound of the GAERR algorithm is presented in the next theorem which is a slightly more general version of Theorem 3.1 in [4].

###### Theorem 3.1.

Assume the distribution is such that and with probability 1. Let be the output of GAERR when run with step size and let . Then for any with ,

The general idea of the proof is that is an unbiased estimator of the gradient, therefore we can use the standard analysis of the OGD algorithm. The full proof can be found in appendix B.1.

The AERR algorithm is one variant of the GAERR algorithm. It was presented in [4] and uses uniform sampling to estimate . In our GAERR notation it uses

The authors prove (Lemma 3.3 in [4]) that for the AERR algorithm, , which together with Theorem 3.1 and using yields an expected risk bound of . They also prove that up to constant factors, their algorithm is optimal, by showing a corresponding lower bound.

This, however, is not the end of the story. By analyzing the bound, we show that we can improve the bound in a data-dependent manner. Theorem 3.1 shows us that the expected risk bound is proportional to , therefore we wish to develop a sampling method that minimizes the 2-norm of the gradient estimator.

The gradient estimate consists of estimating the inner product and estimating . To estimate , we use the following procedure: we sample indices, , from with probability , and use

(1) |

as an estimator for . The next lemma will assist in bounding its 2-norm.

###### Lemma 3.2.

For every distribution where and , we have .

The proof can be found in appendix B.2.

Since

(2) |

in order to minimize the 2-norm of the estimator, we need to solve the following optimization problem:

subject to |

This problem is equivalent to

(3) | ||||||

subject to |

which can easily be solved using the Lagrange multipliers method to yield the solution

(4) |

We also use

(5) |

as an estimator for the inner product minus the label. The next lemma will assist in bounding its 2-norm.

###### Lemma 3.3.

Using our sampling method we have .

The proof can be found in appendix B.3.

We could have followed a similar optimization strategy for finding the optimal sampling distribution for estimating the inner product. This strategy would have yielded that the optimal probabilities are . We, however, were not able to prove the superiority of this sampling method analytically and it was left out of the algorithm analysis.

Altogether, we formulate a lemma that will bound the gradient estimate.

###### Lemma 3.4.

The GAERR algorithm generates gradient estimates that for all , .

###### Proof.

### 3.1 Known Second Moment Scenario

If we assume we have prior knowledge of the second moment of each attribute, namely for all , we can use equation to calculate the optimal values of the -s. This is the idea behind our DDAERR (Data-Dependent Attribute Efficient Ridge Regression) algorithm.

The expected risk bound of our algorithm is formulated in the next theorem.

###### Theorem 3.5.

Assume the distribution is such that and with probability 1 and are known for . Let be the output of DDAERR, when run with . Then for any with ,

###### Proof of Theorem 3.5.

Recalling that with probability 1 we have , it is easy to see that , therefore the DDAERR algorithm always performs at least as well as the AERR algorithm^{3}^{3}3If it is easy to see that for all . In this case, all the -s are equal to and the DDAERR and AERR algorithms coincide.. However, may also be much smaller than , in cases where the second moments varies between attributes or the vector is sparse. In these cases, we may gain a significant improvement. For example, if we consider a polynomial attribute decay such as , we have which is significantly smaller than .

### 3.2 Unknown Second Moment Scenario, Known

The solution presented in the previous section requires exact knowledge of for all . Such prior knowledge may not be available when the learner is faced with a new learning task. Thus, we turn to consider the case where the moments are initially unknown. We will still assume that the learner can guess or estimate the step size, which depends only on the scalar quantity . In the next section, we will consider the case where even that information is unknown.

The problem in this scenario is that without prior knowledge of the second moments of the attributes, the learner cannot calculate the optimal -s via equation . To address this issue we split the learning into two phases: In the first phase we run on the first training examples and estimate the second moments by sampling the attributes uniformly at random. In the second phase we run on the next training examples, and perform the regular DDAERR algorithm, with a slight modification - in the calculation of the

-s, we use an upper confidence interval instead of the second moments themselves. We assume

is on the order of . This approach is the basis for our Two-Phased DDAERR algorithm (Algorithm 2). The estimate for does not assist in the calculation of the -s, but will give us the optimal step size, .Note that in practice, one can actually run the AERR algorithm during the first phase, in order to obtain a better starting point for the second phase. We ignore this improvement in our analysis below, but incorporate it in the experiments presented in section 5.

There are other variants of this type, the most apparent of them is to use the same samples that estimate the gradient to estimate the moments themselves. This method, even though in some cases may be superior to our method, will not yield better results in the worst case scenarios because we may never get accurate enough estimations for some of the attributes.

The expected risk bound of the algorithm is formulated in the following theorem.

###### Theorem 3.6.

Assume the distribution is such that and with probability 1. Assume further that the value is known. Let be the output of Two-Phased DDAERR when run with

Then for all and for any with , with probability over the first phase, we have

Also, with probability over the first phase, we have

With probability over the first phase, regardless of the value of , the expected risk bound is at most , which is the same bound of the AERR algorithm. This means that the Two-Phased DDAERR algorithm performs with probability over the first phase as well as the AERR algorithm, up to a constant factor. Second, as increases, the expected risk bound turns to . Therefore, if , we achieve an improvement over the AERR algorithm. If , the bound becomes , which is the same bound as in the regular DDAERR algorithm which assumes prior knowledge of the second moment of the attributes.

The conclusion is that even if we do not have prior knowledge of the second moments of the attributes, but can guess , we still should prefer our Two-Phased DDAERR algorithm over the AERR algorithm. In the next section, we analyze the case where even is unknown.

#### 3.2.1 Proof of Theorem 3.6

The main goal of the proof is to bound the expected squared 2-norm of the gradient estimator from above. By using Lemma 3.4, all that remains is to upper bound . In the next lemma we show two different upper bounds on . The first states that with probability over the first phase , meaning that up to a constant factor the bound is the same as in the AERR algorithm. The second bound decreases in , and will help up to analyze the convergence rate of the algorithm.

###### Lemma 3.7.

For all and , with probability over the first phase, we have

and with probability over the first phase, we have

The proof can be found in Appendix B.4.

We will treat each bound separately, and later join the results into a single lemma. First, we prove that even if we do not have an estimate for , with a proper choice of , our Two-Phased DDAERR algorithm still performs with probability over the first phase as well as the AERR algorithm, up to a constant factor.

###### Lemma 3.8.

Let be the output of Two-Phased DDAERR when run with . Then with probability over the first phase, we have for all and for any with ,

The proof can be found in appendix B.5.

Now, if we do have an estimate , we can use it to calculate an appropriate step size and to bound the risk, as shown in the next lemma.

###### Lemma 3.9.

Assume we have a value that satisfies . Let be the output of Two-Phased DDAERR when run with . Then with probability over the first phase, we have for all and for any with ,

The proof can be found in appendix B.6.

This lemma gives a non-trivial expected risk bound only if is small enough, but when is small, this is not necessarily the case. Therefore, we would like to unite these two lemmas to ensure that even in the worst case, we won’t have a worse bound than the AERR algorithm.

###### Lemma 3.10.

Assume we know a value that satisfies . Let be the output of Two-Phased DDAERR when run with

Then for all and for any with , with probability over the first phase, we have

Also, with probability over the first phase, we have

The proof can be found in appendix B.7.

We could always naively bound by , but then, even if tends to infinity, the bound will not be better than the bound of the AERR algorithm. However, if we do have prior knowledge upon the value , it is straightforward to use this lemma to prove Theorem 3.6.

### 3.3 Unknown Second Moment Scenario

In this section, we analyze the case in which we do not have prior knowledge of the second moments of the attributes and on the value of . This scenario may accrue if the learner is faced with a new learning task, and knows nothing about the distribution of the attributes. The problem here, besides not being able to calculate the optimal -s, is that we also cannot calculate the optimal step size, .

Our solution to this scenario is to use again the Two-Phased DDAERR algorithm (Algorithm 2), and calculate an accurate enough estimation of .

###### Lemma 3.11.

We can estimate by the estimator , which satisfies with probability , .

###### Proof.

Using this estimate we can prove our main theorem of this section.

###### Theorem 3.12.

Assume the distribution is such that and with probability 1. Let be the output of Two-Phased DDAERR when run with

Then for all and for any with , with probability over the first phase, we have

Also, with probability over the first phase, we have

If we examine the bound we can see that with probability over the first phase, regardless of the value of , the expected risk bound is at most , which is the same bound of the AERR algorithm. This means that the Two-Phased DDAERR algorithm performs with high probability over the first phase as well as the AERR algorithm, up to a constant factor. Second, as increases, the expected risk bound turns to . Therefore, if , we achieve an improvement over the AERR algorithm. If , the bound becomes , which is the same bound as in the regular DDAERR algorithm with prior knowledge of the second moment of the attributes.

The conclusion is that even if we do not have prior knowledge of the second moments of the attributes and on , we still should prefer our Two-Phased DDAERR algorithm over the AERR algorithm.

It is interesting to compare between the known case and the unknown case. Even though the sampling probabilities are the same, in the unknown case we need more samples to reach the regime where our algorithm significantly improves on AERR. The reason for this is that the expected risk bound is highly dependent on the choice of the step size , and calculating the optimal value requires knowledge of which is harder to estimate than the moments themselves, as the estimation errors of attributes build up.

#### 3.3.1 Proof of Theorem 3.12

The proof is straightforward using Lemma 3.10. First, by denoting , and using

Comments

There are no comments yet.