1 Introduction
The rising interest in the field of Machine Learning (ML) has been strongly driven by the potential to generate economic value. Firms seeking revenue optimizations can gather abundant data at low cost, apply a set of inexpensive algorithmic tools, and produce highaccuracy predictors that can massively improve future decision making. The extent of the potential value that can be created by leveraging data for prediction is apparent in the multimillion dollar competition bounties offered by companies like Netflix and the Heritage Health Foundation, but perhaps even more so in the aggressive hiring of many ML experts by companies like Google and Facebook.
Much of the theoretical results in ML aim to measure, at least implicitly, the economic efficiency of learning problems. For example, in certain settings we have a reasonably thorough understanding of sample complexity [1] which gives us the precise tradeoff between , the quantity of data at our disposal, and the error or loss rate we want to achieve. Reducing error is always beneficial, of course, but must be weighed against the marginal cost of increasing .
The measures of efficiency in ML have broadened in recent years, in particular because gathering data is typically orders of magnitude cheaper than labeling it. This has led to the emergence of the active learning paradigm [3, 4, 12, 17, 2]. Here, we imagine an interface between the learner and the label provider, where the learner may make label queries on data points in an online fashion. By sequentially choosing which data to label, the learner can greatly reduce the number of labels required to learn [12].
A problem that has received little attention in the learning theory literature is the monetary efficiency of learning when data have differing costs. Indeed, realworld prediction tasks often require obtaining examples held by selfinterested, strategic agents; these agents must be incentivized to provide the data they hold, and they have heterogeneous costs for doing so.
In this vein, the present paper seeks to address the following question:
In a world where data is held by selfinterested agents with heterogeneous costs for providing it, and in particular when these costs may be arbitrarily correlated with the underlying data, how can we design mechanisms that are incentivecompatible, have robust learning guarantees, and optimize the costefficiency tradeoffs inherent in the learning problem?
This question is relevant to many realworld scenarios involving financial and strategic considerations in data procurement. Here are two examples:

In the development of a certain drug, a pharmaceutical company wishes to train a disease classifier based on data obtained by hospitals and stored in patients’ medical records. These data are not public, yet the company can offer hospital patients financial incentives to contribute their private records. We note the potential for cost heterogeneity: the compensation required by patients may be correlated with the content of their medical data (
e.g. if they have the disease). 
Online retailers generally hope to know more about website visitors in order to better target products to customers. A retailer can offer to buy customers’ demographic and social data, say in the form of access to their Facebook profile. But again, customers’ willingness to sell may covary with their demographics data in an unknown way.
From sample complexity to budget efficiency
The classical problem in statistical learning theory is the following. We are given
datapoints (examples) sampled from some distribution . Our goal is to select a hypothesis which “performs well” on unseen data from . We can specify performance in terms of a loss function , and we write , known as the risk of , to be the expectation of on a random draw from . The goal is to produce a hypothesis whose risk is not much more than that of , the optimal member of . For example, in binary classification, each data point consists of a pair where encodes some “features” and is the label; a hypothesisis a function that predicts a label for a given set of features; and a typical loss function, the “01 loss”, is defined so that
when and otherwise.Research in statistical learning theory attempts to characterize how well such tasks can be performed in terms of the resources available and the inherent difficulty of the problem. The resource is usually the quantity of data . In binary classification, for instance, the difficulty or richness of the problem is captured by the “VCdimension” , and a famous result [19] is that there is an algorithm achieving the bound
(1) 
with very high probability over the sample
.In the present work we consider an alternative scenario: the learner has a fixed budget and can use this budget to purchase examples. More precisely, on round of a sequence of rounds, agent arrives with data point , sampled i.i.d. from some , and a cost . This cost is known only to the agent and can depend arbitrarily on . The learning mechanism may offer a (possibly randomized) menu of takeitorleaveit prices , with a possibly different price for each data point . The arriving agent observes the price offered for her data and accepts as long as , in which case the mechanism pays the agent and learns .^{1}^{1}1We will discuss the interaction model further in Sections 2 and 8. Our goal is to actively select prices to offer for different datapoints, subject to a budget , in order to minimize the risk of our final output .
At a high level, our main result parallels the classical statistical learning guarantee in (1), but where the limited resource is the budget instead of the sample size .
Main Result 1 (Informal).
For a large class of problems, there is an active data purchasing algorithm that spends at most in expectation and outputs a hypothesis satisfying,
where is an algorithmdependent parameter of the (cost, data) sequence capturing the monetary difficulty of learning and the expectation is over the algorithm’s internal randomness.
This bound depends on the quantity which captures the monetary difficulty
of the problem at hand. (We also need as prior knowledge a rough estimate of
.) This is in rough analogy with VCdimension in classical bounds such as Equation 1. Similarly, the key resource constraint is now the budget rather than the quantity of data .It is important to note that depends on the choice of algorithm . However, our results also include simpler, algorithmindependent bounds. For instance, replace by , where is the mean of the arriving costs, and Main Result 1 continues to hold (and the only prior knowledge required is a rough estimate of ). But can be significantly smaller than when there are particular correlations between the costs and the examples; indeed, we can have even as stays constant. This indicates a case in which the average cost of data is high, but due to beneficial correlations between costs and data, our mechanism can obtain all the data it needs for good learning very cheaply. We give a thorough discussion of in Section 4.4.
Overview of Techniques
Our general idea for attacking this problem is to utilize online learning algorithms (OLAs) for regret minimization [6]. These algorithms output a hypothesis or prediction at each step , and their performance is measured by the summed loss of these predictions over all the steps. The idea is that the hypotheses produced by the OLA at each step can be used both to determine the value of data during the procurement process and to generate a final prediction.
In Section 3, we lay out the tools we need for a pricing and learning mechanism to interact with OLAs. The first highlevel problem is that, because of the budget constraint, our OLA will only see a small subset of the data sequence. We use the tool of importanceweighting to give good regretminimization guarantees even when we do not see the entire data sequence. The second problem is how to aggregate the hypotheses of the OLA and convert its regret guarantee into a risk guarantee for our statistical learning setting. This is achieved with the standard “onlinetobatch” conversion [7].
Given the tools of Section 3, the key remaining challenge is to develop a pricing and learning strategy that achieves low regret. We address this question in Section 4. We formally define a model of online learning for regret minimization with purchased data, in which the mechanism must output a hypothesis at each time step and perform well in hindsight against the entire data sequence, but only has enough budget to purchase and observe a fraction of the arriving data. We defer until later our detailed analysis of this setting, derivation of a pricing strategy, and lower bounds. At this point, we present our pricing strategy and regret guarantees for this setting.
In Section 5, we give our main results: risk guarantees for a learner with budget and access to arriving agents. These bounds follow directly by using the tools in Section 3 and regretminimization results in Section 4.
In Section 6, we develop a deeper understanding of the regret minimization setting. We derive our pricing strategy from an indepth analysis of a more analytically tractable variant of the problem, the “atcost” setting, where the mechanism is only required to pay the cost of the arriving data point rather than the price posted. For this setting, we are able to derive the optimal pricing strategy for minimizing the regret bound of our class of learning algorithms subject to an expected budget constraint.
We also complement our upper bounds by proving lower bounds for datapurchasing regret minimization. These show that our mechanisms for the easier atcost setting have an orderoptimal regret guarantee of . There is a small gap to our mechanisms for the main regret minimization setting, in which our guarantee is on the order of (recall that , so this is a weaker guarantee). The dependence approaches the classic regret bound when is large (approaching ). When is small but still superconstant, we observe the perhaps counterintuitive fact that we can achieve average regret per arrival while only observing an fraction of the arriving data; in other words, we have “no data, no regret.”
Related Work
For “batch” settings in which all agents are offered a price simultaneously, pricing schemes for obtaining data have appeared in recent work, especially Roth and Schoenebeck [16], which considered the design of mechanisms for efficient estimation of a statistic. However, this work and others in related settings [14, 10, 8] consider offline solutions, e.g. drawing a posted price independently for all data points. We focus on an active approach in which the marginal value of individual examples is estimated according to the current learning progress and budget. A datadependent approach to pricing data does appear in Horel et al. [13], but that paper focuses on a quite different learning setting, a model of regression with noisy samples with a budgetfeasible mechanism design approach.
Another difference from the above papers is that we prove risk and regret bounds rather than trying to minimize e.g.
a variance bound, and we also consider a broader class of learning problems.
Other related work.
Other works such as Ghosh et al. [11], Dekel et al. [9], Meir et al. [15] focus on a setting in which agents may misreport their data (also see the peerprediction literature). We suppose that agents may misreport their costs but not their data.
Many of the ideas in the present work draw from recent advances in using importance weighting for the active learning problem [4]. There is a wealth of theoretical research into active learning, including Beygelzimer et al. [5], Balcan et al. [2], Hanneke [12] and many others.
“Budgeted Learning” is a somewhat related area of machine learning, but there the budget is not monetary. The idea is that we do not see all of the features of the data points in our set, but rather have a “budget” of the number of features we may observe (for instance, we may choose any two of the three features height, weight, age).
2 Statistical Learning with Purchased Data
In this section, we formally define the problem setting. The body of the paper will then consist of a series of steps for deriving mechanisms for this setting with provable guarantees, which will finally appear in Section 5.
We consider a statistical learning problem described as follows. Our data points are objects . We are given a hypothesis class
which we will assume is parameterized by vectors
but more broadly can be any Hilbert space endowed with a norm ; for convenience we will treat elements as vectors which can be added, scaled, etc. We are also given a loss function that is convex in . We assume throughout the paper that the loss function is 1Lipschitz in ; that is, for any and any we have .In many common scenarios, is the space of pairs from the cross product , with the feature input and the label, though in our setting can be a more generic object. For example, in the canonical problem of linear regression, we have that , the hypothesis class is vectors , and the loss function is defined according to squared error .
The datapurchasing statistical learning problem is parameterized by the data space , hypothesis space , loss function , number of arriving data points , and expected budget constraint . A problem instance consists of a distribution on the set and a sequence of pairs where each is a data point drawn i.i.d. according to and each is the private cost associated with that data point. The costs may be arbitrarily chosen, i.e. we consider a worstcase model of costs. (For instance, if costs and data are drawn together from a joint, correlated distribution, then this is a special case of our setting.)
In this problem, the task is to design a mechanism implementing the operations “post”, “receive”, and “predict” and interacting with the problem instance as follows.

For each time step :

The mechanism posts a pricing function , where is the price posted for data point .

Agent arrives, possessing .

If the posted price , then agent accepts the transaction: The mechanism pays to the agent and receives . If , agent rejects the transaction and the mechanism receives a null signal.


The mechanism outputs a prediction .
Note that the mechanism is given the parameters , , , , and , but the problem instance is completely unknown to the mechanism prior to to the arrivals. The design problem of the mechanism is how to choose the pricing function to post at each time, how to update based on receiving data, and how to choose the final prediction. The risk or predictive error of a hypothesis is
and the goal of the mechanism is to minimize the risk of its final hypothesis . The benchmark is the optimal hypothesis in the class, .
The mechanism must guarantee that, for every input sequence , it spends at most in expectation over its own internal randomness.
Agentmechanism interaction.
The model of agent arrival and posted prices contains several assumptions. First, agents cannot fabricate data; they can only report data they actually have to the mechanism. Second, agents are rational in that they accept a posted price when it is higher than their cost and reject otherwise. Third, we have an implementation of the mechanism that can obtain the agent’s cost when the transaction occurs.
We emphasize that the purpose of this paper is not focused on the implementation of such a setting, but instead on developing active learning and pricing techniques and guarantees. This is also intended as a simple and clean model in which to begin developing such techniques. However, we briefly note some possible implementations.
In the most straightforward one, the mechanism posts prices directly to the agent who responds directly. This would be a weakly truthful implementation, as agents have no incentive to misreport costs after they choose to accept the transaction.
One strictly truthful implementation uses a trusted third party (TTP) that can facilitate the transactions (and guarantee the validity of the data if necessary). For example, we could imagine attempting to learn to classify a disease, and we could rely on a hospital to act as the broker allowing us to negotiate with patients for their data. Then the TTP/agent interaction could proceed as follows:

Learning mechanism submits the pricing function to the TTP;

Agent provides his data point and cost to the TTP;

TTP determines whether and, if so, instructs the learner to pay to the agent and then provides the pair to the learner.
Other possibilities for strictly truthful implementation include using a bit of cryptography (see Section 8).
3 Tools for Converting RegretMinimizing Algorithms
In this section we begin with the classic regretminimization problem and a broad class of algorithms for this problem. We then show how to apply techniques that convert these algorithms into a form that will be useful for solving the statistical learning problem with purchased data. The only missing ingredient will then be a priceposting strategy, which will be presented in Section 4.
3.1 Recap of Classic RegretMinimization
In the classic regretminimization problem, we have a hypothesis class with the same assumptions as stated in Section 2. At each time the algorithm posts a hypothesis . Nature (the adversary, the environment, etc.) selects a Lipschitz convex loss function .^{2}^{2}2This definition of “loss function” is a departure from our main setting which involved . But we will use this somewhat more general setup by choosing for the datapoint . The algorithm observes and suffers loss .
The loss and regret of the algorithm on this particular input sequence are
(2)  
(3) 
The regret objective is what one typically studies in adversarial settings, where we want to discount the loss incurred by the algorithm by the loss suffered by the best possible chosen with knowledge of the sequence of ’s. As we often consider randomized algorithms, we will generally consider expected loss and regret, where the expectation is over any randomness in the algorithm not over the (possiblyrandomized) input sequence of loss functions. An algorithm is said to guarantee regret if the latter provides an upper bound on regret for every sequence of loss functions .
We utilize the broad class of FollowtheRegularizedLeader (FTRL) online algorithms (Algorithm 1) [20, 18]. Special cases of FTRL include Online Gradient Descent, Multiplicative Weights, and others. Each FTRL algorithm is specified by a convex function which is known as a regularizer and is usually strongly convex with respect to some norm. For example, Multiplicative Weights follows by using the negative entropy function as a regularizer, which is stronglyconvex with respect to norm [6]. Online Gradient Descent follows by using the regularizer , which is stronglyconvex with respect to norm. These special cases have efficient closedform solutions to the update rule for computing .
It is wellknown (and indeed follows as a special case of Lemma 3.1) that, under the assumptions on our setting, FTRL algorithms guarantee an expected regret bound of , and this is tight with respect to .
3.2 ImportanceWeighting Technique for Less Data
As a starting point, suppose we wish to design an online learning algorithm that does not observe all of the arriving loss functions, but still performs well against the entire arrival sequence.
Because the arrival sequence may be adversarially chosen, a good algorithm should randomly choose to sample some of the arrivals. In this section, we abstract away the decision of how to randomly sample. (This will be the focus of Section 4.) In this section, we suppose that at each time , after posting a hypothesis , a probability is specified by some external means as a (possibly random) function of the preceding time steps. With probability , we observe ; with probability , we observe nil.
Our goal is to modify the FTRL algorithm for this setting and obtain a modified regret guarantee. Notice crucially that the definition of loss and regret (3) are unchanged: We still suffer the loss regardless of whether we observe .
The key technique we use is importance weighting. The idea is that, if we only observe each of a sequence of values with probability
, then we can get an unbiased estimate of their sum by taking the sum of
for those we do observe. To check this fact, let be the indicator variable for the event that we observe and note that the expectation of our sum is . This is called importanceweighting the observations (and is a specific instance of a more general machine learning technique). Furthermore, if each is bounded and observed independently, we can expect the estimate to be quite good via tail bounds.The importanceweighted modification to an online learning algorithm is outlined in Algorithm 2. The importanceweighted regret guarantee we obtain is given in Lemma 3.1. It depends on the following key notation. Our analysis and algorithm require a given norm , and we recall the definition of the dual norm .
Definition 3.1.
Given , and convex loss , let .
We can informally think of both as the “difficulty” of arrival when the current hypothesis is , and as the “value” of observing . This interpretation is explored in Section 4 when we define the parameter .
Lemma 3.1.
Assume we implement Algorithm 2 with nonzero sampling probabilities . Assume the underlying OLA is FTRL (Algorithm 1) with regularizer that is strongly convex with respect to . Then the expected regret, with respect to the loss sequence , is no more than
where is a constant depending on and , is a parameter of the algorithm, and the expectation is over any randomness in the choices of and .
We can recover the classic regret bound as follows: Take each , and note by the Lipschitz assumption that each . Then by setting , we get an expected regret bounded by .
3.3 The “OnlinetoBatch” Conversion
So far so good: We can convert an online regretminimization algorithm to use smaller amounts of data, and we postpone the question of how to price data till Section 4. We now address the statistical learning problem, which is how to generate accurate predictions based on the online learning process.
We address this with a standard tool known as the “onlinetobatch conversion,” where we may leverage an online learning algorithm for use in a “batch” setting. A sketch of this technique is as follows, and further details can be found in, e.g., ShalevShwartz [18]. Given a batch of i.i.d. data points, feed them onebyone into the noregret algorithm. Because the algorithm has low regret, its hypotheses predicted well on average. But since each data point was drawn i.i.d., this means that these hypotheses on average predict well on an i.i.d. draw from the distribution. Thus it suffices to take the mean of the hypotheses to obtain low risk.
Lemma 3.2 (OnlinetoBatch [7]).
Suppose the sequence of convex loss functions are drawn i.i.d. from a distribution and that an online learning algorithm with hypotheses achieves expected regret . Let and . For , we have
We note that this conversion will continue to hold in the datapurchasing noregret setting we define next, since all that is required is that the algorithm output a hypothesis at each step and that there is a regret bound on these hypotheses.
4 Regret Minimization with Purchased Data
In this setting, we define the problem of regret minimization with purchased data. We will design mechanisms with good regret guarantees for this problem, which will translate via the aforementioned onlinetobatch conversion (Lemma 3.2) into guarantees for our original problem of statistical prediction.
The essence of the datapurchasing noregret learning setting is that an online algorithm (“mechanism”) is asked to perform well against a sequence of data, but by default, the mechanism does not have the ability to see the data. Rather, the mechanism may purchase the right to observe data points using a limited budget. The mechanism is still expected to have low regret compared to the optimal hypothesis in hindsight on the entire data sequence (even though it only observes a portion of the sequence).
4.1 Problem Definition
The datapurchasing regret minimization problem is parameterized by the hypothesis space , number of arriving data points , and expected budget constraint . A problem instance is a sequence of pairs where each is a convex loss function and each is the cost associated with that data point. We assume that the are Lipschitz, and let be the set of such loss functions.
In this problem, we design a mechanism implementing the operations “post” and “receive” and interacting with the problem instance as follows.

For each time step :

The mechanism posts a hypothesis and a pricing function , where is the price posted for loss function .

Agent arrives, possessing .

If the posted price , then agent accepts the transaction: The mechanism pays to the agent and receives . If , agent rejects the transaction and the mechanism receives a null signal.

Note the key differences from the statistical learning setting: We must post a hypothesis at each time step (and we do not output a final prediction), and data is not assumed to come from a distribution.
The goal of the mechanism is to minimize the loss, namely . The definition of regret is also the same as in the classical setting (Equation 3). Note that we suffer a loss at time regardless of whether we purchase or not. The mechanism must also guarantee that, for every problem instance , it spends at most in expectation over its own internal randomness.
4.2 The ImportanceWeighting Framework
Recall that, in Section 3.2, we introduced the importanceweighting technique for online learning. This gave regret guarantees for a learning algorithm when each arrival is observed with some probability .
Our general approach will be to develop a strategy for randomly drawing posted prices . This will induce a probability of obtaining each arrival .
Therefore, the entire problem has been reduced to choosing a postedprice strategy at each time step. This postedprice strategy should attempt to minimize the regret bound while satisfying the expected budget constraint.
A brief sketch of the proof arguments is as follows. After we choose a posted price strategy, each will be determined as a function of and . ( is just equal to the probability that our randomly drawn price exceeds the agent’s cost .) Thus, we can apply Lemma 3.1, which stated that for these induced probabilities , the expected regret of the learning algorithm is
where is a constant and is a parameter of the learning algorithm to be chosen later.
After we choose and apply such a strategy, the general approach to proving our regret bounds is to find an a priori bound such that . Then the regret bound becomes . If we know this upperbound in advance using some prior knowledge, then we can choose as the parameter for our learning algorithms. This gives a regret guarantee of .
4.3 A First Step to Pricing: The “AtCost” Variant
The bulk of our analysis of the noregret datapurchasing problem actually focuses on a slightly easier variant of the setting: If the arriving agent accepts the transaction, then the mechanism only has to pay the cost rather than the posted price . We call this the “atcost” variant of the problem. This setting turns out to be much more analytically tractable: We derive optimal regret bounds for our mechanisms and matching lower bounds. We then take the key approach and insights derived from this variant and apply them to produce a solution to the main noregret data purchasing problem. In order to keep the story moving forward, we summarize our results for the “atcost” setting here and explore how they are obtained in Section 6.
In the atcost setting, we are able to solve directly for the pricing strategy that minimizes the importanceweighted regret bound of Lemma 3.1. We first define one important quantity, then we state the strategy and result in Theorem 4.1.
Definition 4.1.
For a fixed input sequence , in Definition 3.1, and a mechanism outputting (possibly random) hypotheses , define
where the expectation is over the randomness of the algorithm. Note that lies in by our assumptions on bounded cost and Lipschitz loss.
Now we give the main result for the atcost setting.
Theorem 4.1.
There is a mechanism for the “atcost” problem of data purchasing for regret minimization that interfaces with FTRL and guarantees to meet the expected budget constraint, where for a parameter (Definition 4.1),

The expected regret is bounded by .

This is optimal in that no mechanism can improve beyond constant factors.

The pricing strategy is to choose a parameter and draw randomly according to a distribution such that
The only prior knowledge required is an estimate of up to a constant factor.
4.4 Interpreting the Quantity
Several of our bounds rely heavily on the quantity which measures, in a sense, the “financial difficulty” of the problem. We now devote some discussion to understanding by answering four questions.
(1) How to interpret ?
is an average, over time steps , of . Here, intuitively captures both the “difficulty” of the data and also the “value” or “benefit” of . To explain the difficulty aspect, by examining the regret bound for FTRL learning algorithms (e.g. the importanceweighted regret bound of Lemma 3.1 with all ), one observes that if each is small, then we have an excellent regret bound for our learning algorithm; the problem is “easy”. To explain the value aspect, one can for concreteness take the Online Gradient Descent algorithm; the larger the gradient, the larger the update at this step, and is the norm of the gradient. And in general, the higher , the more likely we are to purchase arrival .
Thus, captures the correlations between the value of the arriving data and the cost of that data. If either the mean of the costs or the average benefit of the data is converging to , then and in these cases we can learn with high accuracy very cheaply, as may be expected. More interestingly, it is possible to have both high average costs, and high average datavalues, and yet still have due to beneficial correlations. In these cases we can learn much more cheaply than might be expected based on either the economic side or the learning side alone.
(2) When should we expect to have good prior knowledge of ?
Although in general will be domainspecific, there are several reasons for optimism. First, compresses all information about the data and costs into a single scalar parameter (compare to the common mechanismdesign assumption that the prior distribution of agents’ values is fully known). Second, we do not need very exact estimates of (e.g. we do not need to know ): For orderoptimal regret bounds, we only need an estimate within a constant factor of . Third, is directly proportional to , which is a normalization constant in our pricing distribution: If we increase , the probability of obtaining a given data point only decreases, and vice versa. In fact, the best choice of is the normalization constant so that we run out of budget precisely when the last arrival leaves. Thus, (equivalently, ) can be estimated and adjusted online by tracking the “burn rate” (spending per unit time) of the algorithm. In simulations, we have observed success with a simple approach of estimating based on the average correlation so far along with the burn rate, i.e. if the current estimated is and there are steps remaining with budget remaining to spend, set .
(3) What can we prove without prior knowledge of ?
It turns out that, if we only have an estimate of , respectively , then this suffices for regret guarantees on the order of , respectively . This “graceful degradation” will continue to be true in the main setting. The idea is that we can follow the optimal form of the pricing strategy while choosing any normalization constant . It may no longer be optimal, but it will ensure that we satisfy the budget and give guarantees depending on the magnitude of . So all we need is an approximate estimate of some value larger than . Both and are guaranteed to upperbound on , so both can be used to pick while satisfying the budget.
To recap, knowledge of only a simple statistic such as the mean of the arriving costs suffices for good learning guarantees, with better knowledge translating to better guarantees.
(4) depends on the algorithm—what are the implications?
We first note that can be upperbounded by, for instance, where is the average of the arriving costs. So a bound containing does imply nontrivial algorithmindependent bounds. The purpose of is to capture cases where we can do significantly better than such bounds because the algorithm is a good fit for the problem. To see this, note that running the FTRL algorithm on the entire data sequence (with no budget constraint) gives a regret bound of . The worst case has each equal to , producing a regret bound. But in a case where the algorithm has a small average and the algorithm enjoys a better regret bound, we may also hope that this improvement is reflected in .
However, one might hope for an algorithmindependent quantity that, in analogy with VCdimension, captures the “difficulty” of the purchasing and learning problem instance. This leads to the question:
(4a) Can we remove the algorithmdependence of the bound? One might hope to achieve a bound depending on an algorithmindependent quantity that captures correlations between data and cost. A natural candidate is . In general, there are difficult cases where one can not achieve a bound in terms of . However, in nicer scenarios we may expect to approximate . For instance, suppose where is a differentiable convex function whose gradient is Lipschitz — commonlyused examples include the squared hinge loss and the log loss. Under this condition, where again we are using , we can show that
By the regret guarantee of our mechanism when run with a good algorithm, even initialized with very weak knowledge, this difference in losses per time step is , implying that . A deeper investigation of this phenomenon is a good candidate for future work.
4.5 Mechanisms and Results for Regret Minimization
In the previous section, we presented our results for the easier “atcost” variant. We now apply the approach derived for that setting to the main regret minimization problem.
For this problem, unlike in the “atcost” variant, we cannot in general solve for the form of the optimal pricing strategy. This is intuitively because, when we must pay the price we post, the optimal strategy depends on . But the algorithm cannot condition the purchasing decision directly on , as this is private information of the arriving agent.
We propose simply drawing posted prices according to the optimal strategy derived for the atcost setting, namely,
(4) 
but with a different choice of normalization constant . We note that there is a pricing distribution that accomplishes this:
Observation 1.
For any and , there exists a pricing distribution on that satisfies Equation 4. Letting , the CDF is given by if , if , and if .
As in the knowncosts case, our regret bounds depend upon the prior knowledge of the algorithm. It will turn out to be helpful to have prior knowledge about both and the following parameter, which can be interpreted as with all costs :
Theorem 4.2.
If Mechanism 3 is run with prior knowledge of and of (up to a constant factor), then it can choose and to satisfy the expected budget constraint and obtain a regret bound of
where (by setting ). Similarly, knowledge only of , respectively , respectively suffices for the regret bound with , respectively , respectively .
We can observe a quantifiable “price of strategic behavior” in the difference between the regret guarantees of Theorems 4.2 (this setting) and Theorem 4.1 (the “atcost”) setting:
Note that , and they approach equality as all costs approach the upper bound , but become very different as the average cost while the maximum cost remains fixed at .
Comparison to lower bound.
Our lowerbound for the data purchasing regret minimization problem is (follows from the lower bound for the atcost setting, Theorem 6.2). So the difference in bounds discussed above, a factor of versus , is the only gap between our upper and lower bounds for the general data purchasing no regret problem.
The most immediate open problem in this paper is close this gap. Intuitively, the lower bound does not take advantage of “strategic behavior” in that a postedprice mechanism may often have to pay significantly more than the data actually costs, meaning that it obtains less data in the long run. Meanwhile, it may be possible to improve on our upperbound strategy by drawing prices from a different distribution.
5 Results for Statistical Learning
In this section, we give the final mechanism, Mechanism 4, for the data purchasing statistical learning problem. The idea is to simply run the regretminimization Mechanism 3 on the arriving agents. At each stage, Mechanism 3 posts a hypothesis . We then aggregate these hypothesis by averaging to obtain our final prediction.
Theorem 5.1.
Mechanism 4 guarantees spending at most in expectation and
where , assuming that and are known in advance up to a constant factor.
If one assumes approximate knowledge respectively of , of , or of , then the guarantee holds with respectively , , or .
Proof.
6 Deriving Pricing and the “atcost” Variant
In Section 4.3, we stated our results for the easier atcost variant of the regret minimization with purchased data problem. This included the postedprice distribution that we use for our main results. In this section, we show how these results and this distribution are derived. The “atcost” variant is formally defined in exactly the same way as the main setting, except that when and the transaction occurs, the mechanism only pays the cost rather than the posted price .
We first show how our postedprice strategy is derived as the optimal solution to the problem of minimizing regret subject to the budget constraint. The resulting upper bounds for the “atcost” variant were given in Theorem 4.1. Then, we give some fundamental lower bounds on regret, showing that in general our upper bounds cannot be improved upon here. These lower bounds also hold for the main noregret data purchasing problem, where there is a small gap to the upper bounds.
6.1 Deriving an Optimal Pricing Strategy
We begin by asking what seems to be an even easier question. Suppose that for every pair that arrives, we could first “see” , then choose a probability with which to obtain and pay . What would be the optimal probability with which to take this data?
Lemma 6.1.
To minimize the regret bound of Lemma 3.1, the optimal choice of sampling probability is of the form The normalization factor .
The proof follows by formulating the convex programming problem of minimizing the regret bound of Lemma 3.1 subject to an expected budget constraint. It also gives the form of the normalization constant , which depends on the input data sequence and the hypothesis sequence.
The key insight is now that we can actually achieve the sampling probabilities dictated by Lemma 6.1 using a randomized postedprice mechanism. Notice that these optimal sampling probabilities are decreasing in . In general, when drawing a price from some distribution, the probability that it exceeds will be decreasing in . So it only remains to find the postedprice distribution that actually induces the sampling probabilities that we want for all simultaneously. That is, by randomly drawing posted prices according to our distribution, we choose to purchase with exactly the probability stated in Lemma 6.1, for any possible value of and without knowing .
Thus, our final mechanism for the atcost variant is to simply apply Mechanism 3, but only pay the cost of the arrival rather than the price we posted. We set . Note that this choice of normalization constant is different from the main setting because we on average pay less in the atcost setting; this leads to the difference in the regret bounds. Our main bound for the atcost variant was given in Theorem 4.1. An open problem for this setting is whether one can obtain the same regret bounds without any prior knowledge at all about the arriving costs and data.
6.2 Lower Bounds for Regret Minimization
Here, we prove lower bounds analogous to the classic regret lower bound, which states that no algorithm can guarantee to do better than . These lower bounds will hold even in the “atcost” setting, where they match our upper bounds. An open problem is to obtain a largerorder lower bound for the main setting where the mechanism pays its posted price. This would show a separation between the atcost variant and the main problem.
First, we give what might be considered a “sample complexity” lower bound for noregret learning: It specializes our setting to the case where all costs are equal to one (and this is known to the algorithm in advance), so the question is what regret is achievable by an algorithm that observes of the arrivals.
Theorem 6.1.
Suppose all costs . No algorithm for the atcost online datapurchasing problem has regret better than ; that is, for every algorithm, there exists an input sequence on which its regret is .
Proof Idea: We will have two coins, with probabilities of coming up heads. We will take one of the coins and provide i.i.d. flips as the input sequence. The possible hypotheses for the algorithm are and the loss is zero if the hypothesis matches the flip and one otherwise. The cost of every data point will be one.
The idea is that an algorithm with regret much smaller than must usually predict heads if it is the headsbiased coin and usually predict tails if it is the tailsbiased coin. Thus, it can be used to distinguish these cases. However, there is a lower bound of samples required to distinguish the coins, and the algorithm only has enough budget to gain information about of the samples. Setting gives the regret bound. ∎
We next extend this idea to the case with heterogeneous costs. The idea is very simple: Begin with the problem from the labelcomplexity lower bound, and introduce “useless” data points and heterogeneous costs. The worst or “hardest” case for a given average cost is when cost is perfectly correlated with benefit, so all and only the “useful” data points are expensive.
Theorem 6.2.
No algorithm for the nonstrategic online datapurchasing problem has expected regret better than ; that is, for every , for every algorithm, there exists a sequence with parameter on which its regret is . Similarly, for and , we have the lower bounds and .
7 Examples and Experiments
In this section, we give some examples of the performance of our mechanisms on data. We use a binary classification problem with feature vector and label . The dataset is described in Figure 3.
The hypothesis is a hyperplane classifier,
i.e. vector where the example is classified as positive if and negative otherwise; the risk is therefore the error rate (fraction of examples misclassified). For the implementation of the online gradient descent algorithm, we use a “convexified” loss function, the wellknown hinge loss: where .In our simulations, we give each mechanism access to the exact same implementation of the Online Gradient Descent algorithm, including the same parameter chosen to be where is the average norm of the data feature vectors. We train on a randomly chosen half of the dataset and test on the other half.
The “baseline” mechanism has no budget cap and purchases every data point. The “naive” mechanism offers a maximum price of for every data point until out of budget. “Ours” is an implementation of Mechanism 4. We do not use any prior knowledge of the costs at all: We initialize and then adjust online by estimating from the data purchased so far. (For a symmetric comparison, we do not adjust accordingly; instead we leave it at the same value as used with the other mechanisms.) The examples are shown in Figure 4.
8 Discussion and Conclusion
8.1 AgentMechanism Interaction Model
Our model of interaction, while perhaps the simplest initial starting point, involves some subtleties that may be interesting to address in the future. A key property is that we need to obtain both an arriving agent’s data point and her cost . The reason is that the cost is used to importanceweight the data based on the probability of picking a price larger than that cost. (The cost report is also required by [16] for the same reason.) As discussed in Section 2, a naïve implementation of this model is incentivecompatible but not strictly so. Exploring implementations, such as the trusted third party approach mentioned, is an interesting direction. For instance, in a strictly truthful implementation, the arriving agent can cryptographically commit to a bid, e.g. by submitting a cryptographic hash of her cost. Then the prices are posted by the mechanism. If the agent accepts, she reveals her data and her cost, verifying that the cost hashes to her commitment. It is strictly truthful for the agent to commit to her true cost.
This paper focused on the learningtheoretic aspects of the problem, but exploring the model further or proposing alternatives is also of interest for future work.
8.2 Conclusions and Directions
The contribution of this work was to propose an active scheme for learning and pricing data as it arrives online, held by strategic agents. The active approach allows learning from past data and selectively pricing future data. Our mechanisms interface with existing noregret algorithms in an essentially blackbox fashion (although the proof depends on the specific class of algorithms). The analysis relies on showing that they have good guarantees in a model of noregret learning with purchased data. This noregret setting may be of interest in future work, to either achieve good guarantees with no foreknowledge at all other than the maximum cost, or to propose variants on the model.
The noregret analysis means our mechanisms are robust to adversarial input. But in nicer settings, one might hope to improve on the guarantees. One direction is to assume that costs are drawn according to a known marginal distribution (although the correlation with the data is unknown). A combination of our approach and the postedprice distributions of Roth and Schoenebeck [16] may be fruitful here.
Broadly, the problem of purchasing data for learning has many potential models and directions for study. One motivating setting, closer to crowdsourcing, is an active problem where data points consist of pairs (example, label) and the mechanism can offer a price for anyone who obtains the label of a given example. In an online arrival scheme, such a mechanism could build on the importanceweighted active learning paradigm [4].
Acknowledgments
The authors thank Mike Ruberry for discussion and formulation of the problem. Thanks to the organizers and participants of the 2014 IndoUS Lectures Week in Machine Learning, Game Theory and Optimization, Bangalore.
We thank the support of the National Science Foundation under awards CCF1301976 and IIS1421391. Any opinions, findings, conclusions, or recommendations expressed here are those of the authors alone.
References
 Anthony and Bartlett [2009] Martin Anthony and Peter L Bartlett. Neural network learning: Theoretical foundations. Cambridge University Press, 2009.
 Balcan et al. [2006] MariaFlorina Balcan, Alina Beygelzimer, and John Langford. Agnostic active learning. In Proceedings of the 23rd International Conference on Machine Learning (ICML06), 2006.
 Balcan et al. [2010] MariaFlorina Balcan, Steve Hanneke, and Jennifer Wortman Vaughan. The true sample complexity of active learning. Machine learning, 80(23):111–139, 2010.
 Beygelzimer et al. [2009] Alina Beygelzimer, Sanjoy Dasgupta, and John Langford. Importance weighted active learning. In 26th International Conference on Machine Learning (ICML09), 2009.
 Beygelzimer et al. [2010] Alina Beygelzimer, Daniel Hsu, John Langford, and Tong Zhang. Agnostic active learning without constraints. In Advances in Neural Information Processing Systems (NIPS10), 2010.
 CesaBianchi and Lugosi [2006] Nicolo CesaBianchi and Gabor Lugosi. Prediction, learning, and games. 2006.
 CesaBianchi et al. [2004] Nicolo CesaBianchi, Alex Conconi, and Claudio Gentile. On the generalization ability of online learning algorithms. Information Theory, IEEE Transactions on, 50(9):2050–2057, 2004.
 Cummings et al. [2015] Rachel Cummings, Katrina Ligett, Aaron Roth, Zhiwei Steven Wu, and Juba Ziani. Accuracy for sale: Aggregating data with a variance constraint. In Proceedings of the 2015 Conference on Innovations in Theoretical Computer Science, pages 317–324. ACM, 2015.
 Dekel et al. [2008] Ofer Dekel, Felix Fischer, and Ariel D Procaccia. Incentive compatible regression learning. In Proceedings of the nineteenth annual ACMSIAM symposium on Discrete algorithms, pages 884–893. Society for Industrial and Applied Mathematics, 2008.
 Ghosh and Roth [2011] Arpita Ghosh and Aaron Roth. Selling privacy at auction. In Proceedings of the 12th ACM Conference on Electronic Commerce (EC11), 2011.
 Ghosh et al. [2014] Arpita Ghosh, Katrina Ligett, Aaron Roth, and Grant Schoenebeck. Buying private data without verification. In Proceedings of the fifteenth ACM conference on Economics and computation, pages 931–948. ACM, 2014.
 Hanneke [2009] Steve Hanneke. Theoretical foundations of active learning. ProQuest, 2009.
 Horel et al. [2014] Thibaut Horel, Stratis Ioannidis, and Muthu Muthukrishnan. Budget feasible mechanisms for experimental design. In Latin American Theoretical Informatics (LATIN14), 2014.
 Ligett and Roth [2012] Katrina Ligett and Aaron Roth. Take it or leave it: Running a survey when privacy comes at a cost. In The 8th Workshop on Internet and Network Economics (WINE12), 2012.
 Meir et al. [2012] Reshef Meir, Ariel D Procaccia, and Jeffrey S Rosenschein. Algorithms for strategyproof classification. Artificial Intelligence, 186:123–156, 2012.
 Roth and Schoenebeck [2012] Aaron Roth and Grant Schoenebeck. Conducting truthful surveys, cheaply. In 13th Conference on Electronic Commerce (EC12), 2012.
 Settles [2011] Burr Settles. From theories to queries: Active learning in practice. Active Learning and Experimental Design W, pages 1–18, 2011.
 ShalevShwartz [2012] Shai ShalevShwartz. Online learning and online convex optimization. Foundations and Trends in Machine Learning, 2012.
 Vapnik [2000] Vladimir Vapnik. The nature of statistical learning theory. Springer Science & Business Media, 2000.
 Zinkevich [2003] Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. 2003.
Appendix
Appendix A Tools for Converting RegretMinimizing Algorithms
Lemma A.1 (Lemma 3.1).
Assume we implement Algorithm 2 with nonzero sampling probabilities . Assume the underlying OLA is FTRL (Algorithm 1) with regularizer that is strongly convex with respect to . Then the expected regret, with respect to the loss sequence , is no more than
where is a constant depending on and , is a parameter of the algorithm, and the expectation is over any randomness in the choices of and .
Proof.
Let . We wish to prove that
where is shorthand for and
As a prelude, note that in general these expectations could be quite tricky to deal with. We consider a fixed input sequence
, but each random variable
depends on the prior sequence of variables and outcomes. However, we will see that the nice feature of the importanceweighting technique of Algorithm 2 helps make this problem tractable.Some preliminaries: Define the importanceweighted loss function at time to be the random variable
Let
be the indicator random variable equal to
if we obtain , which occurs with probability , and equal to otherwise. Then notice that for any hypothesis ,(5) 
To be clear, the expectation is over the random outcome whether or not we obtain datapoint conditioned on the value of ; and conditioned on the value of , by definition we obtain datapoint with probability and obtain the function otherwise.
Now we proceed with the proof. For any method of choosing and any resulting outcomes of , Algorithm 2 reduces to running the FollowtheRegularizedLeader algorithm on the sequence of convex loss functions . Thus, by the regret bound proof for FTRL (Lemma A.2), FTRL guarantees that for every fixed “reference hypothesis” :
where
(Recall that .) Now we will take the expectation of both sides, separating out the expectation over the choice of , over , and over :
Comments
There are no comments yet.