In the age of automation, data is king. The statistics and machine learning algorithms that help curate our online content, diagnose our diseases, and drive our cars, among other things, are all fueled by data. Typically, this data is mined by happenstance: as we click around on the internet, seek medical treatment, or drive “smart” vehicles, we leave a trail of data. This data is recorded and used to make estimates and train machine learning algorithms. So long as representative data is readily abundant, this approach may be sufficient. But some data is sensitive and therefore inaccurate, rare, or lacking detail in observable data traces. In such cases, it is more expedient to buy the necessary data directly from the population.
Consider, for example, the problem a public health administration faces in trying to learn the average weight of a population, perhaps as an input to estimating the risk of heart disease. Weight is a sensitive personal characteristic, and people may be loath to disclose it. It is also variable over time, and so must be collected close to the time of the average weight estimate in order to be accurate. Thus, while other characteristics, like height, age, and gender, are fairly accurately recorded in, for example, driver’s license databases, weight is not. The public health administration may try surveying the public to get estimates of the average weight, but these surveys are likely to have low response rates and be biased towards healthier low-weight samples.
In this paper, we propose a mechanism for buying verifiable data from a population in order to estimate a statistic of interest, such as the expected value of some function of the underlying data. We assume each individual has a private cost, or disutility, for revealing his or her sensitive data to the analyst. Importantly, this cost may be correlated with the private data. For example, overweight or underweight individuals to have a higher cost of revealing their data than people of a healthy weight. Individuals wish to maximize their expected utility, which is the expected payment they receive for their data minus their expected cost. The analyst has a fixed budget for buying data. The analyst does not know the distribution of the data: properties of the distribution is what she is trying to learn from the data samples, therefore it is important that she uses the data she collects to learn it rather than using an inaccurate prior distribution (for example, the analyst may have a prior on weight distribution within a population from DMV records or previous surveys, but such a prior may be erroneous if people do not accurately report their weights). However, we do assume the analyst has a prior for the marginal distribution of costs, and that she estimates how much a survey may cost her as a function of said prior.111This prior could come from similar past exercises. Alternatively, when no prior is known, the analyst can allocate a fraction of his budget to buying data for the sake of learning this distribution of costs. In this paper, we follow prior work (e.g., Roth and Schoenebeck ) and assume that a prior distribution is known, instead of focusing on how one might learn the distribution of costs.
The analyst would like to buy data subject to her budget, then use that data to obtain an unbiased estimator for the statistic of interest. To this end, the analyst posts a menu of probability-price pairs. Each individualwith cost selects a pair from the menu, at which point the analyst buys the data with probability at price . The expected utility of the individual is thus .222As we show, this menu-based formulation is fully general and captures arbitrary data-collection mechanisms. To form an estimate based on this collected data, we assume the analyst uses inverse propensity scoring, pioneered by Horvitz and Thompson . This is the unique unbiased linear estimator; it works by upweighting the data from individual by the inverse of his/her selection probability, .
The Horvitz-Thompson estimator always generates an unbiased estimate of the statistic being measured, regardless of the price menu. However, the precision of the estimator, as measured by the variance or mean-squared error of the estimate, depends on the menu of probability-price pairs offered to each individual. For example, offering a high price would generate data samples with low bias (since many individuals would accept such an offer), but the budget would limit the number of samples. Offering low prices allows the mechanism to collect more samples, but these would be more heavily biased, requiring more aggressive correction which introduces additional noise. The goal of the analyst is to strike a balance between these forces and post a menu that minimizes the variance of her estimate in the worst-case over all possible joint distributions of the data and cost consistent with the cost prior. We note that this problem setting was first studied by Roth and Schonebeck
, who characterized an approximately optimal mechanism for moment estimation.
1.1 Summary of results and techniques
Our main contribution comes in the form of an exact solution for the optimal menu, as discussed in Section 3. As one would expect, if the budget is large, the optimal menu offers to buy, with probability , all data at a cost equal to the maximum cost in the population. If the budget is small, the optimal menu buys data from an individual with probability inversely proportional to the square root of their cost.333Of course, the individual is him/herself selecting the menu option and so the use of an active verb in this context is perhaps a bit misleading. What we mean here is that, given his/her incentives based on his/her private cost, the choice the individual selects is one that buys his/her data with probability inversely proportional to the square root of his/her cost. Interestingly, in intermediate regimes, we show the optimal menu employs pooling: for all individuals with sufficiently low private cost, it buys their data with equal probability; for the remaining high cost agents, it buys their data with probability inversely proportional to the square root of their costs. Revisiting the example of estimating the weight of a population of size , our scheme suggests the following solution. Imagine the costs are with probability , and the total budget of the analyst is . The analyst brings a scale to a public location and posts the following menu of pairs of allocation probability and price: . A simple calculation shows that individuals with cost or will pick the first menu option: stepping on the scale and having their weight recorded with probability , and receiving a payment of dollars. Individuals with cost will pick the second menu option; if they are selected to step on the scale, which happens with probability , the analyst records their weight scaled by a factor of . This scaling is precisely the upweighting from inverse propensity scoring. In expectation over the population, the analyst spends exactly his budget . The estimate is the average of the scaled weights.
We show how to extend our approach in multiple directions. First, our characterization of the optimal mechanism holds even when the quantity to be estimated is the expected value of a
-dimensional moment function of the data. Second, we extend our techniques beyond moment estimation to the common task of multi-dimensional linear regression. In this regression problem, an individual’s data includes both features (which are assumed to be insensitive or publicly available) and outcomes (which may be sensitive). The analyst’s goal is to estimate the linear regression coefficients that relate the outcomes to the features. We make the assumption that an individual’s cost is independent of her features, but may be arbitrarily correlated with her outcome. For example, the goal might be to regress a health outcome (such as severity of a disease) on demographic information. In this case, we might imagine that an agent incurs no cost for reporting his age, height or gender, but his cost might be highly correlated with his realized health outcome. In such a setting, we show that the asymptotically optimal allocation rule, given a fixed average budget per agent as the number of agent grows large, can be calculated efficiently and exhibits a pooling region as before. However, unlike for moment estimation, agents with intermediate costs can also be pooled together. We further show that our results extend to non-linear regression in AppendixA, under mild additional conditions on the regression function.
Our techniques rely on i) reducing the mechanism design problem to an optimization problem through the classical notion of virtual costs, then ii) reducing the problem of optimizing the worst-case variance to that of finding an equilibrium of a zero-sum game between the analyst and an adversary. The adversary’s goal is to pick a distribution of data, conditional on agents’ costs, that maximizes the variance of the analyst’s estimator. We then characterize such an equilibrium through the optimality conditions for convex optimization described in .
1.2 Related work
A growing amount of attention has been placed on understanding interactions between the strategic nature of data holders and the statistical inference and learning tasks that use data collected from these holders. The work on this topic can be roughly divided into two categories according to whether money is used for incentive alignment.
In the first category, individuals as data holders do not directly derive utility from the accuracy of the inference or learning outcome, but in some cases may incur a privacy cost if the outcome leaks their private information. The analyst uses monetary payments to incentivize agents to reveal their data. Our work falls into this category. Prior papers by Roth and Schoenebeck  and Abernethy et al.  are closest to our setting. Similarly to our work, both Roth and Schoenebeck  and Abernethy et al.  consider an analyst’s problem of purchasing data from individuals with private costs subject to a budget constraint, allow the cost to be correlated with the value of data, and assume that individuals cannot fabricate their data. Roth and Schoenebeck  aim at obtaining an optimal unbiased estimator with minimum worst-case variance for population mean, while their mechanism achieves optimality only approximately: instead of the actual worst-case variance, a bound on the worst-case variance is minimized. While our setting is identical to that of , our work precisely minimizes worst-case variance (under a regularity assumption on the cost distribution), and our main contribution is to exhibit the structure of the optimal mechanism, as well as to extend our results to broader classes of statistical inference, moment estimation and linear regression. In particular, compared to , our solution exhibits new structure in the form of a pooling region for low cost agents; i.e., the optimal mechanism pools agents with the lowest costs together and treats them identically. Such structure does not arise in  under a regularity assumption on the cost distribution. Abernethy at al. 
consider general supervised learning. They do not seek to achieve a notion of optimality; instead, they take a learning-theoretic approach and design mechanisms to obtain learning guarantees (risk bounds).
Several papers consider data acquisition models with different objectives under the assumptions that (a) individuals do not fabricate their data, and (b) private costs and value of data are uncorrelated. For example, in the work of Cummings et al. , the analyst can decide the level of accuracy for data purchased from each individual, and wishes to guarantee a certain desired level of accuracy of the aggregated information while minimizing the total privacy cost incurred by the agents. Cai et al.  focus on incentivizing individuals to exert effort to obtain high-quality data for the purpose of linear regression. Another line of research in the first category examines the data acquisition problem under the lens of differential privacy [13, 10, 12, 20, 5]. The mechanism designer then uses payments to balance the trade-off between privacy and accuracy.
In the second category, individuals’ utilities directly depend on the inference or learning outcome (e.g. they want a regression line to be as close to their own data point as possible) and hence they have incentives to manipulate their reported data to influence the outcome. There often is no cost for reporting one’s data. The data analyst, without using monetary payments, attempts to design or identify inference or learning processes so that they are robust to potential data manipulations. Most papers in this category assume that independent variables (feature vectors) are unmanipulable public information and dependent variables are manipulable private information[7, 16, 17, 21], though some papers consider strategic manipulation of feature vectors [14, 8]. Such strategic data manipulations have been studied for estimation , classification [16, 17, 14], online classification , regression [22, 7], and clustering . Work in this category is closer to mechanism design without money in the sense that they focus on incentive alignment in acquiring data (e.g., strategy-proof algorithms) but often do not evaluate the performance of the inference or learning, with a few notable exceptions [14, 8].
2 Model and Preliminaries
There is a population of agents. Each agent has a private pair , where is a data point and is a cost. We think of as the disutility agent incurs by releasing her data . The pair is drawn from a distribution , unknown to the mechanism designer. We denote with the CDF of the marginal distribution of costs,444Throughout the text we will use the CDF to refer to the distribution itself. supported on a set . We assume that and the support of the data points, , are known. However, the joint distribution of data and costs is unknown.
A survey mechanism is defined by an allocation rule and a payment rule , and works as follows. Each agent arrives at the mechanism in sequence and reports a cost . The mechanism chooses to buy the agent’s data with probability . If the mechanism buys the data, then it learns the value of (i.e., agents cannot misreport their data) and pays the agent . Otherwise the data point is not learned and no payment is made.
We assume agents have quasi-linear utilities, so that the utility enjoyed by agent when reporting is
We will restrict attention to survey mechanisms that are truthful and individually rational.
Definition 1 (Truthful and Individually Rational - TIR).
A survey mechanism is truthful if for any cost it is in the agent’s best interest to report their true cost, i.e. for any report :
It is individually rational if, e. for any cost , .
We assume that the mechanism is constrained in the amount of payment it can make to the agents. We will formally define this as an expected budget constraint for the survey mechanism.
Definition 2 (Expected Budget Constraint).
A mechanism respects a budget constraint if:
The designer (or data analyst) wishes to use the survey mechanism to estimate some parameter of the marginal distribution of data points.555We also extend our results to multi-dimensional parameters; see Section 4. For example, it might be that and is the mean of the distribution over data points in the population. To this end, the designer will apply an estimator to the collection of data points elicited by the survey mechanism. We will write for the estimator used. Note that the value of the estimator depends on the sample , but might also depend on the distribution of costs and the survey mechanism. Due to the randomness inherent in the survey mechanism (both in the choice of data points sampled and the values of those samples), we think of
as a random variable, drawn from a distribution. We will focus exclusively on unbiased estimators.
Definition 3 (Unbiased Estimator).
Given an allocation function , an estimator for is unbiased if for any instantiation of the true distribution its expected value is equal to :
Given a fixed choice of estimator, the mechanism designer wants to construct the survey mechanism to minimize the variance (finite sample or asymptotic as the population grows) of that estimator. Since the designer does not know the distribution , we will work with the worst-case variance over all instantiations of that are consistent with the cost marginal .
Definition 4 (Worst-Case Variance).
Given an allocation function and an instance of the true distribution , the variance of an estimator is defined as:
The worst-case variance of is
We are now ready to formally define the mechanism design problem faced by the data analyst.
Definition 5 (Analyst’s Mechanism Design Problem).
Given an estimator and cost distribution , the goal of the designer is to design an allocation rule and payment rule so as to minimize worst-case variance subject to the truthfulness, individual rationality and budget constraints:
Implementing Surveys as Posted Menus.
The formulation above describes surveys as direct-revelation mechanisms, where agents report costs. We note that an equivalent indirect implementation might be more natural: a posted menu survey offers each agent a menu of (price, probability) pairs . If the agent chooses then their data is elicited with probability , in which case they are paid . Each agent can choose the item that maximizes their expected utility, i.e., . By the well-known taxation principle, any survey mechanism can be implemented as a posted menu survey, and the number of menu items required is at most the size of the support of the cost distribution.
2.1 Reducing Mechanism Design to Optimization
We begin by reducing the mechanism design problem to a simpler full-information optimization problem where the designer knows the private cost of each player and can acquire their data by paying them exactly that cost. However, the designer is constrained to using monotone allocation rules, in which players with higher costs have weakly lower probability of being chosen.
Definition 6 (Analyst’s Optimization Problem).
Given an estimator and cost distribution , the optimization version of the designer’s problem is to find a non-increasing allocation rule that minimizes worst-case variance subject to the budget constraint, assuming agents are paid their cost:
Definition 7 (Virtual Costs and Regular Distributions).
If is continuous and admits a density then define the virtual cost function as . If is discrete with support and PDF , then define the virtual cost function as: , with . We also denote with the distribution of virtual costs; i.e., the distribution created by first drawing from and then mapping it to . A distribution is regular if the virtual cost function is increasing.
When is twice-continuously differentiable, is regular if and only if for all . Importantly, in this case, the allocation rule of Roth and Schoenebeck  is monotone strictly decreasing in and does not exhibit a pooling region at low-cost as our solution does. The following is an analogue of Myerson’s  reduction of mechanism design to virtual welfare maximization, adapted to the survey design setting.
If the distribution of costs is regular, then solving the Analyst’s Mechanism Design Problem reduces to solving the Analyst’s Optimization Problem for distribution of costs .
The proof is given in Appendix B.1 ∎
2.2 Unbiased Estimation and Inverse Propensity Scoring
We now describe a class of estimators that we will focus on for the remainder of the paper. Note that simply calculating the quantity of interest, , on the sampled data points can lead to bias, due to the potential correlation between costs and data. For instance, suppose that and the goal is to estimate the mean of the distribution of . A natural estimator is the average of the collected data: . However, if players with lower tend to have lower cost, and are therefore selected with higher probability by the analyst, then this estimator will consistently underestimate the true mean.
This problem can be addressed using inverse propensity scoring (IPS), pioneered by Horvitz and Thompson . The idea is to recover unbiasedness by weighting each data point by the inverse of the probability of observing it. This IPS approach can be applied to any parameter estimation problem where the parameter of interest is the expected value of an arbitrary moment function .
Definition 8 (Horvitz-Thompson Estimator).
The Horvitz-Thompson estimator for the case when the parameter of interest is the expected value of a (moment) function is defined as:
The Horvitz-Thompson estimator is the unique unbiased estimator that is a linear function of the observations . It is therefore without loss of generality to focus on this estimator if one restricts to unbiased linear estimators.666We note that we have assumed, for convenience, that for all in the expression of this estimator, for it to be unbiased and well-defined. It is easy to see from the expression for the variance given in Section 3 that the variance-minimizing allocation rule will indeed be non-zero for each cost.
IPS beyond moment estimation.
We defined the Horvitz-Thompson estimator with respect to moment estimation problems, . As it turns out, this approach to unbiased estimation extends even beyond the moment estimation problem to parameter estimation problems defined as the solution to a system of moment equations or parameters defined as the minima of a moment function . We defer this discussion to Section 5.
3 Estimating Moments of the Data Distribution
In this section we consider the case where the analyst’s goal is to estimate the mean of a given moment function of the distribution. That is, there is some function such that both and are in the support of random variable , and the goal of the analyst is to estimate .777Observe that it is easy to deal with the more general case of by a simple linear translation, i.e., estimate instead, which is in and then translate the estimator back to recover . We assume that , the estimator being applied, is the Horvitz-Thompson estimator given in Definition 8.
For convenience we will assume that the cost distribution has finite support, say with . (We relax the finite support assumption in Section 3.2.) Write for the probability of cost in . Also, for a given allocation rule , we will write for convenience. That is, we can interpret an allocation rule as a vector of values . For further convenience, we will write . This is the probability that the moment takes on its maximum value when the cost is . Finally, we will assume that the distribution of costs is regular.
Our goal is to address the analyst’s mechanism design problem for this restricted setting. By Lemma 1 it suffices to solve the analyst’s optimization problem. We start by characterizing the worst-case variance for this setting.
The worst-case variance of the Horvitz-Thompson estimator of a moment , given cost distribution and allocation rule , is:
For any distribution , observe that the Horvitz-Thompson estimator can be written as the sum of i.i.d. random variables each with a variance:
Hence, the variance of the estimator is . Observe that conditional on any value , the worst-case distribution , will assign positive mass only to values such that . This is because any other conditional distribution can be altered by a mean-preserving spread, pushing all the mass on these values, while preserving the conditional mean . This would strictly increase the latter variance. Thus we can assume without loss of generality that , in which case and . Recall that . Then we can simplify the variance as:
The theorem follows since the worst-case variance is a supremum over all possible consistent distributions, hence equivalently a supremum over conditional probabilities . ∎
Given the above characterization of the variance of the estimator, we can greatly simplify the analyst’s optimization problem for this setting. Indeed, it suffices to find the allocation rule that minimizes (10), subject to being monotone non-decreasing and satisfying the expected budget constraint.
3.1 Characterization of the Optimal Allocation Rule
We are now ready to solve the analyst’s optimization problem for moment estimation. In this and all following sections, we denote for simplicity of notations, and refer to as the “average budget per agent”. Note that different agents with different costs may be allocated different fractions of the total budget that in general do not coincide with . We remark that if is larger than the expected cost of an agent, then it is feasible (and hence optimal) for the analyst to set the allocation rule to pick any type with probability . We therefore assume without loss of generality that .
Our analysis is based on an equilibrium characterization, where we view the analyst choosing and the adversary choosing as playing a zero-sum game and solve for its equilibria. We first present the characterization and some qualitative implications and then present an outline of our proof. We defer the full details of the proof to Appendix B.2.
Theorem 1 (Optimal Allocation for Moment Estimation).
The optimal allocation rule is determined by two constants and such that:
with uniquely determined such that the budget constraint is binding.888The explicit form of this is . Moreover, the parameters and can be computed in time .
The parameters and in Theorem 1 are explicitly derived in closed form in Appendix B.2. For instance, when , then and for all . When then and . In fact, it can be shown (see full proof) that in this latter case, the worst-case distribution is given by . In particular, in this restricted case, the approximation of Roth and Schoenebeck  is in fact optimal, and indeed our allocation rule is expressing the solution of Roth and Schoenebeck  as a posted menu for a discrete, regular distribution of costs. In every other case, and our solution differs from that of Roth and Schoenebeck , exhibiting a pooling region for low-cost agents. More generally, the computational part of Theorem 1 follows by performing binary search over the support of , which can be done in time.
We note that the optimal rule essentially allocates to each agent inversely proportionally to the square root of their cost, but may also “pool” the allocation probability for agents at the lower end of the cost distribution. See Figure 1 for examples of optimal solutions.
The proof of Theorem 1 appears in Appendix B.2. The main idea is to view the optimization problem as a zero-sum game between the analyst who designs the allocation rule , and an adversary who designs so as to maximize the variance of the estimate. We show how to compute an equilibrium of this zero-sum game via Lagrangian and KKT conditions, and then note that the obtained must in fact be an optimal allocation rule for worst-case variance.
The analysis above applied to a discrete cost distribution over a finite support of possible costs. We show how to extend this analysis to a continuous distribution below, noting that the continuous variant of the Optimization Problem for Moment Estimation can be derived by taking the limit over finer and finer discrete approximations of the cost distribution.
3.2 Continuous Costs for Moment Estimation
Definition 9 (Continuous Optimization Problem for Moment Estimation).
When costs are supported on , the analyst’s optimization problem for the moment estimation problem based on the Horvitz-Thompson estimator can be written as:
We can now establish the following continuous variant of Theorem 1, which describes the optimal survey mechanism for continuous cost distributions.
Theorem 2 (Continuous Limit of Optimal Allocation).
If the distribution of costs is atomless and supported in , then the optimal allocation rule is determined by two constants and such that:
with uniquely determined such that the budget constraint is binding.999The explicit form of this is .. The quantities and are defined as follows: for any let
Then and (see Figure 2).101010We take the convention that if lies above the range of , then .
See Appendix B.3. ∎
Let us give some intuition behind the form of the allocation rule described in Theorem 2. As in Theorem 1, the allocation rule will pool agents with low costs (i.e., less than some threshold ), then allocate to higher-cost agents inversely proportional to the square root of their costs. In the definition of and , note that is non-decreasing and is non-increasing, so is non-decreasing. We therefore have that , the boundary of the pooling region, increases with up to a maximum value of (at which point all agents are pooled).
Let’s restrict attention to the case where the mean of the distribution is at least as large as half of the maximum value of the support, i.e. . In this setting, we see that for all , so
|(see Figure 2)|
So the optimal allocation sets . Moreover, the allocation for the pooling region is . So the optimal mechanism takes the following intuitive form: first, assign each agent an allocation probability that would, in an alternate world where costs are capped at , precisely exhaust the budget. Since costs can actually be greater than , this flat allocation goes over-budget. So, for agents whose costs are greater than , we remove allocation probability so that (a) the budget becomes balanced, and (b) the remaining probability of allocation is inversely proportional to the square root of the costs.
4 Multi-dimensional Parameters for moment estimation
Section 3 focused on the case of estimating a single-dimensional parameter of the data distribution. In this section we note that our characterization of the optimal mechanism extends to multi-dimensional moment estimation as well. In multi-dimensional moment estimation, there is a function , and our goal is to estimate . Here is the dimension of the estimation problem, which we assume to be a fixed constant.
As before, we will estimate by applying an estimator to the data collected from a survey mechanism. To evaluate an estimator, we must extend our definition of variance to the -dimensional setting, as follows.
Definition 10 (Worst-Case Mean Squared Error - Risk).
Given allocation function and distribution , the expected mean squared error (or risk) of an estimator is
and the worst-case variance of is
When is unbiased, the risk has a natural interpretation: it is simply the sum of variances of each coordinate of , considered separately.
Claim 1 (Risk of Unbiased Estimators).
The risk of any unbiased estimator is equal to the sum of variances of every coordinate:
As in the single-dimensional case, the analyst obtains an estimate through the Horvitz-Thompson estimator, which is defined as follows for parameters in . Also as in the single-dimensional case, The Horvitz-Thompson estimator is an unbiased estimator of .
Definition 11 (Horvitz-Thompson Estimator for Multi-dimensional Moment Estimation).
The Horvitz-Thompson estimator for the case when the parameter of interest is the expected value of a vector of moments is defined as:
For our characterization of worst-case risk, we will assume that the moment function can take on the extreme points of the hypercube .
is such that the induced distribution of is supported on every extreme point of the hypercube.
Under Assumption 1, the worst-case risk of the Horvitz-Thompson estimator of moment is
See Appendix B.4. ∎
Lemma 3 implies that the optimal survey design problem in the -dimensional case is, in fact, identical to the problem considered in the single-dimensional case. We can conclude that Theorems 1 and 2, which characterized the optimal survey mechanisms for discrete and continuous single-parameter settings, respectively, also apply to the multi-dimensional setting without change.
5 Multi-dimensional Parameter Estimation via Linear Regression
In this section, we extend beyond moment estimation to a multi-dimensional linear regression task (we discuss the non-linear case in Appendix A). For this setting we will impose additional structure on the data held by each agent. Each agent’s private information consists of a feature vector , an outcome value , and a residual value , that are i.i.d among agents. Each agent also has a cost . The data is generated in the following way: first, is drawn from an unknown distribution . Then, independently from , the pair is drawn from a joint distribution over . The marginal distribution over costs, , is known to the designer, but not the full joint distribution . Then is defined to be
where with a compact subset of . We further require that is in the interior of . We write for the marginal distribution over , which is supported on some bounded range and has mean . (In particular, .) We remark that it may be the case, however, that .
When a survey mechanism buys data from agent , the pair is revealed. Crucially, the value of is not revealed to the survey mechanism. The goal of the designer is to estimate the parameter vector .
Note that the single-dimensional moment estimation problem from Section 3 is a special case of linear regression. Indeed, consider setting , for each , , and to be the constant . Then, when the survey mechanism purchases data from agent , it learns , and estimating is equivalent to estimating the expected value of .
More generally, one can interpret as a vector of publicly-verifiable information about agent , which might influence a (possibly sensitive) outcome . For example, might consist of demographic information, and might indicate the severity of a medical condition. The coefficient vector describes the average effect of each feature on the outcome, over the entire population. Under this interpretation, is the residual agent-specific component of the outcome, beyond what can be accounted for by the agent’s features. We can interpret the independence of from as meaning that each agent’s cost to reveal information is potentially correlated with their (private) residual data, but is independent of the agent’s features.
As in Section 3, the analyst wants to design a survey mechanism to buy from the agents, obtain data from the set of elicited agents, then compute an estimate of . The expected average payment to each of the n agents should be no more than . As in Section 2.1, we note that the problem of designing a survey mechanism in fact reduces to that of designing an allocation rule that minimizes said variance and satisfies a budget constraint in which the prices are replaced by known virtual costs. To this end, the analyst designs an allocation rule and a pricing rule so as to minimize the -normalized worst-case asymptotic mean-squared error of as the population size goes to infinity. Our mechanism will essentially be optimizing the coefficient in front of the leading term in the mean squared error, ignoring potential finite sample deviations that decay at a faster rate than . Note that we will design allocation and pricing rules to be independent of the population size ; hence, the analyst can use the designed mechanism even if the exact population size in unknown.
5.1 Estimators for Regression
Let be the set of data points elicited by a survey mechanism. The analyst’s estimate will then be the value that minimizes the Horvitz-Thompson mean-squared error , i.e.,
Further, we make the following assumptions on the distribution of data points:
Assumption 2 (Assumption on the distribution of features).
is finite and positive-definite, and hence invertible.
Finite expectation is a property one may expect real data such has age, height, weight, etc. to exhibit. The second part of the assumption is satisfied by common classes of distributions, such as multivariate normals. We first show that is a consistent estimator of .
Under Assumption 2, for any allocation rule that does not depend on , is a consistent estimator of .
Proof of Lemma 4.
Let , and let for simplicity. The following holds:
First, we note that is the unique parameter that minimizes ; indeed, take any , we have that
As and are independent, has mean , this simplifies to
where the last step follows from being positive-definite by Assumption 2.
By definition, is compact.
is continuous in , and so is its expectation.
is also bounded (lower-bounded by , and upper-bounded by either or ), implying that
is continuous and bounded. Hence, by the uniform law of large number, remembering thatare i.i.d,
Finally, noting that conditional on , and are independent, we have:
Therefore, all of the conditions of Theorem 2.1 of  are satisfied, which is enough to prove the result. ∎
Similarly to the moment estimation problem in Section 3, the goal of the analyst is to minimize the worst-case (over the distribution of data and the correlation between ’s and ’s) asymptotic mean-squared error of the estimator . Here “asymptotic” means the worst-case error as approaches the true parameter . The following theorem characterizes the asymptotic covariance matrix of . (In fact, it fully characterizes the asymptotic distribution of .)
Under Assumption 2, for any allocation rule that does not depend on , the asymptotic distribution of is given by
where denotes convergence in distribution and where randomness in the expectations is taken on the costs , the set of elicited data points , the features of the data , and the noise .
Proof of Lemma 5.
For simplicity, let and note that the ’s are i.i.d. Let . First we remark that and . We then note the following:
is in the interior of .
is twice continuously differentiable for all .
. This follows directly from applying the multivariate central limit theorem, noting that
where the first step follows from conditional indepence on of with , the second step from , and the last equality follows from the fact that and are independent and .
, applying the uniform law of large numbers as is i) continuous in , and ii) constant in , thus bounded coordinate-by-coordinate by that is independent of and has finite expectation .
is invertible as it is positive-definite.
Therefore the sufficient conditions i)-v) in Theorem 3.1 of  hold, proving that the asymptotic distribution is normal with mean and variance
To conclude the proof, we remark that by independence of with and ,
Lemma 5 implies that the worst-case asymptotic mean-squared error, under a budget constraint, is given by the worst-case trace of the variance matrix. That is,
where recall that is the marginal distribution over and the distribution over . Importantly, this can be rewritten as
Therefore, the analyst’s decision solely depend on the worst-case correlation between costs and noise , and not on the worst-case distribution . In turn, the analyst’s allocation is completely independent of and robust in .
5.2 Characterizing the Optimal Allocation Rule for Regression
As in Section 3, we assume costs are drawn from a discrete set, say . We will then write for an allocation rule conditional on the cost being , and the probability of the cost of an agent being . We will assume that , meaning that it is not feasible to accept all data points, since otherwise it is trivially optimal to set for all .
The following lemma describes the optimization problem faced by an analyst wanting to design an optimal survey mechanism. Recall that residual values lie in the interval .
Lemma 6 (Optimization Problem for Parameter Estimation).
The optimization program for the analyst is given by: