Information Gathering with Peers: Submodular Optimization with Peer-Prediction Constraints

11/17/2017 ∙ by Goran Radanovic, et al. ∙ EPFL Max Planck Institute for Software Systems ETH Zurich Harvard University 0

We study a problem of optimal information gathering from multiple data providers that need to be incentivized to provide accurate information. This problem arises in many real world applications that rely on crowdsourced data sets, but where the process of obtaining data is costly. A notable example of such a scenario is crowd sensing. To this end, we formulate the problem of optimal information gathering as maximization of a submodular function under a budget constraint, where the budget represents the total expected payment to data providers. Contrary to the existing approaches, we base our payments on incentives for accuracy and truthfulness, in particular, peer-prediction methods that score each of the selected data providers against its best peer, while ensuring that the minimum expected payment is above a given threshold. We first show that the problem at hand is hard to approximate within a constant factor that is not dependent on the properties of the payment function. However, for given topological and analytical properties of the instance, we construct two greedy algorithms, respectively called PPCGreedy and PPCGreedyIter, and establish theoretical bounds on their performance w.r.t. the optimal solution. Finally, we evaluate our methods using a realistic crowd sensing testbed.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


The recent success of various machine learning techniques can partly be attributed to the existence of large sets of labeled data that can readily be used for training purposes. In the past decade, the predominant form of obtaining useful data is through crowdsourcing approaches, where human subjects either directly label data or have private devices that provide measurements about spatially distributed phenomena.

Figure 1: An example of crowdsourcing with incentives: Crowd-sensors (air-quality eggs111 report air-quality measurements and a data collector rewards them with monetary payments. An edge (black line) indicates that there is a sufficient correlation between the measurements of two sensors to verify accuracy.

One of the most important aspects of data is its accuracy, which can only be established if data providers (e.g. crowd-participants) report accurate information. To incentivize accurate reporting, a data collector can provide incentives that compensate data providers for their effort. In its simplest form, this type of data elicitation process can be modeled as a three step protocol:

  • Data providers acquire accurate data experiencing a cost of effort;

  • Data providers report the acquired data to a data collector;

  • The data collector pays to the data providers a value that compensates for the cost of effort.

An example scenario is shown in Figure 1.

The problem, however, arises if the data providers are susceptible to moral hazard, that is, deviation to reporting heuristically

without obtaining the data in the first place. In fact, such a behavior is expected for a rational participant who aims to maximize their utility, since heuristic reporting typically carries no cost of effort. To avoid this problem, the data collector can design payment functions

that are dependent on the accuracy of the reported information, for example, by installing random spot-checks that validate some of the reports [Gao, Wright, and Leyton-Brown2016]. While this approach has often been used in standard micro-task crowdsourcing, it is often too costly to apply it in a more complex elicitation setting. Consider, for example, a crowd-sensing scenario shown in Figure 1, where sensors measure spatially distributed phenomenon that is highly localized. To apply a spot-checking procedure, the data collector would need to have mobile sensors that would change their locations at each time-step. Furthermore, due to the localized nature of the measured phenomenon, the density of the spot-check sensor network has to be relatively large.

Instead of evaluating data providers against trusted reports, DG:13 (DG:13), JF:11 (JF:11), RF:16b (RF:16b), Shnayder:2016:EC (Shnayder:2016:EC), DBLP:conf/aaai/WitkowskiAUK17 (DBLP:conf/aaai/WitkowskiAUK17), and A:17 (A:17), propose peer-prediction mechanisms for incentivizing distributed information sources. Peer-prediction mechanisms reward data providers by measuring consistency among their reports—thus, if a data provider believes that others are honest, she is also incentivized to report truthfully.111Peer-predictions are in general susceptible to collusion, but in many cases one can establish relatively strong incentive properties for a wide variety of reporting strategies [Kong and Schoenebeck2016]. We do not focus on collusion resistance, so we use standard peer-predictions in our setting. The most important condition to hold when applying a peer-prediction mechanism is that a data provider and her peer have correlated private information. Furthermore, this correlation, when expressed through expected payments, should be greater than the cost of effort. The latter property can always be achieved by scaling, provided that a considered peer-prediction method provides strict incentives for truthfulness. The scaling approach, however, neglects potential budget concerns that are important when collecting large data sets.

Overview of Our Approach

We, therefore, focus on the limited budget concern in a distributed data collection process that uses peer-prediction incentives. There are two important aspects to this problem:

  • which data providers to select given that we only have a limited budget to spend on incentives—thus, only those data providers who received incentives can be considered to be reliable;

  • how to ensure that all of the selected data providers have a proper peer—this constrains the selection problem to always include a proper peer of each data provider that is to be selected.

To quantify the usefulness of each data provider, we adopt a submodular utility function, which can, for example, measure the information gain of the data collector for obtaining the reports of the selected data-providers. We will insist that each data provider can be scored against a peer report with resulting expected payment being greater than a given threshold. Furthermore, the total expected payment should be bounded by a budget, while a data provider should be always scored against the best peer among the selected data providers. Our main contributions are:

  • A formal model of information gathering with budget and peer-prediction constraints that is based on submodular maximization.

  • Showing that the studied optimization problem is hard to approximate within a constant factor independent of the properties of the applied payment function.

  • Novel algorithms for maximizing submodular functions with peer-prediction constraints that have provable guarantees for given topological and analytical properties of payments.

  • Experimental evaluation of the proposed algorithms on a crowd-sensing test-bed.

Notice that we do not focus on a particular peer-prediction mechanism, but rather we allow a wide range of possible mechanisms (those that are robust in terms of the number of peers and produce bounded expected payments); thus, we complement the prior work on peer-predictions by examining orthogonal aspects of elicitation without direct verification. We provide the proofs to our formal claims in the extended version of the paper [Radanovic et al.2017].

Problem Statement

We now formalize the problem addressed in this paper. We model data providers as nodes in a graph, whereas the underlying peer-prediction dependencies are modeled via edges whose weights are defined by the expected payments. The overall goal is to select a set of nodes that maximize a submodular utility function, while satisfying the constraint that the cost of the data collector (i.e., the total expected payment to nodes) is within a predefined budget. The following subsections provide more precise modeling details.

Set of Nodes and the Utility Function

We consider a set of nodes (e.g., a population of people or sensors deployed in a city) denoted by set , of size . Hereafter, we denote a generic node by . We associate a function over the set of nodes that quantifies their utility (e.g., informativeness). That is, given a set of selected nodes , the utility achieved from this set is equal to . Furthermore, for given set , and a node , we define the marginal utility of adding to as follows:


Here, the function is assumed to be submodular and monotone. Submodularity is an intuitive notion of diminishing returns, stating that, for any sets , and any given node , it holds that . Monotonicity requires that the function increases as we add more elements to set . That is, for any sets , it holds that . These conditions are relatively general, and are satisfied by many realistic, as well as complex utility functions for information gathering [Krause and Guestrin2011, Krause and Golovin2012, Singla and Krause2013, Singla et al.2014, Tschiatschek, Singla, and Krause2017]. W.l.o.g., we assume that function is normalized, i.e., .

Peer-prediction Constraints (PPC)

The nodes in general exhibit dependencies with other nodes. We consider a particular form of constraints that is associated with information elicitation via peer-prediction mechanisms [Miller, Resnick, and Zeckhauser2005]. A canonical peer-prediction scores the information reported by node using the information of a peer node . Mechanism is said to be proper if node ’s best response to accurate reporting of node is to report accurately, where the quality of a response is measured in terms of node ’s expected payoff over possible (accurate) reports of node .222Therefore, properness is here defined as Bayes-Nash incentive compatibility in game-theoretic sense. We denote node ’s expected payoff for accurate reporting by . To establish the properness of , one needs to ensure that peer provides statistically correlated information to that of node , so that the expected payoff is strictly greater than the cost of accurate reporting (which models, for example, participants’ effort exertion). Therefore, node has only a limited number of peers defined as nodes that lead to the expected payoff , where is a problem specific threshold dependent on the cost of accurate reporting. We will further require that the same holds for node ’s peers, i.e., , and assume that mechanism provides bounded payments, so that . Notice that as increases, a node is expected to have a smaller number of peers, which makes the problem of selecting an optimal set of nodes more constrained. In Section ’Experimental Evaluation’, we confirm this observation by showing the drop in the obtained utility.

Example: Output Agreement (OA). Arguably the simplest peer prediction method is the output agreement of vonAhn:2004:LIC:985692.985733 (vonAhn:2004:LIC:985692.985733), which gives a strictly positive payment only for matching reports. In our experiments, reported information can in general take real values. In that case, as explained by DBLP:conf/hcomp/WaggonerC14 (DBLP:conf/hcomp/WaggonerC14), the OA mechanism can be defined as:


where is Euclidian distance between reported values of and . Note that more complex designs are also allowed by our framework, such as the one proposed in [Faltings, Li, and Jurca2014]. For more information on the properties of different minimal peer-predictions and their relationships, we refer the reader to [Frongillo and Witkowski2016].

With this in mind, we can model the dependencies among nodes using an undirected graph , where edges are defined as . We require that each node in a selected set has a neighboring node that is also in , which implies that we can properly evaluate the reported information of each selected node. We denote the set of neighboring nodes to node as , i.e., . W.l.o.g. we can assume that every node in has at least one peer node , i.e., . Namely, nodes that do not have a peer cannot be incentivized to report accurately in our setting, so they bring utility in terms of . Finally, let us denote by the maximum number of peers that a node in graph has, i.e., .

Cost of Incentivizing Accuracy

Given a selected set of nodes , an information elicitation procedure needs to spend a certain amount of budget, hereafter denoted by , to incentivizing accurate reporting. To quantify the cost of accurate elicitation, one needs to specify a peer selection procedure when a node has multiple peers. We take the approach of selecting the best peer, that is the peer that has the most correlated information to that of the considered node according to the expected payoffs—this leads to the strongest incentives in terms of the separation between the expected payoffs for accurate and inaccurate reporting. With this choice of peer selection procedure, we can define the cost of selecting nodes as function :


Here, contains only sets such that each node has a peer node , which makes well defined.

Optimization Problem

Our goal is to select a feasible set that maximizes the utility given budget , i.e., . More precisely, the budget denotes the total expected payment that a data collector is willing to provide for incentivizing accurate reporting. We therefore pose the following optimization problem:


Ignoring the computational constraints, we denote the optimal solution to this problem as Opt.


Instead of operating directly on optimization problem (4), we reformulate it so that the budget constraint is expressed through a cost function defined over all subsets of nodes , not just the feasible set . We first show that the new optimization problem is equivalent to (4). Unfortunately, it is hard to approximate without any dependency on the structure of the cost function. We then relax it to an optimization problem that uses a modular approximation to the cost function, but operates with reduced budget to satisfy the original budget constraint. This relaxation is the basis for our algorithms developed in the next section, and is sound if the cost function of the original problem has certain topological and analytical constraints. The following subsections explain our methodology in more details.

Expansion of the Cost Function

We start by expanding the domain of cost function to power-set , which will provide us with better insights on the computational complexity of the original problem. In particular, we consider the following expansion :


where is a set of all nodes in who also have a peer in , i.e., . In other words, cost function acts as if all the nodes in who have a peer in are rewarded as usual, while those that do not have a peer in are rewarded with the expected payoff they obtain when scored against the worst peer. Notice that for all , which makes the expansion sound. We denote by the marginal increase of cost for adding an element to , i.e., . We establish the monotonicity of cost function with the following lemma; notice, however, that the cost function is not necessarily sub/super-modular.

Lemma 1.

Cost function defined by (5) is monotone. Furthermore: , for all .

Hardness Result

To prove the complexity of our initial problem, we adapt optimization problem (4) to use the extended cost function . In particular, we consider the optimization problem defined as:


Clearly, any feasible solution to the problem (6) is also a feasible solution to the original problem due to the constraint , while the optimality alignment is ensured by having the same objective value. Now, to show the hardness result for approximating Opt, we reduce the maximum clique problem to optimization problem (6) in a computationally efficient way, thus, obtaining:

Theorem 1.

For any , it is NP-hard to find a solution to optimization problem (6) (and thus (4)) such that .


Consider an arbitrary undirected unweighted graph for which we wish to compute the maximum clique. To reduce the maximum clique problem to (6): 1) define function as , which is clearly monotone and submodular; 2) define payment function as: if , and otherwise; 3) set budget to ; 4) and set . Notice that such an arrangement induces a fully connected graph . Furthermore, we defined deterministic payment functions and , but one can use and instead. Points 2 and 4 ensure that any solution to optimization problem (6) is a clique in graph ; otherwise, the budget constraints would be violated in solving (6). Likewise, points 2 and 3 ensure that any clique is permitted as a potential solution w.r.t. the budget constraint. Finally, point 1 ensures that we search for a clique with the maximum number of vertices. Since the reduction is computationally efficient (polynomial in the graph size, i.e. and ), optimization (4) is at least as hard as the maximum clique problem. Using the fact that the maximum clique problem is hard to approximate within factor [Hastad1999], we obtain the claim. ∎

Structural Properties of the Cost Function

To cope with the computational hardness of the problem at hand, we identify two structural properties of cost function (or equivalently, the structural properties of payment function ). The first one is related to topological properties of graph , and can be quantified with the maximum number of peers that a node in the graph can have. As explained earlier in this section, we denote this number by .

The second property is similar to the notion of curvature of a submodular function [Iyer, Jegelka, and Bilmes2013], but now defined over cost function that is not necessarily sub/super-modular. In particular, we define the slope of cost function as:

The slope of cost function , as defined above, measures how much marginal gains of change as we add more to initially empty set of selected nodes.333That is, quantifies the maximum increase in for adding a node (see Lemma 1). Intuitively, it measures the deviation of from modularity. A specific case of our interest is when , which indicates that is modular and, thus, can be decomposed into a sum of costs dependent only on one vertex, i.e., . In the next subsection, we discuss how to utilize modular approximations of when itself is not modular. First, let us upper bound the slope using the fact that payments are bounded.

Lemma 2.

The slope of cost function is upper-bounded by .

Relaxed Optimization Problem

To make use of the structural constraints of cost function , let us consider a relaxed version of optimization problem (6) with budget constraints defined via a modular lower bound to cost function , denoted by . As we show in the next section, for such a relaxation, one can develop a greedy approach that has provable approximation guarantees on the quality of the obtained solution relative to Opt. More precisely, consider the modular function defined via a cost function :


Clearly, lower bounds as it calculates the expected payoffs of nodes in when they are scored against their worst peers (not necessarily in ). Now, we relax optimization problem (6) to:


In order to make the relaxation sound, any selected set in problem (8) should also be feasible in problem (6) (and thus (4)). We can ensure this by reducing the available budget, i.e., by making appropriately smaller than . Using the slope of cost , we can obtain that the following budget reduction satisfies our requirement.

Lemma 3.

Any feasible solution to optimization problem is also a feasible solution to optimization problem (6) (and thus (4)) for , where is the slope of cost function .

1 Input:
  • PPC graph: ;

  • Utility function : ; budget ;

  • Cost function : , slope , modular approx. ;

  • 2 Output: selected set ;
    3 Initialize:
    • ; ; ;

  • CoupleSuperSet ;

  • 4 //Create CoupleSuperSet
    5 foreach  do
          6 ;
          7 foreach  do
                8 ; ;
           end foreach
    end foreach
    10 while   do
          11 ;
          12 if  then
           end if
          13 ;
          14 ;
          15 ;
    end while
    Algorithm 1 Algorithm PPCGreedy


    We now present a new greedy algorithm for solving the optimization problem with peer-prediction constraints (PPC), called PPCGreedy (Algorithm 1). It is similar to standard greedy approaches for submodular maximization with budget constraints (e.g., 2015-hcomp_nushi_crowd_access_path (2015-hcomp_nushi_crowd_access_path)), but it additionally ensures that a tentative output at a certain iteration is an element of . To do so, it initially constructs a set of couples that contains all the peer pairs and selects at each iteration either a node that already has a peer in the selected set or a pair of nodes that are peers. The selection procedure makes a choice that maximizes the ratio between the utility gain and the cost increase, while not exceeding a given budget . If there are multiple choices that maximize this ratio, the selection procedure selects one of them, whereas if there is no choice that fits the budgets constraints, is set to , which ends the search and outputs the current solution .


    We will now show the main property of our algorithm: its near optimality when cost function has a low slope , i.e., when the difference between and is small. Notice that parameters and are controllable through our design of a peer-prediction method and the requirements on minimal expected payments, which implies that can be tuned. For all practical reasons, it is also reasonable to assume that , which simply states that our algorithm is always able to initially select any pair of nodes.

    Theorem 2.

    Let the maximal relative difference between modular costs of two peer nodes be bounded by , i.e., , and let . Then, the output of Algorithm 1 has the following guarantees on the utility:

    Proof (Sketch).

    The proof of the theorem is non-trivial, so we outline only its basic steps (see [Radanovic et al.2017] for more details). Using the fact that is submodular, while Algorithm 1 is greedy in terms of ratio, we show that:

    where is equal to at time-step , while is the optimum solution to optimization problem when budget . Now, following the the proofs of related results for submodular maximization under budget constraints (e.g., Sviridenko:2004:NMS:2308897.2309164 (Sviridenko:2004:NMS:2308897.2309164), 2015-hcomp_nushi_crowd_access_path (2015-hcomp_nushi_crowd_access_path)), and adapting them to our setting, we obtain that:

    As we argue in the full proof, because is obtained for the same budget as Opt, but the cost that lower bounds . Together with the above inequality, this implies the statement of the theorem. ∎

    We see that the quality of the approximation ratio depends on the structural properties of the cost function, including slope , the maximum cost discrepancy between two nodes measured by , and the maximum fraction of the budget assigned to a node, measured by . As approaches its maximum value, i.e., , the approximation ration goes to . This is consistent with the hardness result presented in Section ’Methodology’, which shows the necessity of imposing structural constraints. One can reach a similar conclusion by analyzing as it goes to its maximal value, i.e., .

    To see this more clearly, we can express the results of the theorem in terms of the original optimization problem and the structural properties of payment function . Using the bound on slope (Lemma 2), the boundedness of payments, which imply , we obtain:

    Corollary 1.

    Assuming , the output of Algorithm 1 has the following guarantees on the utility:


    Therefore, whenever the maximum payment or the number of possible peers go to large values, the approximation factor becomes negligible. Notice that the number of possible peers is dependent on , so we can alternatively say that for small values of , i.e., , the quality of the obtained greedy solution is relatively low. In practice, however, we can often avoid these corner cases by adjusting the payment function, and thus and .

    More Efficient Budget Expenditure

    The PPCGreedy algorithm, as described by Algorithm 1 does not necessarily spend the full budget on incentivizing nodes. This is because we use a reduced budget when running the main steps of the algorithm. One can achieve a better budget efficiency by iteratively calling PPCGreedy method, as shown in Algorithm 2, that we refer to as PPCGreedyIter. It is important to note that in the sub-procedure we take into account the current set of selected nodes when examining the feasibility of a solution and evaluating the utility and cost functions. The budget reduction in the subroutine can, on the other hand, be done with the same (initial) . The procedure terminates when no new node is added, which is equivalent to the budget not changing between two consecutive iterations.

    The utility function is always evaluated with the selected set of nodes from previous iterations, in the algorithm denoted by . The same is true for cost function , denoted by , and its modular approximation . Due to monotonicity of , this means that the reached solution is always as good as the one obtained by PPCGreedy. Furthermore, the cost of the solution is within the budget constraints: this is because (Lemma 1), so the slope defined on upper bounds the one defined on , which implies that the subroutine makes a proper budget reduction. Therefore, the results of Theorem 2 and Corollary 1 are preserved.

    1 Output: selected set ;
    2 Initialize:
    • ; ; ; ;

    3 //Compute
    4 while   do
          5 ;
          6 ;
          7 ;
          8 ;
    end while
    Algorithm 2 Algorithm PPCGreedyIter

    Experimental Evaluation

    To evaluate our approach, we use a crowd sensing test-bed of AS:17 (AS:17), constructed from real measurements of and user locations across an urban area. The concentrations of in the city of Zurich were acquired with a NODE+444 sensor. These measurements were used to fit a Gaussian variogram whose parameters indicate that the relevant correlation range between two measurement locations is about meters. We use this distance to define a disk coverage function—for a set of points of interest, we count how many of these are within meters away from the set of selected points. More formally, given a set of points that represent the location of the selected sensors and set of points that represent locations for which we would like to obtain measurements, the objective function is defined as: . Here, is an indicator variable, evaluating to when is satisfied, and is otherwise, while measures the distance in meters between locations and . The function is a coverage function, which is monotone and submodular [Krause and Golovin2012].

    Points of interest are predefined, and in total, there are of them. These were obtained using a publicly available data (OpenStreetMap555, from which we randomly selected locations from an area in the center of New York City. To identify the locations of available crowd-sensors, i.e., the ground set , we use the population statistics of the test-bed, which give us the likelihood of a user appearing in one of the points. This statistics is inferred from a publicly accessible dataset (Strava666 that contains the mobility patterns of cyclists for a period of days. We sample from the likelihood points to obtain sensing locations and then we perturb them by meters.

    As a peer prediction scoring rule, we use the output agreement mechanism as described in Section Problem Statement. The expected score of this mechanism for two points and in truthful reporting regime is equal to . We approximate the expected value of OA for two sensors and by using the variogram of the test-bed. More precisely, given the range parameter

    , we estimate the expected payoff between two sensors as:

    , where is set to .

    (a) Performance as we vary the available budget.
    (b) Performance as we vary the minimum payments.
    Figure 2: Experimental results show how the utility changes as we increase budget or the minimum expected payments . is the output of different methods. Increasing is beneficial as it allows more sensors to be selected. On the other hand, as we increase , the sensors have less peers and they need to be payed more. Random is run

    times and we show the means and standard deviations.

    (a) Performance as we vary the available budget.

    Results. We test our approaches, PPCGreedy and PPCGreedyIter, against two other baselines: (a) a random selection (denoted by Random) that satisfies peer-prediction constraints, (b) a greedy approach (denoted by Greedy) that assumes it suffices to reward each sensor with , without providing incentives for accurate reporting. Clearly, the latter baseline represents an optimistic approach whose performance upper bounds that of the proposed algorithms, while the former one is likely to lower bound their performance. In all the cases, the expected budget is at most .

    We perform two different tests. In the first test, we vary the total available budget from to at steps of size . At the same time, we keep the minimal expected payment to . As we can see from Figure 2, as the total budget increases, all the methods perform better. However, the increase is more notable for non-random algorithms. The performance of PPCGreedyIter is generally better than the one of PPCGreedy, and this is due to the spent budget — as explained in the previous section, PPCGreedy is only the first step of PPCGreedyIter, which further iteratively runs PPCGreedy on the remaining budget.

    In the second test, we vary the minimal expected payment , which now takes values in . Budget is set to . The results are given in Figure 2. Except for Random, a general trend is that increasing the minimal payments leads to lower performance, which is not surprising given that the number of peers and the budget per sensor decrease in that case. Random in general performs much worse than other techniques for all values of . Moreover, notice that the discrepancy in performance between PPCGreedy, PPCGreedyIter, and Greedy, increases with . Initially, all the non-random algorithms find an optimum of function , achieving the utility equal to .

    Related Work

    Information elicitation. A standard incentives for quality are typically categorized into gold standard techniques, such as proper scoring rules [Gneiting and Raftery2007] or prediction markets [Chen and Pennock2007], or peer-prediction techniques, such as the classical peer-prediction [Miller, Resnick, and Zeckhauser2005] or Bayesian truth serums [Prelec2004, Witkowski and Parkes2012, Radanovic and Faltings2013]. We focus in this paper on peer-prediction techniques due to the scalability of elicitation without verification for acquiring large amounts of highly distributed information. Recently, several peer-predictions were proposed for various crowdsourcing scenarios. These included micro-task crowdsourcing [Dasgupta and Ghosh2013], opinion polling [Jurca and Faltings2011], information markets [Baillon2017], peer grading [Shnayder et al.2016], and most importantly for this work, crowdsensing [Radanovic, Faltings, and Jurca2016]. The proposed mechanisms for these domains follow the standard principles of the classical peer-prediction, e.g., incentivizing participants by comparing their reports and placing higher scores for a priori less likely matches. However, they also often extend the design of the original methods by making them more robust in terms of the required number of participants and the knowledge about them, the heterogeneity of users and tasks, or susceptibility to collusive behaviors [Faltings and Radanovic2017]. We analyze orthogonal characteristics important for deploying such mechanisms in practice, i.e., budget and cost acquisition constraints. Although the prior work (e.g., DBLP:conf/ijcai/LiuC16 (DBLP:conf/ijcai/LiuC16)) does study meta-mechanisms that make peer-predictions proper in terms of effort exertion, it is often based on scaling techniques, which either ignore budget limitations or the cost of effort.

    Submodular function maximization. From the technical side, the most important aspect relates to submodular function maximization. While there is a sizeable literature on this topic (e.g., krause2011submodularity (krause2011submodularity), 2012-survey_krause_submodular (2012-survey_krause_submodular)), we mostly focus on the prior work that is closely related to the techniques used in this paper. Our basic objective is a subset selection under budget (knapsack) constraints (e.g., Sviridenko:2004:NMS:2308897.2309164 (Sviridenko:2004:NMS:2308897.2309164)), and we base our algorithmic techniques on a simple greedy approach [Nushi et al.2015]. Notice that we additionally have a graph based constraint, which is in spirit similar to singla15netexp (singla15netexp), although we are solving a different optimization problem. Arguably, this paper is most related to submodular maximization with submodular budget constraints [Iyer and Bilmes2013]; contrary to this work, our budget constraints are not necessarily sub/super-modular. It is also worth mentioning the hardness results that relate to the ones obtained in this paper, such as the inapproximability of the maximum of a submodular non-monotone, possibly negative, profit function [Feige et al.2008].


    In this paper, we have introduced an information elicitation model for data collection from distributed sources when the incentive mechanism is based on peer-predictions. We have shown that optimal information gathering is computationally infeasible in that even approximating the optimal solution is NP-hard. However, given structural constraints on peer-prediction incentives, we have proposed two greedy methods that achieve good performance relative to the optimum, and have tested their performance empirically on a realistic crowd-sensing test-best.

    Acknowledgments This work was supported in part by the Swiss National Science Foundation, program as part of the Opensense II project, ERC StG 307036, a SNSF Early Postdoc Mobility fellowship, and a Facebook Graduate fellowship.


    • [Baillon2017] Baillon, A. 2017. Bayesian markets to elicit private information. Proceedings of the National Academy of Sciences 114(30):7958–7962.
    • [Chen and Pennock2007] Chen, Y., and Pennock, D. M. 2007. A utility framework for bounded-loss market makers. In

      Proceedings of the Twenty-Third Conference on Uncertainty in Artificial Intelligence

    • [Dasgupta and Ghosh2013] Dasgupta, A., and Ghosh, A. 2013. Crowdsourced judgement elicitation with endogenous proficiency. In Proceedings of the 22nd ACM International World Wide Web Conference.
    • [Faltings and Radanovic2017] Faltings, B., and Radanovic, G. 2017. Game Theory for Data Science: Eliciting Truthful Information. Morgan & Claypool Publishers.
    • [Faltings, Li, and Jurca2014] Faltings, B.; Li, J. J.; and Jurca, R. 2014. Incentive Mechanisms for Community Sensing. IEEE Transaction on Computers 63(1):115–128.
    • [Feige et al.2008] Feige, U.; Immorlica, N.; Mirrokni, V.; and Nazerzadeh, H. 2008. A combinatorial allocation mechanism with penalties for banner advertising. In Proceedings of the 17th International Conference on World Wide Web.
    • [Frongillo and Witkowski2016] Frongillo, R., and Witkowski, J. 2016. A geometric method to construct minimal peer prediction mechanisms. In Proceedings of the 30th AAAI Conference on AI.
    • [Gao, Wright, and Leyton-Brown2016] Gao, A.; Wright, J. R.; and Leyton-Brown, K. 2016. Incentivizing evaluation via limited access to ground truth: Peer-prediction makes things worse. CoRR abs/1606.07042.
    • [Gneiting and Raftery2007] Gneiting, T., and Raftery, A. E. 2007. Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association 102:359–378.
    • [Hastad1999] Hastad, J. 1999. Clique is hard to approximate within . Acta Math. 182(1):105–142.
    • [Iyer and Bilmes2013] Iyer, R., and Bilmes, J. 2013. Submodular optimization with submodular cover and submodular knapsack constraints. In Proceedings of the 26th International Conference on Neural Information Processing Systems.
    • [Iyer, Jegelka, and Bilmes2013] Iyer, R.; Jegelka, S.; and Bilmes, J. 2013. Curvature and optimal algorithms for learning and minimizing submodular functions. In Proceedings of the 26th International Conference on Neural Information Processing Systems.
    • [Jurca and Faltings2011] Jurca, R., and Faltings, B. 2011. Incentives for answering hypothetical questions. In Workshop on Social Computing and User Generated Content.
    • [Kong and Schoenebeck2016] Kong, Y., and Schoenebeck, G. 2016. A framework for designing information elicitation mechanisms that reward truth-telling. CoRR abs/1603.07751.
    • [Krause and Golovin2012] Krause, A., and Golovin, D. 2012. Submodular function maximization. Tractability: Practical Approaches to Hard Problems 3:19.
    • [Krause and Guestrin2011] Krause, A., and Guestrin, C. 2011. Submodularity and its applications in optimized information gathering. ACM Transactions on Intelligent Systems and Technology 2(4):32.
    • [Liu and Chen2016] Liu, Y., and Chen, Y. 2016. Learning to incentivize: Eliciting effort via output agreement. In Proceedings of the 25th International Joint Conference on Artificial Intelligence.
    • [Miller, Resnick, and Zeckhauser2005] Miller, N.; Resnick, P.; and Zeckhauser, R. 2005. Eliciting informative feedback: The peer-prediction method. Management Science 51:1359–1373.
    • [Nushi et al.2015] Nushi, B.; Singla, A.; Gruenheid, A.; Zamanian, E.; Krause, A.; and Kossmann, D. 2015. Crowd access path optimization: Diversity matters. In AAAI Conference on Human Computation and Crowdsourcing.
    • [Prelec2004] Prelec, D. 2004. A bayesian truth serum for subjective data. Science 34(5695):462–466.
    • [Radanovic and Faltings2013] Radanovic, G., and Faltings, B. 2013. A robust bayesian truth serum for non-binary signals. In Proceedings of the 27th AAAI Conference on Artificial Intelligence.
    • [Radanovic et al.2017] Radanovic, G.; Singla, A.; Krause, A.; and Faltings, B. 2017. Information gathering with peers: Submodular optimization with peer-prediction constraints (extended version). CoRR abs/1711.06740.
    • [Radanovic, Faltings, and Jurca2016] Radanovic, G.; Faltings, B.; and Jurca, R. 2016. Incentives for effort in crowdsourcing using the peer truth serum. ACM Transactions on Intelligent Systems and Technology 7:48:1–48:28.
    • [Shnayder et al.2016] Shnayder, V.; Agarwal, A.; Frongillo, R.; and Parkes, D. C. 2016. Informed truthfulness in multi-task peer prediction. In Proceedings of the 2016 ACM Conference on Economics and Computation.
    • [Singla and Krause2013] Singla, A., and Krause, A. 2013. Incentives for privacy tradeoff in community sensing. In AAAI Conference on Human Computation and Crowdsourcing (HCOMP).
    • [Singla et al.2014] Singla, A.; Horvitz, E.; Kamar, E.; and White, R. W. 2014. Stochastic privacy. In Proc. Conference on Artificial Intelligence (AAAI).
    • [Singla et al.2015] Singla, A.; Horvitz, E.; Kohli, P.; White, R.; and Krause, A. 2015. Information gathering in networks via active exploration. In IJCAI.
    • [Singla2017] Singla, A. 2017. Learning and Incentives in Crowd-Powered Systems. Ph.D. Dissertation, ETH.
    • [Sviridenko2004] Sviridenko, M. 2004. A note on maximizing a submodular set function subject to a knapsack constraint. Oper. Res. Lett. 32(1):41–43.
    • [Tschiatschek, Singla, and Krause2017] Tschiatschek, S.; Singla, A.; and Krause, A. 2017. Selecting sequences of items via submodular maximization. In Proceedings of the 31th AAAI Conference on Artificial Intelligence (AAAI’17).
    • [von Ahn and Dabbish2004] von Ahn, L., and Dabbish, L. 2004. Labeling images with a computer game. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems.
    • [Waggoner and Chen2014] Waggoner, B., and Chen, Y. 2014. Output agreement mechanisms and common knowledge. In Proceedings of the Second AAAI Conference on Human Computation and Crowdsourcing.
    • [Witkowski and Parkes2012] Witkowski, J., and Parkes, D. C. 2012. A robust bayesian truth serum for small populations. In Proceedings of the 26th AAAI Conference on Artificial Intelligence.
    • [Witkowski et al.2017] Witkowski, J.; Atanasov, P.; Ungar, L. H.; and Krause, A. 2017. Proper proxy scoring rules. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence.

    Appendix: Information Gathering with Peers

    Proof of Lemma 1

    Statement: Cost function defined by (5) is monotone. Furthermore: , for all .


    Denote by a set of nodes in who have peers in . Notice that . For a given set and , we have:

    Furthermore, from the proof we see that , thus proving the second part of the statement. ∎

    Proof of Theorem 1

    Statement: For any , it is NP-hard to find a solution to optimization problem (6) (and thus (4)) such that .


    We prove the statement by reducing the maximum clique problem to optimization (6). Consider an arbitrary undirected unweighted graph for which we wish to compute the maximum clique. To reduce the maximum clique problem to (6):

    1. define function as , which is clearly monotone and submodular;

    2. define payment function as: if , and otherwise;

    3. set budget to ;

    4. and set .

    Notice that such an arrangement induces a fully connected graph . Furthermore, we defined deterministic payment functions and , but one can use and instead. Points 2 and 4 ensure that any solution to optimization problem (6) is a clique in graph ; otherwise, the budget constraints would be violated in solving (6). Likewise, points 2 and 3 ensure that any clique is permitted as a potential solution w.r.t. the budget constraint. Finally, point 1 ensures that we search for a clique with the maximum number of vertices. Since the reduction is computationally efficient (polynomial in the graph size, i.e. and ), optimization (4) is at least as hard as the maximum clique problem. Using the fact that the maximum clique problem is hard to approximate within factor [Hastad1999], we obtain the claim. ∎

    Proof of Lemma 2

    Statement: The slope of cost function is upper-bounded by:


    Notice that the slope is maximized when there exist a node such that:

    • the expected payoff of when it’s scored against its worst peer is ,

    • the expected payoff of increases to whenever any other peer is used for scoring,

    • any ’s peer (including 777The analysis slightly changes if is required to have the same payoff as when they are mutual peers, but the main result stays the same.) achieves expected payoff when scored against and otherwise, when scored against some other peer, they achieve .

    This gives us:

    Proof of Lemma 3

    Statement: Any feasible solution to optimization problem is also a feasible solution to optimization problem (6) (and thus (4)) for , where is the slope of cost function .


    Since both problems require that , we only need to show that the budget constraint in is not violated when is selected. Let us enumerate the elements of , so that each element is assigned an index . We have:

    The (first) inequality is due to the slope of . Now, since , it follows that , which proves the statement. ∎

    Proof of Theorem 2

    Statement: Let the maximal relative difference between modular costs of two peer nodes be bounded by , i.e., , and let . Then, the output of Algorithm 1 has the following guarantees on the utility:


    Let denote an optimal solution to optimization problem when budget , denote an optimal solution to optimization problem when budget . Cleary . Furthermore, , where Opt is an optimal solution to optimization problem (and ), because cost function is lower bounded by its modular approximation . Therefore, it suffices to lower bound with modulated by the approximation factor in the statement of the theorem.

    Let represent the current solution of the greedy algorithm at time step and assume w.l.o.g. that is not .888This is true for at least due to the assumption on the boundedness of modular payments, i.e. . Due to monotonicity and submodularity of function and the fact that Algorithm 1 is greedy in terms of ratio, we have:

    where is any peer of . Using the fact that Algorithm 1 is greedy in terms of ratio, we further obtain:

    Now, and give us:

    By rearranging, we get:

    and further since and :

    where we used an inductive argument. In other words:

    where we for notational convenience considered instead of . Now, because:

    for , we obtain: