Truth Serums for Massively Crowdsourced Evaluation Tasks

by   Vijay Kamble, et al.
Stanford University

A major challenge in crowdsourcing evaluation tasks like labeling objects, grading assignments in online courses, etc., is that of eliciting truthful responses from agents in the absence of verifiability. In this paper, we propose new reward mechanisms for such settings that, unlike many previously studied mechanisms, impose minimal assumptions on the structure and knowledge of the underlying generating model, can account for heterogeneity in the agents' abilities, require no extraneous elicitation from them, and furthermore allow their beliefs to be (almost) arbitrary. These mechanisms have the simple and intuitive structure of an output agreement mechanism: an agent gets a reward if her evaluation matches that of her peer, but unlike the classic output agreement mechanism, this reward is not the same across evaluations, but is inversely proportional to an appropriately defined popularity index of each evaluation. The popularity indices are computed by leveraging the existence of a large number of similar tasks, which is a typical characteristic of these settings. Experiments performed on MTurk workers demonstrate higher efficacy (with a p-value of 0.02) of these mechanisms in inducing truthful behavior compared to the state of the art.



There are no comments yet.


page 1

page 2

page 3

page 4


Partial Truthfulness in Minimal Peer Prediction Mechanisms with Limited Knowledge

We study minimal single-task peer prediction mechanisms that have limite...

Information Signal Design for Incentivizing Team Formation

We study the use of Bayesian persuasion (i.e., strategic use of informat...

REFORM: Reputation Based Fair and Temporal Reward Framework for Crowdsourcing

Crowdsourcing is an effective method to collect data by employing distri...

Dominantly Truthful Multi-task Peer Prediction with a Constant Number of Tasks

In the setting where participants are asked multiple similar possibly su...

Civic Crowdfunding for Agents with Negative Valuations and Agents with Asymmetric Beliefs

In the last decade, civic crowdfunding has proved to be effective in gen...

Working in Pairs: Understanding the Effects of Worker Interactions in Crowdwork

Crowdsourcing has gained popularity as a tool to harness human brain pow...

Overall Agreement for Multiple Raters with Replicated Measurements

Multiple raters are often needed to be used interchangeably in practice ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Systems that leverage the wisdom of the crowd are ubiquitous today. Recommendation systems such as Yelp and others, where people provide ratings and reviews for various entities, are used by millions of people across the globe [Luc11]

. Commercial crowd-sourcing platforms such as Amazon Mechanical Turk, where workers perform microtasks in exchange for payments over the Internet, are employed for a variety of purposes such as collecting labelled data to train machine learning algorithms 

[RYZ10]. In massive open online courses (MOOCs), students’ exams or assignments are often evaluated by means of “peer-grading”, where students grade each others’ work [PHC13].

A common feature in many of these applications is that they involve a large number of similar evaluation tasks and every agent performs a subset of these tasks. For instance, a typical collection of tasks on Amazon Mechanical Turk comprises of labeling a large set of images for some machine learning application. A standard Peer-grading task in massive open online courses (MOOCs) involves grading of a large number of submissions for each assignment. We call these tasks massively crowdsourced evaluation tasks or MCETs.

A major challenge in MCETs is incentivizing the agents to report their evaluations truthfully - to not try to game the system for monetary gain. This is achieved by designing appropriate reward mechanisms, and there has been a considerable amount of prior work on designing such mechanisms for different settings. Unfortunately, most of these mechanisms have found limited success in practice. One critical drawback that we believe seems to impede their widespread use is the fact that these mechanisms have a complex structure and description, which makes it difficult for a typical agent to understand the mechanism and account for it while choosing their behavior. Recently, in an effort to promote practical deployments of market designs, there has been a significant push towards designing simpler economic mechanisms in the mechanism design research community [Rub15, Li15]. Indeed, simplicity in mechanism design has been a theme in many recent workshops in the research community; for instance, the abstract of one such workshop [Sim15] on “Complexity and Simplicity in Economics” quotes “Ideal economic systems must still remain simple enough for human participants to understand…” Due to the significant presence of the human element, these considerations all the more important in the case of crowd-sourcing. In this paper, we attempt to address these issues in the context of MCETs by designing a class of simple reward mechanisms that incentivize truthful reporting.

The research on designing reward mechanisms for crowdsourced evaluation tasks falls largely into two categories depending on whether or not one assumes the existence of so-called gold standard objects  [LEHB10, CMBN11]. These are a small subset of objects, for which the principal either knows the correct evaluations apriori or can verify them accurately. Incentive design is then facilitated by scoring the agents on their performance on these objects by using proper scoring rules [LS09, SZP15, SZ15].

The present paper contributes to a second line of research, that makes no assumption about the existence of such gold standard objects, and is more realistic in many applications of interest where obtaining correct evaluations for a fraction of objects is either impossible or too costly. There have been several works that operate specifically in this domain and have designed clever incentive mechanisms while making different assumptions on the behavior of the agents, and on the knowledge of the mechanism designer about this behavior [MRZ05, Pre04, WP12, RF13].

Our mechanisms build upon the structure of output agreement mechanisms [VAD08, VAD04]

that are simple, intuitive, and have been quite popular in practice, except they suffer from a critical drawback of not incentivizing truthful responses in general. In an output agreement mechanism, two agents answer the same question, and they are both rewarded if their answers match. From the perspective of an agent, in the absence of any extraneous information, this almost incentivizes truthful reporting, since in many cases it is more likely that the other agent also has the same answer. But this is not the case when the agent believes that her answer is relatively unpopular and that a typical agent will have a different opinion. It is then tempting to report the answer that is more likely to be popular rather than correct. Moreover, there is an undesirable equilibrium in this game where every person reports the same answer irrespective of their true evaluation, which guarantees each person the highest possible payoff rewarded by the mechanism. Our mechanisms overcome these drawbacks by giving proportionately higher rewards for answers that turn out to be relatively less popular and lower rewards for answers that turn out to be more popular on an average. These rewards are designed in such a way that as soon as the agent sees the object and forms an evaluation, the conditional probabilities of the evaluations of another agent evaluating the same object change relative to the overall popularity of the different evaluations in such a way, that it becomes more profitable to report his/her opinion truthfully. This is achieved by leveraging some fundamental properties of the generating model.

We consider a standard setting that assumes the existence of an underlying generating model that captures the inherent characteristics of each of these evaluation tasks, and the abilities/biases of the agents. The mechanisms we propose are ‘minimal’ in the sense that they do not solicit any extraneous information from the agents apart from their own individual evaluations. Further, they make minimal structural assumptions on the generating model and do not require the knowledge of its details. Finally, motivated by practical concerns, truthfulness is incentivized in quite a strong sense, in that, the agents are allowed to have (almost) arbitrary opinions or beliefs about the details of the generating model, e.g., one agent may grossly underestimate the abilities of the other agents and overestimate her own ability, while another agent may have no such opinions. In order to achieve these objectives, our mechanism assumes the existence of a large number of similar tasks, which is typical of MCETs.

An important distinction that naturally arises in the MCET setting is that between a homogeneous and a heterogeneous population of agents [RF15]. Homogeneity of the agents intuitively means that all agents are statistically similar in the way they answer any question; it implies, for instance, that the agents do not have any relative biases or difference in abilities. As we argue later, such an assumption is reasonable in the case of surveys, where an agent’s answer to a question can be seen as an independent sample of the distribution of the answers in the population. But it is inappropriate in subjective evaluation tasks like rating movies or grading answers, in which systematic biases may exist because of differences in preferences, effort or abilities. In our design, propose mechanisms specifically tailored to both of these settings. In the heterogeneous case it is known [RF15] that it is necessary to impose certain structural restrictions on the generating model in order to be able to design truthful mechanisms. With this in mind, we restrict ourselves to the setting of binary-choice evaluation tasks, and then propose a mechanism that is truthful under a mild regularity assumption that is naturally justifiable in several MCETs of interest. This assumption is substantially weaker than other assumptions that have appeared before in literature for similar settings (e.g., [DG13]).

Finally, we conduct experimental evaluation on Amazon Mechanical Turk to test how understandable or “simple” these mechanisms with an output agreement structure and popularity-scaled rewards are, and how successful they are at inducing optimal behavior. We compare one of our mechanisms with a mechanism proposed in [RF15], which is the current state of the art for the homogeneous population MCET setting, as a benchmark. The experiments reveal that our mechanism is more successful in inducing truthful behavior (with a -value equal to ).

The remainder of the paper is organized as follows. Section 2 presents a formal description of the model considered in the paper. Given the model, Section 3 puts our work in perspective of the existing literature. Section 4 and Section 5 contain the main results of the paper. Section 4 presents a mechanism to incentivize truthful reports without asking for additional information, assuming that the population is homogeneous. Section 5 then extends the results to a setting that does not make the homogeneity assumption. Our experimental results are presented in Section 6. The paper concludes with a discussion in Section 7.

2 Model

Consider a population denoted by the set , with agents labelled . Consider an evaluation task in which an agent in interacts with an object and forms an evaluation taking values in a finite set . Examples of object and evaluation pairs are: Movies/businesses ratings (e.g., Yelp, IMDB), images labels (in crowdsourced labeling tasks), assignments grades (in peer-grading). An agent’s evaluation for an object is influenced by the unknown attributes of the object and the manner in which these attributes affect her evaluations, or in abstract terms, her tastes, abilities etc. Note that the attributes of an object capture everything about the object that could affect its evaluation and as such these attributes may or may not be measurable. For example, in the case where a mathematical solution is being evaluated in a peer-grading platform, its attributes could be elegance, handwriting, clarity of presentation etc., taking values: “elegance: high, handwriting: poor, clarity of presentation: poor.” Denote the hidden attribute values of an object by the quantity , which we will simply call the type of the object and assume that this type takes values in a finite universe .

Denote agent ’s evaluation for the object by

. The manner in which an object’s attributes influence her evaluation is modeled by a conditional probability distribution over

given different values of , i.e., for each and . For notational convenience, we will denote this distribution by and we will refer to it as the “filter" of person . Note that for each , the filter

can be represented as a stochastic matrix of size

(recall that a stochastic matrix is one in which all the entries are non-negative and all the rows sum to ). We assume that the filters themselves are drawn independently for each agent , but from an identical distribution defined on a support for all , where is some subset of the set of all stochastic matrices of size .

In our setting there are similar objects, labeled , that are being evaluated. The type of object is denoted by and each is assumed to be drawn independently from a common probability distribution over . Let denote the set of persons that evaluate object and let be the set of objects that a person evaluates. If an agent evaluates object , let denote her evaluation for that object. We assume that since the objects are similar, the filters of any individual agent for evaluating the different objects are the same, i.e., .

Conditional on realizations of the filters for all , we make the following independence assumptions:

  1. The evaluations by different are conditionally independent given .

  2. The sets of random variables

    for the different objects are mutually independent across objects.

In particular the second assumption implies that and are independent for any person that has evaluated objects and .111This assumption precludes the possibility of dependence induced by lack of knowledge of some hidden information about an agent, e.g., if a worker’s mood is bad on a particular day, there may be a bias in all her evaluations. Note that the random variables for a single object need not be independent unless conditioned on .

The pair of probability distribution over types and the distribution over filters is then said to comprise a generating model denoted as

. In particular, given the conditional independence assumptions above, they fully specify a joint distribution on the underlying types of the different objects, the filters of the different agents, and the evaluations of the different agents of these objects.

Our goal is to design a payment mechanism that truthfully elicits evaluations from the population. With MCETs in mind, we are specifically interested in the case where is large. The mechanism designer is not assumed to have any knowledge of or the filters of the different people in the population, or of . Further, we assume that every member of the population knows the structure of the underlying generating model, in particular the existence of a single that generates the type for each object, of the existence of some that generates the filters of every member, the conditional independence assumptions on the evaluations given the type for every object, and the independence of the evaluations across different objects. But the agents may not know, or may have different subjective beliefs about the values of , about , and even their own filter. We present an example of this setting.

Example 2.1.

Peer-grading in MOOCs: Peer-grading, where students evaluate their peers and these evaluations are processed to assign grades to every student, has been proposed as a scalable solution to the problem of grading in MOOCs. An important component of any such scheme is the design of incentives so that students are truthful when they grade others. For example, say that the answer of any student to a fixed question has some true grade , or , which can be taken to be the type of the answer. Suppose that apriori there is a distribution over the grade of any answer that is common to all answers (to a fixed question). Each answer is then graded by a few students (and in turn each student grades a few answers), who, depending on some given rubric and their abilities, form an opinion as to what grade should be assigned to the answer. Similarly there are thousands of such answers that are graded by other students. It is natural to assume that conditional on the true grade of an answer, the evaluations of different students who grade that answer are independent. Also it is natural to assume that the grades given by the students to different answers are independent. One then wants to design a mechanism that incentivizes the students to report their true opinions about the answers that they have graded.

Let denote the person ’s reported evaluation for object . Then we have the following definition of a payment mechanism.

Definition 2.1.

A payment (or scoring) mechanism is a set of functions , one for each person in the population, that map the reports to a real valued payment (or score).

We will work with the following notion of detail-free incentive compatibility.

Definition 2.2.

Consider a class of generating models. We say that a given payment mechanism is strictly detail-free Bayes-Nash incentive compatible with respect to the class if for each ,


for each , where the conditional expectation is with respect to the joint distribution on the evaluations of the population resulting from any specification of the generating model in class , and any in the support of . Here .

This definition implies that that as long as an agent believes that the generating model is in , irrespective of whether or not she knows the generating model and her own filter, if everyone else is truthful, she gets a strictly higher payoff by being truthful. Thus truthful reporting is a strict equilibrium in the game induced by the mechanism if every agent believes that the generating model is in .

We will consider two classes of generating models inspired by two types of applications that are encountered in practice. This difference arises from the considerations for the differences in the manner in which different agents evaluate an object.

  • Homogeneous population: Consider a typical survey, e.g., suppose the government wishes to find out the chance that a visit to the DMV office in a particular location at a particular time of the day faces a waiting time of more than 1 hour. This is a number that can be thought of as an attribute of the DMV and for simplicity, assume that it takes values in a finite set, say . The evaluation of any agent is just a value , with 1 denoting that she faced a wait time of greater than 2 hours. In this case it is natural to assume that , i.e., each person’s evaluation is an independent sample of the hidden value . This means that does not depend on , and is the same value for everyone. In such a case, we say that the population is homogeneous, i.e., conditioned on the type of the object, different agents form their evaluations in a statistically identical fashion. In this case, has its support on a single filter: the population filter . We will consider this case in Section 4.

  • Heterogeneous population: In most subjective evaluations, the manner in which agents form evaluations differ considerably due to differences in preferences, abilities etc. So it is natural to assume that the filters vary across the population, i.e. has a support of size larger than . We will consider this case in Section 5. In this case, in general it is impossible to design detail-free truthful mechanisms for this case unless some additional structural assumptions are made on the support of . We will propose a natural structural assumption for the case , i.e., in the case where the evaluations are binary, and design a truthful mechanism under this assumption.

3 Related work

The theory of elicitation of private evaluations or predictions of events has a rich history. In the standard setting, an agent possesses some private information in the form of an evaluation of some object or some informed prediction about an event, and one would like to elicit this private information. There are two categories of these problems. In the first category, the ground truth, e.g., true quality or nature of the object or the knowledge of the realization of the event that one wants to predict, is available or will be available at a later stage. In this case, the standard technique is to score an agent’s reports against the ground truth, and proper scoring rules [GR07, Sav71, LS09] provide an elegant framework to do so. In the second category of problems, the ground truth is not known. In this case there is little to be done except to score these reports against the reports of other agents who have provided similar predictions about the same event. The situation is then inherently strategic, in which one hopes to sustain truthful reporting as an equilibrium of a game: assuming all the other agents provide their predictions truthfully, these predictions form an informative ensemble, and with a carefully designed rule that scores reports against this ensemble, one incentivizes any agent to also be truthful. The present work falls in this category.

In this category, majority of early literature has focused on the case where a single object is being evaluated. In a pioneering work, the peer-prediction method by [MRZ05] assumed that the population is homogeneous and the mechanism designer knows the agents’ beliefs about the underlying generating model of evaluations. In this case they demonstrated the use of proper-scoring rules to design a truthful mechanism that utilizes the knowledge of these subjective beliefs. These mechanisms are minimal in the sense that they only require agents to report their evaluations. In another influential work, [Pre04] considered a homogeneous population and designed an oblivious mechanism, famously termed Bayesian truth serum (BTS), that does not require the knowledge of the underlying generating model, but requires that the number of agents is large and that they have a common prior, i.e., they have the same beliefs about the underlying generating model and this fact is common knowledge. This mechanism is not minimal: apart from reporting their evaluations, agents are also required to report their beliefs about the reports of others. [WP12] and [RF13] later used proper-scoring rules to design similar mechanisms for the case where the population size is finite. These mechanisms are again not minimal, and in fact it is known (see [JF11], [RF13]) that no minimal mechanism that does not use the knowledge of the prior beliefs can incentivize truthful reporting of evaluations.

It is the case in many applications in crowd-sourcing, that one is interested in acquiring evaluations from a population for several similar objects. It is thus natural to explore the possibility of exploiting this statistical similarity to design better (e.g. minimal) mechanisms for jointly scoring these evaluation tasks. This is the context of the present work. Three major works in this area that have considered this case are [WP13], [DG13] and more recently, [RF15]. Both [WP13] and [DG13] only considered the case where the evaluations are binary. The former considered a homogeneous population while the latter considered a heterogeneous population, while both making specific assumptions on the generating model. [RF15] on the other hand have considered both homogeneous and heterogeneous populations.

For a homogeneous population with multiple objects, [WP13]

try to utilize the statistical independence of the objects to estimate the prior distribution of evaluations and use that to compute payments using a proper scoring rule. In spirit, we are similar to this approach (and also


) in the sense that we use the law of large numbers to estimate some prior statistics and we get incentive compatibility for a large population, but we do not restrict ourselves to the binary setting.

[RF15] recently have also designed a mechanism that is truthful in the general non-binary setting while requiring only a finite number of objects, again using proper scoring rules. In their mechanism, for computing the reward to an agent for evaluating a given object, a sample of evaluations of other agents for other objects of a fixed size needs to be collected, and an agent’s reward can be non-zero only if this sample is sufficiently rich, i.e., it has an adequate representation of all the possible evaluations. Although our mechanism needs the number of objects to be large, it has a much simpler structure.

For the case of heterogeneous population, [RF15] show that typically one cannot guarantee truthfulness with minimal elicitation. Nevertheless, [DG13] have designed truthful minimal mechanism for the case of binary evaluations for a specific generating model: it is assumed that and for each agent, the probability of correctly guessing the true type of the object is at least 0.5 and it does not depend on the type. Although we also consider the binary evaluations, we allow to be arbitrary and our regularity condition is considerably weaker.

In a parallel development, [SAFP16] elegantly extended the mechanism in [DG13] for the heterogeneous population setting to handle more than two evaluations. But their mechanism is truthful only if the joint distribution of the evaluations seen by two agents evaluating the same object satisfy a property that they refer to as being "categorical". It is a somewhat restrictive condition which says that if an agent makes an evaluation , then the conditional probability that the other agent makes a certain other evaluation

reduces relative to the prior probability of making that evaluation, for every other evaluation

. If this condition is satisfied in our setting, then an "additive" mechanism that we suggest for the case of a heterogeneous population is trivially truthful for non-binary settings, almost by assumption. They also design a mechanism that is truthful in general, in particular without this restriction, but they require that the mechanism designer has access to certain information about the joint distribution of the evaluations for an object by two agents. In another parallel development [RFJ16], the authors consider the heterogeneous population setting and propose almost exactly the same mechanism as ours: it has an output agreement structure where the rewards for matching on an evaluation are inversely proportional to an estimate of the prior probability of seeing that evaluation. They show that the mechanism is truthful for the general non-binary setting, but the condition under which this holds is the same as the condition for truthfulness, and it is not clear if if would reasonably hold in practical settings.

Contrary to the assumptions in these two works our regularity condition is a precise condition on the generating model that can be mapped to a condition on the "behavior" of the agents. It implies both the condition in [RFJ16] and the condition in [SAFP16] in the binary setting, and further it can be argued to naturally hold in most MCETs of interest.

4 Homogeneous population

In this section, we will first consider the case where the population of agents is homogeneous. We will consider the following class of generating models , that we will call .

Definition 4.1.

is the class of all generating models that satisfy the following set of assumptions.

  1. The population is homogeneous, i.e., for any . This means that has support of size . denotes the common population filter.

  2. Define

    By the Cauchy-Schwarz inequality . Then there is some such that


Let us take a closer look at the second assumption. The Cauchy-Schwarz inequality has the following geometric interpretation. For any evaluation

, define the vector


in the Euclidean space . Then the Cauchy-Schwarz inequality says that for any two evaluations and , the magnitude of the projection of the vector on the unit vector in the direction is less than the magnitude of the vector itself (one can reverse the roles of and ), i.e.,


If we let denote the angle in radians between two non-zero vectors and , defined as


then the inequality is strict if and only if the angle between the vectors and is positive and their magnitude is non-zero. In fact, under the condition that for all , which holds in our case, we can show that the the second assumption is equivalent to the following assumption.

Assumption A: There is a and such that for any , the following holds:

  1. for each , and

  2. (note that since these are component-wise positive, we have ).

The first condition says that the probability of an agent forming any evaluation for an object is uniformly bounded away from zero for all generating models in the class. To get an intuition for the second condition, consider the case the angle between and is zero. One can show that this happens only when there is a such that for each such that . But this case the evaluations and need not be distinguished at all, since they contain the same information about . In particular, for each . Hence one can equivalently consider a generating model with a fewer number of possible evaluations.

Proposition 1.

Assumption A and assumption 2 in definition 4.1 are equivalent.


To see that assumption A implies the second assumption, note that implies that:

Multiplying throughout by , we have:

Here in the last inequality, we use that fact that

which follows from the Jensen’s inequality. The reverse direction is less straightforward and this is where we need to use the fact that for all . First of all

implies that either or is non-zero. Say . Then dividing on both sides, we get:

where the last inequality holds since . In other words:


Since and , this implies both and , i.e., . Finally we have . Note that so that ..

We present our proposed mechanism for this case in Mechanism 1, denoted as Hom-OA, where OA stands for output agreement. The mechanism has the structure of an output agreement mechanism, where a person is rewarded for evaluated an object only if his evaluation matches that of a chosen peer who has evaluated the same object. In our mechanism this reward itself depends on how "popular" the matched evaluation is overall across all the objects, where the notion of popularity is defined in a particular manner. In the following theorem, we show that the mechanism is Bayes-Nash incentive compatible for a large enough N.

The observations of all the people for the different objects are are solicited. Let these be denoted by , where every . A person ’s payment is computed as follows:
  • From each population , choose any two persons and different from , and for each possible evaluation , compute the quantity

    Then compute

  • For each evaluation , fix a payment defined as

    where is any positive constant.

  • For computing person ’s payment for evaluating object , choose another person who has evaluated the same object . If their reports match, i.e., if , then the person gets a reward of . If the reports do not match, then gets 0 payment for the evaluation of that object.

Mechanism 1 Hom-OA for a homogeneous population. Assumes that .
Theorem 2.

There exists a depending only on and such that if the number of objects is , Mechanism 1 is strictly detail-free Bayes-Nash incentive compatible with respect to the class .


First, note that the computation of the payments for the different is unaffected by the reports of person . Next, suppose that everyone but a person is truthful. Recalling the definition of , denote

In the proof of Proposition 1, we have seen that the second assumption in the definition of class implies that , and thus we have for all . Next, recall that

We will show first that

where there exists a function of , , that depends only on and , i.e., it is independent of the generating model, such that and . i.e., . To show this, we first have for any :

Here the second inequality follows from Hoeffding’s inequality, and the fifth is the Taylor series approximation of the function , where . The other inequalities result from the fact that . Thus we have

Taking , we have:


where and as a function of , it depends only on and , and further . Next we also have,

Here the second inequality results from the fact that on the event , . This is because only takes values in the set . The third inequality follows from Hoeffding’s inequality, and the fifth follows from the Taylor approximation of the function , where . Now choosing , we get:


where and as a function of , it depends only on and . Further . Hence defining we have where and it depends on and .

The expected reward of person for evaluating object , if she reports when her true evaluation is is

Similarly, we have . Next, lying is strictly worse for person if , that is if

i.e., if

Further, for every generating model in class , we have

where . Thus truthtelling gives a strictly better payoff if . Since , which depends only on and and , there is an depending only on and such that for all , irrespective of the generating model in . ∎

Note how detail-freeness follows from the fact that the proof does not depend on the specific filter , but depends on a universal property shared by any such filter, i.e, the Cauchy-Schwarz inequality.

4.1 Remarks:

An alternative to the peer prediction method: In the case where the mechanism designer knows the underlying generating model and , the mechanism can compute the rewards for each evaluation directly, without having to estimate statistics from evaluations for multiple objects. In order to do so, for each evaluation , one defines

and defines payments for the different evaluations as (in case for some , then one can simply not allow that signal to be reported). In this case, our mechanism provides an alternative to the peer prediction method of [MRZ05], while using the simple structure of output agreement mechanisms and without using proper scoring rules.

Relaxing the requirement that : The assumption is needed to ensure that for any object, there are at least two persons other than any given person who have evaluated the object, which in turn ensures that the computation of the values are unaffected by the reports of . In practice, even if one computes the values by randomly selecting any two persons for each object, as long as is small and is large, these values will not be affected much by the reports of . In this case, one can drop the subscript , and use the same values to compute everyone’s payment.

Truthfulness gives higher payoff than random sampling: Truthful reporting is not the only equilibrium of this mechanism. One class of equilibria is where each person in the population reports evaluations sampled independently from the same distribution, say , independent of their observations. But one can easily show that the truthful equilibrium gives more reward in expectation to each person than in any of the equilibria in this class. In this case, we have for all such that , where . Thus the expected payment of each agent for evaluation of one object is , where whereas the expected payment for evaluation of one object in the truthful equilibrium is

where . The last inequality follows from the Jensen’s inequality. In fact, this inequality is strict for the class with a gap that is universally bounded away from zero. To see this, note that the inequality is not strict only when and are independent, i.e., the population filter is such that the evaluations are independent of the type of the object. But if that is the case, one can verify that 222To see this, observe that for any and , the angle between the two vectors and is , and hence the Cauchy-Schwarz inequality is not strict., thus violating our assumption. Thus for a large enough , the truthful equilibrium gives a strictly higher expected payoff.

Conjecture - Truthfulness gives higher payoff than any symmetric equilibrium: We conjecture that for a large enough , truthful reporting gives at least as much payoff to each individual as in any symmetric equilibrium, where each person in the population maps the observed evaluation for any object to a reported evaluation with some probability for each . In fact, we conjecture that under the conditions satisfied by , the payoff is strictly higher compared to all symmetric equilibria except the ones that result from relabelling the signals. Any symmetric equilibrium is equivalent to the truthful equilibrium in which the population filter is given by:

The expected payment of each agent for evaluation of one object in the truthful equilibrium is

where . Similarly, the expected payoff in any symmetric equilibrium is:

We conjecture that the following inequality holds in general, and strictly under the assumptions of the class :


We have not been able to prove or disprove it thus far. The inequality has the following interpretation. Consider the following definition.

Definition 4.2.

Consider two random variables and , taking values in a finite set , such that they are conditionally independent and identically distributed given some random variable taking values in a finite set . Then the agreement measure between and is defined as

Now if has distribution , and the conditional distributions of and given are denoted as , then the agreement measure is , and this is the expected payoff to an individual under the truthful equilibrium in our mechanism. The agreement measure has the following properties:

  1. . To see this, note that Jensen’s inequality implies that

    In fact only when and are independent.

  2. . To see this, note that Jensen’s inequality implies that

    In fact only when and are identical and they are distributed uniformly, i.e., and .

Now our conjecture is true if the agreement measure has the following property. Suppose and are two random variables such that 1) and are conditionally independent given , and and are conditionally independent given and 2) and have the same conditional distributions given and respectively. Then clearly