Competitive Statistical Estimation with Strategic Data Sources

04/29/2019 ∙ by Tyler Westenbroek, et al. ∙ University of Illinois at Urbana-Champaign University of Washington 12

In recent years, data has played an increasingly important role in the economy as a good in its own right. In many settings, data aggregators cannot directly verify the quality of the data they purchase, nor the effort exerted by data sources when creating the data. Recent work has explored mechanisms to ensure that the data sources share high quality data with a single data aggregator, addressing the issue of moral hazard. Oftentimes, there is a unique, socially efficient solution. In this paper, we consider data markets where there is more than one data aggregator. Since data can be cheaply reproduced and transmitted once created, data sources may share the same data with more than one aggregator, leading to free-riding between data aggregators. This coupling can lead to non-uniqueness of equilibria and social inefficiency. We examine a particular class of mechanisms that have received study recently in the literature, and we characterize all the generalized Nash equilibria of the resulting data market. We show that, in contrast to the single-aggregator case, there is either infinitely many generalized Nash equilibria or none. We also provide necessary and sufficient conditions for all equilibria to be socially inefficient. In our analysis, we identify the components of these mechanisms which give rise to these undesirable outcomes, showing the need for research into mechanisms for competitive settings with multiple data purchasers and sellers.



There are no comments yet.


page 1

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Data has increasingly seen a role in the economy as an important good. As an input to machine learning algorithms, data can not only create new products and innovations, but also be used to redesign business strategies and processes. As the demand for data increases, we have seen the formation of data aggregators, who collate data for either use or resale. A fundamental information asymmetry arises between data aggregators and data sources: how can aggregators verify the quality of the data they purchase from data sources?

In particular, data sources often incur an effort cost to obtain high quality data. For example, devices require maintenance and upkeep to ensure accurate measurements, portable sensors need to use their limited energy resources to collect and transmit data, and human agents may need to be compensated to properly perform a desired task. As such, if a data aggregator wants a high quality data point, they must appropriately compensate the data source. Furthermore, this problem is complicated by the fact that the data aggregators cannot observe the effort exerted, and only the data received. As such, the payments must be calculated from the data sets alone, with no knowledge of the effort exerted or noise levels of data points. This problem has led to the design of a variety of mechanisms to ensure data sources provide quality data, which we will outline in more detail in Section 2.

The contribution of this paper is the study of the data market that forms when multiple data aggregators share the same pool of data sources. In particular, we note that data is non-rivalrous, in the sense that it can be cheaply copied and shared with multiple data aggregators. Since a data aggregator does not ‘consume’ the good after purchasing it, data sources will have an incentive to share the same data with as many aggregators as are willing to pay. We show that the non-rivalrous nature of data introduces a coupling between data buyers: when a data aggregator incentivizes a data source to produce high quality data, other data aggregators benefit. In particular, this coupling leads to undesirable properties of the equilibrium. In many single-aggregator formulations, equilibria are unique and there is no social inefficiency. In contrast, the multiple-aggregator case leads to a multiplicity of equilibria, and social inefficiencies across all equilibria.

The rest of this paper is organized as follows. In Section 2, we discuss the related literature and contextualize our contributions. In Section 3, we introduce our model for data sources, data aggregators, and their interactions in the data market. In Section 4, we characterize the generalized Nash equilibria in the data market, and identify necessary and sufficient conditions for social ineffiency. In Section 5, we extend the results to cases where data sources do not share their data with all data aggregators. Finally, we close with final remarks in Section 6.

2 Related Literature

In recent years, there has been a quickly growing body of literature on models for data exchange and data markets. Broadly speaking, the existing literature can be broken down by two categories: models with a single data purchaser and single data source, and models with a single data purchaser and multiple data sources.

In the first category, we find a class of models which study a single data purchaser and a single data source. These works focus on the game theoretic interactions and information states between the two agents. In particular, these works consider the strategies arising from direct signals, actions, and payments, rather than indirect coupling that can arise from multiple sources or purchasers. Some of these papers feature multiple data sources, but these are ultimately separable into a collection of single-source models, and, at their core, focus on the direct interactions between buyers and sellers of data. In [1], optimal mechanisms for a single data source to sell to a single buyer are developed using a signaling framework. The authors of [2] design a menu of prices for different data qualities, employing a screening framework. In [3], the authors consider a single aggregator and single source, and show how repeated interactions with noisy verification allow for mechanisms which elicit costly effort from a data source. A single data source charging data purchasers for queries about customer preferences is studied in [4].

In the second category, there are a class of models which study a single data purchaser with multiple data sources. These works focus on capturing how the data supplied by one data source affects another. In [5], the authors consider a single data aggregator and multiple data sources, and show how robustness of the sample median provides protection against strategic data sources. In [6], the authors consider a single data aggregator and multiple data sources in a setting with verifiable data, and allow the data and the cost of revealing data to be arbitrarily correlated.

There is also a new body of work in the single-aggregator, multiple-source case, using peer prediction mechanisms, first introduced in [7]. These techniques often use scoring techniques to evaluate the ‘goodness’ of received data, and often examine classification tasks. In [8, 9], the authors develop mechanisms for eliciting the truth in crowdsourcing applications, while  [10, 11, 12] consider theoretical extensions to strengthen the original results of [7], all in the context of a single aggregator. In [13], the authors consider a classification problem with a single aggregator and multiple data sources, which extends the classic peer prediction results by exploiting correlations between the queries and query responses.

A parallel literature considers similar ideas in the regression domain. These works design general payment mechanisms, by which a central data aggregator may incentivize data sources to exert the effort necessary to produce and report readings which are deemed to be of high quality, with respect to the estimation task the aggregator is performing. The roots of these approaches can be traced least as far back as VCG mechanisms, a set of seminal results in mechanism design [14]. Indeed, numerous approaches for deciding payments based on the actions of other agents have been proposed [15]. Here, we again see attention given to crowdsourcing [16].

Several recent papers [17, 18, 19, 20, 21, 3] investigate new directions in this domain. In cases where, without the ability to directly determine the effort exerted by data sources, data buyers must design incentive mechanisms based solely on the data available to them. In [17], whose approach we extend here, the authors develop a mechanism which a data aggregator can use to precisely set the level of effort a collection of data sources exert when producing data. A similar mechanism is explored in [18]. Extensions are considered wherein data sources form coalitions [19], or where aggregators assess the quality of readings using a trusted data source [20]. Meanwhile, [21] and [3] investigate dynamic settings where data sources are repeatedly queried.

Our work is closest in spirit to the literature studying regression problems with multiple data sources, with our key contribution being the presence of multiple data aggregators that are coupled in their costs and actions. To our knowledge, this is one of the first papers which considers multiple data aggregators and multiple data sources simultaneously. In particular, we simultaneously model coupling between data aggregators in their cost functions, coupling in the payments to the same pool of data sources, and coupling between data sources due to payments that depend on their peers’ data.

We suppose all data aggregators are trying to estimate the same function and share the same pool of data sources. Additionally, we assume each data aggregator has already chosen an estimator, and now must determine how to issue payments to have low estimation error with their exogenously fixed estimator. Our model builds heavily on the model introduced in [17], which featured a single data aggregator. Our contribution is an extension that models cases with multiple data aggregators. For consistency, we will refer to data purchasers as data aggregators, and data sellers as data sources.

Furthermore, the work in the paper is a significant extension of our prior work [22] where we considered strategic data sources with a specific exponential function mapping effort to query response quality. In the present work, we characterize equilibria and the price of anarchy for a much broader class of games between data buyers where the data sources’ effort functions can be any non-negative, strictly decreasing, convex, and twice continuously differentiable function. The characterization we provide considers both bounded and unbounded feasible effort sets for the data sources.

3 Data Market Preliminaries

In this section, we outline the models for data sources, data aggregators, and the strategic interactions between them.

At a high level, each data aggregator collects data from data sources to construct an estimate of a given function. In exchange for this data, the data aggregator issues incentives to the data sources. The data aggregators have three terms in their cost function: 1) an estimation error term, which rewards the data aggregator for constructing a better estimate; 2) a competition term, which penalizes when other data aggregators have higher quality estimates; 3) a payment term, which is the cost incurred issuing incentives.

Each data source is able to produce a noisy sample of the desired function. The data sources can exert effort to reduce the variance of the data sample, and we assume the data sources are

effort-averse, i.e. data sources will prefer to exert less effort, unless they are provided incentive by the aggregators. As such, the data sources have two terms in their utility function: 1) an incentive term, which rewards payments received; 2) an effort term, which penalizes effort exerted.

The level of effort exerted and the variance of the data are not known by the data aggregator; this private information gives rise to moral hazard. One of the problems for the aggregator is the task of designing incentives which depend only on the information available to them. Another important nuance is that data is non-rivalrous; thus, when a data source produces a higher-quality data sample, all the aggregators which receive this data benefit.

In order to simplify the initial introduction of our model, we will first assume that each data source provides data to all the aggregators in the data market, and receives payment from all aggregators as well. In Section 5, we will outline how our results change when this assumption is removed.

3.1 Overview

More formally, let be the index set of strategic data sources, and let be the index set of strategic data aggregators. Each data aggregator desires to construct an estimate for a given function , where is a feature space. Practically, one may think of as a set of features the data aggregators are capable of observing, while the mapping encapsulates the relationship between the observable features and the outcome of interest.

Each data source is able to produce a noisy sample of at the fixed point . The point is common knowledge among all data sources and aggregators. The variance of is proportional to the effort exerted by data source to produce the reading. Each data source is characterized by an effort-to-variance function , where represents the set of feasible efforts that data source can exert. When data source exerts effort , they produce the data point:



is a random variable with mean

and variance . The function is common knowledge among all data sources and aggregators. However, while the function is known, the effort exerted is private. This means that the actual variance of , namely , is also private information of . We will delve into assumptions in the data source model in greater detail in Section 3.2.

Now, suppose a data aggregator is granted access to a data set . At this point, the data aggregator processes this data to construct an estimate for . In exchange for this data set, the data aggregator issues payment to data source for each . Here, denotes the data given to each member of . Note that the payment to from depends not only on the data supplied by , but rather depends on all data available to .

The data aggregator then incurs loss , which will depend on , the payments issued, as well as , the effort exerted by the data sources. We will formalize the data aggregator in greater detail in Section 3.3.

The interaction of the data market proceeds in three stages.

  1. Aggregators declare incentives: Each data aggregator commits to a payment contract . The payments will depend on the data shared with , as well as the common knowledge information and functions .

  2. Sources exert effort, realize and share data: In response to , each data source chooses an effort . Then, the random variable is realized according to (1). The data is shared with each data aggregator. Note that has control over only through . In other words, the data source chooses the quality of data they generate, but cannot arbitrarily manipulate the reported value of .

  3. Aggregators construct estimates, issue payments: Each data buyer constructs their estimate , issues payments to the data sources, and incurs loss .

For convenience, we include a table summarizing the notation throughout this paper in Table 1.

3.2 Strategic Data Sources

As mentioned previously, each data source has their own

feature vector

, and samples the function at this point. We may also refer to as a query throughout the text, and as the query response for data source . The data source is characterized by the effort-to-variance function . We assume so that each data source may exert no effort in producing her reading if she desires.

Assumption 1.

For each , the set is a closed, connected set and contains .

Assumption 1 means that we consider two cases:

  1. , i.e. the data sources maximum allowed effort is unbounded.

  2. for some , i.e. the data sources maximum allowed effort is bounded.

Imposing an upper-bound on the amount of a effort a data source can exert can be used to model constraints such as hardware limitations. As we shall see in Section 4, the imposition of such constraints can drastically affect equilibrium behavior in the data market.

Once the data source exerts effort , they produce the data point according to (1). Again, we note that the data source only controls the effort level . They can only indirectly control through , and cannot report arbitrary values as their data. We also impose the assumption that the noise in the data is independent across data sources.

Assumption 2.

For each , is a random variable with mean and variance . Furthermore, the random variables are independent.

Both and the function function are common knowledge, but the effort and , the actual variance of , are private.

For convenience, we let be the joint effort set and let be the tuple of effort-to-variance functions. We make the following assumptions on the effort-to-variance mappings .

Assumption 3.

For each data source , the mapping , which is the square root of , is (i) strictly decreasing, (ii) convex, and (iii) twice continuously differentiable.

The assumptions correspond to the variance of the estimate generated by data source decreasing in the effort exerted, with decreasing marginal returns.

Using the notation , we model each data source with the following utility function:


where the expectation is with respect to the randomness in , the data generated by the data sources upon exerting effort .111For simplicity and as a first-step analysis, we assume that the data sources only care about the payments received from the aggregator, and are indifferent to which aggregators they share their data with. An interesting and practical extension would be to consider the case where the data sources’ utility functions are aggregator-dependent. This could arise when data sources trust different aggregators differently, or over privacy concerns. Note the form of (2) implies that the data sources are risk-neutral and effort-adverse. Additionally, the form of (2) also implies the effort can be normalized to be comparable to the payments. We note that the timing of the game implies that data sources must commit to an effort level ex-ante.

Thus, in the second stage of the game, data source has knowledge of the payment contracts , and chooses to maximize their , defined by (2). However, since the utility of each data source depends on the effort exerted by the other data sources, the payments induce a game between the data sources. In Section 3.6 we will fully characterize this game for the particular class of incentives we introduce in Section 3.4.

3.3 Strategic Data Aggregators

The primary objective of each aggregator is to construct a low-variance estimate for the function . We adopt the following formal definition for an estimator.

Definition 1 (Estimator [17]).

Let be a family of functions . An estimator for takes as input a collection of examples and produces an estimated function .

As an example, may be the class of linear functions , in which case one may produce an estimated function of

via linear regression.

Each data aggregator constructs his estimate for from the class of functions , using the readings . We let denote the estimate that aggregator constructs based on the readings they receive.222In general, aggregators need not fit models of the same type—e.g., one data aggregator may choose to generate their estimate via linear regression, while another fits a polynomial of higher degree. Different estimator types across data aggregators may be used to encapsulate competitive advantages one has over another.

Each data aggregator’s estimator is given, fixed, and common knowledge among all agents. In other words, this means that, for each data aggregator, the process by which a data set is turned into an estimate is exogenous. We focus on the design of incentives once each buyer has chosen an estimator.

First, we introduce some restrictions on the class of estimators allowed. The following assumption is required for us to be able to consider the contribution of data source to reducing aggregator ’s estimation cost. Also, note that the functions will be non-negative by construction.

Assumption 4.

We assume the estimator for each is separable, in the following sense [17]. There exists a function such that for all queries , distributions over , and variances of the reported estimates at queries in the dataset :


Here, the expectation is taken across the randomness in , as well as across .

For brevity, we will also define the function as follows:


Let denote the index set of aggregators excluding and let be the payments of all aggregators excluding . Aggregator constructs payments so as to minimize:


As in (3), the expectation in (3.3) is taken with respect to and the randomness in the query responses . The distribution weighs the importance data aggregator places on accurately estimating for different query points .

The scalars parameterize the level of competition between aggregators and . When , aggregator is indifferent to the success of ’s estimation; interacts with entirely through the incentives issued to the data sources. We note that, even when for all and , we can still see degeneracies and social inefficiency arise, since data aggregators will still be coupled through the data sources.333This is a stylized formulation of how competition can affect different data aggregators, but we see interesting results arise even in this simple model. In the future, we hope to consider more extensive models of competition for data aggregators. The parameter denotes a conversion between dollar amounts allocated by the payment functions and the utility generated by the quality of the various estimates that are constructed. We make the assumption that aggregator has knowledge of what estimator every other data aggregator plans to use, as well as the weighting distributions.444This is a fairly strong assumption given that competing data aggregators are unlikely to inform their competitors how they intend to process the data supplied by the sources. Our work isolates how coupling between aggregators through data sources affect the data market; an interesting avenue for future work is to consider extensions with different information sets, and characterize the existence and severity of market inefficiencies in these various situations.

3.4 Structure of Payment Contracts

Throughout this paper, we will assume a particular form for the payment contracts the aggregators offer to the data sources. Similar to previous notation, we let . For a given and we assume that is of the form:


Here, and are nonnegative scalars. Also, denotes ’s data set excluding . Namely is the data features for all sources excluding and is the query responses to aggregator , excluding .

Note that these payments do not directly depend on the level of effort that any of the data sources exert, since the data aggregators do not have a means to directly observe these values. Rather, the payment to source from aggregator depends on the ’s best estimate for excluding ’s data, namely, . The payments only depend on the data reported to them, and can be calculated by the aggregator.

Similar payment contracts are common in the literature [17, 18, 20]

, in part because of their intuitive structure. The aggregator constructs an unbiased estimate of what data source

should report, and this estimate is not influenced by the data of . This estimate is used to overcome the problem of moral hazard: all data sources are appropriately incentivized to reduce the variance of their reported data accordingly.

Given this payment structure, each data aggregator’s choice of payment contracts reduces to choosing parameters where and .

In the single aggregator case (when ), it was shown in [17] that payments of the form in (6) induce a game between the data sources for which there is a unique dominant strategy equilibrium. That is, for each collection of parameters and , the data sources each exert a unique level of effort. The authors develop and algorithm by which the single aggregator may select these parameters such that (i) data sources are incentivized to exert any level of effort that the aggregator desires, and (ii) data sources are compensated at exactly the value of their effort, i.e. .

This paper’s contribution is the study of how pricing schemes of this form perform in the more general case where there is more than one data aggregator (when ), and data aggregators may compete with each other. The goal is to model multiple aggregators as strategic decision-makers in competition, and understand the data market where these agents interact. Thus, while prior work captured moral hazard, we extend this model to capture competition and the non-rivalrous nature of data.

3.5 Formulation of Aggregator Optimization Problem

As mentioned previously, the aggregators hope to minimize their costs, as given in (3.3). They do so by choosing the parameters . In this section, we will describe the aggregator’s optimization problem in more detail, and specify constraints that the parameter choice must satisfy.

The first constraint is individual rationality (IR). Individual rationality requires that each data source’s utility is non-negative ex-ante [23].555Alternatively, a data source’s utility may be compared to an outside option; for simplicity, we model the outside option as having zero utility. This ensures that rational data sources are willing to exert effort to produce the data. The second constraint is non-negative payments from each data aggregator. Given that there are multiple aggregators, we introduce a constraint that the payment each aggregator offers to each is non-negative ex-ante.666Negative payments could be handled via exchangeable utilities among the data aggregators or via a trusted third–party to manage the allocations; however, in an effort to ensure clarity, we leave these scenarios aside.

We’ll introduce some notation for brevity here; we let denote the expected value of the payment :



denotes the probability measure with mass one at

and . Similar to previous conventions, we define:

Thus, the IR constraint for each data source is formalized:


Similarly, the non-negativity constraint for each data source and data aggregator is given by:


The third constraint is incentive compatibility (IC). Intuitively, IC states that when a data source is acting rationally and choosing actions to maximize their utility, they behave as the data aggregators intended. When there is a single aggregator, IC is typically enforced by the aggregator finding the effort that minimizes their cost, , and then designing such that .777For notational brevity, we will use as a function rather than a set-valued function throughout this paper; this is well-defined by Assumption 3.

In the competitive setting, IC for one aggregator is defined holding all other aggregators payments fixed. Each of the data aggregators make their choice of payment subject to the fact that data source selects effort according to


Note that the payment each source receives depends on the efforts exerted by the other data sources. Thus, for each set of contracts offered by the aggregators, a game is induced between the data sources to determine how much effort they will exert. The aggregators compete by issuing incentives, which influences the equilibrium behavior of this game.

From the perspective of the data aggregators, the IC constraint states the desired effort level must be a dominant strategy for data source ; that is, is the utility-maximizing action for regardless of the actions taken by other sources . Formally, the following must hold for all :

With these constraints, we formulate a bilevel optimization problem for each aggregator. Consider a fixed aggregator . Given a fixed action profile for all other buyers , i.e. given , aggregator aims to solve:


where is defined in (3.3).

Note that this problem actually has

optimization problems as constraints, making is a difficult bilevel program. However, we will reformulate the aggregator’s problem to a more manageable non-linear program in the sequel. This is possible, in part, due to the nice properties of the payment contract structure introduced in Section 

3.4; this tractability motivates the use of payment contracts of that particular form. Next, we analyze the induced game between the data sources and simplify the aggregator’s optimization problem.

3.6 Induced Equilibrium Between Data Sources

To ensure a notion of incentive compatibility in equilibrium, we show there is a well-defined mapping from the parameters chosen by the aggregators to the equilibrium .

Definition 2.

For fixed payments , we say is an induced Nash equilibrium if for each data source :


If (11) holds for all rather than just at , then we say that is an induced dominant strategy equilibrium.

Suppose now that we have a set of payments of the form discussed in Section 3.4, characterized by parameters . Data source chooses effort according to:


for each choice of made by the other data sources. It is straight forward to verify that (12) is a concave maximization problem which admits a unique globally optimal solution. This follows from our assumption that is convex and decreasing, recalling that for each and observing that is a convex set. Moreover, note that the choice of this optimal effort is not affected by the choice of , since each of the terms enters (12) as a constant from the perspective of . Thus, each choice of contract parameters selected by the aggregators leads to an induced dominant strategy equilibrium for the data sources. In particular, note that the choice of


fully characterizes the level of effort that data source exerts in equilibrium. We reiterate that the constraints on the aggregator’s optimization problems will ensure the chosen contract parameters respect the IR and non-negativity constraints.

Next, we define to be the implicitly-defined map such that returns the solution to (12) for a given choice of . In the following section, we will use this mapping to simplify the optimization problem facing each of the aggregators.

Definition 3.

For a given data source , let:


When with , define where


On the other hand, when , define .

The above definition implies is the minimum value of that the aggregators must offer data source to ensure they do not have incentive to exert negative effort.888This situation could correspond to source obfuscating their data, for example. We have restricted to the non-negative orthant, so we will add constraints to ensure we are operating within the domain of our model. Similarly, if the aggregators increase past , source cannot further increase the level of effort they exert, and the mapping ceases to be meaningful. Thus, when reformulating each buyers optimization in the following section we will additionally constrain for each .

The following lemma provides properties on the mapping which are needed to prove existence of equilibria for the game between aggregators in the first stage.

Lemma 1.

Fix a data source . Then the mapping is continuous and strictly increasing in for all values of .


The first-order optimality condition for the data source is given by:


By assumption is strictly decreasing and convex so that (16) has a unique solution for all . By definition, this solution is . Implicit differentiation of (16) then yields:

where we suppress the dependence of on . The right-hand side of the above equation is strictly positive by Assumption 3. Continuity follows directly by Assumption 3. ∎

3.7 Reformulation of Buyers Optimization Problem

Finally, using our previous analysis and assumptions, we reformulate the optimization problem faced by each aggregator. This reformulation will simplify our analysis of equilibrium behavior in the data market, and lend economic interpretability to the results presented in Section 4.

Previously, we assumed that aggregator ’s estimator is separable in Assumption 4

. This allows us to write the loss function of


Recall that is fixed and common knowledge. Thus, we can replace each of the evaluations of the ’s with constants. Towards this end, for each and , we define:


Note that each , by definition of the . In addition, for each and , define:


Since we defined such that , we can write:

Similarly, the expected payment for any data source and data aggregator is given by:

Before proceeding, we provide an interpretation of the constants introduced above. The constant denotes the relevance of data sampled from the point when constructing aggregator ’s estimate, given the distribution of all of the data sources. The parameter corresponds to the level of demand that aggregator has for high-quality data from source , factoring in the benefit this data supplies to the competitors of . In other words, parameters capture the effects of the non-rivalrous nature of data. The parameter denotes a measure of coupling that exists between the payment contracts and . In the case of a single aggregator (i.e. [17]), this coupling did not prove problematic. In contrast, when there are multiple aggregators, each aggregator has an incentive to try and exploit this coupling, as shall become clear in our ensuing analysis. This coupling will play a central role in determining the existence and efficiency of equilibrium behavior in the data market.

Collecting the various expressions we have introduced, aggregator ’s optimization problem can be re-written as:


Without loss of generality, we let , by normalizing the accordingly. Note that the constraint can be omitted, in light of the constraint , since each and .

Notation Meaning Defined or First Used in Equation
index of data source
index set of data sources
index of aggregator
index set of aggregators
expected payment from aggregator to source (7)
linear term in ; used to adjust level of effort in equilibrium (6)
vector containing the parameters offered to source by the members of
constant term in ; used to ensure incentive compatibility in equilibrium (6)
vector containing the parameters offered to sources by the members of
sum of parameters offered to source across all members of (13)
mimimum value of required to ensure source does not exert negative effort (15)
minimum value of at which data source exerts her maximum effort (14)
, the allowable range of
implicit map which returns the equilibrium value of as a function of
level of competition between (3.3)
relevance of data from in constructing aggregator ’s estimator (17)
aggregate demand for from (19)
sum of demand for data source across all members of (23)
coupling between and (36)
Table 1: Notation Reference Chart

4 Generalized Nash Equilibria in the Data Market

It is important to note that the constraints each aggregator faces in her optimization problem (20) depend on the actions taken by the rest of the aggregators in the data market. In particular, in order to ensure that the IR and IC constraints are maintained in equilibrium, we require an equilibrium concept which allows each aggregator’s admissible action space to depend on the choice of contract parameters selected by the other aggregators in the data market. Thus, we will employ the notion of a generalized Nash equilibrium [24] to study competitive outcomes in the data market, which is a natural extension of the typical notion of Nash equilibrium to this setting.

Let be aggregator ’s actions space; that is, where and . Each aggregator solves a parametric nonlinear programming problem given by


where with a finite set indexing the constraint functions of aggregator . Note that, unlike in the classic definition of a Nash equilibrium, the admissible action space of aggregator depends on , the actions of .

We say is a generalized Nash (GN) equilibrium problem. A GN equilibrium is defined as follows.

Definition 4.

A point is said to be a GN equilibrium for if for all , solves .

We now analyze the game between the aggregators utilizing the notion of a GN problem and GN equilibrium. We will characterize the existence and uniqueness of GN equilibria in two scenarios. In Section 4.1, we will consider the case where the effort spaces of data sources are unbounded, i.e. . In Section 4.2, we will characterize the case where each data source has an upper bound on the level of effort they can exert, i.e. . In Section 4.3, we will then address the social efficiency of the equilibria identified in Section 4.1. A similar analysis of the equilibria identified in Section 4.2 can be found in Appendix A.2.

Before preceding to out main results, we provide a technical lemma that will have a central role in our ensuing analysis and introduce some notation which will simplify the statement of our results. For compactness, for a given set of parameters we define . (Recall that is the sum of parameters, as defined in Equation (13).)

Lemma 2.

Suppose , where , is a GN equilibrium for the game defined by (20). Then for each :


In other words, the IR constraint is always binding in equilibrium, and the expected payment to data source is equal to the effort exerted in equilibrium:


Suppose that there is an equilibrium in which the IR constraint is not binding for some data source . Then, there must a exists an aggregator whose non-negativity constriant corresponding to source is also not binding. Thus, this cannot be an equilibrium as aggregator can unilaterally improve their payoff by decreasing without causing any of the constraints to be violated, contradicting the assertion that the given selection of parameters is an equilibrium. ∎

The result of Lemma 2 is a well-known result in contract design—that is, the individual rationality constraint always binds for the optimal contract [23]. As shall become clear in our analysis in the following sections, the equality (22) forms an implicit constraint that appears in each of the aggregators’ optimizations, which will be directly responsible for the degeneracy observed in the data market. Roughly speaking, while the parameters selected by the aggregators determine the level of effort that the data sources will exert, the parameters determine what portion of this effort each aggregator is expected to compensate.

For each , define:


which can be interpreted to be the total demand for high quality data from data source . Next, we define:

Note that Lemma 2 implies that if is an GN equilibrium in the game between the aggregators then will hold for each . Moreover, the non-negativity constraints in the game between the buyers will hold only if for each and .

4.1 Unbounded Effort Spaces

Let us first consider the case where there is no upper bound on the effort the data sources may exert, i.e. .

Theorem 1.

Consider the game , where each aggregator’s objective is to solve the optimization in (20). Suppose that for each , and . Further, suppose that , . Then, there is either no GN equilibrium or an infinite number of GN equilibria. Moreover, if is a GN equilibrium, then the following conditions hold:

  1. The set of infinite GN equilibria is given by:

    That is, the parameters selected by the aggregators are the same across each GN equilibrium, and all degeneracy lies in the equilibrium parameters which lie in the -dimensional convex polytope defined above.

  2. The effort exerted by each data source is the same in each GN equilibrium and the efforts constitute a unique induced dominated strategy equilibrium between the data sources. More precisely, each data source exerts effort in all GN equilibria.

Before going ahead with the proof of the theorem, we discuss its hypotheses and implications. The hypothesis that implies that there is enough demand for the data from source such that she does not have incentive to exert negative effort in equilibrium. Together, the aggregators will provide sufficient incentive to so that accepts each of the contracts offered to her, and truthfully report her query-response. When we investigate the case where only provides readings to a subset of the aggregators in Section 5, only the relevant subset of aggregators must maintain this constraint. This condition places a restriction on what subsets of incentives from the aggregators each data source is willing to accept.

As we discovered in Section 3.6, the parameters selected by the aggregators uniquely determine how much effort the data sources exert in equilibrium. Intuitively, the fact that the parameters are constant across all GN equilibria means that, when GN equilibria do exist in the game between the aggregators, the aggregators have agreed to incentivize the data sources to each exert a particular level of effort. The proof of the theorem will shed some light on how this unique choice of parameters is selected when GN equilibria exist, and also demonstrate what ‘goes wrong’ in cases where the aggregators cannot agree on how much effort to incentivize the sources to exert. In the latter case, no GN equilibrium solution exists in the game between the aggregators. Further commentary on this point is provided after the proof of Theorem 2.

Meanwhile, for a fixed profile of parameters, the parameters determine how much of this effort each aggregator is responsible for compensating in expectation. Even when aggregators are able to agree on how much effort to incentivize from the data sources and select the unique GN equilibria choice for , there is a non-uniqueness in the parameters in equilibrium. This implies that there is a fundamental ambiguity in who will fund the exertion of the data sources. In the extreme case, it is possible for one aggregator to pay for the entirety of the expected compensation offered to the data sources, while the other aggregators pay nothing in expectation.

Proof of Theorem 1.

By Lemma 2, we have that:


Plugging in this constraint, the cost function for aggregator can be expressed as:

By swapping the roles of and in the middle term above, aggregator ’s cost can be decomposed into the sum of costs for each data sources. We define:

Then aggregator ’s optimization problem reduces to:


Note that the cost does not depend on , for any . We complete the argument by ignoring the constraints and showing that the constraints are satisfied for the set of equilibria we characterize.

Differentiating the cost with respect to and applying (16) and for all , we have that: