Recent years witnessed the rise and development of deep learning
. Many laboratories and companies put emphasis on building neural network applications such as DeepMind, Facebook AI Research (FAIR) and Stanford AI Lab (SAIL). In these applications, large-scale datasets are indispensable. Therefore, data acquisition underpins the success of these applications. Traditionally, they may hire voluntaries to collect data such as photos or voices, which is a very time-consuming and labour-intensive process.
Crowdsourcing is a teamwork collaboration mode in which companies use the open call format to attract potential workers to finish the task at a lower price, which was first proposed by Howe 
. Many companies are committed to crowdsourcing services such as Mechanical Turk and gengo AI. Consequently, more and more research teams turn to these platforms to acquire data. For example, ImageNet from SAIL is retrieved by Mechanical Turk.
In traditional crowdsourcing models, the requester has to pay to not only the data providers but also the third-party crowdsourcing platforms. However, the data retrieved in this way may be tedious and repetitive, but the requester still has to pay for it. Therefore, whether the requester can benefit from the paid crowdsourcing platforms is not clear. Moreover, the data retrieved in this way is also revealed to the platform, which may cause a privacy issue.
Therefore, in this paper, we propose a novel crowdsourcing mechanism for data acquisition on social networks. The requester is the owner of the mechanism and can use it to collect data without any third-party platforms. The mechanism requires the requester to release the task information to her neighbours on the network. Under this mechanism, the participants will be incentivized to provide all their data and invite all their neighbours to do the task, because they will gain payoffs not only from their offered data but also from their diffusion contribution. By doing so, the task information can be disseminated through the whole social network without giving any payments to the workers in advance.
Different from other crowdsourcing mechanisms, our mechanism only distributes rewards to those who provide effective data and do effective diffusion. That is, the workers will not gain any payoff if they do not contribute to the data acquisition task. Hence it can eliminate redundant and irrelevant data, and avoid unnecessary expenses.
In the crowdsourcing literature, there are many related mechanisms published. Franklin et al. focused on how to use crowdsourcing to process difficult queries . Chawla et al. proposed an optimal crowdsourcing contest for high-quality submissions . Zhou et al. studied a new method of measurement principle for work quality . Miller et al. devised a scoring system to evaluate the feedback elicited . Radanovicet al. presented a general mechanism to reward the workers depending on peer consistency . They are all different from our work. They mainly focused on the crowdsourcing model to improve the quality of the work provided by the workers and there is often a single ground truth in their settings which is unknown to the requester. Our setting is not seeking the answer for a ground truth, we are aiming for collecting rich data. Moreover, their settings have not considered the task propagation via workers. In our setting, we also incentivize the workers to propagate the task information to their neighbours to collect more data.
There also exists some interesting literature about information diffusion on social networks. Narayanam and Narahari studied the target set selection problem , which involves discovering a small subset of influential workers in a given social network, to maximize the diffusion quality of the workers rather than incentivizing them to diffuse. In terms of incentivizing people to disseminate the task information, a team from MIT has proposed an interesting solution under the DARPA Network Challenge . However, the solution only works for a tree structure graph. Our mechanism refers to the idea and puts forward a modified payoff policy for diffusion contribution in single-source directed acyclic graphs. More importantly, the reward in the DARPA network challenge is predefined, while in our setting it varies according to the data offered by the others.
Our mechanism is closely related to the strategy diffusion mechanism proposed by Shen et al. . However, their solution focuses on solving the problem of false-name attacks and does not consider data effectiveness. Also, their mechanism cannot guarantee that all the workers will propagate the task information to all their neighbours.
The contributions of our mechanism advance the state of the art in the following ways:
We model a crowdsourcing mechanism on social networks without relying on third-party platforms. Our mechanism incentivizes the workers to not only offer their data truthfully but also propagate the task information to their neighbours without paying them in advance. This guarantees that more effective data is collected.
We give a novel method to evaluate the effectiveness of the acquired data and distribute rewards to the workers without unnecessary expenses. This is achieved by Shapley value.
Our mechanism is also budget constrained and the payoffs are adjustable by the requester, which incentivizes the requesters to apply our mechanism in real applications.
2 The Model
Consider a data acquisition task that is executed on a social network. To simplify the representation, we first model the network as a directed acyclic graph (DAG) with a single source which is a special node called the requester of task , and later on we will consider a general graph. In the graph, where denotes the set of workers and denotes the information flow between vertices. For any , if there is a directed edge from to , then can directly propagate the task information to . Here, we say is ’s child and is ’s parent. Let be the set of ’s children, be the set of ’s parents and be the neighbour set of each . If there is a directed path from to , then we say is ’s successor and is ’s predecessor. For each , let be the set of ’s all successors, and be the set of ’s all predecessors. Each worker has a depth representing the length of the shortest path from the requester to . Figure 1 shows an example on the social network, where the number in each node represents the worker’s ID.
In the above network, requester wants to collect data of task . Each worker is a potential data owner and has a private dataset related to task , where each represents an atomic data (e.g. an image) and is the number of atomic data owned by the worker . Let be the space of all possible datasets. In our setting, we are not aiming for a single ground truth, instead, we try to collect a dataset as rich as possible.
Given the problem setting, it is evident that the requester can only collect data among her neighbours with whom she can directly communicate. Traditionally, to collect as much required data as possible, the requester tends to do promotions with the help of some paid third-party advertising platforms (such as Mechanical Turk and gengo AI). However, the quality of the data collected can not be guaranteed and users may tend to give redundant data which is costly but not useful for the requester.
In this paper, we propose a novel diffusion mechanism for crowdsourcing the data. The goal of the mechanism is to incentivize the workers on the social network to provide all the data they have and also propagate the task information to all their neighbours. Different from other data collection platforms, the purpose of our mechanism does not reward the redundant data providers (i.e., duplicate data will not be paid). Furthermore, the workers’ total payoff is relevant not only to their provided data but also to their diffusion contribution.
For each worker , let be ’s type. Due to the information flow constraint, we do not need to consider in ’s strategy space. Then the type profile of all the workers is denoted as , where represents the type profile of all workers except . Let be ’s type space, and is the type profile space for all the workers.
Our mechanism requires each worker participated in the mechanism to report their type. Let be the type worker reported, where is the data provided and is the children has invited to do the task. Let if worker is not invited or refuses to participate in the mechanism.
Given a report profile of all workers, let the network generated be , where and is reduced by .
A report profile is feasible if for each worker with , there exists at least one path from requester to on the network .
Intuitively, feasibility means that an agent cannot join in the mechanism if she is not invited/informed about the sale, which holds naturally in practice.
Let be the set of all feasible report profiles of . To simplify the description, the following discussion will only focus on feasible report profiles.
A crowdsourcing diffusion mechanism on the social network is defined by a payoff policy , where .
Given a feasible report profile , is the payoff of worker for her provided data and diffusion contribution.
To design a crowdsourcing diffusion mechanism, we hope that workers are incentivized to give all their data and invite all their neighbours to offer more data. This property is called incentive compatibility. An incentive compatible (truthful) diffusion mechanism guarantees that for all workers , reporting her true type is a dominant strategy, i.e., .
A crowdsourcing diffusion mechanism is incentive compatible (IC) if , for all , , and for any , if there exists a path from to in , otherwise .
Under the crowdsourcing diffusion mechanism , requester’s payment is the sum of the payments made to the workers. We say is budget constrained if is bounded by a constant under all settings.
A crowdsourcing diffusion mechanism is budget constrained (BC) if for all and all , we have
where is a constant.
3 The Mechanism
In this section, we introduce our novel crowdsourcing diffusion mechanism (CDM). The payoff policy of the mechanism is composed of two parts: data contribution and diffusion contribution. The data contribution indicates how the requester validates workers’ provided data, and the diffusion contribution indicates how the requester validates other workers’ diffusion on the social network. Finally, we will give the total payoff policy by using both.
3.1 Data Contribution
The method we evaluate data contribution is based on Shapley value, which is a classic method to allocate interest in cooperative games . First we define as the valuation function that evaluates the value of a dataset for the requester. Here the valuation function should be monotone increasing and bounded, i.e., for datasets and , if , then . Then if we directly apply the Shapley value among all workers on the network, the data contribution for each worker will be:
Here is the dataset offered by the workers in set : . Intuitively, the Shapley value calculates the average marginal valuation contribution of each worker. However, with this simple method, workers may not be willing to share the task information to their neighbours as a worker’s neighbours with similar data will compete with the worker to reduce her payoff, which against what we want to achieve with the mechanism.
A mechanism using Equation (1) as the evaluation of data contribution is not incentive compatible.
Consider the network in Figure 3, if and workers 1 and 2 truthfully offer their type, i,e., and , then,
However, if the worker 1 choose to not propagate the task information to worker 2, then her data contribution becomes
To combat the diffusion issue with Shapley value, we design the novel payoff sharing mechanism called layered Shapley value. Let be the set of all the workers with depth : . Let be all the workers in the first layers, i.e., . Suppose there are totally layers on the network, , then for each worker , the layered Shapley value is defined as follows:
Intuitively speaking, Equation (2) calculates the average marginal contribution of the workers in the layer using the standard Shapley value, but assume that all the workers in the prior layers have already joined the coalition before them. More specifically, for the first layer (i.e., the requester’s neighbours), the standard Shapley value is applied to calculate their data contribution among the workers in the first layer only. Then for the workers in the second layer, we also apply the Shapley value to compute their data contribution, under the condition that all workers in the first layer are in the coalition first. The calculation after the second layer will not change the Shapley value of the first layer. This continues for all the other layers. This ensures that workers close to the requester will have a higher priority to get rewards for their data contributions. More importantly, by using the layered Shapley value, we can still ensure the following key properties:
The sum of all workers’ layered Shapley value is equal to the valuation of the whole dataset given by workers, i.e. .
If and are two workers in the same layer who are equivalent in the sense that for all , then .
If there is a worker who has for all , which indicates that she does not provide any extra information, then .
Therefore, we will not reward redundant data which has been provided by others in the prior layers. The reason is that in this way child agents can not decrease the utility of their parents and then all the workers are incentivized to propagate the task information.
Till now, we have qualified the data contribution by the layered Shapley value. There is one remaining problem when we apply it to a real-world application, which is how to choose the valuation function . Here we will give a possible approach using information entropy. Information entropy is a function which was first proposed by Shannon . Now it becomes a traditional method to measure the amount of the information of data [14, 12]. Information entropy is defined in terms of distributions on some space with finite dimension :
To evaluate a dataset related to the data acquisition task by information entropy, we can assume the overall dataset required by the requester
can be classified inindependent target classes, denoted by . For each class , let be its feature space with a predefined finite dimension . Then for a dataset , every atomic data
can be expressed as a feature vector, where is the specific feature in class for . For example, if the task is to collect images of nature, let the two target classes be animals and plants. The space of animals is defined as and the space of plants is defined as . Suppose a dataset has two images and , where is an image with a dog beside a tree while is an image with a cat lying on the lawn. Then and .
We also need to define a distribution function , where is the distribution vector of the dataset . Each represents the distribution over the feature space of the dataset . In the example above, the distribution of the class animals is and the distribution of the class plants is . Therefore, .
Now we can use information entropy to evaluate a dataset using the joint entropy defined on independent target classes:
Given a dataset related to task , the valuation of the dataset by Equation (3) is bounded.
According to the definition of information entropy, we can calculate the valuation of as:
Since the dimensions of feature spaces of the task are predefined and finite, the valuation is bounded by a constant. ∎
Take the network in Figure 2 as an example. Worker 1, 2 and 3 are in layer 1; worker 4, 5 and 6 are in layer 2; worker 7 and 8 are in layer 3; worker 9 is in layer 4. The layer-related Shapley value of the worker 1 is: . This is consistent with intuition that what effective data should be.
3.2 Diffusion Contribution
In traditional crowdsourcing mechanisms, only those who are aware of the task information can compete for some rewards. So the participants who have been informed have no reason to invite their neighbours to do the task. Therefore, to incentivize workers to propagate the information, CDM will give them payoffs for their diffusion contribution. In other words, the workers will gain benefits by spreading the task information to their neighbours.
In our mechanism, the diffusion contribution of a worker for her successor is recursively computed as:
where and .
Here, the parameters are interpreted as: is the number of worker ’s child neighbours which has a path to . For example, in Figure 2, among all the child neighbours of the requester, only worker and worker have a path to worker . Hence, . Similarly, . Factor is a discount factor and is the proportion factor, which are predefined coefficients. Note that is a virtual payoff of the requester to simplify the calculation.
Firstly, we have for all three cases. In Figure 3, since the network is a chain, the contribution of a worker is her parent’s contribution multiplied by a discount factor , then we have and . In Figure 3, since the requester has two children who are connected to worker , the worker and have to share the discounted contribution from their parent, then we have . In Figure 3, since the diffusion path from to and from to both contains worker , worker ’s contribution are the sum of the discounted contribution from her parents. Then we have . Therefore, all the workers’ contribution can be computed by Equation (3).
Finally, the total diffusion contribution of worker is defined as:
The intuition behind the diffusion contribution of CDM is that if a worker’s successor provides some effective data, then the worker will be rewarded for her diffusion. Furthermore, from Equation (3), we can easily conclude that the diffusion contribution is evaluated along the path layer by layer.
The requester can adjust the two factors and for different demands. A higher implies that the requester is willing to give more rewards for diffusion contribution, which will also bring greater expenses. A higher means that the diffusion contribution will decrease rapidly with depth.
Given a data contribution related to task from worker , the diffusion contribution distributed to all her predecessors is bounded.
According to the definition of diffusion contribution in Equation (4), we can calculate the total contribution of ’s predecessors as:
Since is bounded according to the Equation (2), the total contribution of ’s predecessors is bounded by a constant. ∎
Take the network in Figure 2 as an example. Let and . If worker 7 has an data contribution , then we can calculate all the corresponding diffusion contribution: ; ; ; .
3.3 Total Payoff
At last, we can get our total payoff policy:
where are predefined factors. This is to ensure that the payoff for data contribution is greater than that for diffusion contribution. Otherwise, the workers may not want to offer their data. Another important observation is and bounded since and bounded. The detailed proof will be illustrated in the next section.
The total procedure of the mechanism is shown below.
In general, CDM is a centralized data acquisition mechanism. In the beginning, the requester does not know all the workers except her neighbors, so she can only inform her neighbors about the task. Under CDM, the workers informed are incentivized to invite their neighbors to join in the task and to provide all the data they owned to the requester directly. In this way, the requester can collect data as rich as possible without any third-party platforms.
4 Properties of CDM
In this section, we will prove that our crowdsourcing diffusion mechanism is incentive compatible and budget constrained. The mechanism also helps the requester collect more effective data. With these properties, a requester is incentivized to apply our mechanism.
The data collected from the crowdsourcing diffusion mechanism is no less than only doing the crowdsourcing among the requester’s neighbours.
Traditionally, the participants in crowdsourcing mechanism are those whom the requester can directly communicate with (i.e., the requester is a platform and participants are the registered users of the platform). These users can be viewed as the requester’s child neighbours in CDM, denoted as , which is a subset of all the workers on the social network. Then we have: . Therefore, the amount of data collected in CDM is always equal to or greater than that of traditional crowdsourcing mechanisms. ∎
The crowdsourcing diffusion mechanism is incentive compatible.
For each worker , her private data is composed of three parts , where , and respectively means the data has been offered by the workers in the previous layers, the data can be only offered by the workers in the same layer as and the data can be offered by the workers in the succedent layers. Obviously, we can discuss the three parts separately.
For , the worker will receive zero payoffs in our mechanism. She cannot enlarge this payoff by reporting a or by inviting fewer workers since it has nothing to do with the workers in previous layers.
For , suppose in the layer where is, there are workers (including ) own this data where . Then according to the property of Shapley value, if truthfully offers , these workers will share the payoffs for this data. Therefore, the payoff the worker will receive is . If she offers a , then the payoff will become to . If she invites fewer workers, it has nothing to do with her payoffs.
For , suppose worker is the predecessor of the first worker in the succedent layers who also owns this data; otherwise, she will not be rewarded if not offering this data or inviting fewer neighbours. If she reports , she transfers some of her data payoffs to diffusion payoffs. Then the payoff for her diffusion contribution is , where is the payoff if offers this part of data by herself. Hence, she will be likely to offer the whole by herself.
Therefore, for each worker , truthfully reporting her type is the dominant strategy, i.e., . ∎
The crowdsourcing diffusion mechanism is budget constrained.
The total dataset retrieved by our crowdsourcing diffusion mechanism is . In Lemma 3.2, we have that is bounded. Since the requester’s expenses is the sum of the payoffs, then we have:
Then, we can conclude that the expenses for a data acquisition task will not exceed , which is bounded. Moreover, the requester can control the budget by adjusting the factors. ∎
At last, we show that our mechanism can work on any social networks rather than DAGs. Since CDM is executed layer by layer, we can first run breadth first traversal on the network and then reduce the edges between the workers in the same layer. After reduction, an arbitrary network can be transferred to a DAG with all the properties remained.
In this paper, we have proposed a novel crowdsourcing mechanism via social networks. The mechanism is running by the task requester, and she does not need to pay in advance for getting the promotions. The prominent contribution of our mechanism is that it incentivizes participants to propagate the task information to their neighbours and to involve more workers in the task. Besides that, all workers will also offer as much data as they have. One of the keys to guarantee these properties is that workers who are close to the requester will have a higher priority to win rewards than their children according to layered Shapley value.
Since our mechanism is based on social networks, it is very challenging to conduct experiments on existing crowdsourcing platforms such as Mechanical Turk. It will be a valuable future work for us to carefully design the experiments based on either existing crowdsourcing platforms or social networks.
Our work has several interesting aspects for future investigation. First of all, the false-name attack is typical in a crowdsourcing system. Hence, designing an advanced mechanism which is false-name proof is a vital successor work. Besides, we can also consider the cost for workers to offer the data and make diffusion in future work. Another scene can be considered where workers’ action will be affected by their neighbours. The last valuable further work can be generalising our mechanism to other crowdsourcing tasks rather than data acquisition.
-  Chawla, S., Hartline, J.D., Sivan, B.: Optimal crowdsourcing contests. In: Proceedings of the Twenty-third Annual ACM-SIAM Symposium on Discrete Algorithms. pp. 856–868. SODA ’12, Society for Industrial and Applied Mathematics, Philadelphia, PA, USA (2012), http://dl.acm.org/citation.cfm?id=2095116.2095185
-  Franklin, M.J., Kossmann, D., Kraska, T., Ramesh, S., Xin, R.: Crowddb: Answering queries with crowdsourcing. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data. pp. 61–72. SIGMOD ’11, ACM, New York, NY, USA (2011). https://doi.org/10.1145/1989323.1989331, http://doi.acm.org/10.1145/1989323.1989331
-  Howe, J.: The rise of crowdsourcing. Wired magazine 14(6), 1–4 (2006)
-  LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. nature 521(7553), 436 (2015)
-  Miller, N., Resnick, P., Zeckhauser, R.: Eliciting informative feedback: The peer-prediction method. Management Science 51(9), 1359–1373 (2005). https://doi.org/10.1287/mnsc.1050.0379, https://doi.org/10.1287/mnsc.1050.0379
-  Narayanam, R., Narahari, Y.: A shapley value-based approach to discover influential nodes in social networks. IEEE Transactions on Automation Science and Engineering 8(1), 130–147 (2011)
-  Pickard, G., Rahwan, I., Pan, W., Cebrián, M., Crane, R., Madan, A., Pentland, A.: Time critical social mobilization: The DARPA network challenge winning strategy. Computing Research Repository abs/1008.3172 (2010)
-  Radanovic, G., Faltings, B., Jurca, R.: Incentives for effort in crowdsourcing using the peer truth serum. ACM TIST 7(4), 48:1–48:28 (2016). https://doi.org/10.1145/2856102, https://doi.org/10.1145/2856102
Rényi, A.: On measures of entropy and information. In: Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics. The Regents of the University of California (1961)
-  Roth, A.E.: The Shapley value: essays in honor of Lloyd S. Shapley. Cambridge University Press (1988)
-  Rychtáriková, R., Korbel, J., Machácek, P., Císar, P., Urban, J., Stys, D.: Point information gain and multidimensional data analysis. Entropy 18(10), 372 (2016)
-  Shen, W., Feng, Y., Lopes, C.V.: Multi-winner contests for strategic diffusion in social networks. Computing Research Repository abs/1811.05624 (2018)
Zhang, X., Mei, C., Chen, D., Li, J.: Feature selection in mixed data: A method using a novel fuzzy rough set-based information entropy. Pattern Recognition56, 1–15 (2016)
-  Zhou, D., Liu, Q., Platt, J.C., Meek, C., Shah, N.B.: Regularized minimax conditional entropy for crowdsourcing. Computing Research Repository abs/1503.07240 (2015)