Driven by the explosion of smart devices (e.g., smartphones and wearable devices), mobile crowdsensing (MCS)  has emerged as a promising sensing paradigm for data collection. A typical MCS system outsources small sensing tasks to a large crowd of device users to take advantages of sensing and computing power of mobile devices. Smart devices are often equipped with built-in sensors including but not limited to GPS, camera, gyroscope, and accelerometer, and thus can accomplish various social and commercial tasks for different applications, such as traffic reporting, environment monitoring, social interactions, and e-business . In practice, mobilesensing participants face great uncertainties from the sensing environment and its interactions with the MCS service provider and other participants. This paper aims to understand how participants can make optimal sensing decisions against uncertainties. We model the participants’ interactions using Markov decision processes (MDPs), and develop a multi-agent reinforcement learning (MARL) algorithm-IntelligentCrowd for all MCS participants to learn optimal sensing policies simultaneously.
I-a Related Work
One critical issue in MCS is how to incentivize mobile users to participate in sensing programs, since performing sensing tasks will consume resources and incur costs to the participants. Therefore, great efforts have been made to design incentive mechanisms [2, 3, 4] to enroll users for MCS. For example, in , an incentive mechanism was designed using a Stackelberg game framework. In , the authors considered the opportunistic nature of participants and proposed three online incentive mechanisms using a reverse auction, offering more flexibility in recruiting opportunistically encountered participants. In , a bargaining-based incentive mechanism was designed and a distributed iterative algorithm is used to solve the bargaining problem between the sensing platform and smart device users.
Once mobile users are participating in MCS programs, another critical problem is how to assign tasks to participants considering the sensing/task diversity. Existing studies have focused on task allocation from the MCS service provider [5, 6, 7, 8]. For example, studies in  showed that the optimal task allocation problem is NP-hard since sensing tasks are often associated with different locations and MCS participants under time constraints. Therefore, approximation algorithms were developed in  to solve a satisfactory task allocation solution with a proven approximate ratio. In , a worker selection framework was proposed for multi-task MCS environments with time-sensitive tasks and delay-tolerant tasks. In , the authors developed a bi-objective optimization problem to address a multi-task-oriented participant selection problem. Sensing task assignment problem has also been studied in mobile social networks , and an online task assignment algorithm was designed using a greedy strategy.
I-B Motivation and Contributions
In real-world applications, the MCS participants are facing many uncertainties that affect their decisions. For example, due to stochastic sensing environments, participants taking the same sensing strategies may lead to different sensed information. In addition, participants’ economic return given by the service provider depends not only on their own efforts but also on other participants’ decisions. But most of the aforementioned studies [2, 3, 4, 5, 6, 7, 8] were platform-centric and did not consider participants’ behaviors under uncertainties and fast-changing environments. Bad decision-making can not give participants enough payoffs, nor the service provider could get high-quality crowd-sourced information. This motivates us to study how to make sequential optimal sensing decisions from the participants’ perspective, especially under uncertainties.
Reinforcement learning (RL) is a set of machine learning algorithms which has recently been applied to many real-world control and decision problems . For instance, in , deep RL algorithm is used for robotic motion planning under stochastic environment; in , the authors proposed a reinforcement learning algorithm which maximizes the arbitrage for single battery; in , a single-agent RL algorithm is proposed for designing incentives for MCS participants. Yet RL has not been fully exploited under the multi-agent setting, which is appealing in many real-time problem settings such as MCS with the nature of multiple participants.
In this paper, we model the MCS participants’ behaviors using multi-agent Markov decision processes (MDPs), in which participants make decisions on their sensing efforts under uncertainties of sensing quality and payoffs. We solve MDPs using a multi-agent reinforcement learning algorithm in an online fashion. The main contributions of this paper are summarized as follows:
User-centric MCS: We take participants’ perspective to design optimal strategy for determining the efforts that would maximize participants’ payoffs given an incentive mechanism;
Online learning and decision making: We design an online distributed MCS algorithm, namely IntelligentCrowd
, which makes use of deep Neural Networks to learn the policy of MCS participation;
Performance Evaluation: We validate our proposed algorithm’s performance under various kind of stochastic environments, which helps MCS participants get payoffs from a series of crowdsensing tasks.
Ii System Model and Problem Formulation
In this section, we first model the MCS system, the value of information, and the user’s effort of participation. We will then formulate the MCS problem, in which we aim to help MCS users make the participation decisions to maximize their accumulated payoffs in an uncertain sensing environment.
Ii-a MCS System
Consider an MCS system with a fixed set of of mobile users and a service provider. There are two-way communications between mobile users and the service provider. Each user has the capability of sensing some data in a certain area and a certain timestep (as shown in Fig. 1). We consider an MCS campaign over a finite and discrete time horizon , and voluntary participants equipped with sensors perform sensing tasks for the service provider. At each time slot , the service provider first publicizes the sensing tasks with a total reward budget which is time-dependent. All the participating users take their efforts to collect information and will be allocated a portion of based on their contribution to the overall crowdsensed information, which will be discussed later in this section.
Ii-B User Model
We will now present the sequential user behavior in the participation of MCS along with its payoff from the service provider.
Ii-B1 Action of Participants
To participate in a MCS task, each user needs to select an effort level and send sensed data to the service provider in time slot . Performing sensing tasks will incur costs to users. At the same time, each user will also get a reward assigned by the service provider based on how good the quality of the sensed data is. We also adopt the notion and to denote the collective effort and rewards profile at each time, while we take notation as the set of possible efforts for agent .
Before we present the detailed models of the costs and rewards for users, we firstly introduce the measure of the value of information (VoI) with respect to the user’s action. We adopt the notion quality of information (QoI) , which is a real-valued scalar indicating the quality of user ’s sensed data at time . Similarly, we denote as the set of observed QoI for all users, and as the set of possible QoI observations for participant . Note that, due to the mobility of users and the variations of sensing tasks, is a stochastic process over time and often unpredictable. Then the VoI of user in time slot is based on user’s contribution . We assume that after the campaign at time , the past level of QoI is revealed to all participants. Yet user doesn’t know others’ actions since everyone is making decisions independently.
Ii-B2 Payoff Function
Participant ’s payoff is determined by the reward from the service provider along with her sensing cost. The reward for user is proportional to its share of VoI at current timestep, while a participation cost is incurred based on the effort she makes with . Then in sum, participant ’s payoff function is
For simplicity, in this work, we assume that is notified by the service provider at time , while is known to each participant. Our framework can also take the stochastic case of and into account.
Ii-C Payoff Maximization Problem
With user’s payoff designed, in this paper, we take a user-centric perspective. Our objective is to find a sequential decision , which maximizes user ’s total expected discounted payoff during the total MCS participating period, where is a pre-defined discount factor.
If the user’s dynamical sensing environment is known or can be predicted, finding is a sequential decision problem. Intuitively, we can find solutions via a model-based dynamical system, which uses either the off-the-shelf offline optimization method or predictive control which maximizes (1) with system’s dynamical constraints. Based on the available information and participants’ interactions, essentially for each MCS participant , we can cast user’s effort level as a policy mapping function from past QoI:
where is the total window length of past QoI taken into consideration, and we adopt to denote the policy function’s parameters.
However, we are faced with two difficulties in applying previous-mentioned approaches to find . Firstly, the participant may be situated in a highly stochastic environment. The high dimensionality of the state spaces for both users’ effort levels and sensing environments, the forecast accuracy of future sensing environments both restrict the performance of existing methods. Secondly, the nature of interactions among MCS participants has made it hard to model the system dynamics accurately. For instance, one agent chooses a certain effort , while this choice not only affects other agents’ current payoff, it also could impact all agents’ future decisions by considering the dynamics of sensing environments. To tackle these difficulties of modeling and decision-making, we take a machine learning approach, which automatically learns to choose effort levels for multiple participants in a stochastic QoI environment.
Iii Multi-Agent Reinforcement Learning
To solve the real-time payoff maximization problem for all agents, in this section, we will first describe the problem setup using MDPs, which have been developed for the discrete-time stochastic control process. Reinforcement learning algorithm, e.g., Q-Learning , could be applied for solving single-agent MDPs. Yet to solve multi-agent MDPs with non-stationary participants’ sensing environment aided by partially observable information (e.g., each participant does not know others’ sensing efforts ), we extend the deep reinforcement learning algorithm proposed by , and design our MARL algorithm IntelligentCrowd. We will then illustrate in Section IV, under various kinds of dynamical sensing environments, IntelligentCrowd is able to simultaneously find sensing efforts for each participant under online setting.
Iii-a Multi-Agent MDPs
In normal MDPs models, we are only interested in the stochastic decision making by using past actions and system states. The decision on the effort level of MCS participants is a natural extension of MDPs to the multi-agent case. Mathematically, we are considering a set of actions and a set of sensed QoI level . For the overall system, each step’s state is simply . By taking a joint action for all MCS participants , we also define two functions related to states (QoI) and actions (sensing effort): a) the sensing environment takes a state transition function , and b) the reward (payoff) for each agent .
Reinforcement learning algorithm aims at automatically learning an policy under the MDP framework. For single-agent MDPs with states, actions and reward defined111For the simplicity of notation, we omit subscripts of each variable to indicate the single-agent case., we could use standard Q-Learning or train a Deep Q Network (DQN) , which fits a function that maps to the accumulated reward . To learn such an (action, state)value function, we learn the Q-table in Q-Learning or Q-network in DQN via the recursive step to minimize the following loss:
Note that we need to take the expectation on (3) when we are using batch-based algorithms. Once we get an accurate approximation of , finding the optimal policy function is essentially a network inference step:
Now we discuss how to extend the single-agent setting to the multi-agent MCS system. To make our algorithm practical for real-world multiple MCS participants, we are also considering the following assumptions:
During training, each participant could observe both collective actions and collective reward profiles ;
During testing or real implementation, each participant can only use its share of information, such as observations of QoI for agents (partially observable on );
We do not assume any particular communication algorithms between participants about their sensing strategies;
Each participant may face heterogeneous and stochastic QoI dynamics (as shown in Fig. 2).
One may directly try to extend Q-learning via (3) to the multi-agent setting by finding separate for each agent using available . However, environment becomes non-stationary from the perspective of any individual MCS participant, since the effort level chosen by one participant would affect the payoffs of other participants. Such change can not be explained by each participant’s own state-action space independently. Thus the family of Q-learning [12, 14] is not able to learn such non-stationary dependencies, since Q function is updating independently for each participant. Though it is possible to include all agents’ decisions as input for Q table/network during training process to help learn the Q function, it can not be included during testing because of the independence assumption on participant’s strategies. Moreover, Q-learning is difficult to scale up to continuous actions such as the effort level in our case.
Iii-B IntelligentCrowd Algorithm
To ease the learning difficulty by only using a single Q network for each agent, we adopt the actor-critic model proposed in [15, 13]. In the multi-agent version of actor-critic model, we are using two neural networks for each MCS participant, actor and critic network to co-learn the action policy. Just as the notation suggests, the critic network is similar to the Q network in DQN, while the actor network replaces the inference step in DQN to directly get the mapping from state to the optimal effort level actions learned by it.
In order to train both networks together, we could utilize the output from . That is, the critic could act as a judge of the policy output by the actor. By using the feedback loss from the critic network, the actor network adjusts its weights to output better decision on effort level. Once trained, the actor could directly output actions and does not need information from the critic network. Thus during training, we could add effort level from other agents to help critic learn the interactions among different agents. Mathematically, we denote the critic’s input to be , while the actor’s input to be which considers step’s historical VoI observations. Since critic has the full observation of each participant’s effort decision during the training process, it is faced with a stationary environment, which is not subject to change of state transition function with any modification on .
Similar to the policy gradient update approach , we can update the actor network via the gradient of policy value return:
We implement a centralized training with decentralized execution. That is, during training processes, each participant is able to make use of extra action information provided by other participants, while during the real implementation, they only make use of participant’s own observations .
We summarize our IntelligentCrowd algorithm in Algorithm 1, which is a user-centric algorithm for MCS participants to find the best sensing efforts in a multi-agent stochastic sensing environment. We also want to highlight that once IntelligentCrowd completes the training stage in a centralized manner for the critic networks, it could be implemented in a distributed manner for each MCS participant. In the algorithm, we take the notion of episode length similarly to the MCS campaign time . The actor-critic model is trained using fixed-length episode data. We also keep an experience replay buffer during training, which stores the tuples from past traces of MCS participants’ effort decisions and QoI evolution. We simply use superscript and to denote samples from along with their next-step values.
Iv Performance Evaluations
To evaluate the performance of our algorithm, we validate IntelligentCrowd on a series of different sensing environments. We also analyze the impacts of different parameters in IntelligentCrowd.
Iv-a Simulation Setup
Iv-A1 Neural Networks Setup
We specifically construct four groups of actor-critic networks to learn the sensing policy upon effort level for four MCS participants. We use two-layer fully connected networks for both actor and critic networks. We vary the historical window size during our training and testing, where decides how much historical QoI data
are taken into consideration for both networks. Standard neural networks modeling techniques, such as batch normalization, pass-through links are adopted. We train these actor and critic networks till convergence on each agent’s episode payoff and keep a fixedsteps in all simulations.
Iv-A2 QoI Dynamics
Due to the mobility of users and the time-varying nature of sensing capabilities, we consider three heterogeneous QoI dynamics as shown in Fig. 2:
Sine Dynamics: Under this setting, the MCS participant is faced with periodic sensing signals with fixed amplitude and frequency.
Linear Dynamics: The MCS participants are receiving periodic QoI of linear strength w.r.t time.
Markov chain Dynamics: We simulate finite state space Markov chain to represent the QoI temporal evolution.
We also make the system dynamics more challenging for IntelligentCrowd by setting different frequencies/amplitudes/transition matrices for different MCS participants with the same dynamics. We also allow the sensed signal to be negative (e.g., some wrong information), and each MCS participant must learn to avoid such fake information.
Iv-B Results and Analysis
In Fig. 4, we show the simulation results when agents are all under a Sine
dynamics of QoI. We plot the mean episode reward w.r.t the training episode, along with reward variance inruns. All participants exhibit similar learning behaviors with varying window length . During the initial training episodes, all of the agents lack knowledge of the QoI dynamics and the decision patterns of other agents, so the performance is poor and participant even gets negative payoffs, which implies that MCS participants do not get sufficient reward from the service provider to compensate their costs. At around episodes 7-8, all participants start to learn a good strategy on , and are getting greater payoffs. Such payoffs become stable as training goes on, which suggests that the neural network training for actor and critic is stable.
We could also observe that as increases from to , all of the agents are learning strategies which could get higher payoffs. This indicates that IntelligentCrowd performs better with the aid of more information coming from past observations. However, by increasing from 50 to 100, such historical information does not help much in the decision making of sensing effort levels, but it takes more computational resources to train actor-critic networks due to the increasing dimensions of data input.
Next, we evaluate our algorithm when MCS participants are faced with different kind of dynamics. In this setting, it is more challenging for MCS participants to make wise choices of effort levels since heterogeneous dynamics would make user interactions more complicated. As shown in Fig. 5, all participants are able to select effort levels to make positive payoffs when training goes to the end. Yet for participants 1, 2, and 3 who are under more “predictable” sine or linear dynamics, they are able to make positive payoffs around 7-8 episode. Yet agent 4 under Markov decisions are starting to learn to make a positive payoff in later episodes. More interestingly, even though participants 1-3 are making high payoffs around episode 10, participant 4 starts to make good effort decisions at each timestep so as to get a portion of total payoffs from the other 3 participants and make higher payoffs in later episodes.
In Table I, we summarize the average accumulated rewards for four MCS participants during an episode for four types of QoI dynamics after training has finished. The mixed dynamics is the same environment we simulate in Fig. 5. We also observe that when we set in IntelligentCrowd, MCS participants could get the highest crowdsensing payoffs.
V Discussion and Conclusion
In this paper, we take MCS participants’ perspective and investigate the problem of determining sensory efforts to maximize the payoff for each individual participant. We first address the difficulties in modeling and decision-making because participants are faced with stochastic sensing environments and there exist complex interactions among MCS participants. Then we develop an online algorithm IntelligentCrowd, which can leverage the power of deep reinforcement learning to efficiently find the best sensing decision for each participant in real time. We validate our IntelligentCrowd algorithm by simulations in various sensing environments.
In future work, we will consider the interactions between the service provider’s mechanism design and MCS participants’ decision making, and implement our ideas on real-world crowdsensing data. We are also interested in exploring the effects of the social interactions among MCS participants.
-  R. K. Ganti, F. Ye, and H. Lei, “Mobile crowdsensing: current state and future challenges,” IEEE Communications Magazine, vol. 49, no. 11, 2011.
-  D. Yang, G. Xue, X. Fang, and J. Tang, “Crowdsourcing to smartphones: incentive mechanism design for mobile phone sensing,” in Proceedings of the 18th annual international conference on Mobile computing and networking. ACM, 2012, pp. 173–184.
-  X. Zhang, Z. Yang, Z. Zhou, H. Cai, L. Chen, and X. Li, “Free market of crowdsourcing: Incentive mechanism design for mobile sensing,” IEEE transactions on parallel and distributed systems, vol. 25, no. 12, pp. 3190–3200, 2014.
-  Y. Zhan, Y. Xia, and J. Zhang, “Incentive mechanism in platform-centric mobile crowdsensing: A one-to-many bargaining approach,” Computer Networks, vol. 132, pp. 40–52, 2018.
-  B. Guo, Y. Liu, W. Wu, Z. Yu, and Q. Han, “Activecrowd: A framework for optimized multitask allocation in mobile crowdsensing systems,” IEEE Transactions on Human-Machine Systems, vol. 47, no. 3, pp. 392–403, 2017.
-  Y. Liu, B. Guo, Y. Wang, W. Wu, Z. Yu, and D. Zhang, “Taskme: multi-task allocation in mobile crowd sensing,” in Proceedings of the 2016 ACM International Joint Conference on Pervasive and Ubiquitous Computing. ACM, 2016, pp. 403–414.
-  M. Xiao, J. Wu, L. Huang, Y. Wang, and C. Liu, “Multi-task assignment for crowdsensing in mobile social networks,” in Computer Communications (INFOCOM), 2015 IEEE Conference on. IEEE, 2015, pp. 2227–2235.
-  S. He, D.-H. Shin, J. Zhang, and J. Chen, “Toward optimal allocation of location dependent tasks in crowdsensing,” in INFOCOM, 2014 Proceedings IEEE. IEEE, 2014, pp. 745–753.
-  R. S. Sutton, A. G. Barto et al., Reinforcement learning: An introduction. MIT press, 1998.
-  J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimization,” in International Conference on Machine Learning, 2015, pp. 1889–1897.
-  H. Wang and B. Zhang, “Energy storage arbitrage in real-time markets via reinforcement learning,” arXiv preprint arXiv:1711.03127, 2017.
-  L. Xiao, Y. Li, G. Han, H. Dai, and H. V. Poor, “A secure mobile crowdsensing game with deep reinforcement learning,” IEEE Transactions on Information Forensics and Security, 2017.
-  R. Lowe, Y. Wu, A. Tamar, J. Harb, O. P. Abbeel, and I. Mordatch, “Multi-agent actor-critic for mixed cooperative-competitive environments,” in Advances in Neural Information Processing Systems, 2017, pp. 6379–6390.
-  V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, p. 529, 2015.
-  V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” in International conference on machine learning, 2016, pp. 1928–1937.
-  M. H. Cheung, F. Hou, and J. Huang, “Make a difference: Diversity-driven social mobile crowdsensing,” in INFOCOM 2017-IEEE Conference on Computer Communications, IEEE. IEEE, 2017, pp. 1–9.