I Introduction
Driven by the explosion of smart devices (e.g., smartphones and wearable devices), mobile crowdsensing (MCS) [1] has emerged as a promising sensing paradigm for data collection. A typical MCS system outsources small sensing tasks to a large crowd of device users to take advantages of sensing and computing power of mobile devices. Smart devices are often equipped with builtin sensors including but not limited to GPS, camera, gyroscope, and accelerometer, and thus can accomplish various social and commercial tasks for different applications, such as traffic reporting, environment monitoring, social interactions, and ebusiness [1]. In practice, mobilesensing participants face great uncertainties from the sensing environment and its interactions with the MCS service provider and other participants. This paper aims to understand how participants can make optimal sensing decisions against uncertainties. We model the participants’ interactions using Markov decision processes (MDPs), and develop a multiagent reinforcement learning (MARL) algorithmIntelligentCrowd for all MCS participants to learn optimal sensing policies simultaneously.
Ia Related Work
One critical issue in MCS is how to incentivize mobile users to participate in sensing programs, since performing sensing tasks will consume resources and incur costs to the participants. Therefore, great efforts have been made to design incentive mechanisms [2, 3, 4] to enroll users for MCS. For example, in [2], an incentive mechanism was designed using a Stackelberg game framework. In [3], the authors considered the opportunistic nature of participants and proposed three online incentive mechanisms using a reverse auction, offering more flexibility in recruiting opportunistically encountered participants. In [4], a bargainingbased incentive mechanism was designed and a distributed iterative algorithm is used to solve the bargaining problem between the sensing platform and smart device users.
Once mobile users are participating in MCS programs, another critical problem is how to assign tasks to participants considering the sensing/task diversity. Existing studies have focused on task allocation from the MCS service provider [5, 6, 7, 8]. For example, studies in [8] showed that the optimal task allocation problem is NPhard since sensing tasks are often associated with different locations and MCS participants under time constraints. Therefore, approximation algorithms were developed in [8] to solve a satisfactory task allocation solution with a proven approximate ratio. In [5], a worker selection framework was proposed for multitask MCS environments with timesensitive tasks and delaytolerant tasks. In [6], the authors developed a biobjective optimization problem to address a multitaskoriented participant selection problem. Sensing task assignment problem has also been studied in mobile social networks [7], and an online task assignment algorithm was designed using a greedy strategy.
IB Motivation and Contributions
In realworld applications, the MCS participants are facing many uncertainties that affect their decisions. For example, due to stochastic sensing environments, participants taking the same sensing strategies may lead to different sensed information. In addition, participants’ economic return given by the service provider depends not only on their own efforts but also on other participants’ decisions. But most of the aforementioned studies [2, 3, 4, 5, 6, 7, 8] were platformcentric and did not consider participants’ behaviors under uncertainties and fastchanging environments. Bad decisionmaking can not give participants enough payoffs, nor the service provider could get highquality crowdsourced information. This motivates us to study how to make sequential optimal sensing decisions from the participants’ perspective, especially under uncertainties.
Reinforcement learning (RL) is a set of machine learning algorithms which has recently been applied to many realworld control and decision problems [9]. For instance, in [10], deep RL algorithm is used for robotic motion planning under stochastic environment; in [11], the authors proposed a reinforcement learning algorithm which maximizes the arbitrage for single battery; in [12], a singleagent RL algorithm is proposed for designing incentives for MCS participants. Yet RL has not been fully exploited under the multiagent setting, which is appealing in many realtime problem settings such as MCS with the nature of multiple participants.
In this paper, we model the MCS participants’ behaviors using multiagent Markov decision processes (MDPs), in which participants make decisions on their sensing efforts under uncertainties of sensing quality and payoffs. We solve MDPs using a multiagent reinforcement learning algorithm in an online fashion. The main contributions of this paper are summarized as follows:

Usercentric MCS: We take participants’ perspective to design optimal strategy for determining the efforts that would maximize participants’ payoffs given an incentive mechanism;

Online learning and decision making: We design an online distributed MCS algorithm, namely IntelligentCrowd
, which makes use of deep Neural Networks to learn the policy of MCS participation;

Performance Evaluation: We validate our proposed algorithm’s performance under various kind of stochastic environments, which helps MCS participants get payoffs from a series of crowdsensing tasks.
Ii System Model and Problem Formulation
In this section, we first model the MCS system, the value of information, and the user’s effort of participation. We will then formulate the MCS problem, in which we aim to help MCS users make the participation decisions to maximize their accumulated payoffs in an uncertain sensing environment.
Iia MCS System
Consider an MCS system with a fixed set of of mobile users and a service provider. There are twoway communications between mobile users and the service provider. Each user has the capability of sensing some data in a certain area and a certain timestep (as shown in Fig. 1). We consider an MCS campaign over a finite and discrete time horizon , and voluntary participants equipped with sensors perform sensing tasks for the service provider. At each time slot , the service provider first publicizes the sensing tasks with a total reward budget which is timedependent. All the participating users take their efforts to collect information and will be allocated a portion of based on their contribution to the overall crowdsensed information, which will be discussed later in this section.
IiB User Model
We will now present the sequential user behavior in the participation of MCS along with its payoff from the service provider.
IiB1 Action of Participants
To participate in a MCS task, each user needs to select an effort level and send sensed data to the service provider in time slot . Performing sensing tasks will incur costs to users. At the same time, each user will also get a reward assigned by the service provider based on how good the quality of the sensed data is. We also adopt the notion and to denote the collective effort and rewards profile at each time, while we take notation as the set of possible efforts for agent .
Before we present the detailed models of the costs and rewards for users, we firstly introduce the measure of the value of information (VoI) with respect to the user’s action. We adopt the notion quality of information (QoI) , which is a realvalued scalar indicating the quality of user ’s sensed data at time . Similarly, we denote as the set of observed QoI for all users, and as the set of possible QoI observations for participant . Note that, due to the mobility of users and the variations of sensing tasks, is a stochastic process over time and often unpredictable. Then the VoI of user in time slot is based on user’s contribution . We assume that after the campaign at time , the past level of QoI is revealed to all participants. Yet user doesn’t know others’ actions since everyone is making decisions independently.
IiB2 Payoff Function
Participant ’s payoff is determined by the reward from the service provider along with her sensing cost. The reward for user is proportional to its share of VoI at current timestep, while a participation cost is incurred based on the effort she makes with . Then in sum, participant ’s payoff function is
(1) 
For simplicity, in this work, we assume that is notified by the service provider at time , while is known to each participant. Our framework can also take the stochastic case of and into account.
IiC Payoff Maximization Problem
With user’s payoff designed, in this paper, we take a usercentric perspective. Our objective is to find a sequential decision , which maximizes user ’s total expected discounted payoff during the total MCS participating period, where is a predefined discount factor.
If the user’s dynamical sensing environment is known or can be predicted, finding is a sequential decision problem. Intuitively, we can find solutions via a modelbased dynamical system, which uses either the offtheshelf offline optimization method or predictive control which maximizes (1) with system’s dynamical constraints. Based on the available information and participants’ interactions, essentially for each MCS participant , we can cast user’s effort level as a policy mapping function from past QoI:
(2) 
where is the total window length of past QoI taken into consideration, and we adopt to denote the policy function’s parameters.
However, we are faced with two difficulties in applying previousmentioned approaches to find . Firstly, the participant may be situated in a highly stochastic environment. The high dimensionality of the state spaces for both users’ effort levels and sensing environments, the forecast accuracy of future sensing environments both restrict the performance of existing methods. Secondly, the nature of interactions among MCS participants has made it hard to model the system dynamics accurately. For instance, one agent chooses a certain effort , while this choice not only affects other agents’ current payoff, it also could impact all agents’ future decisions by considering the dynamics of sensing environments. To tackle these difficulties of modeling and decisionmaking, we take a machine learning approach, which automatically learns to choose effort levels for multiple participants in a stochastic QoI environment.
Iii MultiAgent Reinforcement Learning
To solve the realtime payoff maximization problem for all agents, in this section, we will first describe the problem setup using MDPs, which have been developed for the discretetime stochastic control process. Reinforcement learning algorithm, e.g., QLearning [9], could be applied for solving singleagent MDPs. Yet to solve multiagent MDPs with nonstationary participants’ sensing environment aided by partially observable information (e.g., each participant does not know others’ sensing efforts ), we extend the deep reinforcement learning algorithm proposed by [13], and design our MARL algorithm IntelligentCrowd. We will then illustrate in Section IV, under various kinds of dynamical sensing environments, IntelligentCrowd is able to simultaneously find sensing efforts for each participant under online setting.
Iiia MultiAgent MDPs
In normal MDPs models, we are only interested in the stochastic decision making by using past actions and system states. The decision on the effort level of MCS participants is a natural extension of MDPs to the multiagent case. Mathematically, we are considering a set of actions and a set of sensed QoI level . For the overall system, each step’s state is simply . By taking a joint action for all MCS participants , we also define two functions related to states (QoI) and actions (sensing effort): a) the sensing environment takes a state transition function , and b) the reward (payoff) for each agent .
Reinforcement learning algorithm aims at automatically learning an policy under the MDP framework. For singleagent MDPs with states, actions and reward defined^{1}^{1}1For the simplicity of notation, we omit subscripts of each variable to indicate the singleagent case., we could use standard QLearning or train a Deep Q Network (DQN) [14], which fits a function that maps to the accumulated reward . To learn such an (action, state)value function, we learn the Qtable in QLearning or Qnetwork in DQN via the recursive step to minimize the following loss:
(3a)  
(3b) 
Note that we need to take the expectation on (3) when we are using batchbased algorithms. Once we get an accurate approximation of , finding the optimal policy function is essentially a network inference step:
(4) 
Now we discuss how to extend the singleagent setting to the multiagent MCS system. To make our algorithm practical for realworld multiple MCS participants, we are also considering the following assumptions:

During training, each participant could observe both collective actions and collective reward profiles ;

During testing or real implementation, each participant can only use its share of information, such as observations of QoI for agents (partially observable on );

We do not assume any particular communication algorithms between participants about their sensing strategies;

Each participant may face heterogeneous and stochastic QoI dynamics (as shown in Fig. 2).
One may directly try to extend Qlearning via (3) to the multiagent setting by finding separate for each agent using available . However, environment becomes nonstationary from the perspective of any individual MCS participant, since the effort level chosen by one participant would affect the payoffs of other participants. Such change can not be explained by each participant’s own stateaction space independently. Thus the family of Qlearning [12, 14] is not able to learn such nonstationary dependencies, since Q function is updating independently for each participant. Though it is possible to include all agents’ decisions as input for Q table/network during training process to help learn the Q function, it can not be included during testing because of the independence assumption on participant’s strategies. Moreover, Qlearning is difficult to scale up to continuous actions such as the effort level in our case.
IiiB IntelligentCrowd Algorithm
To ease the learning difficulty by only using a single Q network for each agent, we adopt the actorcritic model proposed in [15, 13]. In the multiagent version of actorcritic model, we are using two neural networks for each MCS participant, actor and critic network to colearn the action policy. Just as the notation suggests, the critic network is similar to the Q network in DQN, while the actor network replaces the inference step in DQN to directly get the mapping from state to the optimal effort level actions learned by it.
In order to train both networks together, we could utilize the output from . That is, the critic could act as a judge of the policy output by the actor. By using the feedback loss from the critic network, the actor network adjusts its weights to output better decision on effort level. Once trained, the actor could directly output actions and does not need information from the critic network. Thus during training, we could add effort level from other agents to help critic learn the interactions among different agents. Mathematically, we denote the critic’s input to be , while the actor’s input to be which considers step’s historical VoI observations. Since critic has the full observation of each participant’s effort decision during the training process, it is faced with a stationary environment, which is not subject to change of state transition function with any modification on .
Similar to the policy gradient update approach [15], we can update the actor network via the gradient of policy value return:
(5a)  
(5b) 
We implement a centralized training with decentralized execution. That is, during training processes, each participant is able to make use of extra action information provided by other participants, while during the real implementation, they only make use of participant’s own observations .
We summarize our IntelligentCrowd algorithm in Algorithm 1, which is a usercentric algorithm for MCS participants to find the best sensing efforts in a multiagent stochastic sensing environment. We also want to highlight that once IntelligentCrowd completes the training stage in a centralized manner for the critic networks, it could be implemented in a distributed manner for each MCS participant. In the algorithm, we take the notion of episode length similarly to the MCS campaign time . The actorcritic model is trained using fixedlength episode data. We also keep an experience replay buffer during training, which stores the tuples from past traces of MCS participants’ effort decisions and QoI evolution. We simply use superscript and to denote samples from along with their nextstep values.
Iv Performance Evaluations
To evaluate the performance of our algorithm, we validate IntelligentCrowd on a series of different sensing environments. We also analyze the impacts of different parameters in IntelligentCrowd.
Iva Simulation Setup
IvA1 Neural Networks Setup
We specifically construct four groups of actorcritic networks to learn the sensing policy upon effort level for four MCS participants. We use twolayer fully connected networks for both actor and critic networks. We vary the historical window size during our training and testing, where decides how much historical QoI data
are taken into consideration for both networks. Standard neural networks modeling techniques, such as batch normalization, passthrough links are adopted. We train these actor and critic networks till convergence on each agent’s episode payoff and keep a fixed
steps in all simulations.IvA2 QoI Dynamics
Due to the mobility of users and the timevarying nature of sensing capabilities, we consider three heterogeneous QoI dynamics as shown in Fig. 2:

Sine Dynamics: Under this setting, the MCS participant is faced with periodic sensing signals with fixed amplitude and frequency.

Linear Dynamics: The MCS participants are receiving periodic QoI of linear strength w.r.t time.

Markov chain Dynamics: We simulate finite state space Markov chain to represent the QoI temporal evolution.
We also make the system dynamics more challenging for IntelligentCrowd by setting different frequencies/amplitudes/transition matrices for different MCS participants with the same dynamics. We also allow the sensed signal to be negative (e.g., some wrong information), and each MCS participant must learn to avoid such fake information.
IvB Results and Analysis
In Fig. 4, we show the simulation results when agents are all under a Sine
dynamics of QoI. We plot the mean episode reward w.r.t the training episode, along with reward variance in
runs. All participants exhibit similar learning behaviors with varying window length . During the initial training episodes, all of the agents lack knowledge of the QoI dynamics and the decision patterns of other agents, so the performance is poor and participant even gets negative payoffs, which implies that MCS participants do not get sufficient reward from the service provider to compensate their costs. At around episodes 78, all participants start to learn a good strategy on , and are getting greater payoffs. Such payoffs become stable as training goes on, which suggests that the neural network training for actor and critic is stable.We could also observe that as increases from to , all of the agents are learning strategies which could get higher payoffs. This indicates that IntelligentCrowd performs better with the aid of more information coming from past observations. However, by increasing from 50 to 100, such historical information does not help much in the decision making of sensing effort levels, but it takes more computational resources to train actorcritic networks due to the increasing dimensions of data input.
Next, we evaluate our algorithm when MCS participants are faced with different kind of dynamics. In this setting, it is more challenging for MCS participants to make wise choices of effort levels since heterogeneous dynamics would make user interactions more complicated. As shown in Fig. 5, all participants are able to select effort levels to make positive payoffs when training goes to the end. Yet for participants 1, 2, and 3 who are under more “predictable” sine or linear dynamics, they are able to make positive payoffs around 78 episode. Yet agent 4 under Markov decisions are starting to learn to make a positive payoff in later episodes. More interestingly, even though participants 13 are making high payoffs around episode 10, participant 4 starts to make good effort decisions at each timestep so as to get a portion of total payoffs from the other 3 participants and make higher payoffs in later episodes.
In Table I, we summarize the average accumulated rewards for four MCS participants during an episode for four types of QoI dynamics after training has finished. The mixed dynamics is the same environment we simulate in Fig. 5. We also observe that when we set in IntelligentCrowd, MCS participants could get the highest crowdsensing payoffs.
Memory Length  10  30  50  100 

Sine  28.25  25.88  33.52  28.07 
Linear  46.41  48.36  50.01  48.79 
Markov  35.89  38.88  44.73  42.31 
Mixed  22.43  35.75  36.91  27.77 
V Discussion and Conclusion
In this paper, we take MCS participants’ perspective and investigate the problem of determining sensory efforts to maximize the payoff for each individual participant. We first address the difficulties in modeling and decisionmaking because participants are faced with stochastic sensing environments and there exist complex interactions among MCS participants. Then we develop an online algorithm IntelligentCrowd, which can leverage the power of deep reinforcement learning to efficiently find the best sensing decision for each participant in real time. We validate our IntelligentCrowd algorithm by simulations in various sensing environments.
In future work, we will consider the interactions between the service provider’s mechanism design and MCS participants’ decision making, and implement our ideas on realworld crowdsensing data. We are also interested in exploring the effects of the social interactions among MCS participants.
References
 [1] R. K. Ganti, F. Ye, and H. Lei, “Mobile crowdsensing: current state and future challenges,” IEEE Communications Magazine, vol. 49, no. 11, 2011.
 [2] D. Yang, G. Xue, X. Fang, and J. Tang, “Crowdsourcing to smartphones: incentive mechanism design for mobile phone sensing,” in Proceedings of the 18th annual international conference on Mobile computing and networking. ACM, 2012, pp. 173–184.
 [3] X. Zhang, Z. Yang, Z. Zhou, H. Cai, L. Chen, and X. Li, “Free market of crowdsourcing: Incentive mechanism design for mobile sensing,” IEEE transactions on parallel and distributed systems, vol. 25, no. 12, pp. 3190–3200, 2014.
 [4] Y. Zhan, Y. Xia, and J. Zhang, “Incentive mechanism in platformcentric mobile crowdsensing: A onetomany bargaining approach,” Computer Networks, vol. 132, pp. 40–52, 2018.
 [5] B. Guo, Y. Liu, W. Wu, Z. Yu, and Q. Han, “Activecrowd: A framework for optimized multitask allocation in mobile crowdsensing systems,” IEEE Transactions on HumanMachine Systems, vol. 47, no. 3, pp. 392–403, 2017.
 [6] Y. Liu, B. Guo, Y. Wang, W. Wu, Z. Yu, and D. Zhang, “Taskme: multitask allocation in mobile crowd sensing,” in Proceedings of the 2016 ACM International Joint Conference on Pervasive and Ubiquitous Computing. ACM, 2016, pp. 403–414.
 [7] M. Xiao, J. Wu, L. Huang, Y. Wang, and C. Liu, “Multitask assignment for crowdsensing in mobile social networks,” in Computer Communications (INFOCOM), 2015 IEEE Conference on. IEEE, 2015, pp. 2227–2235.
 [8] S. He, D.H. Shin, J. Zhang, and J. Chen, “Toward optimal allocation of location dependent tasks in crowdsensing,” in INFOCOM, 2014 Proceedings IEEE. IEEE, 2014, pp. 745–753.
 [9] R. S. Sutton, A. G. Barto et al., Reinforcement learning: An introduction. MIT press, 1998.
 [10] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimization,” in International Conference on Machine Learning, 2015, pp. 1889–1897.
 [11] H. Wang and B. Zhang, “Energy storage arbitrage in realtime markets via reinforcement learning,” arXiv preprint arXiv:1711.03127, 2017.
 [12] L. Xiao, Y. Li, G. Han, H. Dai, and H. V. Poor, “A secure mobile crowdsensing game with deep reinforcement learning,” IEEE Transactions on Information Forensics and Security, 2017.
 [13] R. Lowe, Y. Wu, A. Tamar, J. Harb, O. P. Abbeel, and I. Mordatch, “Multiagent actorcritic for mixed cooperativecompetitive environments,” in Advances in Neural Information Processing Systems, 2017, pp. 6379–6390.
 [14] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Humanlevel control through deep reinforcement learning,” Nature, vol. 518, no. 7540, p. 529, 2015.
 [15] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” in International conference on machine learning, 2016, pp. 1928–1937.
 [16] M. H. Cheung, F. Hou, and J. Huang, “Make a difference: Diversitydriven social mobile crowdsensing,” in INFOCOM 2017IEEE Conference on Computer Communications, IEEE. IEEE, 2017, pp. 1–9.
Comments
There are no comments yet.