1. Introduction
With the prevalence of smart mobile devices, mobile crowdsensing (MCS) becomes a novel sensing mechanism nowadays to address various urban tasks such as environment and traffic monitoring (Zhang et al., 2014). Traditional MCS mechanisms usually collect a large amount of data to cover almost all the cells (i.e., subareas) of the target area to ensure quality. This requires MCS organizers to recruit many participants (e.g., at least one from every cell for full coverage), leading to a relatively high cost. To reduce such cost while still ensuring a high level of quality, a new MCS paradigm, namely Sparse MCS, is proposed recently (Wang et al., 2018c, 2016a). Sparse MCS collects data from only a few cells while intelligently inferring the data of rest cells with quality guarantees (i.e., the error of inferred data is lower than a threshold). Hence, compared to traditional mechanisms, MCS organizers’ cost can be reduced since only a few participants need to be recruited, while the task quality is still ensured.
In Sparse MCS, one key issue affecting how much cost can be practically saved is cell selection — which cells the organizer decides to collect sensed data from participants (Wang et al., 2016a). To show the importance of cell selection, Figure 1 (left part) gives an illustrative example of two different cell selection cases in a city, which is split to cells. In Case 1.1, all the selected cells are gathered in one corner of the city; in Case 1.2, the collected data is more widely distributed in the whole city. As data of most sensing tasks has spatial correlations (i.e., nearby cells may have similar data), e.g., air quality (Zheng et al., 2013), the cell selection of Case 1.2 will generate a higher inference quality of the inferred data than Case 1.1. Moreover, a MCS campaign usually lasts for a long time (i.e., sensing every one hour), so that not only spatial correlations, but also temporal correlations need to be carefully considered in cell selection. As shown in Figure 1 (right part), sensing the same cells in continuous cycles (Case 2.1) may not be as efficient as sensing the different cells (Case 2.2) considering the inference quality. While data of different MCS applications may involve diverse spatiotemporal correlations, determining the proper cell selection strategies is a nontrivial task.
Existing works on Sparse MCS mainly leverage QueryByCommittee (QBC) (Wang et al., 2015, 2018c)
in cell selection. QBC first uses various inference algorithms to deduce the data of all the unsensed cells, and then chooses the cell where the inferred data of various algorithms have the largest variance as the next cell for sensing. Briefly, the cell selection criteria of QBC is choosing the cell which is the most uncertain considering a committee of inference algorithms, i.e., hardtoinfer. While QBC has shown its effectiveness in some scenarios
(Wang et al., 2015, 2018c), it does not directly optimize the objective of Sparse MCS, i.e., minimizing the number of sensed cells under a quality guarantee. In fact, the existing works using QBC also realize that its performance is still far from the optimal cell selection strategy^{1}^{1}1Please note that the optimal cell selection strategy is impractical as it needs to know the ground truth data of each cell in advance, which is absolutely impossible in reality (Wang et al., 2015). (Wang et al., 2015). To reduce this performance gap from the optimal strategy, our research question here is, can we find a better cell selection strategy in Sparse MCS, which can directly minimize the number of selected cells under the inference quality guarantee?To this end, in this paper, we design a new cell selection framework for Sparse MCS, called DRCell, with Deep Reinforcement learning. In recent years, deep reinforcement learning has shown its successes in decision making problems in diverse areas such as robot control (Gu et al., 2017) and game playing (Silver et al., 2017; Mnih et al., 2015). In general, deep reinforcement learning can benefit a large set of decision making problems which can be abstracted as ‘an agent needs to decide the action under a certain state’. Our cell selection problem can actually be interpreted as ‘an MCS server (agent) needs to choose the next cell for sensing (action) considering the data already collected (state)’. In this regard, it is promising to apply deep reinforcement learning on the cell selection problem in Sparse MCS.
To effectively employ deep reinforcement learning in cell selection, we still face several issues.

The first issue is how to mathematically model the state, action, and reward, which are key concepts in reinforcement learning (Sutton and Barto, 2005). Briefly speaking, reinforcement learning attempts to learn a Qfunction which takes the current state as input, and generates reward scores for each possible action as output. Then, we can take the action with the highest reward score as our decision. Only if we can model state, action and reward properly, we can generate the cell selection policy that can minimize the number of cells selected under the quality requirement.

The second issue is how to learn the Qfunction. Traditional Qlearning techniques in reinforcement learning work well in the scenarios where the state and action spaces are small (i.e., number of states and actions is limited). However, in Sparse MCS, the space of state is actually quite large. For example, suppose there are 100 cells (subareas) in the target sensing area, then even we only consider the current cycle, the possible number of states grows up to (whether a cell is sensed by participants or not). To overcome the difficulty of large state space, we hence propose to leverage deep learning along with reinforcement learning, i.e., deep reinforcement learning to learn Qfunction for our cell selection problem.

The last issue is the training data scarcity issue. Usually, deep reinforcement learning requires a lot of training data (i.e., known state, action, and reward) to learn Qfunction. In areas such as robot control or game playing, a robot or a computer can continuously run for data collection until the training performance is good. However, in MCS, we cannot have an unlimited amount of data for training. Then, how to address the training data scarcity issue, at least partially, should also be considered in our cell selection problem.
In summary, this work has the following contributions:
(1) To the best of our knowledge, this work is the first research that attempts to leverage the deep reinforcement learning to address the critical question in Sparse MCS, cell selection.
(2) We propose DRCell to select the best cell for obtaining the sensed data in Sparse MCS. More specifically, we model the state
with onehot encoding, and the
rewardfollowing the inference quality requirement of Sparse MCS. Then, considering the spatiotemporal correlations hidden in the sensed data, we propose a recurrent deep neural network structure to learn the reward output from the inputs of state and action. Finally, to relieve the dependence on a large amount of training data, we propose a transfer learning algorithm between heterogeneous sensing tasks in the same target area, so that the decision function learned on one task can be efficiently transferred to another task with only a little training data.
(3) Experiments on real data of sensing tasks including temperature, humidity and air quality monitoring have verified the effectiveness of DRCell. In particular, DRCell can outperform the stateoftheart mechanism QBC by reducing up to 15% of cells while guaranteeing the same quality in Sparse MCS.
2. Related Work
2.1. Sparse Mobile Crowdsensing
MCS is proposed to utilize widespread crowds to perform largescale sensing tasks (Zhang et al., 2014; Ganti et al., 2011; Guo et al., 2015). In practice, to minimize sensing cost while ensuring data quality, some MCS tasks involve inference algorithms to fill missing data of unsensed cells, such as noise sensing (Rana et al., 2010), traffic monitoring (Zhu et al., 2013), and air quality sensing (Wang et al., 2015). It is worth noting that in such MCS tasks, compressive sensing (Candès and Recht, 2009; Donoho, 2006) has become the de facto choice of the inference algorithm (Rana et al., 2010; Zhu et al., 2013; Wang et al., 2015; Xu et al., 2015; Wang et al., 2018c). Recently, by extracting the common research issues involved in such tasks involving data inference, Wang et al. (Wang et al., 2016a) propose a new MCS paradigm, called Sparse MCS. Besides the inference algorithm, Sparse MCS also abstracts other critical research issues such as cell selection and quality assessment. Later, privacy protection mechanism is also added into Sparse MCS (Wang et al., 2016b). In this paper, we focus on the cell selection issue and aim to use deep reinforcement learning techniques to address it.
2.2. Deep Reinforcement Learning
Reinforcement Learning (RL) (Sutton and Barto, 2005) is concerned with how to map states to actions so as to maximize the cumulative rewards. It utilizes rewards to guide agent to do the better sequential decisions, and has substantive and fruitful interactions with other engineering and scientific disciplines. Recently, many researchers focus on combining deep learning with reinforcement learning to enhance RL in order to solve concrete problems in the sciences, business, and other areas. Mnih et al. (Mnih et al., 2013) propose the first deep reinforcement learning model (DQN) to deal with the highdimensional sensory input successfully and apply it to play seven Atari 2600 games. More recently, Silver et al. (Silver et al., 2016) apply DQN and present , which was the first program to defeat worldclass players in Go. Moreover, to deal with the partially observable states, Hausknecht and Stone(Hausknecht and Stone, 2015)
introduce a deep recurrent neural network (DRQN), particularly a LongShortTermMemory (LSTM) Network, and apply it to play Atari 2600 games. Lample and Chaplot
(Lample et al., 2016) even use DRQN to play FPS Games.While deep reinforcement learning has already been used in a variety of areas, like object recognition (Ba et al., 2014), robot control (Levine et al., 2015), and communication protocol (Foerster et al., 2016), MCS researchers just began to apply it very recently. Xiao et al. (Xiao et al., 2017a) formulate the interactions between a server and vehicles as a vehicular crowdsensing game. Then they propose the Qlearning based strategies to help server and vehicles make the optimal decisions for the dynamic game. Moreover, Xiao et al. (Xiao et al., 2017b) apply DQN to derive the optimal policy for the Stackelberg game between a MCS server and a number of smartphone users. As far as we know, this paper is the first research attempt to use deep reinforcement learning in cell selection of sparse MCS, so as to reduce MCS organizers’ data collection costs while still guaranteeing the data quality.
3. Problem Formulation
We first define several key concepts, and then mathematically formulate the cell selection problem in Sparse MCS. Finally, we illustrate a running example to explain our problem in more details.
Definition 1. Sensing Area. We suppose that the target sensing area can be split into a set of cells (e.g., grids (Zheng et al., 2013; Wang et al., 2018c)). The objective of a sensing task is to get a certain type of data (e.g., temperature, air quality) of all the cells in the target area.
Definition 2. Sensing Cycle. We suppose the sensing tasks can be split into equallength cycles, and the cycle length is determined by the MCS organizers according to their requirements (Xiong et al., 2015; Wang et al., 2018c). For example, if an organizer wants to update the data of the target sensing area every one hour, then he can set the cycle length to one hour.
Definition 3. Ground Truth Data Matrix. Suppose we have cells and cycles, then for a certain sensing task, the ground truth data matrix is denoted , where is the true data in cell at cycle .
Definition 4. Cell Selection Matrix. In Sparse MCS, we will only select partial cells in each cycle for data collection, while inferring the data for rest cells. Cell selection matrix, denoted , marks the cell selection results. means that the cell is selected at cycle for data collection; otherwise, .
Definition 5. Inferred Data Matrix. In Sparse MCS, when an organizer decides not to collect any more data in the current cycle, the data of unsensed cells will then be inferred. Then, we denote the inferred data of the th cycle as , and thus the inferred data of all the cycles as a matrix . Note that in Sparse MCS, compressive sensing is the de facto choice of the inference algorithm nowadays (Rana et al., 2010; Zhu et al., 2013; Wang et al., 2015; Xu et al., 2015; Wang et al., 2018c), and we also use it in this work.
Definition 6. (, )quality (Wang et al., 2018c). In Sparse MCS, the quality guarantee is called (, )quality, meaning that in of cycles, the inference error (e.g., mean absolute error) is not larger than . Formally,
(1) 
where is the number of total sensing cycles.
Note that in practice, since we do not know the ground truth data matrix , we also cannot know whether is smaller than in the current cycle with 100% confidence. This is why we include in the quality requirement, as it is impossible to ensure 100% of cycles’ error less than . To ensure (
, p)quality, certain quality assessment method is needed in Sparse MCS to estimate the the probability of the error less than
for the current cycle. If the estimated probability is larger than , then the current cycle satisfies (, p)quality and no more data will be collected (we will then move to the next sensing cycle). In Sparse MCS, leaveoneout based Bayesian inference method is often leveraged for quality assessment
(Wang et al., 2018c, 2015, 2016a), and we also use it in this work.Problem [Cell Selection]: Given a Sparse MCS task with cells and cycles, using compressive sensing as data inference method and leaveoneout based Bayesian inference as quality assessment method, we aim to select a minimal subset of sensing cells during the whole sensing process (minimize the number of nonzero entries in the cellselection matrix ), while satisfying quality:
We now use a running example to illustrate our problem in more details, as shown in Figure 2. (1) Suppose we have five cells and the current is the 5th cycle; (2) We select the cell 3 for collecting data, and then assess whether the current cycle can satisfy quality; (3) As we find that the quality requirement is not satisfied, we continue collecting data from cell 5; (4) The quality requirement is now satisfied, so the data collection is terminated for the current cycle, and the data of the unsensed cells is inferred. In this example, we see that after five cycles, there are totally 11 data submissions from participants. The objective of our cell selection problem is exactly to minimize the number of data submissions.
4. Methodology
In this section, we propose a novel mechanism, called DRCell, to address the cell selection problem with deep reinforcement learning. First, we will mathematically model the state, reward, and action used in DRCell. Then, with a simplified MCS task example (i.e., there are only a few cells in the target area), we explain how traditional reinforcement learning works find the most appropriate cell for sensing based on our state, reward, and action modeling. Afterward, we elaborate how deep learning can be combined with reinforcement learning (i.e., deep reinforcement learning) to work on more realistic cases of cell selection where the target area can include a large number of cells. Finally, we describe how transfer learning can help us to generate a cell selection strategy with only a little training data under some specific conditions.
4.1. Modeling state, action, and reward
To apply deep reinforcement learning on cell selection, we need to model the key concepts in terms of state, action, and reward. Figure 3 illustrates the relationship between the three key concepts in DRCell. Briefly speaking, in DRCELL, based on the current data collection state, we need to learn a Qfunction (will be elaborated in next a few subsections), which can output reward scores for each possible action. The action in cell selection is choosing which cell as the next sensing cell, while reward indicates how good a certain action is. If an action (i.e., a cell) gets a higher reward score, it may be a better choice. Next we formally model the three concepts.
(1) State represents the current data collection condition of the MCS task. In Sparse MCS, cell selection matrix (Definition 4) can naturally model the state of Sparse MCS well, as it records both where and when we have collected data from the target sensing area during the whole task. In practice, we can just keep the recent cycles’ cell selection matrix as the state, denoted as , where
represents the cell selection vector of the current cycle (1 means selected and 0 means no),
represents last cycle, and so on. Figure 4 shows an example of how we encode the current data collection condition into the state model if recent two cycles are considered. Note that we use to denote the whole set of states. As an example, suppose that we consider the recent two cycles and there are totally five cells in the target area, then the number of possible states, i.e., .(2) Action means all the possible decisions that we may make in cell selection. Suppose there are totally cells in the target sensing area, then our next selected cell can have choices, leading to the whole action set . Note that while in practice we will not select one cell for more than once in one cycle, to make the action set consistent under different states, we assume that the possible action set is always the complete set of all the cells under any state. More specifically, if some cells have already been selected in the current cycle, then the probability of choosing these cells is zero.
(3) Reward is used to indicate how good an action is. In each sensing cycle, we select actions one by one until the selected cells can satisfy the quality requirement in the current cycle (i.e., inference error less than ^{2}^{2}2When running Sparse MCS, we have to set a probability in quality requirement, i.e., ()quality, as we do not know the ground truth data of unsensed cells. However, in the training stage of the cell selection policy, we assume that we have obtained the data of all the cells in the target area for some time (e.g., 1 day), and thus we can directly compute the inference error. More details on the training stage will be described in the evaluation section.). Satisfying this quality requirement is the goal of cell selection and should be reflected in the reward modeling. Hence, a positive reward, denoted by , would be given to an action (i.e., a cell) under a state if the quality requirement is satisfied in the current cycle after the action is taken. In addition, as selecting participants to collect data incurs cost, we also put a negative score in the reward modeling of an action. Then, the reward can be written as , in which means whether the action makes the current cycle satisfy the inference quality requirement.
With the above modeling, we then need to learn the Qfunction (see Figure 3) which can output the reward score of every possible action under a certain state. In the next subsection, we will first use a traditional reinforcement learning method, tabular Qlearning, to illustrate a simplified case where a small number of cells exist in the target sensing area.
4.2. Training Qfunction with Tabular QLearning
In traditional reinforcement learning, a widely used strategy to obtain the Qfunction is the tabular Qlearning. In this method, the Qfunction is represented by a Qtable, denoted as . Each element in the Qtable, represents the reward score of a certain action under a certain state . The objective of learning the Qfunction is then equivalent to filling all the elements in the Qtable.
The tabular Qlearning algorithm is shown in Algorithm 1. Under the current state , the algorithm selects the action which has the maximum value from (in fact, not always the best action is selected, will be elaborated later). After the action has been conducted, the cell has been selected and the data of the cell has been collected, the current state will change to the next state . Note that if the current cycle satisfies the quality requirement (i.e., inference error less than ), then the next state will shift to a new cycle. For the selected action, we would get the real reward considering whether the inference quality requirement of the current cycle is satisfied and then update Qtable according to the equations as follows
(2) 
(3) 
where provides the highest expected reward score of the next state (i.e., the reward of the best action under the next state ); is the discount factor indicating the myopic view of the Qlearning regarding the future reward; is the learning rate.
Moreover, during the training stage, under a certain state, if we always select the action with the largest reward score in the Qtable, the algorithm may get a local optima. To address this issue, we need to explore during training, i.e., sometimes trying actions other than the best one. We thus use the greedy algorithm before selection. More specifically, under a certain state, we select the best action according to the Qtable with a probability and randomly select one of the other actions with the probability . Following the existing literature, at the beginning of the training, we set a relatively large so that we can try more; then, with the training process proceeds, we gradually reduce until the Qtable is converged and then Algorithm 1 is terminated.
Figure 5 illustrates an example of using tabular Qlearning for training Qfunction. For simplicity, we set the discount factor to 1 and the learning rate to 1. Here, we suppose that there are five cells in the target area, and we only consider two recent cycles: the last and current one. Hence, the state has the dimension of , as shown in , , and . The value means that the cell has been selected and means not. First, we initialize the table, all the values in the Qtable are set to 0. When we first meet some states, e.g., , scores of all the actions in the Qtable under are 0 (Qtable: in Figure 5). We then randomly select one action since all the values are equal. If we choose the action (select the cell 3), the state turns to . Then we update Q[] as the current reward score plus the maximum score of the next state (i.e., future reward). The current reward is since the current cycle cannot satisfy the quality requirement ( in the example). The maximum score for the state is 0 in the Qtable. Hence, we get (Qtable: in Figure 5). Similar, under , we choose . If these selections could satisfy the quality, we get the current reward is ( is set to , i.e., total number of cells). Also, the maximum possible reward of the next state is 0 in the current Qtable. Then we update (Qtable: in Figure 5). After some rounds, we meet many times and maybe other actions are not good. And the Qtable would be changed to Qtable: in Figure 5. This time, under , we check Qtable and find that has the largest value, so we choose and perform . Then, we update , since the maximum reward score of the next state is 4 (Qtable: in Figure 5). Therefore, at the next times when we meet again, we would probably choose the action , since it has the largest reward score.
While the tabular Qlearning can work well for an MCS task in a target area including a small number of cells, as shown in the above example, practical MCS tasks may involve a large number of cells. Suppose there are cells in the target area and we want to consider recent cycles to model states, then the state space will become extremely huge, , which is intractable in practice. To overcome this difficulty, in the next subsection we will propose to leverage deep learning with reinforcement learning to train the decision function for cell selection in Sparse MCS.
4.3. Training Qfunction with Deep Recurrent QNetwork
To overcome the problem incurred by the extremely large state space in the cell selection, we then turn to use the Deep QNetwork (DQN), which combines Qlearning with deep neural networks. The difference between DQN and tabular Qlearning is that a deep neural network is used to replace the Qtable to deal with the dimension curse. In DQN, we do not need the Qtable lookups, but calculate for each stateaction pair selection. More specifically, the DQN inputs the current state and action, then it uses a deep neural network to obtain an estimated value of , shown as
(4) 
For each selection, we use the neural network parameterized by to calculate the Qfunction and select the best stateaction pair who has the largest reward score, or called Q value. Note that the greedy algorithm is also used in DQN to balance the exploration and exploitation.
To obtain the estimation of the Q value which approximates the expected one in (4), our proposed DQN uses the experience replay technique. After one selection, we obtain the experience at current time step , denoted as , and the memory pool is . Then, DQN randomly chooses part of the experiences to learn and update the network parameters . The goal is to calculate the best to obtain . The stochastic gradient algorithm is applied with the learning rate
and the loss function is defined as follow,
(5) 
Thus
(6) 
For each update, DQN randomly chooses part of experiences from , then calculates and updates the network parameters . Moreover, to avoid the oscillations (i.e., the Qfunction changes too rapidly in training), we apply the fixed Qtargets technique. More specifically, we do not always use the latest network parameter to calculate the maximum possible reward of the next state (i.e., ), but update the corresponding parameter every a few iterations, i.e.,
(7) 
The DQN learning algorithm is summarized in Algorithm 2.
In DQN, how to design the network structure also impacts the effectiveness of the learned Qfunction. One common way is using dense layers to connect the input (state) and output (a reward score vector of all possible actions). However, the temporal correlations exist in our state , but the dense layers cannot catch the temporal pattern well. We thus propose to use LSTM (LongShortTermMemory) layers rather than dense layers in DQN so as to catch the temporal patterns in our state, which is also called Deep Recurrent QNetwork (DRQN) (Hausknecht and Stone, 2015). More specifically, in DRQN, Qfunction can be defined as,
(8) 
where represents the observation at time step (i.e., the cell selection vector at ), and is the extra input returned by the LSTM network from the previous time step . In our cell selection problem, a state can be divided into time steps of observations, and then can also be used as inputs of the DRQN for learning the Qfunction.
4.4. Reducing Training Data by Transfer Learning
With deep reinforcement learning, we can get the Qfunction that outputs reward scores for all the possible actions under a certain state, and then we can choose the cell that has the largest score in cell selection. However, the Qfunction learning algorithm mentioned in the previous sections may need a large amount of training data, which also incurs collection cost for MCS organizers.^{3}^{3}3An organizer needs to conduct a preliminary study on the target sensing area to collect the data from every cell for a short time. Then, can we reduce the amount of training data under certain circumstances?
In reality, many types of data have interdata correlations, e.g., temperature and humidity (Wang et al., 2018c). Then, if there are multiple correlated sensing tasks in a target area, probably the cell selection strategy learned for one task can benefit another task. With this intuition, we present a transfer learning method for learning the Qfunction of an MCS task (target task) with the help of the cell selection strategy learned from another correlated task (source task). We assume that the source task has adequate training data, while the target task has only a little training data. Inspired by the finetuning techniques widely used in image processing with deep neural networks, for training the Qfunction of the target task, we initialize the parameters of its DRQN to the parameter values of the source task DRQN (learned from the adequate training data of the source task). Then, we use the limited amount of training data of the target task to continue the DRQN learning process (Algorithm 2). In such a way, we can reduce the amount of training data required for obtaining a good cell selection strategy of the target task.
5. Evaluation
In this section, we conduct extensive experiments based on two reallife datasets, which include various types of sensed data in representative MCS applications, such as temperature, humidity, and air quality.
5.1. Datasets
Same as previous Sparse MCS literature (Wang et al., 2018c, 2015), we adopt two reallife datasets, SensorScope (Ingelrest et al., 2010), and UAir (Zheng et al., 2013), to evaluate the performance of our proposed cell selection algorithm DRCell. These two datasets contain various types of sensed data, including temperature, humidity, and air quality. The detailed statistics of the two datasets are listed in Table 1. Although these sensed data are collected from sensor networks or static stations, the mobile devices can also be used to obtain them (as in (Devarakonda et al., 2013; Hasenfratz et al., 2012)). We can treat them as the data sensed by smartphones and use these datasets in our experiments to show the effectiveness of DRCell.
Datasets  
SensorScope  UAir  
City  Lausanne (Switzerland)  Beijing (China) 
Data  temperature, humidity  PM2.5 
Cell size ()  50*30  1000*1000 
Number of cells  57  36 
Cycle length ()  0.5  1 
Duration ()  7  11 
Error metric  mean absolute error  classification error 
Mean Std.  (temperature)  (PM2.5) 
(humidity) 
SensorScope (Ingelrest et al., 2010): The SensorScope dataset contains various environment readings, temperature and humidity. The sensed data are collected from the EPFL campus with an area about . We first divide the target area into 100 cells, each cell is . Then we find that 57 out of 100 cells are deployed with valid sensors. Hence, we use the sensed data at the 57 cells to evaluate our algorithms. We use the mean absolute error to measure the inference error.
UAir (Zheng et al., 2013): The UAir dataset includes the air quality readings from Beijing. Same as (Zheng et al., 2013), we split the Beijing to cells where each cell is . Then, there are 36 cells with the sensed air quality readings. With this dataset, we conduct the experiment of PM2.5 sensing, and try to infer the air quality index category^{4}^{4}4Six categories (Zheng et al., 2013): Good (050), Moderate (51100), Unhealthy for Sensitive Groups (101150), Unhealthy (150200), Very Unhealthy (201300), and Hazardous (¿300) of unsensed cells. The inference error is measured by classification error.
5.2. Baseline Algorithms
We compare DRCell to two existing cell selection methods: QBC and RANDOM.
QBC
: Based on the researches in active learning on matrix completion, Wang et al.
(Wang et al., 2018c) present an intuitive method, called Query by Committee based cell selection algorithm. QBC selects the salient cell determined by ”committee” to allocate the next task. More specifically, QBC attempts to use some different data inference algorithms, such as compressive sensing and KNearest Neighbors, to infer the full sensing matrix. Then, it allocates the next task to the cell with the largest variance among the inferred values of different algorithms.RANDOM: In each sensing cycle, RANDOM will randomly select cells one by one until the selected cells can ensure a satisfying inference accuracy.
5.3. Experiment Process
To learn DRCell, we use the first 2day data of each dataset to train our Qfunction, i.e., we suppose that the MCS organizers will conduct a 2day preliminary study to collect data from all the cells of the target area. After the 2day training stage, we enter the testing stage when we can use the trained Qfunction to obtain the reward of every possible action under the current state, and then choose the action (i.e., cell) with the largest reward score. During the testing stage, we use the leaveoneout Bayesian inference method to ensure quality, same as previous Sparse MCS literature (Wang et al., 2018c). The objective is to select cells as few as possible with the quality guarantee, and thus we will compare the number of cells selected by DRCell and baseline methods to verify the effectiveness of DRCell.
5.4. Experiment Results
We first evaluate the performance by using the temperature data in SensorScope and the PM2.5 data in UAir, respectively. The results are shown in Figure 6.
In the temperature scenario of SensorScope, for the predefined ()quality, we set the error bound as and as or . This quality requirement is that the inference error is smaller than for around 90% or 95% of cycles. Figure 6 (leftmost part) shows the average numbers of selected cells for each sensing cycles. DRCell always outperforms two baseline methods. More specifically, when , DRCell can select and fewer cells than QBC and RANDOM, respectively. In general, DRCell only needs to select 12.84 out of 57 cells for each sensing cycle when ensuring the inference error below in of cycles. When we improve the quality requirement to , DRCell needs to select more cells to satisfy the higher requirement. Particularly, DRCell selects 15.08 out of 57 cells under the quality and achieves better performances by selecting and fewer cells than QBC and RANDOM, respectively. For the PM2.5 scenario in UAir, we set the error bound as and as or and get the similar observations shown in Figure 6 (rightmost part). When is /, DRCell selects 9.0/12.5 out of 36 cells and reduces /, and / of selected cells than QBC and RANDOM, respectively.
We then conduct the experiments on the multitask MCS scenario, i.e., temperaturehumidity monitoring, in SensorScope to verify the transfer learning performance. We conduct twoway experiments, temperature as the source task and humidity as the target task; and vice versa. More specifically, for the source task, we still suppose that we obtain 2day data for training; but for the target task, we suppose that we only obtain 10 cycles (i.e., 5 hours) of training data. Moreover, we add two compared methods to verify the effectiveness of our transfer learning method: NOTRANSFER and SHORTTRAIN. NOTRANSFER is the method that directly uses the Qfunction of the source task to the target task, and SHORTTRAIN means that the target task model is only trained on the 10cycle training data.
The quality requirement of temperature is ()quality and the humidity is quality. Figure 7 shows the average numbers of selected cells. When temperature is seen as the target task, TRANSFER can achieve better performance by reducing , , and selected cells compared with NOTRANSFER, SHORTTRAIN, and RANDOM, respectively. When humidity is the target task, similarly, TRANSFER can select , , and fewer cells than NOTRANSFER, SHORTTRAIN, and RANDOM, respectively. Note that NOTRANSFER and SHORTTRAIN even perform worse than RANDOM in this case. It emphasizes the importance of having an adequate amount of training data for DRCell. By using transfer learning, we can significantly reduce the training data required for learning a good Qfunction in DRCell, and thus further reducing the data collection costs of MCS organizers.
Finally, we report the computation time of DRCell. Our experiment platform is equipped with Intel Xeon CPU E2630 v4 @ 2.20GHz and 32 GB RAM. We implement our DRCell training algorithm in TensorFlow (CPU version). In our experiment scenarios, the training time consumes around 2–4 hours, which is totally acceptable in reallife deployments as the training is an offline process.
6. Conclusion
In this paper, to improve the cell selection efficiency in Sparse MCS, we propose a novel Deep Reinforcement learning based Cell selection mechanism, namely DRCell. We properly model the three key concepts in reinforcement learning, i.e., state, action, and reward, and then propose a deep recurrent Qnetwork with LSTM to learn the Qfunction that can output the reward score given an arbitrary stateaction pair. Then, under a certain state, we can choose the cell with the largest reward score as the next cell for sensing. Furthermore, we propose a transfer learning method to reduce the amount of training data required for learning the Qfunction, if there are multiple correlated MCS tasks conducted in the same target area. Experiments on various real sensing datasets verify the effectiveness of DRCell in reducing the data collection costs.
In our future work, we will study how to conduct the reinforcement learning based cell selection in an online manner, so that we do not need a preliminary study stage for collecting the training data any more. Besides, we will also consider a case where the data collection costs of different cells are diverse. Finally, we will consider to extend our mechanism to multitask allocation scenarios when heterogeneous tasks are conducted simultaneously (Wang et al., 2017a; Wang et al., 2018b) and privacypreserving scenarios when the participant privacy protection mechanisms are applied (Wang et al., 2016b, 2017b, 2018a).
References
 (1)
 Ba et al. (2014) Jimmy Ba, Volodymyr Mnih, and Koray Kavukcuoglu. 2014. Multiple Object Recognition with Visual Attention. Computer Science (2014).
 Candès and Recht (2009) Emmanuel J Candès and Benjamin Recht. 2009. Exact matrix completion via convex optimization. Foundations of Computational mathematics 9, 6 (2009), 717.
 Devarakonda et al. (2013) Srinivas Devarakonda, Parveen Sevusu, Hongzhang Liu, Ruilin Liu, Liviu Iftode, and Badri Nath. 2013. Realtime air quality monitoring through mobile sensing in metropolitan areas. In Proceedings of the 2nd ACM SIGKDD international workshop on urban computing. ACM, 15.
 Donoho (2006) David L Donoho. 2006. Compressed sensing. IEEE Transactions on information theory 52, 4 (2006), 1289–1306.
 Foerster et al. (2016) Jakob N. Foerster, Yannis M. Assael, Nando de Freitas, and Shimon Whiteson. 2016. Learning to Communicate to Solve Riddles with Deep Distributed Recurrent QNetworks. CoRR abs/1602.02672 (2016). arXiv:1602.02672 http://arxiv.org/abs/1602.02672
 Ganti et al. (2011) Raghu K Ganti, Fan Ye, and Hui Lei. 2011. Mobile crowdsensing: current state and future challenges. IEEE Communications Magazine 49, 11 (2011).
 Gu et al. (2017) Shixiang Gu, Ethan Holly, Timothy Lillicrap, and Sergey Levine. 2017. Deep reinforcement learning for robotic manipulation with asynchronous offpolicy updates. In Robotics and Automation (ICRA), 2017 IEEE International Conference on. IEEE, 3389–3396.
 Guo et al. (2015) Bin Guo, Zhu Wang, Zhiwen Yu, Yu Wang, Neil Y Yen, Runhe Huang, and Xingshe Zhou. 2015. Mobile crowd sensing and computing: The review of an emerging humanpowered sensing paradigm. ACM Computing Surveys (CSUR) 48, 1 (2015), 7.
 Hasenfratz et al. (2012) David Hasenfratz, Olga Saukh, Silvan Sturzenegger, and Lothar Thiele. 2012. Participatory air pollution monitoring using smartphones. Mobile Sensing 1 (2012), 1–5.
 Hausknecht and Stone (2015) Matthew J. Hausknecht and Peter Stone. 2015. Deep Recurrent QLearning for Partially Observable MDPs. CoRR abs/1507.06527 (2015). arXiv:1507.06527 http://arxiv.org/abs/1507.06527
 Ingelrest et al. (2010) Francois Ingelrest, Guillermo Barrenetxea, Gunnar Schaefer, Martin Vetterli, Olivier Couach, and Marc Parlange. 2010. SensorScope:Applicationspecific sensor network for environmental monitoring. Acm Transactions on Sensor Networks 6, 2 (2010), 1–32.

Lample
et al. (2016)
Guillaume Lample,
Devendra Singh Chaplot, Guillaume Lample,
and Devendra Singh Chaplot.
2016.
Playing FPS Games with Deep Reinforcement
Learning. In
AAAI Conference on Artificial Intelligence
. 
Levine
et al. (2015)
Sergey Levine, Chelsea
Finn, Trevor Darrell, and Pieter
Abbeel. 2015.
Endtoend training of deep visuomotor policies.
Journal of Machine Learning Research
17, 1 (2015), 1334–1373.  Mnih et al. (2013) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. 2013. Playing Atari with Deep Reinforcement Learning. Computer Science (2013).
 Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. 2015. Humanlevel control through deep reinforcement learning. Nature 518, 7540 (2015), 529.
 Rana et al. (2010) Rajib Kumar Rana, Chun Tung Chou, Salil S Kanhere, Nirupama Bulusu, and Wen Hu. 2010. Earphone: an endtoend participatory urban noise mapping system. In Proceedings of the 9th ACM/IEEE International Conference on Information Processing in Sensor Networks. ACM, 105–116.
 Silver et al. (2016) David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, and Marc Lanctot. 2016. Mastering the game of Go with deep neural networks and tree search. Nature 529, 7587 (2016), 484.
 Silver et al. (2017) David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. 2017. Mastering the game of go without human knowledge. Nature 550, 7676 (2017), 354.
 Sutton and Barto (2005) R Sutton and A Barto. 2005. Reinforcement Learning: An Introduction. MIT Press. 90–127 pages.
 Wang et al. (2017a) Jiangtao Wang, Yasha Wang, Daqing Zhang, Feng Wang, Yuanduo He, and Liantao Ma. 2017a. PSAllocator: multitask allocation for participatory sensing with sensing capability constraints. In Proceedings of the 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing. ACM, 1139–1151.
 Wang et al. (2018b) Jiangtao Wang, Yasha Wang, Daqing Zhang, Feng Wang, Haoyi Xiong, Chao Chen, Qin Lv, and Zhaopeng Qiu. 2018b. MultiTask Allocation in Mobile Crowd Sensing with Individual Task Quality Assurance. IEEE Transactions on Mobile Computing (2018).
 Wang et al. (2018a) Leye Wang, Gehua Qin, Dingqi Yang, Xiao Han, and Xiaojuan Ma. 2018a. Geographic Differential Privacy for Mobile Crowd Coverage Maximization. In AAAI.
 Wang et al. (2017b) Leye Wang, Dingqi Yang, Xiao Han, Tianben Wang, Daqing Zhang, and Xiaojuan Ma. 2017b. Location privacypreserving task allocation for mobile crowdsensing with differential geoobfuscation. In Proceedings of the 26th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 627–636.
 Wang et al. (2015) Leye Wang, Daqing Zhang, Animesh Pathak, Chao Chen, Haoyi Xiong, Dingqi Yang, and Yasha Wang. 2015. CCSTA: qualityguaranteed online task allocation in compressive crowdsensing. In Proceedings of the 2015 ACM International Joint Conference on Pervasive and Ubiquitous Computing. ACM, 683–694.
 Wang et al. (2016a) Leye Wang, Daqing Zhang, Yasha Wang, Chao Chen, Xiao Han, and Abdallah M’hamed. 2016a. Sparse mobile crowdsensing: challenges and opportunities. IEEE Communications Magazine 54, 7 (2016), 161–167.
 Wang et al. (2016b) Leye Wang, Daqing Zhang, Dingqi Yang, Brian Y Lim, and Xiaojuan Ma. 2016b. Differential location privacy for sparse mobile crowdsensing. In Data Mining (ICDM), 2016 IEEE 16th International Conference on. IEEE, 1257–1262.
 Wang et al. (2018c) Leye Wang, Daqing Zhang, Dingqi Yang, Animesh Pathak, Chao Chen, Xiao Han, Haoyi Xiong, and Yasha Wang. 2018c. SPACETA: CostEffective Task Allocation Exploiting Intradata and Interdata Correlations in Sparse Crowdsensing. Acm Transactions on Intelligent Systems & Technology 9, 2 (2018), 1–28.
 Xiao et al. (2017a) Liang Xiao, Tianhua Chen, Caixia Xie, Huaiyu Dai, and Vincent Poor. 2017a. Mobile Crowdsensing Games in Vehicular Networks. IEEE Transactions on Vehicular Technology PP, 99 (2017), 1–1.
 Xiao et al. (2017b) Liang Xiao, Yanda Li, Guoan Han, Huaiyu Dai, and H. Vincent Poor. 2017b. A Secure Mobile Crowdsensing Game with Deep Reinforcement Learning. IEEE Transactions on Information Forensics & Security PP, 99 (2017), 1–1.
 Xiong et al. (2015) Haoyi Xiong, Daqing Zhang, Leye Wang, and Hakima Chaouchi. 2015. EMC 3: Energyefficient data transfer in mobile crowdsensing under full coverage constraint. IEEE Transactions on Mobile Computing 14, 7 (2015), 1355–1368.
 Xu et al. (2015) Liwen Xu, Xiaohong Hao, Nicholas D Lane, Xin Liu, and Thomas Moscibroda. 2015. More with less: Lowering user burden in mobile crowdsourcing through compressive sensing. In Proceedings of the 2015 ACM International Joint Conference on Pervasive and Ubiquitous Computing. ACM, 659–670.
 Zhang et al. (2014) Daqing Zhang, Leye Wang, Haoyi Xiong, and Bin Guo. 2014. 4W1H in mobile crowd sensing. IEEE Communications Magazine 52, 8 (2014), 42–48.
 Zheng et al. (2013) Yu Zheng, Furui Liu, and Hsun Ping Hsieh. 2013. UAir:when urban air quality inference meets big data. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1436–1444.
 Zhu et al. (2013) Yanmin Zhu, Zhi Li, Hongzi Zhu, Minglu Li, and Qian Zhang. 2013. A compressive sensing approach to urban traffic estimation with probe vehicles. IEEE Transactions on Mobile Computing 12, 11 (2013), 2289–2302.