Cell Selection with Deep Reinforcement Learning in Sparse Mobile Crowdsensing

by   Leye Wang, et al.

Sparse Mobile CrowdSensing (MCS) is a novel MCS paradigm where data inference is incorporated into the MCS process for reducing sensing costs while its quality is guaranteed. Since the sensed data from different cells (sub-areas) of the target sensing area will probably lead to diverse levels of inference data quality, cell selection (i.e., choose which cells of the target area to collect sensed data from participants) is a critical issue that will impact the total amount of data that requires to be collected (i.e., data collection costs) for ensuring a certain level of quality. To address this issue, this paper proposes a Deep Reinforcement learning based Cell selection mechanism for Sparse MCS, called DR-Cell. First, we properly model the key concepts in reinforcement learning including state, action, and reward, and then propose to use a deep recurrent Q-network for learning the Q-function that can help decide which cell is a better choice under a certain state during cell selection. Furthermore, we leverage the transfer learning techniques to reduce the amount of data required for training the Q-function if there are multiple correlated MCS tasks that need to be conducted in the same target area. Experiments on various real-life sensing datasets verify the effectiveness of DR-Cell over the state-of-the-art cell selection mechanisms in Sparse MCS by reducing up to 15 of sensed cells with the same data inference quality guarantee.


page 5

page 13

page 14


Deep Reinforcement Learning for Conversational AI

Deep reinforcement learning is revolutionizing the artificial intelligen...

Dynamic Sparse Training for Deep Reinforcement Learning

Deep reinforcement learning has achieved significant success in many dec...

Automated Human Cell Classification in Sparse Datasets using Few-Shot Learning

Classifying and analyzing human cells is a lengthy procedure, often invo...

Deep Reinforcement Learning of Cell Movement in the Early Stage of C. elegans Embryogenesis

Cell movement in the early phase of C. elegans development is regulated ...

Reducing the Deployment-Time Inference Control Costs of Deep Reinforcement Learning Agents via an Asymmetric Architecture

Deep reinforcement learning (DRL) has been demonstrated to provide promi...

Learning Locomotion Skills Using DeepRL: Does the Choice of Action Space Matter?

The use of deep reinforcement learning allows for high-dimensional state...

Hybrid Cell Assignment and Sizing for Power, Area, Delay Product Optimization of SRAM Arrays

Memory accounts for a considerable portion of the total power budget and...

1. Introduction

With the prevalence of smart mobile devices, mobile crowdsensing (MCS) becomes a novel sensing mechanism nowadays to address various urban tasks such as environment and traffic monitoring (Zhang et al., 2014). Traditional MCS mechanisms usually collect a large amount of data to cover almost all the cells (i.e., subareas) of the target area to ensure quality. This requires MCS organizers to recruit many participants (e.g., at least one from every cell for full coverage), leading to a relatively high cost. To reduce such cost while still ensuring a high level of quality, a new MCS paradigm, namely Sparse MCS, is proposed recently (Wang et al., 2018c, 2016a). Sparse MCS collects data from only a few cells while intelligently inferring the data of rest cells with quality guarantees (i.e., the error of inferred data is lower than a threshold). Hence, compared to traditional mechanisms, MCS organizers’ cost can be reduced since only a few participants need to be recruited, while the task quality is still ensured.

In Sparse MCS, one key issue affecting how much cost can be practically saved is cell selectionwhich cells the organizer decides to collect sensed data from participants (Wang et al., 2016a). To show the importance of cell selection, Figure 1 (left part) gives an illustrative example of two different cell selection cases in a city, which is split to cells. In Case 1.1, all the selected cells are gathered in one corner of the city; in Case 1.2, the collected data is more widely distributed in the whole city. As data of most sensing tasks has spatial correlations (i.e., nearby cells may have similar data), e.g., air quality (Zheng et al., 2013), the cell selection of Case 1.2 will generate a higher inference quality of the inferred data than Case 1.1. Moreover, a MCS campaign usually lasts for a long time (i.e., sensing every one hour), so that not only spatial correlations, but also temporal correlations need to be carefully considered in cell selection. As shown in Figure 1 (right part), sensing the same cells in continuous cycles (Case 2.1) may not be as efficient as sensing the different cells (Case 2.2) considering the inference quality. While data of different MCS applications may involve diverse spatio-temporal correlations, determining the proper cell selection strategies is a non-trivial task.

Figure 1. Different cell selection cases.

Existing works on Sparse MCS mainly leverage Query-By-Committee (QBC) (Wang et al., 2015, 2018c)

in cell selection. QBC first uses various inference algorithms to deduce the data of all the unsensed cells, and then chooses the cell where the inferred data of various algorithms have the largest variance as the next cell for sensing. Briefly, the cell selection criteria of QBC is choosing the cell which is the most uncertain considering a committee of inference algorithms, i.e., hard-to-infer. While QBC has shown its effectiveness in some scenarios

(Wang et al., 2015, 2018c), it does not directly optimize the objective of Sparse MCS, i.e., minimizing the number of sensed cells under a quality guarantee. In fact, the existing works using QBC also realize that its performance is still far from the optimal cell selection strategy111Please note that the optimal cell selection strategy is impractical as it needs to know the ground truth data of each cell in advance, which is absolutely impossible in reality (Wang et al., 2015). (Wang et al., 2015). To reduce this performance gap from the optimal strategy, our research question here is, can we find a better cell selection strategy in Sparse MCS, which can directly minimize the number of selected cells under the inference quality guarantee?

To this end, in this paper, we design a new cell selection framework for Sparse MCS, called DR-Cell, with Deep Reinforcement learning. In recent years, deep reinforcement learning has shown its successes in decision making problems in diverse areas such as robot control (Gu et al., 2017) and game playing (Silver et al., 2017; Mnih et al., 2015). In general, deep reinforcement learning can benefit a large set of decision making problems which can be abstracted as ‘an agent needs to decide the action under a certain state’. Our cell selection problem can actually be interpreted as ‘an MCS server (agent) needs to choose the next cell for sensing (action) considering the data already collected (state)’. In this regard, it is promising to apply deep reinforcement learning on the cell selection problem in Sparse MCS.

To effectively employ deep reinforcement learning in cell selection, we still face several issues.

  1. The first issue is how to mathematically model the state, action, and reward, which are key concepts in reinforcement learning (Sutton and Barto, 2005). Briefly speaking, reinforcement learning attempts to learn a Q-function which takes the current state as input, and generates reward scores for each possible action as output. Then, we can take the action with the highest reward score as our decision. Only if we can model state, action and reward properly, we can generate the cell selection policy that can minimize the number of cells selected under the quality requirement.

  2. The second issue is how to learn the Q-function. Traditional Q-learning techniques in reinforcement learning work well in the scenarios where the state and action spaces are small (i.e., number of states and actions is limited). However, in Sparse MCS, the space of state is actually quite large. For example, suppose there are 100 cells (subareas) in the target sensing area, then even we only consider the current cycle, the possible number of states grows up to (whether a cell is sensed by participants or not). To overcome the difficulty of large state space, we hence propose to leverage deep learning along with reinforcement learning, i.e., deep reinforcement learning to learn Q-function for our cell selection problem.

  3. The last issue is the training data scarcity issue. Usually, deep reinforcement learning requires a lot of training data (i.e., known state, action, and reward) to learn Q-function. In areas such as robot control or game playing, a robot or a computer can continuously run for data collection until the training performance is good. However, in MCS, we cannot have an unlimited amount of data for training. Then, how to address the training data scarcity issue, at least partially, should also be considered in our cell selection problem.

In summary, this work has the following contributions:

(1) To the best of our knowledge, this work is the first research that attempts to leverage the deep reinforcement learning to address the critical question in Sparse MCS, cell selection.

(2) We propose DR-Cell to select the best cell for obtaining the sensed data in Sparse MCS. More specifically, we model the state

with one-hot encoding, and the


following the inference quality requirement of Sparse MCS. Then, considering the spatio-temporal correlations hidden in the sensed data, we propose a recurrent deep neural network structure to learn the reward output from the inputs of state and action. Finally, to relieve the dependence on a large amount of training data, we propose a transfer learning algorithm between heterogeneous sensing tasks in the same target area, so that the decision function learned on one task can be efficiently transferred to another task with only a little training data.

(3) Experiments on real data of sensing tasks including temperature, humidity and air quality monitoring have verified the effectiveness of DR-Cell. In particular, DR-Cell can outperform the state-of-the-art mechanism QBC by reducing up to 15% of cells while guaranteeing the same quality in Sparse MCS.

2. Related Work

2.1. Sparse Mobile Crowdsensing

MCS is proposed to utilize widespread crowds to perform large-scale sensing tasks (Zhang et al., 2014; Ganti et al., 2011; Guo et al., 2015). In practice, to minimize sensing cost while ensuring data quality, some MCS tasks involve inference algorithms to fill missing data of unsensed cells, such as noise sensing (Rana et al., 2010), traffic monitoring (Zhu et al., 2013), and air quality sensing (Wang et al., 2015). It is worth noting that in such MCS tasks, compressive sensing (Candès and Recht, 2009; Donoho, 2006) has become the de facto choice of the inference algorithm (Rana et al., 2010; Zhu et al., 2013; Wang et al., 2015; Xu et al., 2015; Wang et al., 2018c). Recently, by extracting the common research issues involved in such tasks involving data inference, Wang et al. (Wang et al., 2016a) propose a new MCS paradigm, called Sparse MCS. Besides the inference algorithm, Sparse MCS also abstracts other critical research issues such as cell selection and quality assessment. Later, privacy protection mechanism is also added into Sparse MCS (Wang et al., 2016b). In this paper, we focus on the cell selection issue and aim to use deep reinforcement learning techniques to address it.

2.2. Deep Reinforcement Learning

Reinforcement Learning (RL) (Sutton and Barto, 2005) is concerned with how to map states to actions so as to maximize the cumulative rewards. It utilizes rewards to guide agent to do the better sequential decisions, and has substantive and fruitful interactions with other engineering and scientific disciplines. Recently, many researchers focus on combining deep learning with reinforcement learning to enhance RL in order to solve concrete problems in the sciences, business, and other areas. Mnih et al. (Mnih et al., 2013) propose the first deep reinforcement learning model (DQN) to deal with the high-dimensional sensory input successfully and apply it to play seven Atari 2600 games. More recently, Silver et al. (Silver et al., 2016) apply DQN and present , which was the first program to defeat world-class players in Go. Moreover, to deal with the partially observable states, Hausknecht and Stone(Hausknecht and Stone, 2015)

introduce a deep recurrent neural network (DRQN), particularly a Long-Short-Term-Memory (LSTM) Network, and apply it to play Atari 2600 games. Lample and Chaplot

(Lample et al., 2016) even use DRQN to play FPS Games.

While deep reinforcement learning has already been used in a variety of areas, like object recognition (Ba et al., 2014), robot control (Levine et al., 2015), and communication protocol (Foerster et al., 2016), MCS researchers just began to apply it very recently. Xiao et al. (Xiao et al., 2017a) formulate the interactions between a server and vehicles as a vehicular crowdsensing game. Then they propose the Q-learning based strategies to help server and vehicles make the optimal decisions for the dynamic game. Moreover, Xiao et al. (Xiao et al., 2017b) apply DQN to derive the optimal policy for the Stackelberg game between a MCS server and a number of smartphone users. As far as we know, this paper is the first research attempt to use deep reinforcement learning in cell selection of sparse MCS, so as to reduce MCS organizers’ data collection costs while still guaranteeing the data quality.

3. Problem Formulation

We first define several key concepts, and then mathematically formulate the cell selection problem in Sparse MCS. Finally, we illustrate a running example to explain our problem in more details.

Definition 1. Sensing Area. We suppose that the target sensing area can be split into a set of cells (e.g., grids (Zheng et al., 2013; Wang et al., 2018c)). The objective of a sensing task is to get a certain type of data (e.g., temperature, air quality) of all the cells in the target area.

Definition 2. Sensing Cycle. We suppose the sensing tasks can be split into equal-length cycles, and the cycle length is determined by the MCS organizers according to their requirements (Xiong et al., 2015; Wang et al., 2018c). For example, if an organizer wants to update the data of the target sensing area every one hour, then he can set the cycle length to one hour.

Definition 3. Ground Truth Data Matrix. Suppose we have cells and cycles, then for a certain sensing task, the ground truth data matrix is denoted , where is the true data in cell at cycle .

Definition 4. Cell Selection Matrix. In Sparse MCS, we will only select partial cells in each cycle for data collection, while inferring the data for rest cells. Cell selection matrix, denoted , marks the cell selection results. means that the cell is selected at cycle for data collection; otherwise, .

Definition 5. Inferred Data Matrix. In Sparse MCS, when an organizer decides not to collect any more data in the current cycle, the data of unsensed cells will then be inferred. Then, we denote the inferred data of the -th cycle as , and thus the inferred data of all the cycles as a matrix . Note that in Sparse MCS, compressive sensing is the de facto choice of the inference algorithm nowadays (Rana et al., 2010; Zhu et al., 2013; Wang et al., 2015; Xu et al., 2015; Wang et al., 2018c), and we also use it in this work.

Definition 6. (, )-quality (Wang et al., 2018c). In Sparse MCS, the quality guarantee is called (, )-quality, meaning that in of cycles, the inference error (e.g., mean absolute error) is not larger than . Formally,


where is the number of total sensing cycles.

Note that in practice, since we do not know the ground truth data matrix , we also cannot know whether is smaller than in the current cycle with 100% confidence. This is why we include in the quality requirement, as it is impossible to ensure 100% of cycles’ error less than . To ensure (

, p)-quality, certain quality assessment method is needed in Sparse MCS to estimate the the probability of the error less than

for the current cycle. If the estimated probability is larger than , then the current cycle satisfies (

, p)-quality and no more data will be collected (we will then move to the next sensing cycle). In Sparse MCS, leave-one-out based Bayesian inference method is often leveraged for quality assessment

(Wang et al., 2018c, 2015, 2016a), and we also use it in this work.

Problem [Cell Selection]: Given a Sparse MCS task with cells and cycles, using compressive sensing as data inference method and leave-one-out based Bayesian inference as quality assessment method, we aim to select a minimal subset of sensing cells during the whole sensing process (minimize the number of non-zero entries in the cell-selection matrix ), while satisfying -quality:

Figure 2. Running example.

We now use a running example to illustrate our problem in more details, as shown in Figure 2. (1) Suppose we have five cells and the current is the 5th cycle; (2) We select the cell 3 for collecting data, and then assess whether the current cycle can satisfy -quality; (3) As we find that the quality requirement is not satisfied, we continue collecting data from cell 5; (4) The quality requirement is now satisfied, so the data collection is terminated for the current cycle, and the data of the unsensed cells is inferred. In this example, we see that after five cycles, there are totally 11 data submissions from participants. The objective of our cell selection problem is exactly to minimize the number of data submissions.

4. Methodology

Figure 3. State, action, reward in DR-CELL

In this section, we propose a novel mechanism, called DR-Cell, to address the cell selection problem with deep reinforcement learning. First, we will mathematically model the state, reward, and action used in DR-Cell. Then, with a simplified MCS task example (i.e., there are only a few cells in the target area), we explain how traditional reinforcement learning works find the most appropriate cell for sensing based on our state, reward, and action modeling. Afterward, we elaborate how deep learning can be combined with reinforcement learning (i.e., deep reinforcement learning) to work on more realistic cases of cell selection where the target area can include a large number of cells. Finally, we describe how transfer learning can help us to generate a cell selection strategy with only a little training data under some specific conditions.

4.1. Modeling state, action, and reward

To apply deep reinforcement learning on cell selection, we need to model the key concepts in terms of state, action, and reward. Figure 3 illustrates the relationship between the three key concepts in DR-Cell. Briefly speaking, in DR-CELL, based on the current data collection state, we need to learn a Q-function (will be elaborated in next a few subsections), which can output reward scores for each possible action. The action in cell selection is choosing which cell as the next sensing cell, while reward indicates how good a certain action is. If an action (i.e., a cell) gets a higher reward score, it may be a better choice. Next we formally model the three concepts.

Figure 4. An example of state model.

(1) State represents the current data collection condition of the MCS task. In Sparse MCS, cell selection matrix (Definition 4) can naturally model the state of Sparse MCS well, as it records both where and when we have collected data from the target sensing area during the whole task. In practice, we can just keep the recent cycles’ cell selection matrix as the state, denoted as , where

represents the cell selection vector of the current cycle (1 means selected and 0 means no),

represents last cycle, and so on. Figure 4 shows an example of how we encode the current data collection condition into the state model if recent two cycles are considered. Note that we use to denote the whole set of states. As an example, suppose that we consider the recent two cycles and there are totally five cells in the target area, then the number of possible states, i.e., .

(2) Action means all the possible decisions that we may make in cell selection. Suppose there are totally cells in the target sensing area, then our next selected cell can have choices, leading to the whole action set . Note that while in practice we will not select one cell for more than once in one cycle, to make the action set consistent under different states, we assume that the possible action set is always the complete set of all the cells under any state. More specifically, if some cells have already been selected in the current cycle, then the probability of choosing these cells is zero.

(3) Reward is used to indicate how good an action is. In each sensing cycle, we select actions one by one until the selected cells can satisfy the quality requirement in the current cycle (i.e., inference error less than 222When running Sparse MCS, we have to set a probability in quality requirement, i.e., ()-quality, as we do not know the ground truth data of unsensed cells. However, in the training stage of the cell selection policy, we assume that we have obtained the data of all the cells in the target area for some time (e.g., 1 day), and thus we can directly compute the inference error. More details on the training stage will be described in the evaluation section.). Satisfying this quality requirement is the goal of cell selection and should be reflected in the reward modeling. Hence, a positive reward, denoted by , would be given to an action (i.e., a cell) under a state if the quality requirement is satisfied in the current cycle after the action is taken. In addition, as selecting participants to collect data incurs cost, we also put a negative score in the reward modeling of an action. Then, the reward can be written as , in which means whether the action makes the current cycle satisfy the inference quality requirement.

With the above modeling, we then need to learn the Q-function (see Figure 3) which can output the reward score of every possible action under a certain state. In the next subsection, we will first use a traditional reinforcement learning method, tabular Q-learning, to illustrate a simplified case where a small number of cells exist in the target sensing area.

0:    Q-table:
1:  while True do
2:     Update the current state
3:     Check Q-table, select and perform the action , which has the largest Q-value via the -greedy algorithm.
4:     if The selected cells in current cycle satisfy the quality requirements then
5:        // The cell selection in this cycle is complete, next state is the initial state in next cycle.
7:        Update the next state
8:        Obtain the reward for this action
9:     else
10:        // Continue to select cells in this cycle.
11:         ( is in the -th element)
12:        Update the next state .
13:        Obtain the reward for this action
14:     end if
15:     Update Q-table via (2) and (3).
16:  end while
Algorithm 1 Tabular Q-learning

4.2. Training Q-function with Tabular Q-Learning

In traditional reinforcement learning, a widely used strategy to obtain the Q-function is the tabular Q-learning. In this method, the Q-function is represented by a Q-table, denoted as . Each element in the Q-table, represents the reward score of a certain action under a certain state . The objective of learning the Q-function is then equivalent to filling all the elements in the Q-table.

The tabular Q-learning algorithm is shown in Algorithm 1. Under the current state , the algorithm selects the action which has the maximum value from (in fact, not always the best action is selected, will be elaborated later). After the action has been conducted, the cell has been selected and the data of the cell has been collected, the current state will change to the next state . Note that if the current cycle satisfies the quality requirement (i.e., inference error less than ), then the next state will shift to a new cycle. For the selected action, we would get the real reward considering whether the inference quality requirement of the current cycle is satisfied and then update Q-table according to the equations as follows


where provides the highest expected reward score of the next state (i.e., the reward of the best action under the next state ); is the discount factor indicating the myopic view of the Q-learning regarding the future reward; is the learning rate.

Moreover, during the training stage, under a certain state, if we always select the action with the largest reward score in the Q-table, the algorithm may get a local optima. To address this issue, we need to explore during training, i.e., sometimes trying actions other than the best one. We thus use the greedy algorithm before selection. More specifically, under a certain state, we select the best action according to the Q-table with a probability and randomly select one of the other actions with the probability . Following the existing literature, at the beginning of the training, we set a relatively large so that we can try more; then, with the training process proceeds, we gradually reduce until the Q-table is converged and then Algorithm 1 is terminated.

Figure 5. An illustrative example of tabular Q-learning.

Figure 5 illustrates an example of using tabular Q-learning for training Q-function. For simplicity, we set the discount factor to 1 and the learning rate to 1. Here, we suppose that there are five cells in the target area, and we only consider two recent cycles: the last and current one. Hence, the state has the dimension of , as shown in , , and . The value means that the cell has been selected and means not. First, we initialize the table, all the values in the Q-table are set to 0. When we first meet some states, e.g., , scores of all the actions in the Q-table under are 0 (Q-table: in Figure 5). We then randomly select one action since all the values are equal. If we choose the action (select the cell 3), the state turns to . Then we update Q[] as the current reward score plus the maximum score of the next state (i.e., future reward). The current reward is since the current cycle cannot satisfy the quality requirement ( in the example). The maximum score for the state is 0 in the Q-table. Hence, we get (Q-table: in Figure 5). Similar, under , we choose . If these selections could satisfy the quality, we get the current reward is ( is set to , i.e., total number of cells). Also, the maximum possible reward of the next state is 0 in the current Q-table. Then we update (Q-table: in Figure 5). After some rounds, we meet many times and maybe other actions are not good. And the Q-table would be changed to Q-table: in Figure 5. This time, under , we check Q-table and find that has the largest value, so we choose and perform . Then, we update , since the maximum reward score of the next state is 4 (Q-table: in Figure 5). Therefore, at the next times when we meet again, we would probably choose the action , since it has the largest reward score.

While the tabular Q-learning can work well for an MCS task in a target area including a small number of cells, as shown in the above example, practical MCS tasks may involve a large number of cells. Suppose there are cells in the target area and we want to consider recent cycles to model states, then the state space will become extremely huge, , which is intractable in practice. To overcome this difficulty, in the next subsection we will propose to leverage deep learning with reinforcement learning to train the decision function for cell selection in Sparse MCS.

4.3. Training Q-function with Deep Recurrent Q-Network

To overcome the problem incurred by the extremely large state space in the cell selection, we then turn to use the Deep Q-Network (DQN), which combines Q-learning with deep neural networks. The difference between DQN and tabular Q-learning is that a deep neural network is used to replace the Q-table to deal with the dimension curse. In DQN, we do not need the Q-table lookups, but calculate for each state-action pair selection. More specifically, the DQN inputs the current state and action, then it uses a deep neural network to obtain an estimated value of , shown as


For each selection, we use the neural network parameterized by to calculate the Q-function and select the best state-action pair who has the largest reward score, or called Q value. Note that the greedy algorithm is also used in DQN to balance the exploration and exploitation.

To obtain the estimation of the Q value which approximates the expected one in (4), our proposed DQN uses the experience replay technique. After one selection, we obtain the experience at current time step , denoted as , and the memory pool is . Then, DQN randomly chooses part of the experiences to learn and update the network parameters . The goal is to calculate the best to obtain . The stochastic gradient algorithm is applied with the learning rate

and the loss function is defined as follow,




For each update, DQN randomly chooses part of experiences from , then calculates and updates the network parameters . Moreover, to avoid the oscillations (i.e., the Q-function changes too rapidly in training), we apply the fixed Q-targets technique. More specifically, we do not always use the latest network parameter to calculate the maximum possible reward of the next state (i.e., ), but update the corresponding parameter every a few iterations, i.e.,


The DQN learning algorithm is summarized in Algorithm 2.

0:    Initialize the network parameters,
1:  while True do
2:     Update the current state
3:     Calculate Q-value by Deep Q-Network with the parameter via (4), select action with -greedy algorithm.
4:     if The selected cells in current cycle satisfy the quality requirements then
5:        // The cell selection in this cycle is complete, next state is the initial state in next cycle.
6:        Update the next state
7:        Obtain the reward for this action
8:     else
9:        // Continue to select cells in this cycle.
10:        Update the next state
11:        Obtain the reward for this action
12:     end if
14:     Randomly select some from
15:     Calculate via (7)
17:     if RPLACEITER then
19:     end if
20:  end while
Algorithm 2 Deep Recurrent Q-Network Learning

In DQN, how to design the network structure also impacts the effectiveness of the learned Q-function. One common way is using dense layers to connect the input (state) and output (a reward score vector of all possible actions). However, the temporal correlations exist in our state , but the dense layers cannot catch the temporal pattern well. We thus propose to use LSTM (Long-Short-Term-Memory) layers rather than dense layers in DQN so as to catch the temporal patterns in our state, which is also called Deep Recurrent Q-Network (DRQN) (Hausknecht and Stone, 2015). More specifically, in DRQN, Q-function can be defined as,


where represents the observation at time step (i.e., the cell selection vector at ), and is the extra input returned by the LSTM network from the previous time step . In our cell selection problem, a state can be divided into time steps of observations, and then can also be used as inputs of the DRQN for learning the Q-function.

4.4. Reducing Training Data by Transfer Learning

With deep reinforcement learning, we can get the Q-function that outputs reward scores for all the possible actions under a certain state, and then we can choose the cell that has the largest score in cell selection. However, the Q-function learning algorithm mentioned in the previous sections may need a large amount of training data, which also incurs collection cost for MCS organizers.333An organizer needs to conduct a preliminary study on the target sensing area to collect the data from every cell for a short time. Then, can we reduce the amount of training data under certain circumstances?

In reality, many types of data have inter-data correlations, e.g., temperature and humidity (Wang et al., 2018c). Then, if there are multiple correlated sensing tasks in a target area, probably the cell selection strategy learned for one task can benefit another task. With this intuition, we present a transfer learning method for learning the Q-function of an MCS task (target task) with the help of the cell selection strategy learned from another correlated task (source task). We assume that the source task has adequate training data, while the target task has only a little training data. Inspired by the fine-tuning techniques widely used in image processing with deep neural networks, for training the Q-function of the target task, we initialize the parameters of its DRQN to the parameter values of the source task DRQN (learned from the adequate training data of the source task). Then, we use the limited amount of training data of the target task to continue the DRQN learning process (Algorithm 2). In such a way, we can reduce the amount of training data required for obtaining a good cell selection strategy of the target task.

5. Evaluation

In this section, we conduct extensive experiments based on two real-life datasets, which include various types of sensed data in representative MCS applications, such as temperature, humidity, and air quality.

5.1. Datasets

Same as previous Sparse MCS literature (Wang et al., 2018c, 2015), we adopt two real-life datasets, Sensor-Scope (Ingelrest et al., 2010), and U-Air (Zheng et al., 2013), to evaluate the performance of our proposed cell selection algorithm DR-Cell. These two datasets contain various types of sensed data, including temperature, humidity, and air quality. The detailed statistics of the two datasets are listed in Table 1. Although these sensed data are collected from sensor networks or static stations, the mobile devices can also be used to obtain them (as in (Devarakonda et al., 2013; Hasenfratz et al., 2012)). We can treat them as the data sensed by smartphones and use these datasets in our experiments to show the effectiveness of DR-Cell.

Sensor-Scope U-Air
City Lausanne (Switzerland) Beijing (China)
Data temperature, humidity PM2.5
Cell size () 50*30 1000*1000
Number of cells 57 36
Cycle length () 0.5 1
Duration () 7 11
Error metric mean absolute error classification error
Mean Std. (temperature) (PM2.5)
Table 1. Statistics of Two Evaluation Datasets

Sensor-Scope (Ingelrest et al., 2010): The Sensor-Scope dataset contains various environment readings, temperature and humidity. The sensed data are collected from the EPFL campus with an area about . We first divide the target area into 100 cells, each cell is . Then we find that 57 out of 100 cells are deployed with valid sensors. Hence, we use the sensed data at the 57 cells to evaluate our algorithms. We use the mean absolute error to measure the inference error.

U-Air (Zheng et al., 2013): The U-Air dataset includes the air quality readings from Beijing. Same as (Zheng et al., 2013), we split the Beijing to cells where each cell is . Then, there are 36 cells with the sensed air quality readings. With this dataset, we conduct the experiment of PM2.5 sensing, and try to infer the air quality index category444Six categories (Zheng et al., 2013): Good (0-50), Moderate (51-100), Unhealthy for Sensitive Groups (101-150), Unhealthy (150-200), Very Unhealthy (201-300), and Hazardous (¿300) of unsensed cells. The inference error is measured by classification error.

5.2. Baseline Algorithms

We compare DR-Cell to two existing cell selection methods: QBC and RANDOM.


: Based on the researches in active learning on matrix completion, Wang et al.

(Wang et al., 2018c) present an intuitive method, called Query by Committee based cell selection algorithm. QBC selects the salient cell determined by ”committee” to allocate the next task. More specifically, QBC attempts to use some different data inference algorithms, such as compressive sensing and K-Nearest Neighbors, to infer the full sensing matrix. Then, it allocates the next task to the cell with the largest variance among the inferred values of different algorithms.

RANDOM: In each sensing cycle, RANDOM will randomly select cells one by one until the selected cells can ensure a satisfying inference accuracy.

5.3. Experiment Process

To learn DR-Cell, we use the first 2-day data of each dataset to train our Q-function, i.e., we suppose that the MCS organizers will conduct a 2-day preliminary study to collect data from all the cells of the target area. After the 2-day training stage, we enter the testing stage when we can use the trained Q-function to obtain the reward of every possible action under the current state, and then choose the action (i.e., cell) with the largest reward score. During the testing stage, we use the leave-one-out Bayesian inference method to ensure -quality, same as previous Sparse MCS literature (Wang et al., 2018c). The objective is to select cells as few as possible with the quality guarantee, and thus we will compare the number of cells selected by DR-Cell and baseline methods to verify the effectiveness of DR-Cell.

5.4. Experiment Results

Figure 6. Number of selected cells for temperature and PM2.5 sensing tasks.

We first evaluate the performance by using the temperature data in Sensor-Scope and the PM2.5 data in U-Air, respectively. The results are shown in Figure 6.

In the temperature scenario of Sensor-Scope, for the predefined (-)-quality, we set the error bound as and as or . This quality requirement is that the inference error is smaller than for around 90% or 95% of cycles. Figure 6 (leftmost part) shows the average numbers of selected cells for each sensing cycles. DR-Cell always outperforms two baseline methods. More specifically, when , DR-Cell can select and fewer cells than QBC and RANDOM, respectively. In general, DR-Cell only needs to select 12.84 out of 57 cells for each sensing cycle when ensuring the inference error below in of cycles. When we improve the quality requirement to , DR-Cell needs to select more cells to satisfy the higher requirement. Particularly, DR-Cell selects 15.08 out of 57 cells under the --quality and achieves better performances by selecting and fewer cells than QBC and RANDOM, respectively. For the PM2.5 scenario in U-Air, we set the error bound as and as or and get the similar observations shown in Figure 6 (rightmost part). When is /, DR-Cell selects 9.0/12.5 out of 36 cells and reduces /, and / of selected cells than QBC and RANDOM, respectively.

We then conduct the experiments on the multi-task MCS scenario, i.e., temperature-humidity monitoring, in Sensor-Scope to verify the transfer learning performance. We conduct two-way experiments, temperature as the source task and humidity as the target task; and vice versa. More specifically, for the source task, we still suppose that we obtain 2-day data for training; but for the target task, we suppose that we only obtain 10 cycles (i.e., 5 hours) of training data. Moreover, we add two compared methods to verify the effectiveness of our transfer learning method: NO-TRANSFER and SHORT-TRAIN. NO-TRANSFER is the method that directly uses the Q-function of the source task to the target task, and SHORT-TRAIN means that the target task model is only trained on the 10-cycle training data.

Figure 7. Number of selected cells for temperature and humidity sensing tasks (transfer learning).

The quality requirement of temperature is (-)-quality and the humidity is --quality. Figure 7 shows the average numbers of selected cells. When temperature is seen as the target task, TRANSFER can achieve better performance by reducing , , and selected cells compared with NO-TRANSFER, SHORT-TRAIN, and RANDOM, respectively. When humidity is the target task, similarly, TRANSFER can select , , and fewer cells than NO-TRANSFER, SHORT-TRAIN, and RANDOM, respectively. Note that NO-TRANSFER and SHORT-TRAIN even perform worse than RANDOM in this case. It emphasizes the importance of having an adequate amount of training data for DR-Cell. By using transfer learning, we can significantly reduce the training data required for learning a good Q-function in DR-Cell, and thus further reducing the data collection costs of MCS organizers.

Finally, we report the computation time of DR-Cell. Our experiment platform is equipped with Intel Xeon CPU E2630 v4 @ 2.20GHz and 32 GB RAM. We implement our DR-Cell training algorithm in TensorFlow (CPU version). In our experiment scenarios, the training time consumes around 2–4 hours, which is totally acceptable in real-life deployments as the training is an off-line process.

6. Conclusion

In this paper, to improve the cell selection efficiency in Sparse MCS, we propose a novel Deep Reinforcement learning based Cell selection mechanism, namely DR-Cell. We properly model the three key concepts in reinforcement learning, i.e., state, action, and reward, and then propose a deep recurrent Q-network with LSTM to learn the Q-function that can output the reward score given an arbitrary state-action pair. Then, under a certain state, we can choose the cell with the largest reward score as the next cell for sensing. Furthermore, we propose a transfer learning method to reduce the amount of training data required for learning the Q-function, if there are multiple correlated MCS tasks conducted in the same target area. Experiments on various real sensing datasets verify the effectiveness of DR-Cell in reducing the data collection costs.

In our future work, we will study how to conduct the reinforcement learning based cell selection in an online manner, so that we do not need a preliminary study stage for collecting the training data any more. Besides, we will also consider a case where the data collection costs of different cells are diverse. Finally, we will consider to extend our mechanism to multi-task allocation scenarios when heterogeneous tasks are conducted simultaneously (Wang et al., 2017a; Wang et al., 2018b) and privacy-preserving scenarios when the participant privacy protection mechanisms are applied (Wang et al., 2016b, 2017b, 2018a).


  • (1)
  • Ba et al. (2014) Jimmy Ba, Volodymyr Mnih, and Koray Kavukcuoglu. 2014. Multiple Object Recognition with Visual Attention. Computer Science (2014).
  • Candès and Recht (2009) Emmanuel J Candès and Benjamin Recht. 2009. Exact matrix completion via convex optimization. Foundations of Computational mathematics 9, 6 (2009), 717.
  • Devarakonda et al. (2013) Srinivas Devarakonda, Parveen Sevusu, Hongzhang Liu, Ruilin Liu, Liviu Iftode, and Badri Nath. 2013. Real-time air quality monitoring through mobile sensing in metropolitan areas. In Proceedings of the 2nd ACM SIGKDD international workshop on urban computing. ACM, 15.
  • Donoho (2006) David L Donoho. 2006. Compressed sensing. IEEE Transactions on information theory 52, 4 (2006), 1289–1306.
  • Foerster et al. (2016) Jakob N. Foerster, Yannis M. Assael, Nando de Freitas, and Shimon Whiteson. 2016. Learning to Communicate to Solve Riddles with Deep Distributed Recurrent Q-Networks. CoRR abs/1602.02672 (2016). arXiv:1602.02672 http://arxiv.org/abs/1602.02672
  • Ganti et al. (2011) Raghu K Ganti, Fan Ye, and Hui Lei. 2011. Mobile crowdsensing: current state and future challenges. IEEE Communications Magazine 49, 11 (2011).
  • Gu et al. (2017) Shixiang Gu, Ethan Holly, Timothy Lillicrap, and Sergey Levine. 2017. Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In Robotics and Automation (ICRA), 2017 IEEE International Conference on. IEEE, 3389–3396.
  • Guo et al. (2015) Bin Guo, Zhu Wang, Zhiwen Yu, Yu Wang, Neil Y Yen, Runhe Huang, and Xingshe Zhou. 2015. Mobile crowd sensing and computing: The review of an emerging human-powered sensing paradigm. ACM Computing Surveys (CSUR) 48, 1 (2015), 7.
  • Hasenfratz et al. (2012) David Hasenfratz, Olga Saukh, Silvan Sturzenegger, and Lothar Thiele. 2012. Participatory air pollution monitoring using smartphones. Mobile Sensing 1 (2012), 1–5.
  • Hausknecht and Stone (2015) Matthew J. Hausknecht and Peter Stone. 2015. Deep Recurrent Q-Learning for Partially Observable MDPs. CoRR abs/1507.06527 (2015). arXiv:1507.06527 http://arxiv.org/abs/1507.06527
  • Ingelrest et al. (2010) Francois Ingelrest, Guillermo Barrenetxea, Gunnar Schaefer, Martin Vetterli, Olivier Couach, and Marc Parlange. 2010. SensorScope:Application-specific sensor network for environmental monitoring. Acm Transactions on Sensor Networks 6, 2 (2010), 1–32.
  • Lample et al. (2016) Guillaume Lample, Devendra Singh Chaplot, Guillaume Lample, and Devendra Singh Chaplot. 2016. Playing FPS Games with Deep Reinforcement Learning. In

    AAAI Conference on Artificial Intelligence

  • Levine et al. (2015) Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. 2015. End-to-end training of deep visuomotor policies.

    Journal of Machine Learning Research

    17, 1 (2015), 1334–1373.
  • Mnih et al. (2013) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. 2013. Playing Atari with Deep Reinforcement Learning. Computer Science (2013).
  • Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. 2015. Human-level control through deep reinforcement learning. Nature 518, 7540 (2015), 529.
  • Rana et al. (2010) Rajib Kumar Rana, Chun Tung Chou, Salil S Kanhere, Nirupama Bulusu, and Wen Hu. 2010. Ear-phone: an end-to-end participatory urban noise mapping system. In Proceedings of the 9th ACM/IEEE International Conference on Information Processing in Sensor Networks. ACM, 105–116.
  • Silver et al. (2016) David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, and Marc Lanctot. 2016. Mastering the game of Go with deep neural networks and tree search. Nature 529, 7587 (2016), 484.
  • Silver et al. (2017) David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. 2017. Mastering the game of go without human knowledge. Nature 550, 7676 (2017), 354.
  • Sutton and Barto (2005) R Sutton and A Barto. 2005. Reinforcement Learning: An Introduction. MIT Press. 90–127 pages.
  • Wang et al. (2017a) Jiangtao Wang, Yasha Wang, Daqing Zhang, Feng Wang, Yuanduo He, and Liantao Ma. 2017a. PSAllocator: multi-task allocation for participatory sensing with sensing capability constraints. In Proceedings of the 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing. ACM, 1139–1151.
  • Wang et al. (2018b) Jiangtao Wang, Yasha Wang, Daqing Zhang, Feng Wang, Haoyi Xiong, Chao Chen, Qin Lv, and Zhaopeng Qiu. 2018b. Multi-Task Allocation in Mobile Crowd Sensing with Individual Task Quality Assurance. IEEE Transactions on Mobile Computing (2018).
  • Wang et al. (2018a) Leye Wang, Gehua Qin, Dingqi Yang, Xiao Han, and Xiaojuan Ma. 2018a. Geographic Differential Privacy for Mobile Crowd Coverage Maximization. In AAAI.
  • Wang et al. (2017b) Leye Wang, Dingqi Yang, Xiao Han, Tianben Wang, Daqing Zhang, and Xiaojuan Ma. 2017b. Location privacy-preserving task allocation for mobile crowdsensing with differential geo-obfuscation. In Proceedings of the 26th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 627–636.
  • Wang et al. (2015) Leye Wang, Daqing Zhang, Animesh Pathak, Chao Chen, Haoyi Xiong, Dingqi Yang, and Yasha Wang. 2015. CCS-TA: quality-guaranteed online task allocation in compressive crowdsensing. In Proceedings of the 2015 ACM International Joint Conference on Pervasive and Ubiquitous Computing. ACM, 683–694.
  • Wang et al. (2016a) Leye Wang, Daqing Zhang, Yasha Wang, Chao Chen, Xiao Han, and Abdallah M’hamed. 2016a. Sparse mobile crowdsensing: challenges and opportunities. IEEE Communications Magazine 54, 7 (2016), 161–167.
  • Wang et al. (2016b) Leye Wang, Daqing Zhang, Dingqi Yang, Brian Y Lim, and Xiaojuan Ma. 2016b. Differential location privacy for sparse mobile crowdsensing. In Data Mining (ICDM), 2016 IEEE 16th International Conference on. IEEE, 1257–1262.
  • Wang et al. (2018c) Leye Wang, Daqing Zhang, Dingqi Yang, Animesh Pathak, Chao Chen, Xiao Han, Haoyi Xiong, and Yasha Wang. 2018c. SPACE-TA: Cost-Effective Task Allocation Exploiting Intradata and Interdata Correlations in Sparse Crowdsensing. Acm Transactions on Intelligent Systems & Technology 9, 2 (2018), 1–28.
  • Xiao et al. (2017a) Liang Xiao, Tianhua Chen, Caixia Xie, Huaiyu Dai, and Vincent Poor. 2017a. Mobile Crowdsensing Games in Vehicular Networks. IEEE Transactions on Vehicular Technology PP, 99 (2017), 1–1.
  • Xiao et al. (2017b) Liang Xiao, Yanda Li, Guoan Han, Huaiyu Dai, and H. Vincent Poor. 2017b. A Secure Mobile Crowdsensing Game with Deep Reinforcement Learning. IEEE Transactions on Information Forensics & Security PP, 99 (2017), 1–1.
  • Xiong et al. (2015) Haoyi Xiong, Daqing Zhang, Leye Wang, and Hakima Chaouchi. 2015. EMC 3: Energy-efficient data transfer in mobile crowdsensing under full coverage constraint. IEEE Transactions on Mobile Computing 14, 7 (2015), 1355–1368.
  • Xu et al. (2015) Liwen Xu, Xiaohong Hao, Nicholas D Lane, Xin Liu, and Thomas Moscibroda. 2015. More with less: Lowering user burden in mobile crowdsourcing through compressive sensing. In Proceedings of the 2015 ACM International Joint Conference on Pervasive and Ubiquitous Computing. ACM, 659–670.
  • Zhang et al. (2014) Daqing Zhang, Leye Wang, Haoyi Xiong, and Bin Guo. 2014. 4W1H in mobile crowd sensing. IEEE Communications Magazine 52, 8 (2014), 42–48.
  • Zheng et al. (2013) Yu Zheng, Furui Liu, and Hsun Ping Hsieh. 2013. U-Air:when urban air quality inference meets big data. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1436–1444.
  • Zhu et al. (2013) Yanmin Zhu, Zhi Li, Hongzi Zhu, Minglu Li, and Qian Zhang. 2013. A compressive sensing approach to urban traffic estimation with probe vehicles. IEEE Transactions on Mobile Computing 12, 11 (2013), 2289–2302.