Deep Reinforcement Learning (DRL): Another Perspective for Unsupervised Wireless Localization

04/09/2020 ∙ by You Li, et al. ∙ University of Calgary Wuhan University 31

Location is key to spatialize internet-of-things (IoT) data. However, it is challenging to use low-cost IoT devices for robust unsupervised localization (i.e., localization without training data that have known location labels). Thus, this paper proposes a deep reinforcement learning (DRL) based unsupervised wireless-localization method. The main contributions are as follows. (1) This paper proposes an approach to model a continuous wireless-localization process as a Markov decision process (MDP) and process it within a DRL framework. (2) To alleviate the challenge of obtaining rewards when using unlabeled data (e.g., daily-life crowdsourced data), this paper presents a reward-setting mechanism, which extracts robust landmark data from unlabeled wireless received signal strengths (RSS). (3) To ease requirements for model re-training when using DRL for localization, this paper uses RSS measurements together with agent location to construct DRL inputs. The proposed method was tested by using field testing data from multiple Bluetooth 5 smart ear tags in a pasture. Meanwhile, the experimental verification process reflected the advantages and challenges for using DRL in wireless localization.



There are no comments yet.


page 2

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The internet-of-things (IoT) technology has started to empower the future of numerous fields [1]. To spatialize IoT data, the time and location information of IoT devices are essential. Thus, localization is both an important application scenario and a development direction for IoT.

IoT localization methods have been widely researched. There are technologies including wireless [2], motion, and environmental [3] sensor based localization, as well as their integration [4]. During the recent decade, the development in IoT technologies and the emergence of geo-spatial big data have made it possible to implement wide-area mass-market localization by using crowdsourced data. However, the performance of such mass-market localization techniques may be degraded by various factors, such as the complexity of localization environment [5], the existence of device diversity [6], and the uncertainty in crowdsourced data [7]. Thus, it is still an open challenge to use low-cost IoT devices for robust localization.

I-a Deep-Learning-Based Localization

The development of deep-learning (DL) techniques have led to the emergence of new localization methods. Examples of such methods include localization using deep neural network (DNN)

[8], Gaussian processes (GP) [9]

, random forests


, hidden Markov model (HMM)


, support vector machine (SVM)

[12], and fuzzy logic [13]. These DL techniques have also been used in other localization-related aspects. For example, DNN has been used for localization parameter tuning [14], activity recognition [15], and localization uncertainty prediction [16].

DL algorithms have shown great potentials in enhancing localization, especially in complex scenarios that are difficult to model, have parameters that are difficult to set, and have nonlinear and correlated measurements. However, most of the existing DL-based localization methods are supervised methods. That is, these methods require training data that have known location labels. The acquisition of location labels is commonly time-consuming and label-costly [17]. Meanwhile, the accuracy of location labels is degraded by factors such as device diversity [6], device motion and orientation [18], and database outage [19]. Thus, unsupervised localization methods are needed to reduce reliance on location labels.

I-B Unsupervised Localization

To realize unsupervised localization, researchers have proposed various methods, such as simultaneous localization and mapping (SLAM) [20] and crowdsourcing [21]. In such methods, the uncertainty in reference point (RP) location labels will directly lead to errors in the generated localization databases. Inertial-sensor-based dead-reckoning (DR) can provide autonomous indoor/outdoor localization solutions [22]. However, it is challenging to obtain long-term accurate DR solutions with low-cost sensors due to the requirement for heading and position initialization, the misalignment angles between human body and device, and the existence of sensor errors [23]. Thus, constraints are needed to constrain DR drifts. Vehicle-motion constraints (e.g., zero velocity updates, zero angular rate updates, and non-holonomic constraints) [4] are typically used to correct for DR errors. However, these motion constraints are relative constraints, which can only mitigate the accumulation of DR errors, instead of eliminate them. DR solutions always drift when external updates, such as loop closures and global navigation satellite systems (GNSS) positions, are not available.

In many crowdsourcing applications, it is difficult to assure the reliability of RP locations in the database due to the limitation of physical environment. For example, there may be insufficient observations for wireless localization. If this is the case, it is important to evaluate the quality of localization data, so as to select the robust ones. From the big-data perspective, a small proportion of crowdsourced data, if robust, is enough for database training. The research in [24] presents a general framework for assessing sensor data quality. The evaluation framework involves the impact of indoor localization time, user motion, and sensor biases. Furthermore, the research [7] enhances this framework and introduces stricter quality-assessment criteria.

Compared to these works, this research is carried out from another perspective. The extensively-concerned deep reinforcement learning (DRL) technique is applied. DRL has been proven to have the following advantages [25]

in other areas: (1) it can be used for unsupervised learning through an action-reward mechanism and (2) it can provide not only the estimated solution at the current moment, but also the long-term reward. Thus, it may bring benefits into the localization field.

I-C DRL in Navigation and Localization

DRL, which is the core artificial intelligence (AI) algorithm for the AlphaGo, has attracted intensive attention. DRL can be regarded as a combination of DL and reinforcement learning (RL). The former provides learning mechanisms, while the later sets goals for learning. In general, DRL involves agents that observe states and act in order to collect long-term rewards

[26]. The DRL algorithm has experienced stages such as the deep Q-network (DQN), asynchronous advantage actor-critic (A3C), and unsupervised reinforcement and auxiliary learning (UNREAL) [27]. The research in [25] points out three components for a DRL solution: basis/core (e.g., the definition of states, actions, and reward function), basic units (e.g., the Q-network, action selection, replay memory, and target network), and state reformulation (i.e., the method for state-awareness data processing).

Abbreviation Definition
AI Artificial Intelligence
A3C Asynchronous Advantage Actor-Critic
BT5 Bluetooth 5
CDF Cumulative Distribution Function
DL Deep Learning
DNN Deep Neural Network
DQN Deep Q-Network
DR Dead-Reckoning
DRL Deep Reinforcement Learning
GNSS Global Navigation Satellite Systems
GP Gaussian Processes
GW Gateway
HMM Hidden Markov Model
ID IDentification
IoT Internet of Things
LF Localization Feature
MDP Markov Decision Process
NLoS Non-Line-of-Sight
RL Reinforcement Learning
RMS Root Mean Square
RP Reference Point
RSS Received Signal Strength
SLAM Simultaneous Localization And Mapping
SVM Support Vector Machine
UNREAL UNsupervised REinforcement and Auxiliary Learning
TABLE I: List of abbreviations

Navigation is an important application scenario for DRL. The early-stage DRL algorithms are used for learning in vedio games. Such gaming applications require navigation actions in a virtual world. Another classic scenario for DRL research is maze navigation [28]. The research in [29] provides deep investigation on the performance of DRL-based maze navigation. Furthermore, the research on DRL-based navigation has been extended from visual to real world. Researchers have utilized DRL for navigation by using data from various types of sensors, such as camera [30], lidar [31], 360-degree camera [32], Google street view [33], wireless sensors [34], and magnetic sensors [35]. Meanwhile, other data or techniques, such as topological maps [36], particles [37], cooperative agents [38], and social interactions [39] have been involved. The latest directions for DRL-based navigation include mapless navigation [31][32][40], navigation in new [40] and complex [28] environments, and navigation with varying targets [41].

I-D Problem Statement and Main Contributions

The DRL-based approaches have been verified to be effective in navigation. However, most of these methods are not suitable for localization. Although navigation and localization are not separated in many applications, they have different principles. Navigation and localization may use the same input (e.g., signals from wireless, image, and environmental sensors) but have different outputs. Navigation is the problem of finding the optimal path between the agent (i.e., a IoT end-device) and a target place; thus, its output is the moving action. In contrast, the output for localization is the agent location. The challenges for using DRL for localization include

  • In a navigation application, the agent chooses an action from the DRL engine and move. The action directly changes the state (i.e., the agent location). Thus, a navigation process can be modeled as a Markov decision process (MDP) and thus can be processed by DRL. However, localization is closer to a DL problem, instead of DRL.

  • The existing DRL-based localization methods (e.g., [34]) require target points, which are necessary for setting rewards. To obtain such target points, supervised or semi-supervised data are needed.

  • The existing target-dependent navigation and localization methods suffer from another issue; that is, the trained model is related with the target. When the target changes, re-training may be needed. This phenomenon also limits the use of DRL in localization.

  • Most of the existing works are based on vision data. Thus, it is necessary to investigate the use of DRL in wireless positioning, which is the most widely used technology in IoT localization.

Therefore, the main contributions of this paper can be stated as follows.

  • It is difficult to use DRL in traditional snapshot localization models because these models do not meet the MDP definition. Thus, this paper proposes a method to model a continuous wireless localization process as an MDP and process it within a DRL framework.

  • It is challenging to obtain DRL rewards when using only unsupervised data (e.g., daily-life crowdsourced data). To alleviate this issue, this paper presents a reward-setting mechanism for unsupervised wireless localization. Robust landmark data are extracted automatically from unlabeled wireless received signal strengths (RSS).

  • To ease requirements for model re-training when using DRL for localization, this paper uses RSS measurements together with the agent location to construct the input for DRL. Thus, it is not necessary to re-train DRL when the target changes.

Ii DRL-based Unsupervised Wireless Localization

This section describes the methodology of DRL-based unsupervised wireless localization. Specifically, this section is comprised of the problem description, the construction of MDP model for wireless localization, the details of the DQN algorithm, and the mechanism for reward setting.

Ii-a Problem Description

The purpose for localization is to determine the agent position in a spatial coordinate system. The agent position can also be represented by gridding the space and determining the identification (ID) of the grid that the agent is located in. To determine the agent location, surrounding localization signals (LFs) such as RSS are measured. The fingerprinting method is commonly used for localization through two steps, training and prediction. At the training step, [location, LF] fingerprints at multiple RPs are used to generate a database. At the prediction step, the likelihood value between the real-time measured LF vector and the reference LF vector at each RP in the database is computed. The RPs with the LFs that are closest to the measured one are selected to compute the agent location [42]. From this perspective, localization is a DL problem, which inputs LF measurements and outputs the RP ID.

The fingerprinting method provides a snapshot localization solution. Its advantage is that a location output can be obtained once a real-time LF measurement is inputted. For dynamic localization applications, a common approach is to further input the snapshot localization solution into a localization filter (e.g., an extend Kalman filter or particle filter) to generate a more robust solution by fusing the previous location solutions. In the filter, the snapshot localization solutions are position updates, while sensor-based DR data or pseudo motion constraints (e.g., the constant-velocity assumption) are used to construct the system motion model


This research changes the wireless localization process by introducing the previous location solutions. Accordingly, wireless localization becomes a continuous localization problem. The localization task at time is the process that inputs the agent location at time plus the LF measurement at time , and outputs the agent location at time . After the localization computation at time , the agent may keep static or move towards one of the eight directions in Figure 1. Afterwards, the localization computation at time starts. In this case, the localization computation at each time step only depends on the location from the previous time step and the LF measurement at this time step; meanwhile, the action at each step directly changes the location state. Thus, this process can be modeled as an MDP.

Fig. 1: Architecture for DRL-based Wireless Localization. Red triangles indicate wireless gateway locations

Ii-B MDP Model

An MDP is a discrete-time stochastic control process. Its current state is only related with the latest previous state, instead of earlier ones. In contrast to the Markov chain and HMM, the MDP has involved actions, which directly influent states. An MDP is comprised of four components: states

, actions , a reward function

, and transition probabilities

of moving from to given , where and are the state and action at time step , respectively. The goal for an MDP is to determine the policy that maximizes the expected accumulated rewards , where is the immediate reward at time step and is the discount factor [43]. Figure 1 demonstrates a schematic diagram for DRL-based wireless localization. The state, action, and reward definition have been shown. The details of the components in the figure are described in this subsection and Subsection II-C. To design an MDP for wireless localization, the following three components are defined.

States: the state represents an abstraction of the environment in which the agent makes action decisions at time . The state consists of the agent location and the RSS measurement.

Actions: the agent makes decisions to take actions based on the state . In this research, the action space consists of nice actions, including staying at the same grid and moving toward north, south, west, east, northwest, northeast, southwest, and southeast for a grid.

Reward function: a positive reward will be given when the agent has made a correct action. Theoretically, the geographical distance between the agent and target point can be used for setting rewards [34]. This mechanism is effective when using supervised or semi-supervised data; however, it cannot be used to process unsupervised data. To alleviate this issue, a reward-setting mechanism is presented. The principle of this mechanism is to extract landmark points that have robust location labels and RSS features. When the agent has moved to a landmark point and the measured RSS has the similar feature to the known RSS feature at this landmark point, a positive reward is set.

A challenge for this mechanism is that it is difficult to know either the location or the RSS feature at a landmark point in advance. To alleviate this issue, the locations of wireless gateways (GWs, also known as access points or anchors) and the near-field condition are introduced. Specifically, the near-field condition is activated when it is detected that the agent has moved to a location that is close enough to a GW. Then, the distance between the predicted agent location and the location of this GW is used to set the reward. The method for detecting the near-field condition is described as follows.

One of the most widely-used approaches for detecting the near-field condition is RSS ranging. The wireless signal path-loss model is widely used to convert an RSS to an agent-GW distance by


where and are the path-loss-model parameters. Although such parameters can be trained in advance [44], there are various factors (e.g., device diversity and orientation diversity [6]) that may cause variations in these parameters. This phenomenon leads to the degradation in RSS-based ranging and localization accuracy. Thus, it is challenging to detect the near-field condition through RSS ranging.

To alleviate this issue, the following phenomenon is used: environmental and motion factors commonly weaken an RSS measurement, instead of strengthen them. Therefore, a weak RSS measurement does not ensure a long distance; in contrast, a strong RSS can indicate a short distance. Accordingly, the near-field condition can be identified as: when the measured RSS from a GW is stronger than a threshold , the agent should be located near this GW. Then, the reward can be set as


where is the distance between the location of the agent at time and that of the -th GW; is the threshold for the distance between the predicted agent location and the location of the selected GW. The case indicates that the agent is wrongly located to a point that is far from the landmark point; thus, a negative reward is set.

The states, actions, and reward function are further used for training of the DQN, which is described in the next subsection.

Ii-C Deep Q-Network for Wireless Localization

A core of DQN is Q-learning. The principle of Q-learning is to determine the function


which can be used to compute the expected accumulated rewards for taking an action when there is a given input , where is the action-value function that maps the input to output decisions; is the state reformulation. Once the function is obtained, it becomes possible to construct a policy that maximizes the rewards by


For applications (e.g., navigation in a simple grid maze) that have a simple state, matrix-based equations may be used to compute . For the task in this research, it is challenging to model the Q-learning process. Thus, a DNN is used to resemble Q. The DQN architecture in [25] is used. The DQN algorithm is shown in Table II.

Algorithm 1: The DQN algorithm (modified on [25])
1. Initialize replay memory , Q-network , and target network ;
2. For time step t in 1 to T:
3.      Observe observable state and set reward ;
4.      Generate state reformulation ;
5.      Stack experience tuple into ;
6.      Compute available action set ;
7.      With exploration probability :
8.           Select a random action in ;
9.      Otherwise:
10.           Select ;
11.      Move agent by action ;
12.      Sample a minibatch of from ;
13.      Compute target value through (6);
14.      Compute loss through (5);
15.      Train Q-network through SGD;
16.      Decrease exploration probability ;
17.      if t modulo G == 0:
18.           Update target network with ;
19. End For loop
TABLE II: DQN Algorithm

During each mapping from the input to the output decision, the Q-network generates a result that consists the current state , the current action , the instant reward , and the next state . Such a result is then stored into the replay memory . The target network with parameter is copied from the Q-network in every steps. At each step, a minibatch is sampled randomly from the replay memory and combined with the target network to compute the loss and train the Q-network.

The replay memory , which has a capacity of is created at the initialization step. Afterwards, the newly-generated experience tuple is stacked into . The Q-network is trained when the length of the stored experience tuples reaches the number . For training, a minibatch that has a length of is sampled randomly from . Meanwhile, for each time step in training, the epsilon-greedy policy is used to select actions. The epsilon-greedy policy also balances the reward maximization based on the already-known knowledge (i.e., the exploitation) and the new knowledge that is obtained by trying new actions (i.e., the exploration). The exploration rate is decreased linearly from the initial value to final value during training. For each experience tuple within the sampled minibatch, the target network is used to compute the loss as


where the sign represents the computation of expectation value; is the target value, which can be calculated as


Once the loss value is computed, the stochastic gradient descent (SGD) method


is applied to train the Q-network. During the training process, the batch-normalization approach

[46] is applied to accelerate training.

Iii Experimental Verification

Iii-a Test Description

Field tests were carried out in a smart pasture at Inner Mongolia, China. The test area was an open field that had a size of 120 m by 70 m. The test scenario was similar to that in [6]. Figure 2 (a) demonstrates the test environment and devices. Totally 48 Bluetooth 5 (BT5) based devices (i.e., smart ear tags) were utilized as transmitters, while 20 GWs were used as receivers. Both the devices and GWs were equipped with the Texas Instruments CC2640R2F BT5 chips [47]. Each device was equipped with a microstrip patch antenna with a gain of 0 dBi, while each GW was equipped with a vertical-polarized omni-directional antenna with a gain of 5 dBi. The GWs were deployed evenly over the space by 4 rows and 5 columns. The distances between adjacent GWs were approximately 30 m in the east and 24 m in the north.

Fig. 2: Test field and devices (a) and locations of grids and GWs (b)

The devices were placed at 950 static points on the ground, each for 5 minutes. The data rate for RSS measurements was 0.17 Hz. The data collection process was conducted through a supervised procedure. That is, each data sample had a reference location label. The location labels were only used for localization performance evaluation, instead of localization computation. For this research, the location information in collected data was evenly gridded into 448 grids (i.e., in 16 rows and 28 columns, each grid had a size of 5 m by 5 m), that is, each location data was replaced by that at the nearest grid. Figure 2 (b) shows the locations of grids and GWs. Figure 3 illustrates the GW IDs and the RSS distribution heatmaps for the 20 GWs. The signal coverage range for all GWs reached over 50 m. Thus, the RSS measurements at all the grid points have data from over four GWs. Meanwhile, the RSS measurement with all GWs vary over space. These facts ensure the feasibility of using RSS measurements for localization.

In the test, approximately 2,000 RSS samples from each GW were collected at each grid point. Such gridded data were further used to generate dynamic localization data through random sampling. 10,000 dynamic trajectories, each had a length of 300 steps were generated. Accordingly, there were 3,000,000 actions in the generated training data. To generate each trajectory, a grid was randomly selected as the initial point. Then, the agent started to move one grid by randomly selecting one of the nine actions in Subsection II-B. When the agent arrived a grid, a set of RSS were selected randomly through the 2,000 RSS samples from each GW and used as the RSS measurement at this step. Furthermore, to mitigate the effect of device diversity and orientation diversity, the RSS from GW 8 was selected as the datum to compute differential RSS [6]. Meanwhile, the orientation-compensation model in [6] was used to correct the RSS measurements.

Fig. 3: GW IDs and distribution heatmaps of RSS measurements

Iii-B DRL Training

The generated localization data were used to train the DRL. The related parameters were listed in Table III.

Parameter Value
Scenario Parameter
Number of grid columns 28
Number of grid rows 16
Size of grids 5 m by 5 m
Number of actions 9
Number of landmark points (i.e., GWs) 20
Algorithm Parameter
Replay memory size 10000
Replay start size 2500
Minibatch size 200
Number of sample per target network update 100
Discount factor 0.9
DNN learning rate 0.001
Initial exploration rate 1.0
Final exploration rate 0.05
TABLE III: Values of parameters in DRL

The data processing environment was Python 3.6 with the TensorFlow library


. An DNN with two hidden layers, each had 200 neurons, were used in the DQN. By running on a Macbook Pro that had a processer of 2.5 GHz Intel Core i7 and memory of 16 GB 1600 MHz DDR3, around 55 hours were taken to complete the training. Figure


demonstrates the normalized loss function value over the training time period. It is indicated that the convergence needed around 1,000,000 samples.

Fig. 4: Convergence trend of loss value over training time period

Iii-C DRL Localization

The trained model was used for localization. The method in Subsection III-A was used to generated 100 test trajectories, each had a length of 300 steps. Figure 5 illustrates the localization solutions of four example trajectories. On a relatively large spatial scale (e.g., the 100 m level spatial scale), the localization solution had a similar trend with the reference trajectories. This outcome indicates the potential of using DRL for wireless localization in the long term. On the other hand, on the 10 m level spatial scale, the action output from the DRL may deviate from the actual agent movement. This phenomenon may be caused by factors such as RSS fluctuations.

Fig. 5: Example of DRL-based wireless localization solutions

For comparison, the localization solutions from two comparison methods were computed. One method was DNN [8] that uses supervised data and the other was multilateration [49] with unsupervised data. The former method provided a reference for the achievable localization accuracy with the test data, while the later indicated the localization accuracy when unsupervised data was used. Both comparison approaches used training and testing data that are same to the DRL-based method. On the other hand, only the supervised DNN method used the known location labels in the database-training step. The first comparison method was implemented by using a DNN with two hidden layers, each had 200 neurons. The second comparison method was applied by setting the path-loss model parameters for all GWs at experience values (=2, =-50). Figure 6 demonstrates the cumulative distribution function (CDF) curves of localization errors from 100 test trajectories. Figure 7

shows the location errors statistics, including the mean, root mean square (RMS), and the 80 % and 95 % quantile values.

Fig. 6: CDF curves of location errors
Fig. 7: Statistics of location errors

Figures 6 and 7 indicate that

  • The location errors from the DRL-based method had an RMS and 95 % quantile values of 12.2 m and 24.7 m, respectively. These values were respectively 59.0 % and 36.8 % smaller than those from the unsupervised multilateration method (RMS 19.4 m and 95 % in 39.1 m). This outcome indicates a positive effect by using the DRL-based method to train a DQN by using unlabeled data, and using the obtained model for localization.

  • On the other hand, the RMS and 95 % quantile values of the DRL-based localization errors were respectively 90.6 % and 104.1 % higher than those from the supervised DNN method (RMS 6.4 m and 95 % in 12.1 m). Such result indicates that the localization performance of the unsupervised DRL-based method was still significantly lower than that of the supervised DNN method. The DRL-based localization method may be further enhanced by approaches such as improving the MDP modeling (e.g., the reward-setting mechanism), improving the DRL framework, and introducing geometrical localization models.

Moreover, the following experience and insights were obtained from the tests.

  • The DRL algorithm is data-driven and thus can be implemented without a priori motion model. An advantage for this characteristic is that such self-supervised method is suitable for complex environments that are difficult to modeling and setting parameters. On the other hand, such data-driven methods require a large amount of data and a heavy computational load (e.g., tens of hours in training for even a small scenario). To accelerate computation, the use of DRL-based localization may need support from future AI hardware and chips. Meanwhile, the DRL method is highly dependent on the quality of data. Although the DRL method itself has a well-developed exploration mechanism that may mitigate the issue of over-training, this issue is difficult to eliminate. One method for further alleviating this issue is to integrate with geometrical localization approaches and motion models.

  • The data in this research was randomly sampled from in-field IoT data. Thus, the used data was closer to real-world situations when compared to the simulated data in the majority of existing works on DRL-based navigation and localization. However, the data in this research still cannot fully reflect the performance of the algorithms in real-world IoT localization scenarios. One main reason is that real IoT localization data may be degraded by more environmental (e.g., multipath and obstruction), motion (e.g., motion diversity), and data (e.g., data loss, database outage) factors. A future work will be using real IoT big data for evaluating AI-based localization methods.

  • The DRL algorithm itself is being enhanced due to its research and use in numerous fields. However, similar to many other AI algorithms, an DRL module is similar to a black box for most users. It is difficult to understand and adjust the internal algorithms explicitly. This factor is a potential obstacle to the study of DRL-based localization.

Iv Conclusions

This paper presents an unsupervised wireless localization method by using the DRL framework. Through processing field-testing data from 48 BT5 smart ear tags in a pasture, which had a size of 120 m by 70 m and 20 GWs, the proposed method provided location errors that had RMS and 95 % quantile values of 12.2 m and 24.7 m, which were respectively 59.0 % and 36.8 % lower than those by using an unsupervised multilateration method. Such outcome indicates a positive effect and the potential for using the DRL-based method for wireless localization. On the other hand, the RMS and 95 % quantile values of the location errors from the proposed method were respectively 90.6 % and 104.1 % higher than those from the supervised DNN method. This phenomenon indicates the possibility and necessity to improve the DRL-based localization algorithm in the future.

Meanwhile, the experimental verification process reflected several pros and cons of using DRL for localization. Its advantages include the capability to involve previous localization data and long-term rewards, the possibility to implement localization without geometrical modeling and parameterization of the environment, and the convenience of using the most state-of-the-art DRL platforms and algorithms. The challenges include the dependency on a large amount of data, the heavy computational load, and the black-box issue. The DRL-based localization method may be further enhanced by approaches such as improving the MDP modeling (e.g., the reward-setting mechanism), improving the DRL framework, and introducing geometrical localization models.

V Acknowledgements

The authors would like to thank Dr. Zhe He and Dr. Yuqi Li for designing the IoT system and devices, and Daming Zhang and Stan Chan for conducting tests and data pre-processing.


  • [1] F. Gu, X. Hu, M. Ramezani, et. al., “Indoor Localization Improved by Spatial Context-A Survey”, ACM Comput Surv, vol. 52, no. 3, pp. 1-35, Jul. 2019.
  • [2] W. Jiang, C. Xu, L. Pei, W. Yu, “Multidimensional scaling-based TDOA localization scheme using an auxiliary line”, IEEE Signal processing letters, vol. 23, no. 4, pp. 546-50, Mar. 2016.
  • [3] B. Zhou, A. Liu, V. Lau, “Performance Limits of Visible Light-Based Positioning Using Received Signal Strength Under NLOS Propagation”, IEEE T Wirel Commun, 10.1109/TWC.2019.2934689, Aug. 2019.
  • [4] Y. Li, Y. Zhuang, P. Zhang, et. al., “An Improved Inertial/Wifi/Magnetic Fusion Structure For Indoor Navigation”, Information Fusion, vol. 34, no. 1, pp. 101-119, Mar. 2017.
  • [5] B. Zhou, Q. Chen, P. Xiao, “Error Propagation Analysis of the Received Signal Strength-based Simultaneous Localization and Tracking in Wireless Sensor Networks”, IEEE T Inform Theory, Vol. 63, no. 6, pp. 3983 - 4007, Jun. 2017.
  • [6] Y. Li, Z. He, Y. Li, et. al., “Enhanced Wireless Localization Based on Orientation-Compensation Model and Differential Received Signal Strength,” IEEE Sens J, vol. 19, no. 11, pp. 4201-4210, Jun. 2019.
  • [7] Y. Li, Z. He, Z. Gao, et. al., “Toward Robust Crowdsourcing-Based Localization: A Fingerprinting Accuracy Indicator Enhanced Wireless/Magnetic/Inertial Integration Approach,” IEEE Internet Thing J, vol. 6, no. 2, pp. 3585-3600, Apr. 2019.
  • [8] W. Zhang, K. Liu, W. Zhang, et. al., “Deep Neural Networks for wireless localization in indoor and outdoor environments”, Neurocomputing, vol. 194, no. 1, pp. 279-287, Jun. 2016.
  • [9] B. Ferris, D. Hähnel and D. Fox. “Gaussian Processes for Signal Strength-Based Location Estimation”, in Robot Sci Syst, Philadelphia, USA, 16-19, Aug. 2006.
  • [10] X. Guo, N. Ansari, L. Li and H. Li, “Indoor Localization by Fusing a Group of Fingerprints Based on Random Forests,” IEEE Internet Thing J, doi: 10.1109/JIOT.2018.2810601, Feb. 2018.
  • [11] S. Sun, Y. Li, W. Rowe, et. al., “Practical evaluation of a crowdsourcing indoor localization system using hidden Markov models,” IEEE Sens J, vol. 19, no. 20, pp. 9332-40, Jun. 2019.
  • [12] R. Timoteo, L. Silva, D. Cunha, G. Cavalcanti, “An approach using support vector regression for mobile location in cellular networks”, Comput Netw, vol. 95, no. 1, pp. 51-61, Feb. 2016.
  • [13] F. Orujov, R. Maskeliūnas, R. Damaševičius, W. Wei, Y. Li, “Smartphone based intelligent indoor positioning using fuzzy logic”, Future Gener Comp Sy, vol. 89, no. 1, pp. 335-348, Dec. 2018.
  • [14] K. Chiang, A. Noureldin and N. El-Sheimy, “A new weight updating method for INS/GPS integration architectures based on neural networks”, Meas Sci Technol, vol. 15, no. 10, pp. 2053-2061, 2004.
  • [15]

    F. Gu, K. Khoshelham, S. Valaee, et. al, “Locomotion activity recognition using stacked denoising autoencoders”,

    IEEE Internet of Things Journal, vol. 5, no. 3, pp. 2085-93, Jun. 2018.
  • [16] Y. Li, Z. Gao, Z. He, et. al., “Wireless Fingerprinting Uncertainty Prediction Based on Machine Learning.” Sensors, vol. 19, no. 2, Jan. 2019.
  • [17] P. Bolliger, “Redpin-adaptive, zero-configuration indoor localization through user collaboration”, in Proc ACM Int Worksh Mobil Entity Loc Track GPS-less Env, pp. 55-60, Sept. 2008.
  • [18] Y. Chen, R. Chen, L. Pei, T. Kroger, H. Kuusniemi, J. Liu, W. Chen, “Knowledge-based error detection and correction method of a Multi-sensor Multi-network positioning platform for pedestrian indoor navigation,” in IEEE/ION Pos Loc Nav Symp, Indian Wells, USA, 4-6 May 2010.
  • [19]

    A. Solin, M. Kok, N. Wahlström, et. al., “Modeling and Interpolation of the Ambient Magnetic Field by Gaussian Processes,”

    IEEE T Robotics, vol. 34, no. 4, pp. 1112-1127, Aug. 2018.
  • [20] L. Bruno and P. Robertson, “WiSLAM: Improving FootSLAM with WiFi,” in Int Conf Ind Pos Ind Nav, Guimaraes, Portugal, 21-23 Sept. 2011.
  • [21] B. Zhou, Q. Li, Q. Mao, et. al., “ALIMC: Activity Landmark-based Indoor Mapping via Crowdsourcing”, IEEE T Intell Transp, vol. 16, no. 5, pp.2774-2785, May 2015.
  • [22] Y. Li, Z. Gao, Z. He, et. al., “Multi-Sensor Multi-Floor 3D LocalizationWith Robust Floor Detection,” IEEE Access, vol. 6, no. 1, pp. 76689-99, Dec. 2018.
  • [23] Y. Li, J. Georgy, X. Niu, et. al., “Autonomous Calibration of MEMS Gyros in Consumer Portable Devices,” IEEE Sens J, vol. 15, no. 7, pp. 4062-72, Jul. 2015.
  • [24] P. Zhang, R. Chen, Y. Li, et. al., “A Localization Database Establishment Method based on Crowdsourcing Inertial Sensor Data and Quality Assessment Criteria,” IEEE Internet Thing J, vol. 1, pp. 2327-4662, Mar. 2018.
  • [25] X. Hu, S. Liu, R. Chen, et. al., “A Deep Reinforcement Learning-Based Framework for Dynamic Resource Allocation in Multibeam Satellite Systems,” IEEE Commun Lett, vol. 22, no. 8, pp. 1612-1615, Aug. 2018.
  • [26] J. Wang, J. Hu, G. Min, et. al., “Computation Offloading in Multi-Access Edge Computing Using a Deep Sequential Model Based on Reinforcement Learning,” IEEE Commun Magaz, vol. 57, no. 5, pp. 64-69, May, 2019.
  • [27] M. Jaderberg, V. Mnih, W. Czarnecki, et. al., “Reinforcement learning with unsupervised auxiliary tasks,” in Int Conf Learning Representations, Toulon, France, 24-26, Apr. 2017.
  • [28] P. Mirowski, R. Pascanu, F. Viola, et. al., “Learning to Navigate in Complex Environments,” in Int Conf Learning Representations, Toulon, France, 24-26, Apr. 2017.
  • [29] V. Dhiman, S. Banerjee, B. Griffin, et. al., “A Critical Investigation of Deep Reinforcement Learning for Navigation”, ArXiv, arXiv:1802.02274v2, Jan. 2019.
  • [30] J. Zhang, J. Springenberg, J. Boedecker, W. Burgard, “Deep reinforcement learning with successor features for navigation across similar environments,” IEEE/RSJ Int Conf Intell Robot Syst, Vancouver, BC, Canada, pp. 2371-2378, 24-28 Sept. 2017.
  • [31] L. Tai, G. Paolo and M. Liu, “Virtual-to-real deep reinforcement learning: Continuous control of mobile robots for mapless navigation,” IEEE/RSJ Int Conf Intell Robot Syst, Vancouver, BC, Canada, pp. 2371-2378, 24-28 Sept. 2017.
  • [32] J. Bruce, N. Sünderhauf,P. Mirowski, R. Hadsell, M. Milford, “One-Shot Reinforcement Learning for Robot Navigation with Interactive Replay”, in Conf Neural Inf Process Syst, Long Beach, CA, USA, 4-9 Dec. 2017.
  • [33] P. Mirowski, M. Grimes, M. Malinowski, et al., “Learning to Navigate in Cities Without a Map”, ArXiv, arXiv:1804.00168, Mar. 2018.
  • [34] M. Mohammadi, A. Al-Fuqaha, M. Guizani and J. Oh, “Semisupervised Deep Reinforcement Learning in Support of IoT and Smart City Services,” IEEE Internet Thing J, vol. 5, no. 2, pp. 624-635, Apr. 2018.
  • [35] E. Bejar and A. Moran, “Deep reinforcement learning based neuro-control for a two-dimensional magnetic positioning system,” Int Conf Control Autom Robot, Auckland, pp. 268-273, 20-23 Apr. 2018.
  • [36] Y. Kato, K. Kato and K. Morioka, “Autonomous robot navigation system with learning based on deep Q-network and topological maps,” in IEEE/SICE Int Sym Syst Integ, Taipei, 2017, pp. 1040-1046, 11-14 Dec. 2017.
  • [37] Z. Zhao, T. Braun and Z. Li, “A Particle Filter-based Reinforcement Learning Approach for Reliable Wireless Indoor Positioning,” IEEE J Sel Area Commun, doi: 10.1109/JSAC.2019.2933886, 2019.
  • [38] B. Peng, G. Seco-Granados, E. Steinmetz, et. al., “Decentralized Scheduling for Cooperative Localization With Deep Reinforcement Learning,” IEEE T Veh Technol, vol. 68, no. 5, pp. 4295-4305, May 2019.
  • [39] Y. Chen, M. Everett, M. Liu, J. How, “Socially Aware Motion Planning with Deep Reinforcement Learning”, ArXiv, arXiv:1703.08862, May 2018.
  • [40] O. Zhelo, J. Zhang, L. Tai, et. al., “Curiosity-driven Exploration for Mapless Navigation with Deep Reinforcement Learning”, ArXiv, arXiv:1804.00456, May 2018.
  • [41] Y. Zhu, R. Mottaghi, E. Kolve, et al., “Target-driven visual navigation in indoor scenes using deep reinforcement learning,” in IEEE Int Conf Robot Autom, Singapore, 2017, pp. 3357-3364, 29 May-3 Jun. 2017.
  • [42] A. Haeberlen, E. Flannery, A. Ladd, et. al., “Practical robust localization over large-scale 802.11 wireless networks,” in Proc Int Conf Mobil Comput Netw, pp. 70-84, Philadelphia, PA, 26 Sept. - 01 Oct. 2004.
  • [43] S. Liu, X. Hu, Y. Wang, et. al., “Deep Reinforcement Learning based Beam Hopping Algorithm in Multibeam Satellite Systems”, IET Commun, in press, 2019.
  • [44] Y. Li, K. Yan, Z. He, et. al., “Cost-Effective Localization Using RSS from Single Wireless Access Point,” IEEE T Instrum Meas, doi: 10.1109/TIM.2019.2922752, Jun. 2019.
  • [45] SGD, “Gradient Descent Optimizer,”, Retrieved 01 Sept. 2019.
  • [46] S. Ioffe, C. Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift”, ArXiv, arXiv:1502.03167, Mar. 2015.
  • [47] Texas Instruments, “CC2640R2F SimpleLink Bluetooth low energy Wireless MCU,”, Retrieved 01 Sept. 2019.
  • [48] TensorFlow, “Get Started with TensorFlow,”, Retrieved 01 Sept. 2019.
  • [49] Y. Zhuang, J. Yang, Y. Li, et. al., “Smartphone-Based Indoor Localization with Bluetooth Low Energy Beacons,” Sensors, vol. 16, no. 5, pp. 1-20, 2016.