Semi-supervised Deep Reinforcement Learning in Support of IoT and Smart City Services

by   Mehdi Mohammadi, et al.
Western Michigan University

Smart services are an important element of the smart cities and the Internet of Things (IoT) ecosystems where the intelligence behind the services is obtained and improved through the sensory data. Providing a large amount of training data is not always feasible; therefore, we need to consider alternative ways that incorporate unlabeled data as well. In recent years, Deep reinforcement learning (DRL) has gained great success in several application domains. It is an applicable method for IoT and smart city scenarios where auto-generated data can be partially labeled by users' feedback for training purposes. In this paper, we propose a semi-supervised deep reinforcement learning model that fits smart city applications as it consumes both labeled and unlabeled data to improve the performance and accuracy of the learning agent. The model utilizes Variational Autoencoders (VAE) as the inference engine for generalizing optimal policies. To the best of our knowledge, the proposed model is the first investigation that extends deep reinforcement learning to the semi-supervised paradigm. As a case study of smart city applications, we focus on smart buildings and apply the proposed model to the problem of indoor localization based on BLE signal strength. Indoor localization is the main component of smart city services since people spend significant time in indoor environments. Our model learns the best action policies that lead to a close estimation of the target locations with an improvement of 23 received rewards compared to the supervised DRL model.



There are no comments yet.


page 10


Enabling Cognitive Smart Cities Using Big Data and Machine Learning: Approaches and Challenges

The development of smart cities and their fast-paced deployment is resul...

Deep Reinforcement Learning (DRL): Another Perspective for Unsupervised Wireless Localization

Location is key to spatialize internet-of-things (IoT) data. However, it...

Deep Reinforcement Learning for Adaptive Network Slicing in 5G for Intelligent Vehicular Systems and Smart Cities

Intelligent vehicular systems and smart city applications are the fastes...

Cyber Threat Intelligence for Secure Smart City

Smart city improved the quality of life for the citizens by implementing...

Autonomous Maintenance in IoT Networks via AoI-driven Deep Reinforcement Learning

Internet of Things (IoT) with its growing number of deployed devices and...

Atomic Services: sustainable ecosystem of smart city services through pan-European collaboration

In a world with an ever increasing urbanization, governance is investiga...

Domain Adversarial Graph Convolutional Network Based on RSSI and Crowdsensing for Indoor Localization

In recent years, due to the wider WiFi coverage and the popularization o...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The rapid development of Internet of Things (IoT) technologies motivated researchers and developers to think about new kinds of smart services that extract knowledge from IoT generated data. The scarcity of labeled data is a main issue for developing such solutions especially for IoT applications where a large number of sensors participate in generating data without being able to obtain class labels corresponding to the collected data. Smart cities as a prominent application area of the IoT should provide a range of high-quality smart services to meet the citizen’s needs [1]. Smart buildings are one of the main building blocks of smart cities as citizens spend a significant part of their time indoors. People nowadays spend over 87% of their daily lives indoors [2] for work, shopping, education, etc. Therefore, having a smart environment that provides services to meet the needs to its inhabitants is a valuable asset for organizations. Such services facilitate the development of smart cities. Location-aware services in indoor environments play a significant role in this era. Examples of applications of such services are smart home management [3], delivering cultural contents in museums [4], location-based authentication and access control [5], location-aware marketing and advertisement [6][7], and wayfinding and navigation in smart campuses [8]. Moreover, locating users in indoor environments is very important for smart buildings because it serves as the link that enables the users to interact with other IoT services [9].

Deep learning is a powerful machine learning approach that provides function approximation, classification, and prediction capabilities. Reinforcement learning is another class of machine learning approaches for optimal control and decision making processes where a software agent learns an optimal policy of actions over the set of states in an environment. In the applications where the number of states is very large, a deep learning model can be used to approximate the action values (i.e., how good an action is in a given state). Systems that combine deep and reinforcement learning are in their initial phases but already produced competitive results in some application areas (e.g., video games). Moreover, learning approaches with no or little supervision are expected to get more momentum in the future

[10] mimicking the natural learning processes of humans and animals.

IoT applications can benefit from the decision process for learning purposes. For example, in the case of location-aware services, location estimation can be seen as a decision process in which a software agent determines the exact or closest point to a specific target. In this regard, reinforcement learning [11] can be exploited to formulate and solve the problem. In a reinforcement learning solution, a software agent interacts with the environment and changes the state of the environment by performing some actions. Depending on the performed action, the environment sends a reward to the agent. The agent tries to maximize its rewards over time by choosing those actions that result in higher rewards. A new variation of reinforcement learning, deep reinforcement learning recently was demonstrated by Google to achieve high accuracy in the Atari games [12] and is a suitable candidate for the learning process in the IoT applications.

In this research, we propose a semi-supervised deep reinforcement learning model to benefit from the large number of unlabeled data that are generated in IoT applications.

The points that motivate us for this study are:

  • In the world of IoT where sensors generate a lot of data that cannot be labeled manually for training purposes, semi-supervised approaches are valuable approaches. Moreover, building a deep reinforcement learning framework that works in a semi-supervised manner can serve many IoT applications.

  • Games demonstrated significant improvements using DRL [12]. IoT applications also can be seen as a game where the goal is to estimate the correct classification of a given input and hence can benefit from a DRL approach.

  • The learning process for the scale of smart cities requires many efforts including data gathering, analysis and classification. The strength of deep learning models stems from the latest advancements in computational and data storage capabilities. Such models can be utilized to develop scalable and efficient learning solutions for smart cities from crowd-sensed big data.

  • Smart city applications can be trained in a lab and deployed in a real environment without losing performance. For example, a self-driving car needs to learn how to perform in a variety of conditions (e.g., approaching pedestrians, handling traffic signs, etc.) which can be learned in a few test drives. But it is impossible to account for all the scenarios that might happen in a given city.

Also, our study is motivated by specific observations regarding the localization problem including:

  • While WiFi fingerprinting has been studied widely in the past decade for indoor positioning and the accuracy is in the range of 10 m, BLE is in its infancy for indoor localization and has yielded more fine-grained results [13].

  • There are many practical applications that need an efficient mechanism for positioning in small scale environments such as robotic soccer games to locate the position of the ball, or navigating robots in a building. Our proposed approach can be extended and used in such scenarios aiming to enhance micro-localization accuracy in support of smart environments [9].

The contributions of this paper are as follows:

  • We propose a semi-supervised deep reinforcement learning framework based on deep generative models and reinforcement learning that combines the strengths of deep neural networks and statistical modeling of data density in a reinforcement learning paradigm. To the best of our knowledge, this work is the first attempt to address semi-supervised learning through deep reinforcement learning.

  • We leverage both labeled and unlabeled data in our model. Since unlabeled data are more prevalent, this is a key feature for IoT applications where IoT sensors generate large volumes of data while they cannot be labeled easily. Therefore, our approach helps to alleviate having a lot of labeled data. In addition, the performance of deep reinforcement learning is enhanced by using the proposed semi-supervised approach.

The idea of extending reinforcement learning algorithms to semi-supervised reinforcement learning has not been studied well so far. There are some suggestions that explain the possibility of semi-supervised reinforcement learning by having unlabeled episodes in which the agent does not receive its rewards from the environment [14] [15]. But there is no implementation of such extensions so far. Our proposed semi-supervised deep reinforcement learning, however, follows a different approach where we incorporate a variational autoencoder [16] in our framework as the semi-supervised module to infer the classification of unlabeled data and incorporate this information along with the labeled data to optimize its discriminating boundaries.

To apply the proposed model on a smart city scenario, we chose to perform the experiments on smart buildings which play a significant role in smart cities. Our experimental results assert the efficiency of the proposed semi-supervised DRL model compared to the supervised DRL model. Specifically, the results have been improved by 23% for a small number of training epochs. Also, considering the average performance of both models in terms of received rewards, the semi-supervised model outperforms the supervised model by obtaining twice as many rewards.

The rest of this paper is organized as follows. Section II organizes the recent related works into two parts: one that reviews attempts that utilize DRL models and the other that deals with the indoor localization as a case study. Section III presents related background and then introduces the details of the proposed approach. Section IV presents a use-case study in which the proposed model is used for indoor localization systems using iBeacon signals. Experimental results are presented in Section V followed by concluding remarks in Section VI.

Ii Related Work

In the following sub-sections, we first review some recent research efforts that utilize deep reinforcement learning. Then, we address the latest research efforts that address the indoor localization problem from machine learning perspective.

Ii-a Deep Reinforcement Learning

Deep reinforcement learning has been proposed in recent years [12] and is gaining attention to be applied in various application domains. In the following paragraphs, we review some of the latest research efforts that utilize DRL in different application areas.

Nemati et al. [17]

utilized a deep reinforcement learning algorithm to learn actionable policies for administering an optimal dose of medicines like heparin for individuals. They used a sample dataset of dosage trials and their outcomes from a bunch of electronic medical records. In their model, they used a discriminative Hidden Markov Model (HMM) for state estimation and a Q-network with two layers of neurons. The medication dosage agent tries to learn the optimal policy by maximizing its total reward which is the overall fraction of time when patients are in their therapeutic activated Partial Thromboplastin Time (aPTT) range.

Deep reinforcement learning has been also applied for vehicle image classification as reported in [18]

. In that work, the authors propose a Convolutional Neural Network (CNN) model combined with a reinforcement learning module to guide where to look in the image for the key parts of a car. The information entropy of the classification probability of a focused image that is produced by their CNN model is considered as the reward for the reinforcement learning agent to learn to identify the next visual attention area in the image. The work in

[19] also reports on object localization in images by focusing attention on candidate regions using a deep reinforcement learning approach. In [20], a visual navigation application is presented that uses a variation of deep reinforcement learning by which robots can navigate in a space toward a visual target.

Li et al. [21] also developed a deep reinforcement learning approach for traffic signal timing aiming to have a better signal timing plan. In their model which consists of a four-layer stacked autoencoder neural network to estimate the Q-function, they use the queuing lengths of eight incoming lanes to an intersection as the state of the system at each time. They also define two actions: stay in the current traffic lane, or change the lane to allow other traffic to go through the intersection. The absolute value of the difference between the length of opposite lanes serves as the reward function. Their results show that their proposed model outperformed other conventional reinforcement learning approaches.

Resource management is another task that can use DRL as its underlying mechanism. In their report, Mao et al. [22] formulate the problem of job scheduling with multiple resource demands as a deep reinforcement learning process. In their approach, the objective is to minimize the average job slowdown. The reward function was defined based on the reciprocal duration of the job in order to guide the agent toward the objective.

Another application in which deep reinforcement learning played a key role is natural language understanding for text-based games [23][24]. For example, Narasimhan et al. [23]

used Long Short-Term Memory (LSTM) networks to train the agent with useful representations of text descriptions and a Deep Q-Network (DQN) to approximate Q-functions. Other disciplines like energy management also have incorporated DRL to improve energy utilization


Ii-B Review of Indoor Localization

To the best of our knowledge, there are no prior research efforts that utilized DRL for localization. In the following paragraphs, we review the different machine learning approaches that were utilized in the recent research literature to provide indoor localization services.

Among the approaches for deploying location-based services, Relative Signal Strength (RSS) fingerprinting is one of the most promising approaches. However, there are some challenges that need to be considered in the deployment of such approach including fingerprint annotation and device diversity [26]

. The use of fingerprint-based approaches to identify an indoor position has been studied well in the past decade. Researchers have studied different machine learning approaches in this context including: SVM, KNN, Bayesian-based filtering, transfer learning, and neural networks


It has been shown that for coarse-grained positioning applications based on Bluetooth Low Energy (BLE) RSS fingerprinting, the estimation to decide if a device is inside a room or not yields pretty reliable results [28].

The authors in [29] report their experimental indoor positioning results based on BLE RSS readings. They study the accuracy of three methods including Least Square Estimation (LSE), Three-border positioning, and Centroid positioning. In their testing area, which is a square meters classroom with four BLE stations at the corners, the LSE algorithm shows more accurate positions compared to the two other methods. However, the overall accuracy of these three algorithms is satisfactory.

Museums are good environments for using BLE to provide location-awareness since usually the building and its contents do not allow changes due to preservation policies. The authors in [4] developed a system to make interactive cultural displays in a museum with BLE beacons combined with an image recognition wearable device. The wearable device performs localization by receiving BLE signals from the beacons to identify the room in which it is located. It also identifies artworks by an image processing service. The combination of the closest beacon identifier and the artwork identifier are fed to a processing center to retrieve the appropriate cultural content.

In [30], the authors present a system called DeepFi that utilizes a deep learning method over fingerprinting data to locate indoor positions based on channel state information (CSI). As many other fingerprinting approaches, their system consists of offline training and online localization phases. In the off-line training phase, they exploit deep learning to train all the weights as fingerprints based on the previously stored CSI. Their evaluations in a living room and laboratory settings show that the use of deep learning result in improved localization accuracy of 20%. While their use of CSI approach is limited to WiFi networks, not all available Network Interface Cards (NICs) in the market support obtaining measurements from the different network channels.

In [31]

, deep learning joined with semi-supervised learning as well as extreme learning machine are applied to unlabeled data to study the performance of feature extraction and classification phases of indoor localization. In their study, deep learning network and semi-supervised learning generate high level abstract features and more accurate classification while extreme learning machine can speed up the learning process. Their test setting is a 10

15 square meters area. Their results show that deep learning can improve the accuracy of fingerprinting by at least 1.3% for the same training dataset compared to a shallow learning method. Also increasing unlabeled data has a positive effect on the accuracy compared to shallow feature methods. Compared to other deep learning methods including stacked autoencoder, deep belief network, and multi-layer extreme learning machine, their approach improves the accuracy at least 10%.

In another study [27]

, the authors propose a WiFi localization approach using deep neural networks (DNN). In their system, a four-layer deep learning model is used to extract features from WiFi RSS data. In their approach, the authors use Stacked Denoising Autoencoder and Backpropagation for the training steps. In the online positioning phase, the estimated position based on DNN is refined by an HMM component. Their experiments assert that the number of hidden layers and neurons have a direct effect on the localization accuracy. Increasing the layers leads to better results, but at some point when the network is made deeper, the results start degrading. Their result shows that when using three hidden layers with 200 neurons for each layer, the model achieves the best accuracy.

Ding et al. [32] also used an Artificial neural network (ANN) for WiFi fingerprinting localization. They proposed a localization approach that uses ANNs in conjunction with a clustering method based on affinity propagation. By affinity propagation clustering, the training of the ANN model has been faster and the memory overhead has been lowered. They also reported improved positioning accuracy compared to other baseline methods.

In [33], a deep belief network (DBN) is used for a localization approach that is based on fingerprinting of ultra-wideband signals in an indoor environment. Parameters of channel impulse response are used to get a dataset of fingerprints. Compared to other methods, the author demonstrated that DBN can improve the localization accuracy.

The work in [34] also reports using a deep learning model in conjunction with a regression model to automatically learn discriminative features from the received wireless signal. The authors also use a softmax regression algorithm to perform device-free localization and activity recognition. They report that their proposed method can improve the localization accuracy by 10% compared to other methods.

Semi-supervised algorithms have also been widely applied to the localization problem to utilize the unlabeled data for the prediction of an unknown location. For example, in [35] the authors presented a semi-supervised algorithm based on the manifold assumption to obtain tagged fingerprints out of unlabeled data using a small amount of labeled data. They map the high-dimensional space of fingerprints into a two-dimensional space and achieved an average error of 2 meters.

Our work in this paper presents several significant differences compared to the aforementioned approaches. First, related research studies in deep reinforcement learning do not exploit the statistical information of unlabeled data, while our proposed DRL approach is extended to be semi-supervised and utilizes both labeled and unlabeled data. Second, these approaches provide an application-dependent solution, while our work is a general framework that can work for a variety of IoT applications. Third, for localization systems, all aforementioned deep learning solutions rely on WiFi fingerprinting, while the context of BLE fingerprinting has not been studied in conjunction with deep learning or reinforcement learning approaches.

Iii Background and Proposed Approach

In the following sub-sections, we first describe the fundamentals of variational autoencoders. Then we describe our proposed semi-supervised DRL model by adopting a variational autoencoder in a deep reinforcement learning model. We develop the theoretical foundation of our method based on [16] and [12].

Iii-a Semi-Supervised Learning Using VAE

Semi-supervised learning methods aim to improve the generalization of supervised learning tasks using unlabeled data [36]. They usually use a small set of annotated data along with a larger number of unlabeled data to train the model. In a semi-supervised setting, we have two datasets; one is labeled and the other is unlabeled. The labeled dataset is denoted by for which labels are provided. The other set is with unknown labels. Semi-supervised algorithms are built based on at least one of the following three assumptions [37]: The smoothness assumption, states that if two points and

are close to each other, then their corresponding labels are very likely to be close to each other. The cluster assumption implies how to identify discrete clusters. It states that if two points are in the same cluster, it is more probable that they have the same class label. The manifold assumption points out that high dimensional data can be mapped to a lower dimensional one (i.e., the principle of parsimony) such that the supervised algorithm still approximates the true class of a data point.

For the semi-supervised part of our proposed model, we adopt the deep generative model based on variational autoencoders (VAE) [16]. This model has been used for semi-supervised tasks such as the recognition of handwritten digits, house number classification and motion prediction [38] with impressive results. Figure 1 shows the structure of a typical VAE model. For each data point

there is a vector of corresponding latent variables denoted by

. The distribution of labeled data is represented by , while unlabeled data are represented by .

The latent feature discriminative model (M1) is created based on:

Fig. 1: The high-level concept of a variational autoencoder adopted for deep reinforcement learning.

in which

is Gaussian distributed with mean vector

and variances presented in an identity matrix

. The function is a nonlinear likelihood function with parameter for latent variable based on a deep neural network.

The generative semi-supervised model for generating data using a latent class variable , in addition to a latent variable is (M2):


where represents a categorical distribution or in general a multinomial distribution with a vector of probabilities whose elements sum up to 1. In the dataset, if no label is available, the unknown labels are considered as latent variables in addition to .

The models have two lower bound objectives. To describe the model objectives, a fixed form distribution is introduced with parameter that helps us to estimate the posterior distribution . For all latent variables in the models, an inference deep neural network is introduced to generate a distribution of the form . For M1, a Gaussian inference network is used for latent variable :


in which is the vector of means,

is the vector of standard deviations, and

creates a diagonal matrix. For M2, an inference network is used for latent variables and using Gaussian and multinomial distributions, respectively:


where is a vector of probabilities.

The lower bound for M1 is:


in which

is the Kullback-Leibler divergence function between the encoding and prior distribution and can be obtained as


For the model M2, two cases should be considered. The first one deals with labeled data:


When dealing with unlabeled data, is treated as a latent variable and the resulting lower bound is:


Then the whole dataset has its bound of marginal likelihood as:


By adding a classification loss to the above function, the optimized objective function becomes:


where adjusts the contributions of the generative and discriminative models in the learning process. During the training process for both models M1 and M2, the stochastic gradient of is computed at each minibatch to be used for updating the generative parameters and the variational parameters .

Iii-B Semi-Supervised Deep Reinforcement Learning

To adopt a deep reinforcement learning approach, we need to define the following elements for a Markov Decision Process (MDP). The goal of the MDP in a reinforcement learning problem is to maximize the earned rewards.

Environment: The environment is the territory that the learning agent interacts with.

Agent: The agent observes the environment, receives sensory data and performs a valid action. It then receives a reward for its action. Through training, the agent learns to maximize its rewards.

States: The finite set of states that the environment can assume. Each action of the agent puts the environment in a new state.

Actions: The finite set of available actions that the agent can perform causing a transition from state at time to state at time .

Reward function: This function is the immediate feedback for performing an action. The reward function can be defined such that it reflects the closeness of the current state to the true class label; i.e., . Depending on the problem, different distance measurements can be applied. The point is that we need to devise larger positive rewards for more compelling results and negative rewards for distracting ones.

State transition distribution: is the probability that action in state at time will lead to state at time : .

Having these components, the main problem is to find a policy (where ) that maximizes the rewards: , in which is a discount factor .

In the deep Q-Network approach, we need a deep neural network that approximates the optimal action-value function (Q) [39]:


This function finds the maximum sum of rewards discounted by at each time-step , achievable by a behavior policy , after making an observation () and taking an action (). We can convert this equation to a simpler approximation function using Bellman equation. For a sequence of states and for all possible actions , if the optimal value is known, then we can obtain the optimal strategy by selecting the action that maximizes the expected value of :


To estimate the optimal action-value function, we use a non-linear function approximator (i.e., a neural network with weights ) such that

. The network can be trained by minimizing the loss functions

that is updated at each time-step.

We perform experience replay, so we keep track of the agent’s experiences at each time-step in a replay dataset . This dataset of recently experienced transitions along with the experience replay mechanism are critical for the integration of reinforcement learning and deep neural networks [39].

Q-learning updates are applied on samples from the training data that are uniformly drawn from the experience replay storage . The Q-learning update in iteration uses the following loss function:


in which represents the network parameters in iteration , and the previous network parameters are used to compute the target (). The gradient of the loss function is computed with respect to the weights of the network:

Fig. 2: The proposed model. (a) The DRL agent considers x values as the next state of the environment and y values as a mechanism to compute the reward. For unlabeled data, the x values are only incorporated into the model. (b) a general deep neural network to be used for supervised DRL.

The semi-supervised DRL algorithm is then described in Algorithm 1 to learn from both labeled and unlabeled data.

1:  Input: A dataset of labeled and unlabeled data
2:  Initialize the model parameters , , environment, state space, and replay memory
3:  for  do
4:     for each sample or in dataset do
5:         make observation of sample
6:        for  do
7:            Take an action using -greedy strategy
8:            Perform action to change the current state to the next state
9:           if sample is unlabeled then
10:              Infer the label based on (4): and get approximate reward
11:           else
12:              Observe reward that corresponds to label
13:            Store transition (, , , ) in
14:            Take a random minibatch of transitions (, , , ) from ;
15:           if  is a terminal state then
16:               =
17:           else
18:               =
19:           Apply gradient descent on based on (13)
20:        end for
21:     end for each
22:  end for
Algorithm 1 Semi-Supervised DRL Algorithm

Figure 2 shows the high-level model that uses the deep reinforcement learning technique in conjunction with a generative semi-supervised model instead of a DNN (c.f. Figure 2-b) to handle unlabeled observations. The VAE is extended to have an additional hidden layer and an output to generate the actions.

As other learning processes, the training process for this algorithm is performed offline while policy prediction is performed online. Hence, the algorithm can handle problems with high-dimensional and high-volume data using high performance computing facilities (e.g., cloud servers) to generate the model for online policy prediction. This ability stems from the integration of deep neural networks with reinforcement learning to generate approximation functions for high-dimensional datasets. The performance of this integrated model outperforms the traditional methods of reinforcement learning.

Iv Use Case: Indoor Localization

Several use cases can be envisaged of the proposed approach in a smart city context. For example, this approach can be used for home energy management in conjunction with the Non-Intrusive Load Monitoring (NILM) method and smart meters. In such systems, a small set of labeled data provides individual appliances’ usages and their on and off times. A semi-supervised deep reinforcement learning model can be trained over this small-scale training dataset as well as the stream of unlabeled data with the objective of optimizing energy usage by controlling when to switch appliances on and off.

It can also be used in the context of Intelligent Transportation Systems (ITS) by smart vehicles for navigation in a city context. In such applications, a combination of several factors can be used for the reward function such as closeness to the destination, shortest path, speed, speed variability, etc. The vehicle needs to be trained on several test drives then it uses the large set of unlabeled data to accurately navigate through the city.

Due to the importance of indoor localization and ease of implementation, we showcase the proposed method on the localization problem in the context of smart campus, which is part of a larger smart city context. Despite the fact that indoor localization has been studied extensively in recent years, still it is an open problem bringing several challenges that need to be tackled.

Indoor positioning systems have been proposed with different technologies such as vision, visual light communications (VLC), infrared, ultrasound, WiFi, RFID, and BLE [40]. One determining factor for organizations to choose a technology is the cost of the underlying technologies and devices. Among the aforementioned technologies, BLE is a low-cost solution that has attracted the attention for academic and commercial applications [9]. A combination of BLE and iBeacon technologies to design an indoor location-aware system brings many advantages to buildings that are not equipped with Wireless networks. Since iBeacons devices are of a small form factor, they can be deployed quickly and easily without changing or even tapping into the building’s electrical and communications infrastructure [40].

In recent years, deep learning has been shown to perform favorably compared to other machine learning approaches. One main challenge for deep learning is the need to collect a large volume of labeled data (a.k.a calibration procedure). Typically, scanning a large-scale area like a city or a campus to collect unlabeled data is fairly straightforward. Therefore, to benefit from the enormous volume of unlabeled data, we apply the semi-supervised deep reinforcement learning approach to investigate the benefits of unlabeled data in practical scenarios.

Compared to many related works that have performed their studies in a simulated environment, a small area, or in an isolated testbed, we conducted our experiments in an academic library that is a large and busy operational environment where thousands of visitors commute every day. So it is a valuable experiment that can be beneficial for the IoT and AI communities. In addition, there are no similar attempts that address the positioning problem through the reinforcement learning approach.

In this case study, we utilize a grid of iBeacons to implement a location-aware service offering in a campus setting. In our work, we use the iBeacons’ Received Signal Strength Indicator (RSSI) as the raw source of input data for a deep reinforcement learning model to identify indoor locations.

RSSI is usually represented by a negative number between 0 and -100 and in localization systems it can be used as an indication of the distance separating the transmitter from the receiver (i.e., ranging). In addition to the separating distance, RSSI is affected by some other factors such as movement of people and objects amidst the signals, temperature and humidity of the environment. The distance estimation from a given point to an iBeacon can be derived as follows:


where is the signal propagation constant, is the distance in meters and is the offset RSSI reading at 1 meter from the transmitter.

Due to fluctuations of the received signal strength, many research studies that utilize RSSI fingerprinting perform a preprocessing step to extract more representative features. Some of these preprocessing approaches include averaging multiple RSSI values for the same location, use Gaussian distribution model to filter outliers, and using PCA to reduce the effect of noise in addition to offering new features. In our work, we performed a categorization preprocessing in which a RSSI category represents a range of RSSI values. We explain the exact procedure in section


Iv-a Description of the Environment

The environment is represented as a set of positions that are labeled by row and column numbers. Each position is also associated with the set of RSSI values from the set of deployed iBeacons. The agent observes the environment by receiving RSSI values at each time. Our design requires the agent to take action based on the three most recent RSSI observations.

The agent can choose one of the allowed eight actions to move in different directions. In turn, the agent obtains a positive or negative reward according to its proximity to the right point. The goal of the agent is to approximate the position of the device that has received the RSSI values from the environment by moving in different directions.

Action# 0 1 2 3 4 5 6 7
Move to West East North South NW NE SW SE
TABLE I: List of actions to perform positioning

To adopt a deep reinforcement learning approach, we need to define the following elements for the MDP.

Environment: the active environment is a floor on which a particular position should be identified based on a vector of iBeacon RSSI values. The environment is divided into a grid of same-size cells as shown in Figure 3.

Fig. 3: Illustration of a typical indoor environment for deep reinforcement. learning

Agent: The positioning algorithm itself is represented as an agent. The agent interacts with the environment over time.

States: The state of the agent is represented as a tuple of these observations:

  1. a vector of RSSI values,

  2. current location (identified by row and column numbers), and

  3. distance to the target (for labeled data).

Actions: The action is to move to one of the neighboring cells in a direction of North, East, West, South or in between directions like North West (NW). The first action chooses a random state in the grid. Table I shows the list of allowed actions.

Reward function: the reward function is the reciprocal of the distance error. The reward function has a positive value if the distance to the target point is less than a threshold (). Otherwise, the agent receives a negative reward. Whenever the agent is close to the target, it gains more rewards. On the other hand, if the agent wanders away from the target and its distance is larger than a threshold (), it gains a negative reward. The reward function is represented as follows:

in which is the observed location and is the target location.

V Experimental Results

Here we describe our evaluation on a real world dataset. Our experiments were carried on the first floor of Western Michigan University (WMU) Waldo library. Figure 4 shows the overall layout of the deployment site. In our work, we use the iBeacon RSSI values to serve as the raw source of input data to identify indoor locations. Smartphones are also utilized to sense the iBeacons’ signals and to compute the current position of the user with respect to the set of known iBeacons. Our model utilizes the semi-supervised deep reinforcement learning algorithm to learn from the historical patterns of RSSI values and their corresponding estimated positions to improve its policy when identifying a position based on previously unseen RSSI values.

Fig. 4: Experimental setup with iBeacons.

V-a Dataset

Our dataset is gathered from a real-world deployment of a grid of iBeacons in a campus library area of 200 ft. 180 ft. We mounted 13 iBeacons on the ceiling of the first floor of Waldo Library at Western Michigan University which contains many pillars that might deteriorate the iBeacons signals. So we arranged the iBeacons such that we could get signal coverage by several iBeacons. Each iBeacon is separated by a distance of 30-40 ft. from adjacent iBeacons. To capture the signal strength indicator of these iBeacons, we divided the area into small zones by mapping a grid that has cells of size 1010 square ft. We also developed a specific mobile app to capture training data. For that purpose, we stood on each cell and captured all the iBeacons’ received signals. We also manually assigned the location (i.e., label of the cell) to the captured signals. We stored at least three instances of RSSIs for each cell to have a more reliable measurement and consequently to reduce the effect of noisy data. Overall, we collected 820 labeled data points for training, 600 data points for testing, and 5200 data points are unlabeled for semi-supervised learning.

V-B Preprocessing

Our initial experiments with the raw RSSI values for supervised deep learning showed that the relationship between the features are not truly revealed by deep learning models. So we have enriched the features by adding two sets of features to the original features. So we have three feature sets as:

  • Raw: The original features that come from the direct RSSI readings.

  • S1: The set of features that represent the mutual differences of iBeacon RSSI values; i.e., & , representing the difference between the RSSI value of beacon and beacon .

  • S2: The other set of features designed to represent the categorical values of RSSIs in a Boolean membership mode such that for each beacon we define several categories by a specific interval (e.g., 10) and then represent each RSSI value with the category to which it belongs.

Table II shows the average accuracy of the different feature sets during ten replications. These features are added to the raw features. As can be seen from the table, adding features set S1 to raw features has a minor effect on the average accuracy. On the other hand, adding features set S2 increases the average accuracy especially for finer grained positioning. Also, the combination of S1 and S2 is not as good as using only S2, since S1 lowers the accuracy. This observation points out that enriching a feature set by pairwise differences of RSSI values (S1) has a minor negative effect on the accuracy of the model since those features are not solid discriminative factors.

Interval feature set Accuracy
1m 3m 6m 9m
- raw 0.17 0.47 0.74 0.95
10 raw_s1 0.18 0.49 0.75 0.95
raw_s2 0.26 0.55 0.75 0.97
raw_s1_s2 0.24 0.52 0.74 0.96
5 raw_s2 0.30 0.57 0.76 0.97
TABLE II: Accuracy of different feature sets in a deep neural network

The table also demonstrates that using S2 features when RSSI categorical interval is set to 5 leads to even better results. Therefore, based on these results we use the combination of raw features and S2. Using this preprocessing, each data point is represented as a vector of 13 RSSI values plus 156 range membership features (i.e., 12 range for the 13 beacons) resulting in a total of 169 features: . Each is a label of () pointing to a specific location.

Fig. 5: Obtaining rewards and distances in six episodes with a supervised and semi-supervised DRL models.

V-C Evaluation

To implement our proposed semi-supervised DRL model, we adopted the deep reinforcement learning algorithm in which we incorporated a variational autoencoder to generate more rewarding policies and consequently increasing the accuracy of the localization process. The deep neural networks are implemented on Google TensorFlow


using the Keras package


To evaluate the performance of the proposed semi-supervised DRL model, we performed two sets of experiments: one in which the DRL framework uses a fully-connected deep neural network for supervised learning; and the other in which the DRL framework uses a stacked variational autoencoder for semi-supervised learning.

Figure 5 shows the performance of the DRL in terms of the received rewards as well as distance to the true target for both supervised and semi-supervised models in six episodes (c.f. labels 1-6 on the Figure) . In the plots, it can be seen that the agent in the semi-supervised model learns to achieve higher rewards or smaller distances to the target compared to the supervised model.

Table III shows that the behavior of the semi-supervised model leads to getting closer to the target points compared to just relying on a supervised model. It also indicates faster steps to reach or get close to the target in the same number of epochs. The differences of distances in this table emphasize that the semi-supervised model generates policies that improve the average convergence speed of the localization system by a factor of at least 4.

Average distance to the target points (meter)
as the agent starts end of epochs difference
Supervised 9.4 7.4 2
Semi-supervised 12.8 4.3 8.5
TABLE III: The average speed of convergence to destination points

In Figures 6 and 7, the comparison of utilizing the semi-supervised model versus the supervised model along a different number of epochs shows the efficacy of the semi-supervised approach in handling the localization problem. The results in Figure 6 show that the semi-supervised model reaches a higher reward faster compared to the supervised model while keeping its rewards trend stable. From this figure, it can be seen that the semi-supervised model gains at least 67% more rewards compared to the supervised model. In addition, the semi-supervised model achieves about twice the rewards of the supervised model. This result can be translated to the original measurement where we want to know the effect of the models on the accuracy of the localization as depicted in Figure 7. Figure 7 shows the average distance to the target points in different number of epochs. Here the semi-supervised model achieves 6% to 23% improvement for localization. This result indicates that the unlabeled data helps the VAE to better identify the discriminative boundaries and consequently improves the accuracy of the semi-supervised model.

Fig. 6: The average rewards that are obtained by DRL over different epoch counts using supervised model versus semi-supervised model.
Fig. 7: The average distance to the target over different epoch counts using supervised model versus semi-supervised model.

Vi Conclusion

We proposed a semi-supervised deep reinforcement learning framework as a learning mechanism in support of smart IoT services. The proposed model uses a small set of labeled data along with a larger set of unlabeled ones. The current work is the first attempt that extends the semi-supervised reinforcement learning approach using deep reinforcement learning. The proposed model consists of a deep variational autoencoder network that learns the best policies for taking optimal actions by the agent.

As a use case, we experimented with the proposed model in an indoor localization system. Our experimental results illustrate that the proposed semi-supervised deep reinforcement learning model is able to generalize the positioning policy for configurations where the environment data is a mix of labeled and unlabeled data and achieve better results compared to using a set of only labeled data in a supervised model. The results show an improvement of 23% on the localization accuracy in the proposed semi-supervised deep reinforcement learning model. Also, in terms of gaining rewards, the semi-supervised model outperforms the supervised model by receiving at least 67% more rewards.

This study shows that IoT applications in general, and smart city applications in specific where context-awareness is a valuable asset can benefit immensely from unlabeled data to improve the performance and accuracy of their learning agents. Furthermore, the semi-supervised deep reinforcement learning is a good solution for many IoT applications since it requires little supervision by giving a rewarding feedback as it learns the best policy to choose among alternative actions.


The authors would like to thank Western Michigan University Libraries for providing the experimental testbed and space needed to conduct this research.


  • [1] A. Al-Fuqaha, M. Guizani, M. Mohammadi, M. Aledhari, and M. Ayyash, “Internet of Things: A survey on enabling technologies, protocols, and applications,” IEEE Communications Surveys & Tutorials, vol. 17, no. 4, pp. 2347–2376, 2015.
  • [2] N. E. Klepeis, W. C. Nelson, W. R. Ott, J. P. Robinson, A. M. Tsang, P. Switzer, J. V. Behar, S. C. Hern, W. H. Engelmann et al., “The national human activity pattern survey (NHAPS): a resource for assessing exposure to environmental pollutants,” Journal of exposure analysis and environmental epidemiology, vol. 11, no. 3, pp. 231–252, 2001.
  • [3] L. Mainetti, V. Mighali, and L. Patrono, “A location-aware architecture for heterogeneous building automation systems,” in 2015 IFIP/IEEE International Symposium on Integrated Network Management (IM).   IEEE, 2015, pp. 1065–1070.
  • [4] S. Alletto, R. Cucchiara, G. Del Fiore, L. Mainetti, V. Mighali, L. Patrono, and G. Serra, “An indoor location-aware system for an iot-based smart museum,” IEEE Internet of Things Journal, vol. 3, no. 2, pp. 244–253, 2016.
  • [5] M. V. Moreno, J. L. Hernández, and A. F. Skarmeta, “A new location-aware authorization mechanism for indoor environments,” in Advanced Information Networking and Applications Workshops (WAINA), 2014 28th International Conference on.   IEEE, 2014, pp. 791–796.
  • [6] G. Sunkada, “System for and method of location aware marketing,” Feb. 9 2012, uS Patent App. 12/851,968.
  • [7] P. Dickinson, G. Cielniak, O. Szymanezyk, and M. Mannion, “Indoor positioning of shoppers using a network of bluetooth low energy beacons,” in Indoor Positioning and Indoor Navigation (IPIN), 2016 International Conference on.   IEEE, 2016, pp. 1–8.
  • [8] J. Torres-Sospedra, J. Avariento, D. Rambla, R. Montoliu, S. Casteleyn, M. Benedito-Bordonau, M. Gould, and J. Huerta, “Enhancing integrated indoor/outdoor mobility in a smart campus,” International Journal of Geographical Information Science, vol. 29, no. 11, pp. 1955–1968, 2015.
  • [9] F. Zafari, I. Papapanagiotou, and K. Christidis, “Microlocation for internet-of-things-equipped smart buildings,” IEEE Internet of Things Journal, vol. 3, no. 1, pp. 96–112, 2016.
  • [10] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015.
  • [11] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.   MIT press Cambridge, 1998, vol. 1, no. 1.
  • [12] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforcement learning,” arXiv preprint arXiv:1312.5602, 2013.
  • [13] R. Faragher and R. Harle, “Location fingerprinting with bluetooth low energy beacons,” IEEE Journal on Selected Areas in Communications, vol. 33, no. 11, pp. 2418–2428, 2015.
  • [14] P. Christiano. (2016) Semi-supervised reinforcement learning. [Online]. Available:
  • [15] D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and D. Mané, “Concrete problems in ai safety,” arXiv preprint arXiv:1606.06565, 2016.
  • [16] D. P. Kingma, S. Mohamed, D. J. Rezende, and M. Welling, “Semi-supervised learning with deep generative models,” in Advances in Neural Information Processing Systems, 2014, pp. 3581–3589.
  • [17] S. Nemati, M. M. Ghassemi, and G. D. Clifford, “Optimal medication dosing from suboptimal clinical examples: A deep reinforcement learning approach,” in Engineering in Medicine and Biology Society (EMBC), 2016 IEEE 38th Annual International Conference of the.   IEEE, 2016, pp. 2978–2981.
  • [18] D. Zhao, Y. Chen, and L. Lv, “Deep reinforcement learning with visual attention for vehicle classification‘’,” IEEE Transactions on Cognitive and Developmental Systems.
  • [19] J. C. Caicedo and S. Lazebnik, “Active object localization with deep reinforcement learning,” in

    Proceedings of the IEEE International Conference on Computer Vision

    , 2015, pp. 2488–2496.
  • [20] Y. Zhu, R. Mottaghi, E. Kolve, J. J. Lim, A. Gupta, L. Fei-Fei, and A. Farhadi, “Target-driven visual navigation in indoor scenes using deep reinforcement learning,” arXiv preprint arXiv:1609.05143, 2016.
  • [21] L. Li, Y. Lv, and F.-Y. Wang, “Traffic signal timing via deep reinforcement learning,” IEEE/CAA Journal of Automatica Sinica, vol. 3, no. 3, pp. 247–254, 2016.
  • [22] H. Mao, M. Alizadeh, I. Menache, and S. Kandula, “Resource management with deep reinforcement learning,” in Proceedings of the 15th ACM Workshop on Hot Topics in Networks.   ACM, 2016, pp. 50–56.
  • [23] K. Narasimhan, T. Kulkarni, and R. Barzilay, “Language understanding for text-based games using deep reinforcement learning,” arXiv preprint arXiv:1506.08941, 2015.
  • [24] J. He, J. Chen, X. He, J. Gao, L. Li, L. Deng, and M. Ostendorf, “Deep reinforcement learning with a natural language action space,” arXiv preprint arXiv:1511.04636, 2015.
  • [25] V. François-Lavet, “Deep reinforcement learning solutions for energy microgrids management,” in Proceedings of the European Workshop on Reinforcement Learning, 2016.
  • [26] B. Wang, Q. Chen, L. T. Yang, and H.-C. Chao, “Indoor smartphone localization via fingerprint crowdsourcing: challenges and approaches,” IEEE Wireless Communications, vol. 23, no. 3, pp. 82–89, 2016.
  • [27] W. Zhang, K. Liu, W. Zhang, Y. Zhang, and J. Gu, “Deep neural networks for wireless localization in indoor and outdoor environments,” Neurocomputing, vol. 194, pp. 279–287, 2016.
  • [28] S. Kajioka, T. Mori, T. Uchiya, I. Takumi, and H. Matsuo, “Experiment of indoor position presumption based on rssi of bluetooth le beacon,” in 2014 IEEE 3rd Global Conference on Consumer Electronics (GCCE).   IEEE, 2014, pp. 337–339.
  • [29] Y. Wang, X. Yang, Y. Zhao, Y. Liu, and L. Cuthbert, “Bluetooth positioning using rssi and triangulation methods,” in 2013 IEEE 10th Consumer Communications and Networking Conference (CCNC).   IEEE, 2013, pp. 837–842.
  • [30] X. Wang, L. Gao, S. Mao, and S. Pandey, “Deepfi: Deep learning for indoor fingerprinting using channel state information,” in 2015 IEEE Wireless Communications and Networking Conference (WCNC).   IEEE, 2015, pp. 1666–1671.
  • [31] Y. Gu, Y. Chen, J. Liu, and X. Jiang, “Semi-supervised deep extreme learning machine for wi-fi based localization,” Neurocomputing, vol. 166, pp. 282–293, 2015.
  • [32] G. Ding, Z. Tan, J. Zhang, and L. Zhang, “Fingerprinting localization based on affinity propagation clustering and artificial neural networks,” in 2013 IEEE Wireless Communications and Networking Conference (WCNC).   IEEE, 2013, pp. 2317–2322.
  • [33] J. Luo and H. Gao, “Deep belief networks for fingerprinting indoor localization using ultrawideband technology,” International Journal of Distributed Sensor Networks, vol. 2016, p. 18, 2016.
  • [34] X. Zhang, J. Wang, Q. Gao, X. Ma, and H. Wang, “Device-free wireless localization and activity recognition with deep learning,” in 2016 IEEE International Conference on Pervasive Computing and Communication Workshops (PerCom Workshops).   IEEE, 2016, pp. 1–5.
  • [35] T. Pulkkinen, T. Roos, and P. Myllymäki, “Semi-supervised learning for wlan positioning,” in International Conference on Artificial Neural Networks.   Springer, 2011, pp. 355–362.
  • [36] J. Weston, F. Ratle, H. Mobahi, and R. Collobert, “Deep learning via semi-supervised embedding,” in Neural Networks: Tricks of the Trade.   Springer, 2012, pp. 639–655.
  • [37] O. Chapelle, B. Schölkopf, and A. Zien, “Semi-supervised learning,” 2006.
  • [38] J. Walker, C. Doersch, A. Gupta, and M. Hebert, “An uncertain future: Forecasting from static images using variational autoencoders,” in European Conference on Computer Vision.   Springer, 2016, pp. 835–851.
  • [39] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.
  • [40] L. Mainetti, L. Patrono, and I. Sergi, “A survey on indoor positioning systems,” in Software, Telecommunications and Computer Networks (SoftCOM), 2014 22nd International Conference on.   IEEE, 2014, pp. 111–120.
  • [41] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin et al., “Tensorflow: Large-scale machine learning on heterogeneous distributed systems,” arXiv preprint arXiv:1603.04467, 2016.
  • [42]

    F. Chollet. (2015) Keras: Deep learning library for theano and tensorflow. [Online]. Available: