I Introduction
The rapid development of Internet of Things (IoT) technologies motivated researchers and developers to think about new kinds of smart services that extract knowledge from IoT generated data. The scarcity of labeled data is a main issue for developing such solutions especially for IoT applications where a large number of sensors participate in generating data without being able to obtain class labels corresponding to the collected data. Smart cities as a prominent application area of the IoT should provide a range of highquality smart services to meet the citizen’s needs [1]. Smart buildings are one of the main building blocks of smart cities as citizens spend a significant part of their time indoors. People nowadays spend over 87% of their daily lives indoors [2] for work, shopping, education, etc. Therefore, having a smart environment that provides services to meet the needs to its inhabitants is a valuable asset for organizations. Such services facilitate the development of smart cities. Locationaware services in indoor environments play a significant role in this era. Examples of applications of such services are smart home management [3], delivering cultural contents in museums [4], locationbased authentication and access control [5], locationaware marketing and advertisement [6][7], and wayfinding and navigation in smart campuses [8]. Moreover, locating users in indoor environments is very important for smart buildings because it serves as the link that enables the users to interact with other IoT services [9].
Deep learning is a powerful machine learning approach that provides function approximation, classification, and prediction capabilities. Reinforcement learning is another class of machine learning approaches for optimal control and decision making processes where a software agent learns an optimal policy of actions over the set of states in an environment. In the applications where the number of states is very large, a deep learning model can be used to approximate the action values (i.e., how good an action is in a given state). Systems that combine deep and reinforcement learning are in their initial phases but already produced competitive results in some application areas (e.g., video games). Moreover, learning approaches with no or little supervision are expected to get more momentum in the future
[10] mimicking the natural learning processes of humans and animals.IoT applications can benefit from the decision process for learning purposes. For example, in the case of locationaware services, location estimation can be seen as a decision process in which a software agent determines the exact or closest point to a specific target. In this regard, reinforcement learning [11] can be exploited to formulate and solve the problem. In a reinforcement learning solution, a software agent interacts with the environment and changes the state of the environment by performing some actions. Depending on the performed action, the environment sends a reward to the agent. The agent tries to maximize its rewards over time by choosing those actions that result in higher rewards. A new variation of reinforcement learning, deep reinforcement learning recently was demonstrated by Google to achieve high accuracy in the Atari games [12] and is a suitable candidate for the learning process in the IoT applications.
In this research, we propose a semisupervised deep reinforcement learning model to benefit from the large number of unlabeled data that are generated in IoT applications.
The points that motivate us for this study are:

In the world of IoT where sensors generate a lot of data that cannot be labeled manually for training purposes, semisupervised approaches are valuable approaches. Moreover, building a deep reinforcement learning framework that works in a semisupervised manner can serve many IoT applications.

Games demonstrated significant improvements using DRL [12]. IoT applications also can be seen as a game where the goal is to estimate the correct classification of a given input and hence can benefit from a DRL approach.

The learning process for the scale of smart cities requires many efforts including data gathering, analysis and classification. The strength of deep learning models stems from the latest advancements in computational and data storage capabilities. Such models can be utilized to develop scalable and efficient learning solutions for smart cities from crowdsensed big data.

Smart city applications can be trained in a lab and deployed in a real environment without losing performance. For example, a selfdriving car needs to learn how to perform in a variety of conditions (e.g., approaching pedestrians, handling traffic signs, etc.) which can be learned in a few test drives. But it is impossible to account for all the scenarios that might happen in a given city.
Also, our study is motivated by specific observations regarding the localization problem including:

While WiFi fingerprinting has been studied widely in the past decade for indoor positioning and the accuracy is in the range of 10 m, BLE is in its infancy for indoor localization and has yielded more finegrained results [13].

There are many practical applications that need an efficient mechanism for positioning in small scale environments such as robotic soccer games to locate the position of the ball, or navigating robots in a building. Our proposed approach can be extended and used in such scenarios aiming to enhance microlocalization accuracy in support of smart environments [9].
The contributions of this paper are as follows:

We propose a semisupervised deep reinforcement learning framework based on deep generative models and reinforcement learning that combines the strengths of deep neural networks and statistical modeling of data density in a reinforcement learning paradigm. To the best of our knowledge, this work is the first attempt to address semisupervised learning through deep reinforcement learning.

We leverage both labeled and unlabeled data in our model. Since unlabeled data are more prevalent, this is a key feature for IoT applications where IoT sensors generate large volumes of data while they cannot be labeled easily. Therefore, our approach helps to alleviate having a lot of labeled data. In addition, the performance of deep reinforcement learning is enhanced by using the proposed semisupervised approach.
The idea of extending reinforcement learning algorithms to semisupervised reinforcement learning has not been studied well so far. There are some suggestions that explain the possibility of semisupervised reinforcement learning by having unlabeled episodes in which the agent does not receive its rewards from the environment [14] [15]. But there is no implementation of such extensions so far. Our proposed semisupervised deep reinforcement learning, however, follows a different approach where we incorporate a variational autoencoder [16] in our framework as the semisupervised module to infer the classification of unlabeled data and incorporate this information along with the labeled data to optimize its discriminating boundaries.
To apply the proposed model on a smart city scenario, we chose to perform the experiments on smart buildings which play a significant role in smart cities. Our experimental results assert the efficiency of the proposed semisupervised DRL model compared to the supervised DRL model. Specifically, the results have been improved by 23% for a small number of training epochs. Also, considering the average performance of both models in terms of received rewards, the semisupervised model outperforms the supervised model by obtaining twice as many rewards.
The rest of this paper is organized as follows. Section II organizes the recent related works into two parts: one that reviews attempts that utilize DRL models and the other that deals with the indoor localization as a case study. Section III presents related background and then introduces the details of the proposed approach. Section IV presents a usecase study in which the proposed model is used for indoor localization systems using iBeacon signals. Experimental results are presented in Section V followed by concluding remarks in Section VI.
Ii Related Work
In the following subsections, we first review some recent research efforts that utilize deep reinforcement learning. Then, we address the latest research efforts that address the indoor localization problem from machine learning perspective.
Iia Deep Reinforcement Learning
Deep reinforcement learning has been proposed in recent years [12] and is gaining attention to be applied in various application domains. In the following paragraphs, we review some of the latest research efforts that utilize DRL in different application areas.
Nemati et al. [17]
utilized a deep reinforcement learning algorithm to learn actionable policies for administering an optimal dose of medicines like heparin for individuals. They used a sample dataset of dosage trials and their outcomes from a bunch of electronic medical records. In their model, they used a discriminative Hidden Markov Model (HMM) for state estimation and a Qnetwork with two layers of neurons. The medication dosage agent tries to learn the optimal policy by maximizing its total reward which is the overall fraction of time when patients are in their therapeutic activated Partial Thromboplastin Time (aPTT) range.
Deep reinforcement learning has been also applied for vehicle image classification as reported in [18]
. In that work, the authors propose a Convolutional Neural Network (CNN) model combined with a reinforcement learning module to guide where to look in the image for the key parts of a car. The information entropy of the classification probability of a focused image that is produced by their CNN model is considered as the reward for the reinforcement learning agent to learn to identify the next visual attention area in the image. The work in
[19] also reports on object localization in images by focusing attention on candidate regions using a deep reinforcement learning approach. In [20], a visual navigation application is presented that uses a variation of deep reinforcement learning by which robots can navigate in a space toward a visual target.Li et al. [21] also developed a deep reinforcement learning approach for traffic signal timing aiming to have a better signal timing plan. In their model which consists of a fourlayer stacked autoencoder neural network to estimate the Qfunction, they use the queuing lengths of eight incoming lanes to an intersection as the state of the system at each time. They also define two actions: stay in the current traffic lane, or change the lane to allow other traffic to go through the intersection. The absolute value of the difference between the length of opposite lanes serves as the reward function. Their results show that their proposed model outperformed other conventional reinforcement learning approaches.
Resource management is another task that can use DRL as its underlying mechanism. In their report, Mao et al. [22] formulate the problem of job scheduling with multiple resource demands as a deep reinforcement learning process. In their approach, the objective is to minimize the average job slowdown. The reward function was defined based on the reciprocal duration of the job in order to guide the agent toward the objective.
Another application in which deep reinforcement learning played a key role is natural language understanding for textbased games [23][24]. For example, Narasimhan et al. [23]
used Long ShortTerm Memory (LSTM) networks to train the agent with useful representations of text descriptions and a Deep QNetwork (DQN) to approximate Qfunctions. Other disciplines like energy management also have incorporated DRL to improve energy utilization
[25].IiB Review of Indoor Localization
To the best of our knowledge, there are no prior research efforts that utilized DRL for localization. In the following paragraphs, we review the different machine learning approaches that were utilized in the recent research literature to provide indoor localization services.
Among the approaches for deploying locationbased services, Relative Signal Strength (RSS) fingerprinting is one of the most promising approaches. However, there are some challenges that need to be considered in the deployment of such approach including fingerprint annotation and device diversity [26]
. The use of fingerprintbased approaches to identify an indoor position has been studied well in the past decade. Researchers have studied different machine learning approaches in this context including: SVM, KNN, Bayesianbased filtering, transfer learning, and neural networks
[27].It has been shown that for coarsegrained positioning applications based on Bluetooth Low Energy (BLE) RSS fingerprinting, the estimation to decide if a device is inside a room or not yields pretty reliable results [28].
The authors in [29] report their experimental indoor positioning results based on BLE RSS readings. They study the accuracy of three methods including Least Square Estimation (LSE), Threeborder positioning, and Centroid positioning. In their testing area, which is a square meters classroom with four BLE stations at the corners, the LSE algorithm shows more accurate positions compared to the two other methods. However, the overall accuracy of these three algorithms is satisfactory.
Museums are good environments for using BLE to provide locationawareness since usually the building and its contents do not allow changes due to preservation policies. The authors in [4] developed a system to make interactive cultural displays in a museum with BLE beacons combined with an image recognition wearable device. The wearable device performs localization by receiving BLE signals from the beacons to identify the room in which it is located. It also identifies artworks by an image processing service. The combination of the closest beacon identifier and the artwork identifier are fed to a processing center to retrieve the appropriate cultural content.
In [30], the authors present a system called DeepFi that utilizes a deep learning method over fingerprinting data to locate indoor positions based on channel state information (CSI). As many other fingerprinting approaches, their system consists of offline training and online localization phases. In the offline training phase, they exploit deep learning to train all the weights as fingerprints based on the previously stored CSI. Their evaluations in a living room and laboratory settings show that the use of deep learning result in improved localization accuracy of 20%. While their use of CSI approach is limited to WiFi networks, not all available Network Interface Cards (NICs) in the market support obtaining measurements from the different network channels.
In [31]
, deep learning joined with semisupervised learning as well as extreme learning machine are applied to unlabeled data to study the performance of feature extraction and classification phases of indoor localization. In their study, deep learning network and semisupervised learning generate high level abstract features and more accurate classification while extreme learning machine can speed up the learning process. Their test setting is a 10
15 square meters area. Their results show that deep learning can improve the accuracy of fingerprinting by at least 1.3% for the same training dataset compared to a shallow learning method. Also increasing unlabeled data has a positive effect on the accuracy compared to shallow feature methods. Compared to other deep learning methods including stacked autoencoder, deep belief network, and multilayer extreme learning machine, their approach improves the accuracy at least 10%.
In another study [27]
, the authors propose a WiFi localization approach using deep neural networks (DNN). In their system, a fourlayer deep learning model is used to extract features from WiFi RSS data. In their approach, the authors use Stacked Denoising Autoencoder and Backpropagation for the training steps. In the online positioning phase, the estimated position based on DNN is refined by an HMM component. Their experiments assert that the number of hidden layers and neurons have a direct effect on the localization accuracy. Increasing the layers leads to better results, but at some point when the network is made deeper, the results start degrading. Their result shows that when using three hidden layers with 200 neurons for each layer, the model achieves the best accuracy.
Ding et al. [32] also used an Artificial neural network (ANN) for WiFi fingerprinting localization. They proposed a localization approach that uses ANNs in conjunction with a clustering method based on affinity propagation. By affinity propagation clustering, the training of the ANN model has been faster and the memory overhead has been lowered. They also reported improved positioning accuracy compared to other baseline methods.
In [33], a deep belief network (DBN) is used for a localization approach that is based on fingerprinting of ultrawideband signals in an indoor environment. Parameters of channel impulse response are used to get a dataset of fingerprints. Compared to other methods, the author demonstrated that DBN can improve the localization accuracy.
The work in [34] also reports using a deep learning model in conjunction with a regression model to automatically learn discriminative features from the received wireless signal. The authors also use a softmax regression algorithm to perform devicefree localization and activity recognition. They report that their proposed method can improve the localization accuracy by 10% compared to other methods.
Semisupervised algorithms have also been widely applied to the localization problem to utilize the unlabeled data for the prediction of an unknown location. For example, in [35] the authors presented a semisupervised algorithm based on the manifold assumption to obtain tagged fingerprints out of unlabeled data using a small amount of labeled data. They map the highdimensional space of fingerprints into a twodimensional space and achieved an average error of 2 meters.
Our work in this paper presents several significant differences compared to the aforementioned approaches. First, related research studies in deep reinforcement learning do not exploit the statistical information of unlabeled data, while our proposed DRL approach is extended to be semisupervised and utilizes both labeled and unlabeled data. Second, these approaches provide an applicationdependent solution, while our work is a general framework that can work for a variety of IoT applications. Third, for localization systems, all aforementioned deep learning solutions rely on WiFi fingerprinting, while the context of BLE fingerprinting has not been studied in conjunction with deep learning or reinforcement learning approaches.
Iii Background and Proposed Approach
In the following subsections, we first describe the fundamentals of variational autoencoders. Then we describe our proposed semisupervised DRL model by adopting a variational autoencoder in a deep reinforcement learning model. We develop the theoretical foundation of our method based on [16] and [12].
Iiia SemiSupervised Learning Using VAE
Semisupervised learning methods aim to improve the generalization of supervised learning tasks using unlabeled data [36]. They usually use a small set of annotated data along with a larger number of unlabeled data to train the model. In a semisupervised setting, we have two datasets; one is labeled and the other is unlabeled. The labeled dataset is denoted by for which labels are provided. The other set is with unknown labels. Semisupervised algorithms are built based on at least one of the following three assumptions [37]: The smoothness assumption, states that if two points and
are close to each other, then their corresponding labels are very likely to be close to each other. The cluster assumption implies how to identify discrete clusters. It states that if two points are in the same cluster, it is more probable that they have the same class label. The manifold assumption points out that high dimensional data can be mapped to a lower dimensional one (i.e., the principle of parsimony) such that the supervised algorithm still approximates the true class of a data point.
For the semisupervised part of our proposed model, we adopt the deep generative model based on variational autoencoders (VAE) [16]. This model has been used for semisupervised tasks such as the recognition of handwritten digits, house number classification and motion prediction [38] with impressive results. Figure 1 shows the structure of a typical VAE model. For each data point
there is a vector of corresponding latent variables denoted by
. The distribution of labeled data is represented by , while unlabeled data are represented by .The latent feature discriminative model (M1) is created based on:
(1) 
in which
is Gaussian distributed with mean vector
and variances presented in an identity matrix
. The function is a nonlinear likelihood function with parameter for latent variable based on a deep neural network.The generative semisupervised model for generating data using a latent class variable , in addition to a latent variable is (M2):
(2) 
where represents a categorical distribution or in general a multinomial distribution with a vector of probabilities whose elements sum up to 1. In the dataset, if no label is available, the unknown labels are considered as latent variables in addition to .
The models have two lower bound objectives. To describe the model objectives, a fixed form distribution is introduced with parameter that helps us to estimate the posterior distribution . For all latent variables in the models, an inference deep neural network is introduced to generate a distribution of the form . For M1, a Gaussian inference network is used for latent variable :
(3) 
in which is the vector of means,
is the vector of standard deviations, and
creates a diagonal matrix. For M2, an inference network is used for latent variables and using Gaussian and multinomial distributions, respectively:(4) 
where is a vector of probabilities.
The lower bound for M1 is:
(5) 
in which
is the KullbackLeibler divergence function between the encoding and prior distribution and can be obtained as
.For the model M2, two cases should be considered. The first one deals with labeled data:
(6) 
When dealing with unlabeled data, is treated as a latent variable and the resulting lower bound is:
(7) 
Then the whole dataset has its bound of marginal likelihood as:
(8) 
By adding a classification loss to the above function, the optimized objective function becomes:
(9) 
where adjusts the contributions of the generative and discriminative models in the learning process. During the training process for both models M1 and M2, the stochastic gradient of is computed at each minibatch to be used for updating the generative parameters and the variational parameters .
IiiB SemiSupervised Deep Reinforcement Learning
To adopt a deep reinforcement learning approach, we need to define the following elements for a Markov Decision Process (MDP). The goal of the MDP in a reinforcement learning problem is to maximize the earned rewards.
Environment: The environment is the territory that the learning agent interacts with.
Agent: The agent observes the environment, receives sensory data and performs a valid action. It then receives a reward for its action. Through training, the agent learns to maximize its rewards.
States: The finite set of states that the environment can assume. Each action of the agent puts the environment in a new state.
Actions: The finite set of available actions that the agent can perform causing a transition from state at time to state at time .
Reward function: This function is the immediate feedback for performing an action. The reward function can be defined such that it reflects the closeness of the current state to the true class label; i.e., . Depending on the problem, different distance measurements can be applied. The point is that we need to devise larger positive rewards for more compelling results and negative rewards for distracting ones.
State transition distribution: is the probability that action in state at time will lead to state at time : .
Having these components, the main problem is to find a policy (where ) that maximizes the rewards: , in which is a discount factor .
In the deep QNetwork approach, we need a deep neural network that approximates the optimal actionvalue function (Q) [39]:
(10) 
This function finds the maximum sum of rewards discounted by at each timestep , achievable by a behavior policy , after making an observation () and taking an action (). We can convert this equation to a simpler approximation function using Bellman equation. For a sequence of states and for all possible actions , if the optimal value is known, then we can obtain the optimal strategy by selecting the action that maximizes the expected value of :
(11) 
To estimate the optimal actionvalue function, we use a nonlinear function approximator (i.e., a neural network with weights ) such that
. The network can be trained by minimizing the loss functions
that is updated at each timestep.We perform experience replay, so we keep track of the agent’s experiences at each timestep in a replay dataset . This dataset of recently experienced transitions along with the experience replay mechanism are critical for the integration of reinforcement learning and deep neural networks [39].
Qlearning updates are applied on samples from the training data that are uniformly drawn from the experience replay storage . The Qlearning update in iteration uses the following loss function:
(12) 
in which represents the network parameters in iteration , and the previous network parameters are used to compute the target (). The gradient of the loss function is computed with respect to the weights of the network:
(13) 
The semisupervised DRL algorithm is then described in Algorithm 1 to learn from both labeled and unlabeled data.
Figure 2 shows the highlevel model that uses the deep reinforcement learning technique in conjunction with a generative semisupervised model instead of a DNN (c.f. Figure 2b) to handle unlabeled observations. The VAE is extended to have an additional hidden layer and an output to generate the actions.
As other learning processes, the training process for this algorithm is performed offline while policy prediction is performed online. Hence, the algorithm can handle problems with highdimensional and highvolume data using high performance computing facilities (e.g., cloud servers) to generate the model for online policy prediction. This ability stems from the integration of deep neural networks with reinforcement learning to generate approximation functions for highdimensional datasets. The performance of this integrated model outperforms the traditional methods of reinforcement learning.
Iv Use Case: Indoor Localization
Several use cases can be envisaged of the proposed approach in a smart city context. For example, this approach can be used for home energy management in conjunction with the NonIntrusive Load Monitoring (NILM) method and smart meters. In such systems, a small set of labeled data provides individual appliances’ usages and their on and off times. A semisupervised deep reinforcement learning model can be trained over this smallscale training dataset as well as the stream of unlabeled data with the objective of optimizing energy usage by controlling when to switch appliances on and off.
It can also be used in the context of Intelligent Transportation Systems (ITS) by smart vehicles for navigation in a city context. In such applications, a combination of several factors can be used for the reward function such as closeness to the destination, shortest path, speed, speed variability, etc. The vehicle needs to be trained on several test drives then it uses the large set of unlabeled data to accurately navigate through the city.
Due to the importance of indoor localization and ease of implementation, we showcase the proposed method on the localization problem in the context of smart campus, which is part of a larger smart city context. Despite the fact that indoor localization has been studied extensively in recent years, still it is an open problem bringing several challenges that need to be tackled.
Indoor positioning systems have been proposed with different technologies such as vision, visual light communications (VLC), infrared, ultrasound, WiFi, RFID, and BLE [40]. One determining factor for organizations to choose a technology is the cost of the underlying technologies and devices. Among the aforementioned technologies, BLE is a lowcost solution that has attracted the attention for academic and commercial applications [9]. A combination of BLE and iBeacon technologies to design an indoor locationaware system brings many advantages to buildings that are not equipped with Wireless networks. Since iBeacons devices are of a small form factor, they can be deployed quickly and easily without changing or even tapping into the building’s electrical and communications infrastructure [40].
In recent years, deep learning has been shown to perform favorably compared to other machine learning approaches. One main challenge for deep learning is the need to collect a large volume of labeled data (a.k.a calibration procedure). Typically, scanning a largescale area like a city or a campus to collect unlabeled data is fairly straightforward. Therefore, to benefit from the enormous volume of unlabeled data, we apply the semisupervised deep reinforcement learning approach to investigate the benefits of unlabeled data in practical scenarios.
Compared to many related works that have performed their studies in a simulated environment, a small area, or in an isolated testbed, we conducted our experiments in an academic library that is a large and busy operational environment where thousands of visitors commute every day. So it is a valuable experiment that can be beneficial for the IoT and AI communities. In addition, there are no similar attempts that address the positioning problem through the reinforcement learning approach.
In this case study, we utilize a grid of iBeacons to implement a locationaware service offering in a campus setting. In our work, we use the iBeacons’ Received Signal Strength Indicator (RSSI) as the raw source of input data for a deep reinforcement learning model to identify indoor locations.
RSSI is usually represented by a negative number between 0 and 100 and in localization systems it can be used as an indication of the distance separating the transmitter from the receiver (i.e., ranging). In addition to the separating distance, RSSI is affected by some other factors such as movement of people and objects amidst the signals, temperature and humidity of the environment. The distance estimation from a given point to an iBeacon can be derived as follows:
(14) 
where is the signal propagation constant, is the distance in meters and is the offset RSSI reading at 1 meter from the transmitter.
Due to fluctuations of the received signal strength, many research studies that utilize RSSI fingerprinting perform a preprocessing step to extract more representative features. Some of these preprocessing approaches include averaging multiple RSSI values for the same location, use Gaussian distribution model to filter outliers, and using PCA to reduce the effect of noise in addition to offering new features. In our work, we performed a categorization preprocessing in which a RSSI category represents a range of RSSI values. We explain the exact procedure in section
VB.Iva Description of the Environment
The environment is represented as a set of positions that are labeled by row and column numbers. Each position is also associated with the set of RSSI values from the set of deployed iBeacons. The agent observes the environment by receiving RSSI values at each time. Our design requires the agent to take action based on the three most recent RSSI observations.
The agent can choose one of the allowed eight actions to move in different directions. In turn, the agent obtains a positive or negative reward according to its proximity to the right point. The goal of the agent is to approximate the position of the device that has received the RSSI values from the environment by moving in different directions.
Action#  0  1  2  3  4  5  6  7 

Move to  West  East  North  South  NW  NE  SW  SE 
To adopt a deep reinforcement learning approach, we need to define the following elements for the MDP.
Environment: the active environment is a floor on which a particular position should be identified based on a vector of iBeacon RSSI values. The environment is divided into a grid of samesize cells as shown in Figure 3.
Agent: The positioning algorithm itself is represented as an agent. The agent interacts with the environment over time.
States: The state of the agent is represented as a tuple of these observations:

a vector of RSSI values,

current location (identified by row and column numbers), and

distance to the target (for labeled data).
Actions: The action is to move to one of the neighboring cells in a direction of North, East, West, South or in between directions like North West (NW). The first action chooses a random state in the grid. Table I shows the list of allowed actions.
Reward function: the reward function is the reciprocal of the distance error. The reward function has a positive value if the distance to the target point is less than a threshold (). Otherwise, the agent receives a negative reward. Whenever the agent is close to the target, it gains more rewards. On the other hand, if the agent wanders away from the target and its distance is larger than a threshold (), it gains a negative reward. The reward function is represented as follows:
in which is the observed location and is the target location.
V Experimental Results
Here we describe our evaluation on a real world dataset. Our experiments were carried on the first floor of Western Michigan University (WMU) Waldo library. Figure 4 shows the overall layout of the deployment site. In our work, we use the iBeacon RSSI values to serve as the raw source of input data to identify indoor locations. Smartphones are also utilized to sense the iBeacons’ signals and to compute the current position of the user with respect to the set of known iBeacons. Our model utilizes the semisupervised deep reinforcement learning algorithm to learn from the historical patterns of RSSI values and their corresponding estimated positions to improve its policy when identifying a position based on previously unseen RSSI values.
Va Dataset
Our dataset is gathered from a realworld deployment of a grid of iBeacons in a campus library area of 200 ft. 180 ft. We mounted 13 iBeacons on the ceiling of the first floor of Waldo Library at Western Michigan University which contains many pillars that might deteriorate the iBeacons signals. So we arranged the iBeacons such that we could get signal coverage by several iBeacons. Each iBeacon is separated by a distance of 3040 ft. from adjacent iBeacons. To capture the signal strength indicator of these iBeacons, we divided the area into small zones by mapping a grid that has cells of size 1010 square ft. We also developed a specific mobile app to capture training data. For that purpose, we stood on each cell and captured all the iBeacons’ received signals. We also manually assigned the location (i.e., label of the cell) to the captured signals. We stored at least three instances of RSSIs for each cell to have a more reliable measurement and consequently to reduce the effect of noisy data. Overall, we collected 820 labeled data points for training, 600 data points for testing, and 5200 data points are unlabeled for semisupervised learning.
VB Preprocessing
Our initial experiments with the raw RSSI values for supervised deep learning showed that the relationship between the features are not truly revealed by deep learning models. So we have enriched the features by adding two sets of features to the original features. So we have three feature sets as:

Raw: The original features that come from the direct RSSI readings.

S1: The set of features that represent the mutual differences of iBeacon RSSI values; i.e., & , representing the difference between the RSSI value of beacon and beacon .

S2: The other set of features designed to represent the categorical values of RSSIs in a Boolean membership mode such that for each beacon we define several categories by a specific interval (e.g., 10) and then represent each RSSI value with the category to which it belongs.
Table II shows the average accuracy of the different feature sets during ten replications. These features are added to the raw features. As can be seen from the table, adding features set S1 to raw features has a minor effect on the average accuracy. On the other hand, adding features set S2 increases the average accuracy especially for finer grained positioning. Also, the combination of S1 and S2 is not as good as using only S2, since S1 lowers the accuracy. This observation points out that enriching a feature set by pairwise differences of RSSI values (S1) has a minor negative effect on the accuracy of the model since those features are not solid discriminative factors.
Interval  feature set  Accuracy  

1m  3m  6m  9m  
  raw  0.17  0.47  0.74  0.95 
10  raw_s1  0.18  0.49  0.75  0.95 
raw_s2  0.26  0.55  0.75  0.97  
raw_s1_s2  0.24  0.52  0.74  0.96  
5  raw_s2  0.30  0.57  0.76  0.97 
The table also demonstrates that using S2 features when RSSI categorical interval is set to 5 leads to even better results. Therefore, based on these results we use the combination of raw features and S2. Using this preprocessing, each data point is represented as a vector of 13 RSSI values plus 156 range membership features (i.e., 12 range for the 13 beacons) resulting in a total of 169 features: . Each is a label of () pointing to a specific location.
VC Evaluation
To implement our proposed semisupervised DRL model, we adopted the deep reinforcement learning algorithm in which we incorporated a variational autoencoder to generate more rewarding policies and consequently increasing the accuracy of the localization process. The deep neural networks are implemented on Google TensorFlow
[41]using the Keras package
[42].To evaluate the performance of the proposed semisupervised DRL model, we performed two sets of experiments: one in which the DRL framework uses a fullyconnected deep neural network for supervised learning; and the other in which the DRL framework uses a stacked variational autoencoder for semisupervised learning.
Figure 5 shows the performance of the DRL in terms of the received rewards as well as distance to the true target for both supervised and semisupervised models in six episodes (c.f. labels 16 on the Figure) . In the plots, it can be seen that the agent in the semisupervised model learns to achieve higher rewards or smaller distances to the target compared to the supervised model.
Table III shows that the behavior of the semisupervised model leads to getting closer to the target points compared to just relying on a supervised model. It also indicates faster steps to reach or get close to the target in the same number of epochs. The differences of distances in this table emphasize that the semisupervised model generates policies that improve the average convergence speed of the localization system by a factor of at least 4.
Average distance to the target points (meter)  

as the agent starts  end of epochs  difference  
Supervised  9.4  7.4  2 
Semisupervised  12.8  4.3  8.5 
In Figures 6 and 7, the comparison of utilizing the semisupervised model versus the supervised model along a different number of epochs shows the efficacy of the semisupervised approach in handling the localization problem. The results in Figure 6 show that the semisupervised model reaches a higher reward faster compared to the supervised model while keeping its rewards trend stable. From this figure, it can be seen that the semisupervised model gains at least 67% more rewards compared to the supervised model. In addition, the semisupervised model achieves about twice the rewards of the supervised model. This result can be translated to the original measurement where we want to know the effect of the models on the accuracy of the localization as depicted in Figure 7. Figure 7 shows the average distance to the target points in different number of epochs. Here the semisupervised model achieves 6% to 23% improvement for localization. This result indicates that the unlabeled data helps the VAE to better identify the discriminative boundaries and consequently improves the accuracy of the semisupervised model.
Vi Conclusion
We proposed a semisupervised deep reinforcement learning framework as a learning mechanism in support of smart IoT services. The proposed model uses a small set of labeled data along with a larger set of unlabeled ones. The current work is the first attempt that extends the semisupervised reinforcement learning approach using deep reinforcement learning. The proposed model consists of a deep variational autoencoder network that learns the best policies for taking optimal actions by the agent.
As a use case, we experimented with the proposed model in an indoor localization system. Our experimental results illustrate that the proposed semisupervised deep reinforcement learning model is able to generalize the positioning policy for configurations where the environment data is a mix of labeled and unlabeled data and achieve better results compared to using a set of only labeled data in a supervised model. The results show an improvement of 23% on the localization accuracy in the proposed semisupervised deep reinforcement learning model. Also, in terms of gaining rewards, the semisupervised model outperforms the supervised model by receiving at least 67% more rewards.
This study shows that IoT applications in general, and smart city applications in specific where contextawareness is a valuable asset can benefit immensely from unlabeled data to improve the performance and accuracy of their learning agents. Furthermore, the semisupervised deep reinforcement learning is a good solution for many IoT applications since it requires little supervision by giving a rewarding feedback as it learns the best policy to choose among alternative actions.
Acknowledgment
The authors would like to thank Western Michigan University Libraries for providing the experimental testbed and space needed to conduct this research.
References
 [1] A. AlFuqaha, M. Guizani, M. Mohammadi, M. Aledhari, and M. Ayyash, “Internet of Things: A survey on enabling technologies, protocols, and applications,” IEEE Communications Surveys & Tutorials, vol. 17, no. 4, pp. 2347–2376, 2015.
 [2] N. E. Klepeis, W. C. Nelson, W. R. Ott, J. P. Robinson, A. M. Tsang, P. Switzer, J. V. Behar, S. C. Hern, W. H. Engelmann et al., “The national human activity pattern survey (NHAPS): a resource for assessing exposure to environmental pollutants,” Journal of exposure analysis and environmental epidemiology, vol. 11, no. 3, pp. 231–252, 2001.
 [3] L. Mainetti, V. Mighali, and L. Patrono, “A locationaware architecture for heterogeneous building automation systems,” in 2015 IFIP/IEEE International Symposium on Integrated Network Management (IM). IEEE, 2015, pp. 1065–1070.
 [4] S. Alletto, R. Cucchiara, G. Del Fiore, L. Mainetti, V. Mighali, L. Patrono, and G. Serra, “An indoor locationaware system for an iotbased smart museum,” IEEE Internet of Things Journal, vol. 3, no. 2, pp. 244–253, 2016.
 [5] M. V. Moreno, J. L. Hernández, and A. F. Skarmeta, “A new locationaware authorization mechanism for indoor environments,” in Advanced Information Networking and Applications Workshops (WAINA), 2014 28th International Conference on. IEEE, 2014, pp. 791–796.
 [6] G. Sunkada, “System for and method of location aware marketing,” Feb. 9 2012, uS Patent App. 12/851,968.
 [7] P. Dickinson, G. Cielniak, O. Szymanezyk, and M. Mannion, “Indoor positioning of shoppers using a network of bluetooth low energy beacons,” in Indoor Positioning and Indoor Navigation (IPIN), 2016 International Conference on. IEEE, 2016, pp. 1–8.
 [8] J. TorresSospedra, J. Avariento, D. Rambla, R. Montoliu, S. Casteleyn, M. BeneditoBordonau, M. Gould, and J. Huerta, “Enhancing integrated indoor/outdoor mobility in a smart campus,” International Journal of Geographical Information Science, vol. 29, no. 11, pp. 1955–1968, 2015.
 [9] F. Zafari, I. Papapanagiotou, and K. Christidis, “Microlocation for internetofthingsequipped smart buildings,” IEEE Internet of Things Journal, vol. 3, no. 1, pp. 96–112, 2016.
 [10] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015.
 [11] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press Cambridge, 1998, vol. 1, no. 1.
 [12] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforcement learning,” arXiv preprint arXiv:1312.5602, 2013.
 [13] R. Faragher and R. Harle, “Location fingerprinting with bluetooth low energy beacons,” IEEE Journal on Selected Areas in Communications, vol. 33, no. 11, pp. 2418–2428, 2015.
 [14] P. Christiano. (2016) Semisupervised reinforcement learning. [Online]. Available: https://medium.com/aicontrol/semisupervisedreinforcementlearningcf7d5375197f
 [15] D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and D. Mané, “Concrete problems in ai safety,” arXiv preprint arXiv:1606.06565, 2016.
 [16] D. P. Kingma, S. Mohamed, D. J. Rezende, and M. Welling, “Semisupervised learning with deep generative models,” in Advances in Neural Information Processing Systems, 2014, pp. 3581–3589.
 [17] S. Nemati, M. M. Ghassemi, and G. D. Clifford, “Optimal medication dosing from suboptimal clinical examples: A deep reinforcement learning approach,” in Engineering in Medicine and Biology Society (EMBC), 2016 IEEE 38th Annual International Conference of the. IEEE, 2016, pp. 2978–2981.
 [18] D. Zhao, Y. Chen, and L. Lv, “Deep reinforcement learning with visual attention for vehicle classification‘’,” IEEE Transactions on Cognitive and Developmental Systems.

[19]
J. C. Caicedo and S. Lazebnik, “Active object localization with deep
reinforcement learning,” in
Proceedings of the IEEE International Conference on Computer Vision
, 2015, pp. 2488–2496.  [20] Y. Zhu, R. Mottaghi, E. Kolve, J. J. Lim, A. Gupta, L. FeiFei, and A. Farhadi, “Targetdriven visual navigation in indoor scenes using deep reinforcement learning,” arXiv preprint arXiv:1609.05143, 2016.
 [21] L. Li, Y. Lv, and F.Y. Wang, “Traffic signal timing via deep reinforcement learning,” IEEE/CAA Journal of Automatica Sinica, vol. 3, no. 3, pp. 247–254, 2016.
 [22] H. Mao, M. Alizadeh, I. Menache, and S. Kandula, “Resource management with deep reinforcement learning,” in Proceedings of the 15th ACM Workshop on Hot Topics in Networks. ACM, 2016, pp. 50–56.
 [23] K. Narasimhan, T. Kulkarni, and R. Barzilay, “Language understanding for textbased games using deep reinforcement learning,” arXiv preprint arXiv:1506.08941, 2015.
 [24] J. He, J. Chen, X. He, J. Gao, L. Li, L. Deng, and M. Ostendorf, “Deep reinforcement learning with a natural language action space,” arXiv preprint arXiv:1511.04636, 2015.
 [25] V. FrançoisLavet, “Deep reinforcement learning solutions for energy microgrids management,” in Proceedings of the European Workshop on Reinforcement Learning, 2016.
 [26] B. Wang, Q. Chen, L. T. Yang, and H.C. Chao, “Indoor smartphone localization via fingerprint crowdsourcing: challenges and approaches,” IEEE Wireless Communications, vol. 23, no. 3, pp. 82–89, 2016.
 [27] W. Zhang, K. Liu, W. Zhang, Y. Zhang, and J. Gu, “Deep neural networks for wireless localization in indoor and outdoor environments,” Neurocomputing, vol. 194, pp. 279–287, 2016.
 [28] S. Kajioka, T. Mori, T. Uchiya, I. Takumi, and H. Matsuo, “Experiment of indoor position presumption based on rssi of bluetooth le beacon,” in 2014 IEEE 3rd Global Conference on Consumer Electronics (GCCE). IEEE, 2014, pp. 337–339.
 [29] Y. Wang, X. Yang, Y. Zhao, Y. Liu, and L. Cuthbert, “Bluetooth positioning using rssi and triangulation methods,” in 2013 IEEE 10th Consumer Communications and Networking Conference (CCNC). IEEE, 2013, pp. 837–842.
 [30] X. Wang, L. Gao, S. Mao, and S. Pandey, “Deepfi: Deep learning for indoor fingerprinting using channel state information,” in 2015 IEEE Wireless Communications and Networking Conference (WCNC). IEEE, 2015, pp. 1666–1671.
 [31] Y. Gu, Y. Chen, J. Liu, and X. Jiang, “Semisupervised deep extreme learning machine for wifi based localization,” Neurocomputing, vol. 166, pp. 282–293, 2015.
 [32] G. Ding, Z. Tan, J. Zhang, and L. Zhang, “Fingerprinting localization based on affinity propagation clustering and artificial neural networks,” in 2013 IEEE Wireless Communications and Networking Conference (WCNC). IEEE, 2013, pp. 2317–2322.
 [33] J. Luo and H. Gao, “Deep belief networks for fingerprinting indoor localization using ultrawideband technology,” International Journal of Distributed Sensor Networks, vol. 2016, p. 18, 2016.
 [34] X. Zhang, J. Wang, Q. Gao, X. Ma, and H. Wang, “Devicefree wireless localization and activity recognition with deep learning,” in 2016 IEEE International Conference on Pervasive Computing and Communication Workshops (PerCom Workshops). IEEE, 2016, pp. 1–5.
 [35] T. Pulkkinen, T. Roos, and P. Myllymäki, “Semisupervised learning for wlan positioning,” in International Conference on Artificial Neural Networks. Springer, 2011, pp. 355–362.
 [36] J. Weston, F. Ratle, H. Mobahi, and R. Collobert, “Deep learning via semisupervised embedding,” in Neural Networks: Tricks of the Trade. Springer, 2012, pp. 639–655.
 [37] O. Chapelle, B. Schölkopf, and A. Zien, “Semisupervised learning,” 2006.
 [38] J. Walker, C. Doersch, A. Gupta, and M. Hebert, “An uncertain future: Forecasting from static images using variational autoencoders,” in European Conference on Computer Vision. Springer, 2016, pp. 835–851.
 [39] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Humanlevel control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.
 [40] L. Mainetti, L. Patrono, and I. Sergi, “A survey on indoor positioning systems,” in Software, Telecommunications and Computer Networks (SoftCOM), 2014 22nd International Conference on. IEEE, 2014, pp. 111–120.
 [41] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin et al., “Tensorflow: Largescale machine learning on heterogeneous distributed systems,” arXiv preprint arXiv:1603.04467, 2016.

[42]
F. Chollet. (2015) Keras: Deep learning library for theano and tensorflow. [Online]. Available:
https://keras.io/
Comments
There are no comments yet.