Learning the Localization Function: Machine Learning Approach to Fingerprinting Localization

03/21/2018 ∙ by Linchen Xiao, et al. ∙ 0

Considered as a data-driven approach, Fingerprinting Localization Solutions (FPSs) enjoy huge popularity due to their good performance and minimal environment information requirement. This papers addresses applications of artificial intelligence to solve two problems in Received Signal Strength Indicator (RSSI) based FPS, first the cumbersome training database construction and second the extrapolation of fingerprinting algorithm for similar buildings with slight environmental changes. After a concise overview of deep learning design techniques, two main techniques widely used in deep learning are exploited for the above mentioned issues namely data augmentation and transfer learning. We train a multi-layer neural network that learns the mapping from the observations to the locations. A data augmentation method is proposed to increase the training database size based on the structure of RSSI measurements and hence reducing effectively the amount of training data. Then it is shown experimentally how a model trained for a particular building can be transferred to a similar one by fine tuning with significantly smaller training numbers. The paper implicitly discusses the new guidelines to consider about deep learning designs when they are employed in a new application context.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

IndoorLocalization

Creation of project


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Precise location of things in indoor environments is an essential information for future wireless networks and services. Significant research has been conducted during recent years on indoor localization. In plethora of indoor localization algorithms, RF based approaches are particularly interesting given their technological accessibility. However the propagation in indoor environments follow complex models that should account for various sources of attenuation and deflection. Given the dynamics of indoor environments, the propagation models should also be constantly updated with changes in the propagation space. Therefore the algorithms that rely on explicit propagation models for localization require significant environmental awareness and continuous manual update.

Fingerprinting-based methods, on the other hand, are not model-dependent. They are data-driven approaches working on the assumption that there are certain RF features capable of identifying a location uniquely and stably. The algorithm collect these features at different locations and constructs fingerprint for each point. The fingerprint collection can be done for only finite number of points in the space and it is the most time-consuming part of the algorithm. These pairs of fingerprints and locations are organized into a training database. The localization boils down to finding the location corresponding to a new observation. Fingerprinting algorithms can usually be built on top of available infrastructures such as WiFi networks and their model-independence and minimal infrastructure requirement make them a very attractive choice for fast indoor localization deployment. As their drawbacks, the training database should be updated and sometimes built anew when the environment changes and consequently so are the fingerprints. This is more troubling given the time consuming nature of data collection. The same problem exists when one considers two similar environments for examples two floors of a single building with a very similar structure. Intuitively the similar environments share many structural features that might potentially facilitate the fingerprinting process. In this paper, the goal is to look at the fingerprinting localization algorithms from machine learning point of view and show how these issues can be addressed using the modern learning architectures. Not only utilization of these learning architectures improves upon the classic algorithms in term of accuracy but the techniques like transfer learning and data augmentation can be borrowed from machine learning to accelerate new training base creation.

There are abundance of works on Radio Frequency (RF) based indoor localizations [1, 2, 3, 4, 5] and particularly fingerprinting solutions [6, 7]. Machine learning approaches have received attention in [8], neural network-based approach in [9], -means algorithm in  [10] and prediction-based training methods in [11]. There are many works addressing the theoretical issues around fingerprinting methods, for example the effect of number of Access Pointss on the localization performance [12]. The general theoretical framework for fingerprinting algorithms is presented in [13, 14, 15, 16, 17, 18] where the interaction of many design parameters are discussed such as radio propagation parameters, the training grid, the number of measurements and interference. The issue around the database construction and scalability has been discussed in [19]. There have been some researches which already applied deep learning to indoor localization problems. In [20], the authors propose a fingerprint construction using neural networks based on measured Channel State Informations. In [21], RSSI values are used to solve floor classification problem. In [22], the authors introduce a neural network for mapping the fingerprints to the locations directly.

1.1 Summary of Contributions

In this paper, the main motivations are to address some challenges in fingerprinting localization solutions using modern artificial intelligence such as Deep Neural Networks and to examine whether new insights are required for designing DNNs

when they are used in a new application context. We will consider first fingerprinting algorithms as a data-driven approach from the perspective of data science. The databased organization is discussed from the point of view of existing techniques. Deep architectures like autoencoders are successful in extracting useful features from image datasets. It is investigated whether they can be as effective for

RSSI-based databases. Next assuming that the localization algorithm is nothing but a function that maps certain observations to a location, a DNN, that is a multi-layer neural network, is trained to learn this function. On the other hand, it is shown that certain interesting features of DNNs, namely transfer learning and data augmentation, can be used to address the problems of extrapolation across similar environments and faster training construction.

The rest of the paper is structured as follows. Section 2 presents the general overview of fingerprinting algorithms particularly from data analysis point of view. Section 3 discusses fingerprinting as a classification problem. In Section 4, deep learning architectures are proposed for fingerprinting and genera design guidelines are discussed further in Sections 5 and 6. Section 7 verifies the effectiveness of the design through experimental and simulation based performance evaluations.

2 Architecture of Fingerprinting Algorithms

The localization space is given by the set . The goal of localization is to infer the position of a target node placed at in the region based on the observations

, usually a vector in Euclidean space. Therefore the task of localization can be understood as the problem learning the mapping of the observation

to the location . Fingerprinting algorithms are data-driven approaches where the localization task utilizes the data gathered from the environment as the basis for location inference. In fingerprinting algorithm, a set of true location-observation pairs is known and the localization function is learned using the set mapping new observations to a position. The central task of fingerprinting localization is to learn the localization function denoted by

. The building blocks of fingerprinting algorithms are well known. We briefly overview the essentials. The algorithm can be roughly decomposed into raw signal feature collection, fingerprint creation, pattern matching and post-processing

[23]

. As it will be explained below, these phases correspond to different phases of a supervised learning algorithm.

Figure 1: Phases of fingerprinting algorithm [23]

2.1 Raw Feature Collection

As any data-driven learning algorithm, the fingerprinting algorithm requires the collection of data. In this phase, the algorithm selects a grid of points, denoted by , in the localization space . Some choices of the grid include square, hexagonal and random grid. At each point in this grid , also called training points, some observations are made that pertain to the location information. For RF-based localization, a signal feature is measured, mostly multiple times to compensate the transient effect of propagation environment such as fading and shadowing. The feature is chosen such that it contains adequate information about the location. The sufficient condition for the signal feature to distinguish training points is discussed in [16, 17]. It states simply that the probabilistic descriptions of the feature at different locations should be enough distinct measured in terms of Kullback-Leibler (KL) divergence. In this work, RSSI values of multiple anchors are used as the signal feature. The measurements are labeled with the location of the training points and therefore construct a database of labeled data.

2.2 Fingerprint Creation and Pre-Processing

Fingerprint creation is considered a kind of pre-processing for the data that is essentially heterogeneous. In data analysis applications, it is in general essential to pre-process the data for better presentation, compression and cleaning of the data. There are various guidelines for preparing the data for further analysis [24, 25]. Our data preparation follows the tiny data paradigm developed in [25]. The tiny data paradigm is closely based on Codd’s relational algebra [26]. The tidy data provides a standard way for data preparation and cleaning. The idea is to organize the data such that to each column corresponds a variable and to each row corresponds a measurement which in this case is just a training point. An example is shown in Table 1.

Location Anchor ID Measurement 1 Measurement 2
Table 1: Fingerprinting Database

In RF based fingerprinting solutions, given the volatile nature of wireless environments, measurements from different anchors differ in their numbers and scaling. At a location, one might not be able to get the same number of measurements from anchors due to their different signal strengths and incurred packet loss. Therefore the numbers of visible access points and the corresponding measurements differ from location to location. In this work, the training set consists of rows of different size corresponding to each training location. The row contains the position of the training point and the anchor ID and the corresponding measurements. In the light of what we discussed above, each row might have different length. It will be discussed later how to employ different data augmentation techniques to increase the size of training data and construct the rows of same length. At the end, one constructs the training dataset of form from the raw observations and is called the fingerprint of the point .

For RF-based fingerprinting, a simple and popular procedure[7] for building a homogeneous database out of heterogeneous data is to compute the average value of RSSI values obtained from each access point. The averaged values are arranged into a vector and hence the rows of final database consists of the training location and the vector of averaged RSSI value. This vector is the fingerprint of the point . There are other ways to create the fingerprint including, to name a few, RSSIquantiles or fitting Multivariate Gaussian distributions to RSSIs. In this work, the averaging method is used as the benchmark for comparison with other fingerprinting algorithm. In general, once the data is prepared for further analysis, the fingerprint of the point is implicitly understood as the rows corresponding to in the database.

2.3 Pattern Matching and Post-processing

Once the training dataset is built, a function should be learned mapping the new observations to positions in

. In conventional fingerprinting algorithms, the function of pattern matching is to capture the similarity between fingerprints of training points and the fingerprint of test points. Namely, the goal is to find the most similar pairs of test point and training point in the fingerprint space and then use the location information of the training points to estimate the test point in the location space.

One of the most well known algorithms for pattern matching employs Euclidean Distance (ED) to measure the similarity between fingerprints given as

(1)

With a new test point to be estimated, the Euclidean distance has to be calculated between the fingerprints of all training points in training grid and the fingerprint of test point . The training point with smallest Euclidean distance from test point would be the best candidate:

(2)

In general, the function can be any kernel function. This process is completed with post-processing methods such as nearest neighbors is the last step of traditional fingerprint algorithms. In the previous step, the closest fingerprints in the training set are chosen and the final location is obtained by the linear combination of the corresponding locations. The number and the weights of final linear combination as parameters of learning algorithm are chosen during the matching process.

Indeed, this approach is nothing but

-nearest neighbor classifier. The problem of learning the localization function

is a supervised learning problem. Depending on the particular problem at hand, it can be a classification or regression problem. Regression-based localization function aims at giving the estimated location while classification-based localization function identifies the room or the area in which the target node is placed. In this paper, we consider both classification and regression-based approach and address the design challenges of learning algorithms in this context.

2.4 System Model

As mentioned above, RSSI-based fingerprinting algorithms are considered in this work. At each point, a number of RSSI measurements is taken from visible APs. The number of visible APs and measured RSSIs might differ from point to point.

3 Localization as Classification

Figure 2: Partition of the localization space based on Euclidean distance of RSSIs

A challenging feature of RSSI based fingerprinting is that the mapping from locations to their corresponding measured RSSIs is not a uniformly continuous mapping. In other words, for points extremely close to an AP, small change in the distance leads to arbitrary big difference between measurements. Moreover the mapping is far from being an isometric mapping. This means that the proximity of RSSI values do not necessarily correspond to the geometric proximity. To see that consider a simple line localization where an AP is placed at the origin. Training points are placed at for and RSSI values are the measured received power at each point. If the Euclidean distance is used as pattern matching metric for RSSI-based fingerprinting, then the localization space is partitioned into intervals containing the training points. Each region contains all the points whose fingerprints are the closest in Euclidean distance to the fingerprint of the training point in the interval. However the intervals are not symmetric around the training point as one might expect. This can be seen in Fig. 2. Although in this particular problem, this situation can be avoided by using the inverse RSSI values, it will not be useful in a general indoor environment with multiple anchors and complex propagation structure. Therefore in general Euclidean distance based fingerprinting leads to uneven partition of space around the training points. In other words the closest fingerprint does not translate into the closest training point. One way to circumvent the problem is to use additional data that are not as precise as training points but contain general proximity information. In other words, these new measurements are labeled by the closest training point in geometric sense instead of the precise location. In this sense, the localization is considered as a classification problem. The training grid divides the localization space into different regions geometrically and the goal of the localization algorithm is to determine the geometric region corresponding to a test point by looking at RSSI measurements. Therefore each class corresponds to one of these regions. This formulation makes it possible to utilize the opportunistic measurements obtained by crowd-sourcing approaches. Among classification techniques, Support Vector Machines are used here as the classification algorithm.

3.1 Support Vector Machine Algorithm

SVM [27] is a binary classifier introduced by Vapnik and Chervonenkis in 1963. SVM

aims at finding a linear classifier, i.e., a hyperplane which maximizes the margin between two classes. It has extensions to non-linear classifiers and non-separable data too.

(a) Linear Support Vector Machine
(b) Non-linear SVM classifier with soft margin
Figure 3: SVM for proximity based fingerprinting

First a training dataset, linearly separable, is given consisting of data points with labels . The idea of SVM is to use a hyper-plane to separate two classes so that each class lies on one side of the hyper-plane:

(3)

for some . The optimal hyper-plane would be the one which maximum margin between two classes. It can be seen that the following optimization problem provides a solution for and [28]:

(4)

One can equally solve the dual problem by considering the Lagrangian, yielding the following problem:

(5)

Note that the dimension of search space for the dual problem scales with the size of training set but for the primal problem with the dimension of training points. Therefore, although both problems can be considered as quadratic optimization problem and can be easily solved by quadratic programming algorithms, the choice of which problem to solve depends on then number of training points and their dimension. After solving the dual problem, can be obtained as . The support vectors with are those with and they solely determine and . For a support vector , is obtained as . Th support vectors are shown in Fig.2(a).

In order to deal with non-separable data, a penalty term is added which tries to minimize the number of points inside the margin. If the margin violation of each point is denoted by , the goal is to minimize the -norm of for this purpose. To have a convex formulation -norm is used instead which is well known to provide sparsity. Therefore the following optimization problem is solved:

where is a parameter which should be correctly chosen. The equivalent dual problem is given by:

(6)

Note that the search space has the dimension for the primal problem while the dimension of the dual problem remains unchanged for both separable and non-separable case, equal to hence making it more efficient to solve for non-separable case.

As another advantage of the dual problem, it can be easily extended to a non-linear SVM classifier which is achieved by using kernel trick. In this case the inner product as in (6) is replaced by a kernel function . The kernel function represents an inner product of the transformations of these vectors in a feature space, usually higher dimension and possibly infinite dimensional. This transformation is only done implicitly and in many cases the transformation function is not explicitly known. Since the dual problem only depends on the inner product, it is sufficient to know the inner product in the feature space as a function of training points and kernel functions provide this information. By transformation into higher dimensional space, a linearly non-separable data in the ambient space might become linearly separable in the feature space. An example is given in (2(b)) where a linearly non-separable data is classified using kernel tricks. Some examples of kernel functions are polynomial kernel, Radial Basis Function (RBF) kernel and hyperbolic tangent kernel. The kernel functions contain design parameters too. For instance, RBF kernel is given by where is a design parameter. In its most general form, SVMs only require to tune few parameters namely the constant in (6) and parameters of kernel function. In its most general form, SVM classifier is constructed by solving the following problem

(7)

Note that since we are working in the feature space, is really where is the feature mapping. However the mapping is only implicitly known and hence cannot be known although the classifier itself can be constructed. First is obtained as

(8)

Consequently the classifier for a new observation is constructed as

(9)

3.2 SVM in Fingerprinting Algorithm

In the context of fingerprinting algorithm, the fingerprint of each training point is a sequence of equal number of measurements from different anchors. Therefore for anchors and measurements per each, the training point fingerprint is a vector of dimension . To apply SVM in fingerprinting algorithm, first the training points are chosen. The Voronoi diagram corresponding to the grid divides the location space into different regions where each region can be seen as a class and all points in one region are seen as one class with the training point as the label. The aim is to classify the test point to one of those partitions using SVM.

Note that without proximity data, only a single observation is available for a given class. If so, SVM would be exactly the same as Euclidean based distance localization. It is also possible to obtain multiple observations using data augmentation which will be discussed later. Casting the localization problem as classification makes possible to utilize the so called imprecise measurements. In that case, it would be enough to know the region in which observations are collected, i.e., the label of measurements. Unlike training points on the grid, no precise location information is needed. Interestingly this improves the localization performance. Therefore for each training point , there are precise measurements and proximity measurements corresponding to its Voronoi region. One needs particularly to solve a multi-class classification problem with number of classes equal to the cardinality of the training grid . Therefore multiple binary SVM classifiers have to be trained [29] and the final decision is made according to one vs. one or one vs. rest strategy.

For classes, in one vs. one approach, binary classifiers are constructed and the class with highest number of decisions is chosen. The steps are as follows.

  1. Consider all fingerprints of two training points and in . The fingerprint is labeled with if it corresponds to and if it corresponds to .

  2. Solve the optimization problem (7) to find and the following classifier for a fingerprint :

    (10)
  3. For a fingerprint , the output of the above classifier is which is equal to either or .

  4. Repeat the above steps for each pair of training points.

  5. The final decision is given by

In one vs. rest approach only classifiers are trained and for each classifier one class is tested against all other classes put into a single one. The final output is also based on majority decision. Similar to traditional fingerprinting approach, post-processing like k-nearest neighbor (kNN) can be also used in SVM approach. Instead of choosing the best class, multiple top classes can be chosen. The final location would be the average of corresponding training points location.

4 Deep Learning of Localization Function and Fingerprint Construction

Figure 4:

An artificial neuron

Deep learning architectures emerged as the prime candidate for complex learning problems with excellent performance during recent years. In span of few years these architectures outperformed conventional machine learning algorithms in tasks such as pattern recognition. Neural networks consist of multiple units called neurons connected to each other where each neuron computes a function of its input value, mostly a non-linear function. The input to each neuron is the linear combination of the output of some other neurons. The basic building block is an artificial neuron, Fig.

4. The neuron computes the function where the function

is called the activation function. The main question about the capabilities of this architecture in learning different functions. A disappointing answer is that a single neuron might not be able to learn a function like XOR function

[30]. This looks however more promising when one focuses on multi-layer case. A neural network with a single hidden layer is capable of learning any function on the set of continuous functions over -dimensional cube denoted by [31, 32]. More precisely, the classic result states that for a class of functions and for any function and , there is a finite sum of the form such that for all . The class of activation functions contains continuous discriminatory functions. This result has been extended recently to a class of unbounded activation functions yielding the approximation of functions in pointwise convergence sense [33].

Although promising, it is a very challenging to find the correct number of neurons and weights for a specific function. Usually the parameters are found based on a dataset containing input-output samples of the desired function. The process of learning the weights to be able to perform certain tasks is called training phase. The training is done usually through a procedure called back propagation where the weights are adjusted iteratively to minimize the output error for a training set. The main challenge in training neural networks is that the output error as a function of weights contains many local minima and saddle points, making it extremely difficult to find the global optimum. Although theoretically a single hidden layer would suffice for approximation, it has been recently shown that multiple hidden layers can perform tasks that require exponentially bigger number of neurons if it is done in a shallow architecture [34]

. Deep learning refers to an architecture in which the neurons are organized in many consecutive layers. The training of these architectures was the main obstacle for their development, which has been circumvented during recent years. Deep learnings now are used for performing supervised and unsupervised learning tasks as well as dimensionality reduction and feature extraction tasks.

In this section, deep neural networks are used for two main task, first to extract essential features of RSSI

measurements and second to perform regression task for indoor localization problem. The goal of regression analysis is to learn the localization function

, approximated using the neural network for mapping fingerprints to the estimated location of this point. The training set is therefore the pairs of training points and their fingerprints. Compared to kNN approach, this approach combines pattern matching and post-processing and provides the location in one shot. On the other hand, the received RSSI

values are not averaged and directly used as the input to the algorithm. This might prevent the possible information loss in averaging. In this work, we focus on design issues including the influence of different hyperparameters, avoiding overfitting and training algorithms. The hyperparameters in deep learning consists of number of layers, number of neurons in each layer, the choice of non-linearity parameters, learning rate, etc. The choice of hyperparameter is an important problem in deep learning and currently there is almost no unified theory for choosing those parameters. However there are many guidelines derived from vast experimental researches such as in

[35]. For simplicity, not all the hyperparameters are discussed in this paper. More focus is put into those hyperparameters with seemingly most important effect.

4.1 Fingerprint Construction using Autoencoders

Figure 5: Autoencoders

Autoencoders, also known as auto-association, was historically construed as a model for memory. Mainly an unsupervised learning architecture, it aims at finding a representation of data in a space of lower dimension. An autoencoder, Fig. 5 is a multi-layer neural network which attempts at constructing the very same input by mapping it into hidden layers and ultimately finding a low dimensional representation of data. In that sense they are used for feature extraction and dimensionality reduction. Interestingly, an autoencoder with merely linear layers performs nothing but Principle Component Analysis (PCA) for dimensionality reduction [36]. Although proposed decades ago, autoencoders emerged again as they were used for layer-wise pre-training of deep neural networks [37]

. Its variants such as stacking autoencoders and denoising autoencoder were proposed

[38, 39] where the latter is trained on corrupted inputs and therefore can be used as denoising tool too. Typically, the encoder gradually maps the input to a lower dimensional space through multiple hidden layers of decreasing size. The decoder symmetrically consists of layer of increasing size. Training of autoencoders follow a similar procedure to conventional neural networks. The low dimensional representation of the data is obtained by applying the encoder to test data and then fed into other classification algorithms.

In a similar fashion, autoencoders can be used for feature extraction from RSSI values. As it was discussed before, generally the RSSI values from each AP is averaged out and therefore all measurements are stored in a vector of dimension equal to the number of APs. The question is whether this naive approach is sufficient for either good compression or effective feature extraction. The measured RSSI values are used as the input to the autoencoder and through an encoder of single hidden layer its dimension is reduced. Three main issues should be considered here, first data preparation, second hyperparameter choice such as the dimension of hidden layer and the number of hidden layer neurons and finally the training of hidden layer weights. These steps are common to neural networks and will be discussed extensively in later sections.

4.2 Fingerprinting-Location Regression using Deep Learning

Figure 6: Neural network configuration

With complete and precise knowledge of propagation environment as well as transmitters property, it is possible to find the received powers at different locations using elegant but rather complex derivations. However it is difficult to have actual and complete information in a complex indoor environment. On the other hand, the localization requires solving an inverse problem which maps received powers to the location. This is even more difficult considering the model complexity. An alternative approach is to approximate this function, i.e., the localization function , using many samples.

In this work, deep neural networks are trained to approximate the localization function , Fig 6. The input is the raw RSSI values directly obtained from different APs. The output is the estimated coordinates of the test point. The size of the output layer is fixed therefore either two or three dimensional space. The input size however cannot be chosen a priori fixed. The number of RSSI measurements from an anchor can differ from point to point particularly when their acquisition is dependent upon correct reception of the packet. Therefore the measurements can have different size at each point while the input size of the neural networks should be fixed. This problem is addressed in the next section where a solution is proposed for adjusting the number of measurements to the input size.

5 Designing Deep Neural Networks

Deep neural networks contain huge number of parameters and hyperparameters. Each one of them is chosen differently from one problem to another. Although the neural network design still lacks a unified theory for parameter selection, there are many guidelines and insights obtained throughout the years to address common problems arising in applications. In what follows, we review and employ some of these insights in context of indoor localization.

5.1 Resampling and Normalization

RSSI measurements are used as the input to the neural networks. There are two main problems about directly using the measurements for the input. The first problem is a practical one. The number of measurements from different APs differ at each training point. This is a recurring problem when the acquisition of RSSI values is dependent on the correct reception of the packet. Therefore when Signal-to-Noise-Ratio (SNR) of the received packet is not suitable for correct reception, one receives a few or even no packet to have RSSI. In this case, receiving more packets amounts to further attempts for correct reception and increased latency. To address this problem, we propose a resampling method based on bagging [40]. Bagging method aggregates the output of multiple learning models, each on fit to a different dataset randomly sampled from the original one. The idea of constructing multiple randomly sampled versions of a given dataset is introduced for precision analysis of models which includes techniques like bootstraping [41, 42]

. Aggregating multiple models can lead to a better model built out of many simpler one while using random sampling lead to lower variance by diminishing the effect of accidental regularities.

As it will be shown later, multiple model aggregation is embedded into deep neural networks by proper regularization. In that way, a deep neural network can be seen as an ensemble of learners working in parallel joint with an aggregation at the end. However in this work, the random sampling is particularly implemented but from measurements of each AP. Suppose that at a training point , measurements are available from the anchor . samples are randomly taken with replacement from these measurements and put in the resampled fingerprinting database. This process is repeated for each training points and each anchor. When no measurement is available from an anchor, a default value is used which corresponds to the minimum RSSI value. This procedure leads to a database containing equal number of measurements from each anchor.

The second problem is about the scaling of data. RSSI

measurements, measured in Watt, belong to positive real numbers and can be very large if very close to the anchors. These large values lead to gradient vanishing problem for non-linearities like sigmoid function. In general, without proper normalization of error, weights or the input, these values start from values belonging to different order of magnitudes and the back-propagation algorithm might not even converge. One solution is normalize all

RSSI values between zero and one. Suppose that are respectively the minimum and maximum measured RSSI value. Then the measurement is normalized using . The idea of normalization appears also in image applications. The images might have different contrasts and hence creating variations not essential to the task. The Global contrast normalization (GCN

) is used to avoid varying contrasts by subtracting the mean and rescaling to get equal standard deviation per pixel

[43, Chapter 12]. Note that a similar normalization cannot be used here since the variance and the mean value contains exactly those information vital for localization.

5.2 Activation Function

For a long time, logistic sigmoid function

has been the common choice of activation function. The problem is that for inputs of large absolute value, the gradient function is very small and therefore the gradient do not affect the weights in back-propagation. This problem is called vanishing Gradient problem. Recently, the favorite choice for non-linearity is

Rectified Linear Unit (ReLU) function, defined as . It does not suffer from saturation and converges faster than sigmoid to an acceptable minimum [44, 45, 46].

5.3 Number of Hidden Layers and Neurons

There is currently no systematic method to choose the number of hidden layers and neurons. However, the common belief is that higher number of hidden layers are capable of approximating more complex functions [35]. There are some theoretical works supporting this claim [47, 34]

. These works show that certain class of functions, or operations, can be implemented using deep networks while a shallow network would require an exponentially higher number of neurons to do the same. Therefore utilizing more hidden layers expands the expressiveness and hence, one can approximate more complex functions. However the difficulty of training increases as the number of parameters increases. In general, the complexity of the model, in this case the neural network, should match to the complexity of the data. A priori function about the complexity of the operation might hint to the choice of model. For instance, the binary classification for a dataset which is linearly separable can be done using a single layer Perceptron without any hidden layer. However the information about the complexity is not enough. If the training data is not big enough to capture the complexity of underlying structure, the choice of complex model for learning would lead equally to overfitting problem. Same rules of thumb apply to the choice of number of neurons. In general more training data encourages more number of neurons. In general the number of neurons in each hidden layer should not significantly exceed the input size, to avoid overfitting, and should not be significantly smaller than the output size, to avoid underfitting.

5.4 Weight Initialization

As it will be discussed later in the next section, the training of neural networks consists of minimizing the error function as a function of all weights. This function contains many local optimum and saddle points and it is notoriously difficult to find a good local optimum let alone finding the global optimum. Since the training is done using gradient descent-based update of weights, the initial value of weights can affect significantly the success of training algorithms. A breakthrough in deep learning research came with the idea of layer-wise pre-training of the network weights [48]. The idea is to pre-train the weights of neural networks in order to put them in a good starting point in error space and then fine-tuning the whole network following forward, backward propagation update procedure. Recent works have shown that a straightforward initialization of weights suffices for satisfactory training [49, 50] if it is chosen with attention to the choice of non-linearities and proper possible input values [51]. In this work, the weight matrix of -th layer

is a random matrix with i.i.d. entries following Gaussian distribution

where is the number of neurons in -th layer [50].

5.5 Gradient-based training of weights

Neural networks aim at approximating functions using consecutive application of linear transformations and non-linearities. An important design choice in this context concerns linear transformations simply called weights here. Once the weights are initialized, the weights should be chosen to minimize the error function for the training set defined as a function of weights

. If the explicit characterization of the approximation error as a function of weights were at hand, one could use global optimization techniques to minimize the error and find the optimal weights. However in many learning applications, the desired function is unknown. The function is characterized using its input-output instances provided by the training set. Therefore one can only measure how well a neural network is capable of producing similar input-output instances. An error function for the network is specified for that purpose. The error function is minimized for the instances in the training set using an iterative procedure. The weights are then updated in each iteration to minimize the error mostly using Gradient-based update rules. This means that at each iteration, the gradient of the error function is calculated for all weights and used to update the weights.

Since the error function should be minimized for all the training examples, one might be tempted to minimize the error function for each training instance at each iteration and repeat this multiple times over the whole training set. This is called Stochastic Gradient Descent (SGD). However the dependence of SGD on one single training example does not necessarily guarantee that the gradient of the single example aligns necessarily with the direction that minimizes the error for all samples. Consequently, the gradient updates behave in an oscillatory fashion and therefore lead to significant increase in the training time. Another approach is to use Batch Gradient Descent (BGD), where the error function is the sum of errors of all training instances. There are some problems with this approach. First, when the number of training instances are big the training takes a very long time. The second problem appears particularly when BGD is used for training autoencoders. The mean squared error with BGD is equivalent up to a scaling factor to the average error. The error is minimized if the autoencoder learns to produce the empirical average of the data which is insufficient as representation of the whole data.

To avoid the gradient oscillation of SGD and improve BGD, the training set is divided into so called mini-batches each one containing training points. At each step, the error is minimized for a single mini-batch and weights are updated proportional to their contribution to the mini-batch error. The process is called Mini-batch Gradient Descent (MBGD). Note that SGD corresponds to the case and BGD to the case equal to the whole training data-set. Typical mini-batch sizes are 32, 64, 128 [52]. MBGD has two main advantages. The distribution of a mini-batch of data fits the distribution of the whole data better than a single example and it is converging faster than BGD. The trade-off of above algorithms can be summarized between accuracy and time required for one iteration which is discussed in details in [52].

In all the above approaches, the training is repeated over the training set for multiple times, called epochs. At iteration

, the information about the effect of each weight on the total error is encapsulated in the partial derivatives . These derivatives determine the gradient direction. The common method would be to update the weights by descending on the reverse gradient direction, i.e., by subtracting from the weights. The term is called the learning rate and it is an important design choice.

However this approach, conventionally used in the neural networks for many years, faces huge difficulties. The main difficulty is related to the particular shape of the error function. The function is a high-dimensional non-convex function and contains many local minima as well as many saddle points. A training algorithm can easily get stuck in a local minima or saddle point if the parameters are not chosen properly. The first parameter is the initialization of weights which constitutes the initial value for the Gradient-based algorithms. An initialization sufficiently close to the global minima or a good local minima plays a vital role in success of the training algorithm.

In [53], a term called momentum inspired from physics was introduced to account for the memory of previous gradient updates and alleviate the oscillation problem. In this case, the update is not only depending on the current mini-batch but also the last mini-batch. A momentum parameter needs to be chosen to control how much the last update is taken into account. Typical decaying parameter equals 0.9 or 0.99. At iteration , the weights are updated by which is determined by

(11)

The update is given by .

The learning rate and the momentum control parameter should be carefully chosen. The learning rate controls the step size of gradient descent. A very large learning rate can cause the updates oscillating around the minimum or even diverge. On the other hand a very small learning rate will make the updates very slow and possibly getting stuck in a local minima. There are many methods for better tuning of the learning rate and we will discuss some of them here. Although the same learning rate can be adopted for updating all weights, AdaGrad [54] was designed to enable different weights to have different learning rates. A uniform initial learning rate

is chosen for all weights at the beginning. The learning rate at each iteration is divided by square root of sum of squares of gradients for previous iterations. Therefore the weight updates decay in time inversely proportional with the Gradient of previous steps. A problem of AdaGrad is that the decaying factor is accumulative and hence monotonically increasing in time. Therefore the learning rates are monotonically decreasing and at the end all learning rates will go down to zero. RMSProp

[55] was introduced in 2012 to prevent such effects. Instead of simply adding up the squares of all the gradients, the new decaying factor for the learning rate is obtained by multiplying the decaying factor of previous steps by a parameter . Therefore the effect of recent iterations will be more significant than the early iterations. The rate is normally chosen to be 0.9. Adam [56] was designed in 2014 to combine both benefits of momentum and adaptive learning rate. Adam is easy to implement and computationally efficient. Similar to momentum and RMSProp, decaying parameter and momentum parameter need to be chosen. The authors suggest = 0.9 and = 0.999 as the default values. In the first few iterations, a bias correction has to be applied to prevent the update from going wrong when the momentum and the decaying factors are initialized at zero. Adam has been proven to be a very powerful variant of gradient descent due to the fact that the algorithm is not that sensitive to the initial learning rate. In this work Adam is chosen as our default optimization method.

5.6 Regularization

A central problem in machine learning is overfitting. The typical sign of overfitting is when the learning algorithm offers a very good performance on the training data but it performs badly on the test data. This is because the learning algorithm utilizes a model which is more complex and therefore learns those features of particular training set which are non-essential to the task. Often this is due to the high number of free parameters in the model. The problem is particularly grave for neural networks since even a simple multilayer neural network contains way more parameters that the input dimension. Regularization is an important technique in statistics particularly in the context of inverse problems used for adding more constraints on the desired solution. By limiting the variation of free parameters, it can be used to prevent overfitting.

5.6.1 Early Stopping

The gradient descent algorithms make sure that the total training error is decreasing with the number of iterations. But a small training error is not equivalent to a small test error. Too many iterations of gradient descent may even damage the performance of the system instead of improving it. Early stopping is an inexpensive scheme to prevent the neural network from being overtrained. The idea is to stop the training when the validation error is not decreasing significantly with iterations.

When implementing the early stopping, a parameter called patience has to be chosen manually which is the number of iterations to update after seeing a minimum. As training proceeds and a new minimum in validation error with its weights setting are observed at the iteration, will be saved in memory as the best candidate model. The update will still continue till the iteration. If a even lower validation error is observed during times iterations, then the system model will be updated and another times iterations will be proceeded. This procedure will stop until no better model found during times iterations. The patience is set to avoid immediate stop when there is an oscillation of the error. The use of early stopping makes the number of training iterations a hyperparameter that can be easily optimized.

5.6.2 and Regularization

In neural networks, one way to restrict the free choice of all parameters is to introduce constraints on them. This approach is based on using either or -regularization. In this case, a penalty term , or , is added to the error function of the neural network to be minimized. The penalty term restricts the norm of weights to be small for and promotes sparsity of the weights for . is a parameter to control the degree of penalty. The optimization problem for is changed to:

(12)

with the corresponding gradient:

(13)

Thus the update of gradient descent step is given by:

(14)

It can be seen that the regularization shrinks the weight vector by a constant factor for each step before usual gradient update. The optimization problem for regularization is given by:

(15)

with the corresponding gradient update:

(16)

where is the sign of applied element-wise. The update of one step for regularization is:

(17)

5.6.3 Dropout

Another regularization technique is called dropout which prevents overfitting by randomly dropping some hidden units during weight updates for each iteration [57]

. In this way, the intermediate representations of the input is not dependent on only few neurons and the network tends to learn a distributed representation of the data. The idea of dropout was inspired from the benefit of ensemble learning in machine learning algorithms. Since the training and then combining multiple neural networks is not really possible in practice, the dropout algorithm attempts at creating virtual neural networks inside the main one by dropping some neurons in each iteration.

Dropout is a technique applied during the training state. For each update iteration, some of the neurons are removed randomly along with all their incoming and outgoing connections to get a thinned network. This can be done by randomly deciding if each neuron will be present before each iteration. Each node is present in the training phase with probability

with the default value equal to 0.5. Only those weights of survived neurons are updated in the respective iteration. During the test time, a single neural network should be used to represent the combination of all thinned models. The authors in [57] suggested using a scaled-down versions of the trained weights. During the training, approximately only of the whole neurons is used. Simply using weights from all neurons in the final combination scales up the expected output by . Thus, a scaling factor needs to be multiplied to every weight when assembling all the thinned models. Dropout is very attractive due to its simplicity and strong regularization effect. Gradient descent variants such as Adam, momentum, other regularization methods and early stopping are also compatible with dropout.

6 Transfer Learning and Data Augmentation

Another main problem in fingerprinting algorithms is the necessity of regular updates of fingerprinting database. Due to constant changes in indoor environments, the propagation environment changes with time and so do the fingerprints. It is necessary to regularly update the fingerprints or the model used for localization.

Euclidean distance-based methods are instance-based learning methods which means that the training data is stored and every time a test point needs to be localized the whole database is used for localization. If a training data is out of date due to environment changes, the performance of fingerprinting algorithms is degraded and the solution is usually to abandon all the old data and collect new measurements for all points. The need for regular updates of the database creates a burden for localization systems and is in general very time consuming. The problem remains for SVM based solutions.

In most of the practical situations, the general structure of the building remains intact. Intuitively if a learning algorithm is capable of implicitly learning the building structure then the model only needs to be fine tuned with slight changes of algorithm. Such a model can also be used as a pre-trained model for another building with similar structures. If buildings are similar in their indoor structure then the model trained for one of them can be fine-tuned to another one with small efforts.

In deep learning research this is called transfer learning. Deep learning models are shown to be capable of transferring the learning algorithm to different tasks. In image classification tasks, it has been shown that one can use a pre-trained model keeping the weights and the network size for a different dataset. Deep learning models like AlexNet, VGG and ResNet can be used as basis for new classification tasks [58, 59]. The success of transfer learning is explained, as in [59], by the feature extraction capabilities of deep neural networks. The first layer of deep neural networks in image classification tasks aims at extracting features that resemble either Gabor filters or color blobs. The feature extraction parts of these neural networks can still be used for other image datasets with different classification at hand only with additional fine tuning.

In this paper, the transfer learning is used for facilitating the fingerprinting in similar environments or for updating the database when the environment has been only slightly changed. The main assumption is that fingerprints can be used to learn an implicit representation of the building structure and this can be done by using deep neural networks. In the next section, we evaluate this idea by using a pre-trained model for a building and update the model by a small amount of new data using standard gradient-based methods instead of training the network from the beginning. By doing so, not only one can take full use of the outdated data but also one can accelerate the whole training process.

6.1 Data Augmentation

In image recognition tasks, the output of classifiers should not be changed if the pictures are slightly transformed for example with small rotations. Accordingly a given training database can be enlarged by adding these transformations to the training set. In that way the learning algorithm is encouraged to learn those features essential to the classification task. The method is called data augmentation and can be used to alleviate the effect of overfitting and compensate the lack of sufficient training samples.

In fingerprinting algorithms, due to multi-path effects on RSSI values, it is important to collect multiple measurements at the same point which in turn increases the collection time of training samples. However, if the RSSI values are obtained over time spans bigger than the coherence time of the channel, they can be seen as independent observations. In this case, the fingerprinting algorithm should not be sensitive to the order of fingerprints and therefore a permuted version of fingerprint vectors can also be used as corresponding to a location.

In this paper, the training database is augmented using permuted version of RSSI values. Consider an RSSI matrix where is the number of access points and is the number of measurements per access point. By permutation of each row independently, a new matrix with the same size is obtained. Permutation can be done multiple times for a dataset to further augment the data. As we will see in the next section, this technique leads to performance improvements for deep learning based fingerprinting localization.

7 Numerical and Experimental Analysis

In this section, the previous learning algorithms are implemented to solve indoor localization problem. The implementation details are discussed and the final algorithms are evaluated.

7.1 Fingerprinting database

7.1.1 Simulation Data

The propagation model used for simulating data in this work is present in [60]. The authors propose a multi-wall path loss model by analyzing the effect of number of walls on experimental data and the model can be written as:

(18)

where is a constant equal 40.22 dB for a center frequency of 2.45 GHz. is the path loss exponent, is the distance between transmitter and receiver and is the number of walls. is a constant for multi-wall loss model and is the wall attenuation factor. , , , are chosen to generate simulation data in this work. Apart from that, a random term

drawn from exponential distribution

with has to be subtracted to model multipath propagation. Therefore the final model of RSSI values can be expressed as:

(19)

where is the transmit power of APs and are set to 20 dBm for all APs. The test environment is a room with four APs located at each corner. Five measurements per AP is obtained for localization. The training grid consists of a square lattice with 1m-length as size of each small square sides. In total there are 200 training points and 5 measurements are taken at each point. 1000 test points are drawn randomly in the room and their corresponding fingerprints are created similarly as training points.

7.1.2 Telecommunication Networks Group Data

The Telecommunication Networks Group (TKN) data [61] were collected in the TKN building in Berlin. In their work, fingerprints are constructed in different scenarios. The data in the scenario of ”Small size office environment” is used for simulation in this work. The size of the whole area is approximately m.
There are in total 116 APs deployed in the building for testing and the neighboring buildings. The dataset consists of 41 training points and 20 test points. The training and testing points are presented by red dots in Figure 7.

(a) Training grid of TKN data
(b) Testing grid of TKN data
Figure 7: Training and testing grids of TKN data [23]

The TKN data can be obtained by sending request to the cloud services as described in [62]. The raw measurements for one data point include many measurements from different APs. However, the number of measurements from different APs are not the same in TKN data due to the undetected signals. For instance, may have 10 measurements while just have 2.

For our implementation, we use random sampling to address this problem. When one training sample is to be constructed, the RSSI value for a certain AP is sampled randomly from all available measurements of that AP at that point. At the end, 410 training samples and 20 test samples are constructed for the fingerprinting solutions.

7.1.3 UJIIndoorLoc dataset

In order to compare different proposed algorithms, the UJIIndoorLoc dataset is also used [63]. It contains WiFi measurements used during EvAAL competition at IPIN 2015 [64]. The dataset contains 19937 training samples and 1111 test samples. Each sample consists of 529 features where the first 520 features are RSSI values from 520 access points ranged from -104 dBm to 0 dBm. The positive value 100 is used to indicate when a signal was not detected. The features 521 to 529 correspond to latitude, longitude, floor, building ID, space ID, relative position, user ID, phone ID and time stamp. The data in Table 2 are obtained from 3 buildings with 3,3 and 4 floors respectively.

BuildingFloor ID Training samples Test samples
B0F0 1059 78
B0F1 1356 208
B0F2 1443 165
B0F3 1391 85
B1F0 1368 30
B1F1 1484 143
B1F2 1396 87
B1F3 948 47
B2F0 1942 24
B2F1 2162 111
B2F2 1577 54
B2F3 2709 40
B2F4 1102 39
Total 19937 1111
Table 2: Distribution of data

In this paper, we only select data from floors of building 0. The undetected signals are denoted by -110 dBm instead of 100 which means very weak signals (-104 dBm is the smallest measured RSSI value in data set). The features are then scaled independently to have zero mean and unit variance. The absolute positions are converted to relative positions by subtracting the smallest latitude and longitude in the data set. The room size (98.7m 110.5m, 104.2m 118.4m, 104.2m 119.1m, 104.2m 119.1m for four floors) can be obtained by looking at the difference between maximum and minimum of latitude and longitude. Moreover there are some access points that are undetected for all points in a certain floor. Those features are removed in order to speed up the training phase.

7.2 Fingerprinting Algorithm Design

In this work, we train a neural network as regression to learn the localization function mapping. The input layer corresponds to the RSSI measurements with the number of neurons corresponding to the number of measurements. The output layer gives coordinates of a point in two dimensional space. The input dimension is at least as large as number of APs in the environment. However in this work multiple measurements per AP is included in the input. As it was discussed above, random sampling is used when the number of measurements from an AP is not sufficient to give an input. It will be discussed below how this technique can be used for data augmentation. If no measurement is available from an AP, the input values are set to the smallest possible RSSI value in WiFi standard. The input layer size is 20, that is five measurements per AP, for the simulation data, 116 for TKN dataset and 520 for UJIIndoorLoc dataset.

Raw RSS values as input
Fully Connected layer with 500 neurons
Dropout layer with 50% rate
Fully Connected layer with 500 neurons
Dropout layer with 50% rate
Fully Connected layer with 500 neurons
Dropout layer with 50% rate
Location coordinates (2 dimensional)
Table 3: Neural network configuration

Deep learning architecture for learning localization functions consists of three fully connected hidden layers with 500 neurons in each hidden layer. All hidden layers are equipped with the ReLU non-linearity. The output layer is a linear layer. For each layer we deploy a dropout layer with dropping rate 50 percent. The weights are initialized by using random procedure suggested above. The neural network is trained using Adam algorithm with learning rate 0.001, momentum parameter 0.9 and mini-batch size 100. Moreover penalty is also used with the penalty parameter set to 0.03. The details can be found in Table 3. The regression network is benchmarked with Euclidean distance based fingerprinting and SVM based methods.

7.3 Autoencoder design for Feature Extraction

The test for using autoencoder as feature extraction is done on the simulation data. First an autoencoder is trained properly and then the encoder is used to transform the original inputs into another feature space. The Euclidean distance-based method and SVM are then used for the transformed data.

7.3.1 System Architecture

A single layer autoencoder with size 5 is adopted in the experiment. BGD is used as updating algorithm with learning rate 1, batch size 50. The input is a vector of size 20 meaning 5 measurements per 4 APs. After applying the encoder to the original data, each sample is transformed into a 5-dimensional vector. The goal is to find a better representation of the data or in other words find an efficient fingerprint construction.

Euclidean distance-based method and SVM are then used for the 5-D vectors. is chosen equal to for the number of nearest neighbors for both cases. SVM with RBF kernel is adopted with and the penalty parameter .

7.3.2 Simulation Data Performance

The performance of autoencoder feature extraction is compared with the case where the fingerprints are constructed by simple averaging of RSSI values of each AP. Euclidean distance-based (ED-based) method and SVM are used for pattern matching in both cases.
Localization error can be seen as the performance metric of algorithms which is defined as the Euclidean distance between the estimated position and the ground truth position.

The results are shown in Table 4. Using autoencoder to extract features of data and then applying ED-based method or SVM does not give better performance than doing it directly on the original data. There can be two different reasons for this problem. First, it might be that the autoencoder is not capable of extracting properly all the data features that are suitable for localization tasks. But on the other hand, it might be the case that the autoencoder extracts too much information for the localization tasks. In fingerprinting contexts, the second reason seems more plausible. Note that the simple averaging of RSSI values provides a better performance despite the fact that average values are insufficient in general to represent a dataset. This surprising observation indicates that in some applications, not all the features are relevant for the task at hand. As mentioned above, batch methods for training autoencoders tend to learn the average value of the data which is not suitable for most of pattern recognition tasks. This is not the case for localization and surprisingly these methods might provide better features for fingerprinting. In any case, indoor localization applications seem not to rely on very complex features of the dataset and therefore no sophisticated feature extraction method is required.

Simulation Data
ED SVM Autoencoder+ED Autoencoder+SVM
Mean error [m] 2.44 2.37 2.66 2.47
Error variance 2.27 2.19 2.39 2.49
Min. error [m] 0.04 0.03 0.11 0.12
Max. error [m] 10.63 8.33 7.90 8.76
Table 4: Summary results for simulation data when applied feature extraction

7.4 Performance of Regression Networks in Indoor Localization

7.4.1 Simulation Data Performance

Figure 8: Box plots of localization errors of three algorithms for simulation data
Simulation Data
Euclidean Distance Support Vector Machine Neural Network
Mean error [m] 2.44 2.37 2.35
Error variance 2.27 2.19 1.75
Min. error [m] 0.04 0.03 0.06
Max. error [m] 10.63 8.33 6.73
Table 5: Summary results of three algorithms for simulation data

As shown in Table 5, the neural network approach has the least mean error with 2.35 meters. Moreover, it can be seen in Figure 8

that there are less outliers for the neural network approach than the Euclidean distance-based or SVM approach. This is also proven by comparing the variance of error for the neural network compared to other algorithms. For the simulation data, the neural network approach not only gives a better localization result but also a more stable performance.

7.4.2 UJIndoorLoc Data Performance

Euclidean Distance Support Vector Machine Neural Network
Floor 0 of Building 0
Mean error [m] 10.06 8.48 7.62
Error variance 68.30 63.35 43.00
Min. error [m] 0.45 0.25 0.63
Max. error [m] 40.15 53.62 34.02
Floor 1 of Building 0
Mean error [m] 9.93 8.81 8.08
Error variance 195.33 119.73 93.94
Min. error [m] 0.25 0.09 0.40
Max. error [m] 118.64 81.56 90.00
Floor 2 of Building 0
Mean error [m] 9.50 9.42 7.42
Error variance 180.24 175.47 28.43
Min. error [m] 0.07 0.38 0.37
Max. error [m] 105.58 86.69 25.21
Floor 3 of Building 0
Mean error [m] 9.40 7.72 7.27
Error variance 89.87 52.98 29.36
Min. error [m] 0.82 0.22 0.76
Max. error [m] 63.35 39.16 26.41
Table 6: Summary results of three algorithms for UJIndoorLoc Data

For simplicity, the evaluation is only limited to the building 0 of UJIndoorLoc data. Since there are no available validation data, the training data are randomly divided into training data and validation data with 70% and 30%. In Table 6, performances of three algorithms for UJIndoorLoc data are compared. The results are similar to the simulation data. The neural network approach has a smaller mean localization error. Although it does not always have the lowest minimum and maximum errors, the variance is the smallest which indicates more stable localization ability. Notice that in this case, there are 520 APs compared to 4 APs in the simulation data. This also proves the scalability of neural networks when there are lots of features.

7.4.3 TKN Data Performance

TKN Data
Euclidean Distance Support Vector Machine Neural Network
Mean error [m] 4.90 5.05 3.54
Error variance 9.02 11.97 5.97
Min. error [m] 0.00 1.22 0.59
Max. error [m] 9.97 14.62 9.63
Table 7: Summary results of three algorithms for TKN data

In TKN data, there are a lot of unimportant features (APs in the other buildings) which is a challenge for localization algorithms. The results in Table 7 show the ability of deep learning in dealing with irrelevant features. Neural network gives much better localization accuracy than Euclidean distance-based and SVM methods. The neural network approach also gives smaller variance, the same as shown in the previous two datasets.

7.5 Performance of Data Augmentation

The test of data augmentation is done on both simulation data and UJIndoorLoc data. The data is augmented 1, 5 and 10 times to compare the influence of different levels of data augmentation. One time augmentation means that the dataset is two times as large as the original dataset. Notice that there are 4 APs for the simulation data while 520 APs for UJIndoorLoc data. The test for UJIndoorLoc data is only done for the building 0. Similarly to the previous evaluation, average error, error variance, minimum and maximum error are the performance metrics.

7.5.1 Simulation Data Performance

The simulation data contains 1000 training points and 1000 test points. There are 2000, 5000, 11000 training points after 1, 5, 10 times augmentation respectively. The localization performance is still evaluated by 1000 test points.

Simulation Data
Original 1 time permutation 5 times permutation 10 times permutation
Mean error [m] 2.35 2.33 2.23 2.21
Error variance 1.75 1.73 1.59 1.56
Min. error [m] 0.06 0.04 0.05 0.02
Max. error [m] 6.73 8.01 8.07 6.93
Table 8: Summary results for simulation data when applied data augmentation

By comparing the mean localization error in Table 8, the improvement of data augmentation for fingerprinting indoor localization approach can be observed. Notice that with the increase of augmentation level, the localization error is also decreasing. By 5 times permuting the fingerprints, the localization error is decreased by about 10%. However, with 10 times fingerprints permutation, the localization error is almost the same as the error of doing 5 times permutation. It seems that at a certain level of data augmentation, the performance becomes saturated.

7.5.2 UJIndoorLoc Data Performance

The test of data augmentation for experimental data is done only for floor 0 of building 0 in UJIndoorLoc data consisting of 1059 training samples and 78 test samples. After data augmentation, 30% of the whole data are randomly picked as validation data.
The results in Table 9 verify the effectiveness of data augmentation for experimental data. 10 times permutation of fingerprints decrease the average localization error by about 1 meter as well as the error variance. Similarly to simulation data, the improvement of performance is getting lower when the augmentation reaches a certain level.

Floor 0 of Building 0
Original 1 time permutation 5 times permutation 10 times permutation
Mean error [m] 7.62 6.99 6.87 6.69
Error variance 43.00 41.48 40.85 38.72
Min. error [m] 0.63 0.20 0.37 0.45
Max. error [m] 34.02 34.16 31.02 30.17
Table 9: Summary results for UJIndoorLoc data when applied data augmentation

The performance improvement by data augmentation is almost free. When there is enough memory and computation power, choosing the number of permutation large enough is a good choice. Although the training time may increase due to more training samples but this computation is offline and the localization of test points are still done by a single forward propagation and thus can be used in real applications.

7.6 Performance of Transfer Learning

In order to test the performance of transfer learning, it is necessary to have data from two different propagation models. Two different floors of one building in UJIndoorLoc data can be seen as two different propagation models. They share the same structure as can be seen from Figure 9 with similar anchor placement and thus can be used to test the usage of transfer learning.

Figure 9: Training grids of floor 0 and 1 of building 0 in UJIndoorLoc data. It can be seen that the training grids are following the same track indicating a similar inner structure of two floors.

First, the neural network is trained on the whole dataset of floor 0 to get the pre-trained model and then the model is fine-tuned by only 30% of the training data from floor 1. At the end the model is tested on the test data of floor 1.

Figure 10: Box plots of localization errors for transfer learning

The test results shown in Table 10 include three situations. First, the neural network is trained directly by 30% of the training data from floor 1 and tested with test samples from floor 1. This shows whether smaller training data is sufficient or not. Secondly, the neural network is trained by the whole data from floor 0 and tested with the test samples from floor 1. This experiment examines the performance of naive transfer learning. Lastly, the model obtained from the floor 0 is fine-tuned by 30% of the training data from floor 1 and tested with the test samples from floor 1. The average error after fine-tuning shows a comparable mean error compared with using the whole training data. The result shows the potential of using transfer learning in fingerprinting localization. By comparing the first and third situation, it can be seen that even when there are not enough data, using a pre-trained model can help to get a reasonable performance. The second and third situation can be seen as a model update. The old model (floor 0) can be updated by a small amount of new data (floor 1) and it will work well in the new environment.

Floor 0, 1 of Building 0
Without transfer learning Before fine-tuning After fine-tuning
Mean error [m] 16.76 11.90 8.70
Error variance 77.75 54.75 73.24
Min. error [m] 0.67 0.21 0.15
Max. error [m] 55.45 44.53 69.93
Table 10: Summary results for transfer learning

8 Conclusion

The main motivation behind this work is to show how artificial intelligence can be utilized for indoor localization applications to accomplish tasks that cannot be done efficiently by the existing methods. As expected a complex neural network architecture is capable of approximating well the localization function mapping the measured RSSI values to the locations. Moreover we proposed using data augmentation and transfer learning to alleviate the problem of data collection which is essential in fingerprinting approaches. The structure of a particular environment is reflected in the trained neural network. Therefore a neural network can provide reasonable performance when it is used in another building with same anchor placement. This can significantly simplify the way in which fingerprinting algorithms are trained.

9 Acknowledgment

The authors are grateful to Filip Lemic for helping with TKN dataset.

References

  • [1] C. Medina, J. Segura, and . De la Torre, “Ultrasound Indoor Positioning System Based on a Low-Power Wireless Sensor Network Providing Sub-Centimeter Accuracy,” Sensors, vol. 13, no. 3, pp. 3501–3526, Mar. 2013.
  • [2] E. Brassart, C. Pegard, and M. Mouaddib, “Localization using infrared beacons,” Robotica, vol. 18, no. 02, pp. 153–161, Mar. 2000.
  • [3] Erol-Kantarci et al., “A Survey of Architectures and Localization Techniques for Underwater Acoustic Sensor Networks,” IEEE Communications Surveys Tutorials, vol. 13, no. 3, pp. 487–502, 2011.
  • [4] I. Amundson and X. D. Koutsoukos, “A Survey on Localization for Mobile Wireless Sensor Networks,” in Mobile Entity Localization and Tracking in GPS-less Environnments, ser. Lecture Notes in Computer Science, R. Fuller and X. D. Koutsoukos, Eds.   Springer Berlin Heidelberg, 2009, no. 5801, pp. 235–254.
  • [5] F. Seco et al., “A survey of mathematical methods for indoor localization,” in Intelligent Signal Processing, 2009, pp. 9 –14.
  • [6] D. Milioris et al., “Low-dimensional signal-strength fingerprint-based positioning in wireless lans,” Ad Hoc Networks, pp. 100 – 114, 2014.
  • [7] V. Honkavirta et al., “A Comparative Survey of WLAN Location Fingerprinting Methods,” in WPNC 2009.   IEEE, 2009, pp. 243–251.
  • [8] D. Milioris et al., “Low-Dimensional Signal-Strength Fingerprint-based Positioning in Wireless LANs,” Ad Hoc Networks, 2011.
  • [9] C. Laoudias et al.

    , “Localization using radial basis function networks and signal strength fingerprints in wlan,” in

    Global Telecommunications Conference, 2009. GLOBECOM 2009. IEEE, 2009, pp. 1–6.
  • [10]

    S. Bai and T. Wu, “Analysis of k-means algorithm on fingerprint based indoor localization system,” in

    Microwave, Antenna, Propagation and EMC Technologies for Wireless Communications, 2013.
  • [11] C. Steiner and A. Wittneben, “Efficient training phase for ultrawideband-based location fingerprinting systems,” Signal Processing, IEEE Transactions on, vol. 59, no. 12, pp. 6021–6032, 2011.
  • [12] J. Machaj et al., “Impact of the number of access points in indoor fingerprinting localization,” in Radioelektronika, 2010, pp. 1–4.
  • [13] K. Kaemarungsi et al., “Modeling of indoor positioning systems based on location fingerprinting,” in INFOCOM, 2004, pp. 1012–1022.
  • [14] K. Kaemarungsi, “Efficient design of indoor positioning systems based on location fingerprinting,” in Wireless Networks, Communications and Mobile Computing, vol. 1, 2005, pp. 181–186.
  • [15] Y. Wen, X. Tian, X. Wang, and S. Lu, “Fundamental limits of RSS fingerprinting based indoor localization,” in 2015 IEEE Conference on Computer Communications (INFOCOM), Apr. 2015, pp. 2479–2487.
  • [16] A. Behboodi, F. Lemic, and A. Wolisz, “Hypothesis Testing Based Model for Fingerprinting Localization Algorithms,” in 2017 IEEE 85th Vehicular Technology Conference (VTC-Spring’17), 2017.
  • [17] A. Behboodi et al., “A Mathematical Model for Fingerprinting-based Localization Algorithms,” arXiv preprint, 2016, arXiv: 1610.07636. [Online]. Available: arxiv:1610.07636
  • [18] A. Behboodi, F. Lemic, A. Wolisz, and R. Mathar, “Interference effect on the performance of fingerprinting localization,” in International Conference on Indoor Positioning and Indoor Navigation (IPIN 2017), September 2017.
  • [19] G. Ding et al., “Overview of received signal strength based fingerprinting localization in indoor wireless lan environments,” in Microwave, Antenna, Propagation and EMC Technologies for Wireless Communications, 2013.
  • [20] X. Wang, L. Gao, S. Mao, and S. Pandey, “Csi-based fingerprinting for indoor localization: A deep learning approach,” IEEE Transactions on Vehicular Technology, vol. 66, no. 1, pp. 763–776, 2017.
  • [21] M. Nowicki and J. Wietrzykowski, “Low-effort place recognition with wifi fingerprints using deep learning,” arXiv preprint arXiv:1611.02049, 2016.
  • [22] L. Xiao, A. Behboodi, and R. Mathar, “A deep learning approach to fingerprinting indoor localization,” in International Telecommunication Networks and Applications conference (ITNAC), November 2017.
  • [23] F. Lemic, A. Behboodi, V. Handziski, and A. Wolisz, “Experimental decomposition of the performance of fingerprinting-based localization algorithms,” in Indoor Positioning and Indoor Navigation (IPIN), 2014 International Conference on.   IEEE, 2014, pp. 355–364.
  • [24] S. Kandel, A. Paepcke, J. Hellerstein, and J. Heer, “Wrangler: Interactive visual specification of data transformation scripts,” in Proceedings of the SIGCHI Conference on Human Factors in Computing Systems.   ACM, 2011, pp. 3363–3372.
  • [25] H. Wickham and others, “Tidy data,” Journal of Statistical Software, vol. 59, no. 10, pp. 1–23, 2014.
  • [26] E. F. Codd, The Relational Model for Database Management: Version 2.   Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc., 1990.
  • [27] C. Cortes and V. Vapnik, “Support-vector networks,” Machine learning, vol. 20, no. 3, pp. 273–297, 1995.
  • [28] I. Steinwart and A. Christmann, Support vector machines, 1st ed., ser. Information science and statistics.   New York: Springer, 2008.
  • [29] C.-W. Hsu and C.-J. Lin, “A comparison of methods for multiclass support vector machines,” IEEE transactions on Neural Networks, vol. 13, no. 2, pp. 415–425, 2002.
  • [30] M. L. Minsky and S. A. Papert, Perceptrons: an introduction to computational geometry, 2nd ed.   Cambridge/Mass.: The MIT Press, 1972.
  • [31] G. Cybenko, “Approximation by superpositions of a sigmoidal function,” Mathematics of Control, Signals and Systems, vol. 2, no. 4, pp. 303–314, Dec. 1989.
  • [32] K. Hornik, “Approximation capabilities of multilayer feedforward networks,” Neural Networks, vol. 4, no. 2, pp. 251–257, 1991.
  • [33] S. Sonoda and N. Murata, “Neural network with unbounded activation functions is universal approximator,” Applied and Computational Harmonic Analysis, vol. 43, no. 2, pp. 233–268, Sep. 2017.
  • [34] M. Raghu, B. Poole, J. Kleinberg, S. Ganguli, and J. Sohl-Dickstein, “On the Expressive Power of Deep Neural Networks,” in PMLR, Jul. 2017, pp. 2847–2854.
  • [35] Y. Bengio, “Practical recommendations for gradient-based training of deep architectures,” in Neural networks: Tricks of the trade.   Springer, 2012, pp. 437–478.
  • [36] R. Linsker, “Self-organization in a perceptual network,” IEEE Computer, vol. 21, no. 3, pp. 105–117, Mar. 1988.
  • [37] Y. Bengio, P. Lamblin, D. Popovici, H. Larochelle et al., “Greedy layer-wise training of deep networks,” Advances in neural information processing systems, vol. 19, p. 153, 2007.
  • [38] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, “Extracting and composing robust features with denoising autoencoders,” in Proceedings of the 25th international conference on Machine learning.   ACM, 2008, pp. 1096–1103.
  • [39] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol, “Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion,” Journal of Machine Learning Research, vol. 11, no. Dec, pp. 3371–3408, 2010.
  • [40] Leo Breiman, “Bagging predictors,” Machine Learning, vol. 26, pp. 123–140, 1996.
  • [41] ——, “Bias, Variance, and Arcing Classifiers,” Apr. 1996.
  • [42] Z.-H. Zhou, Ensemble methods: foundations and algorithms, ser. Chapman & Hall/CRC machine learning & pattern recognition series.   Boca Raton, FL: Taylor & Francis, 2012.
  • [43] I. Goodfellow, Y. Bengio, and A. Courville, Deep learning, ser. Adaptive computation and machine learning series.   Cambridge, MA: MIT Press, 2017.
  • [44] K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun, “What is the best multi-stage architecture for object recognition?” in

    2009 IEEE 12th International Conference on Computer Vision

    , Sep. 2009, pp. 2146–2153.
  • [45] X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectifier neural networks.” in Aistats, vol. 15, no. 106, 2011, p. 275.
  • [46]

    A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in

    Advances in neural information processing systems, 2012, pp. 1097–1105.
  • [47] Y. Bengio, “Learning Deep Architectures for AI,” Foundations and Trends in Machine Learning, vol. 2, no. 1, pp. 1–127, 2009.
  • [48] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm for deep belief nets,” Neural computation, vol. 18, no. 7, pp. 1527–1554, 2006.
  • [49] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks.” in Aistats, vol. 9, 2010, pp. 249–256.
  • [50] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1026–1034.
  • [51] D. Mishkin and J. Matas, “All you need is a good init,” in ICLR, 2016.
  • [52] Y. A. LeCun, L. Bottou, G. B. Orr, and K.-R. Müller, “Efficient backprop,” in Neural networks: Tricks of the trade.   Springer, 2012, pp. 9–48.
  • [53] B. T. Polyak, “Some methods of speeding up the convergence of iteration methods,” USSR Computational Mathematics and Mathematical Physics, vol. 4, no. 5, pp. 1–17, 1964.
  • [54]

    J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online learning and stochastic optimization,”

    Journal of Machine Learning Research, vol. 12, no. Jul, pp. 2121–2159, 2011.
  • [55] T. Tieleman and G. Hinton, “Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude,” COURSERA: Neural networks for machine learning, vol. 4, no. 2, pp. 26–31, 2012.
  • [56] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • [57] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.
  • [58] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, “CNN Features off-the-shelf: an Astounding Baseline for Recognition,” arXiv:1403.6382 [cs], Mar. 2014.
  • [59] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How Transferable Are Features in Deep Neural Networks?” in Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, ser. NIPS’14.   Cambridge, MA, USA: MIT Press, 2014, pp. 3320–3328.
  • [60] A. Borrelli et al., “Channel models for ieee 802.11 b indoor system design,” in ICC’04, vol. 6.   IEEE, 2004, pp. 3701–3705.
  • [61] F. Lemic, J. Büsch, M. Chwalisz, V. Handziski, and A. Wolisz, “Demo abstract: Testbed infrastructure for benchmarking rf-based indoor localization solutions under controlled interference,” in Proc. of 11th European Conference on Wireless Sensor Networks (EWSN’14), 2014, pp. 1–5.
  • [62] F. Lemic, “Data management services for evaluation of rf-based indoor localization filip lemic and vlado handziski.”
  • [63] J. Torres-Sospedra, R. Montoliu, A. Martínez-Usó, J. P. Avariento, T. J. Arnau, M. Benedito-Bordonau, and J. Huerta, “Ujiindoorloc: A new multi-building and multi-floor database for wlan fingerprint-based indoor localization problems,” in Indoor Positioning and Indoor Navigation (IPIN), 2014 International Conference on.   IEEE, 2014, pp. 261–270.
  • [64] A. Moreira, M. J. Nicolau, F. Meneses, and A. Costa, “Wi-fi fingerprinting in the real world-rtls um at the evaal competition,” in Indoor Positioning and Indoor Navigation (IPIN), 2015 International Conference on.   IEEE, 2015, pp. 1–10.