Deep Smartphone Sensors-WiFi Fusion for Indoor Positioning and Tracking

11/21/2020 ∙ by Leonid Antsfeld, et al. ∙ 0

We address the indoor localization problem, where the goal is to predict user's trajectory from the data collected by their smartphone, using inertial sensors such as accelerometer, gyroscope and magnetometer, as well as other environment and network sensors such as barometer and WiFi. Our system implements a deep learning based pedestrian dead reckoning (deep PDR) model that provides a high-rate estimation of the relative position of the user. Using Kalman Filter, we correct the PDR's drift using WiFi that provides a prediction of the user's absolute position each time a WiFi scan is received. Finally, we adjust Kalman Filter results with a map-free projection method that takes into account the physical constraints of the environment (corridors, doors, etc.) and projects the prediction on the possible walkable paths. We test our pipeline on IPIN'19 Indoor Localization challenge dataset and demonstrate that it improves the winner's results by 20% using the challenge evaluation protocol.



There are no comments yet.


page 5

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Ubiquitous location-based services have recently attracted a great deal of attention. They require a reliable positioning and tracking technology for mobile devices that works outdoors as well as indoors [39]. While navigation satellite systems such as GPS already provide reliable positioning outdoors, a corresponding solution is yet to be found for indoor environment where GPS signal cannot penetrate and provide sufficient accuracy performance.

Indoor location-based services [11] bring important social and commercial values, by enabling many applications including human localization and tracking, personalized advertisement, living assistance, etc. The ubiquity of smart-phones and the availability of different wireless infrastructure, such as WiFi and Bluetooth, make them an attractive platform for such positioning systems.

Numerous techniques for smartphone-based indoor positioning have been developed, yet there is not a single solution that can guarantee a reliable and universal service [34] on its own. Most techniques exhibit their strengths and weaknesses under different conditions. In combination, they can complement each other and improve not only accuracy but also reliability of service.

Nowadays a typical smartphone contains a dozen of different sensors, and their number keeps growing. There are several types of sensors in a smartphone. Network sensors, such as WiFi and Bluetooth, may be leveraged to estimate an absolute position of a user. WiFi positioning using received signal strength (RSS) fingerprinting [22] have been considered as the most popular indoor positioning solutions. RSS values from several access points (APs) can be easily gathered by common smartphones under existing WiFi infrastructure. However, severe RSS fluctuations always render inaccurate positioning results. Motion and Position sensors, (a.k.a. Inertial Measurement Unit, IMU) such as accelerometer, gyroscope and magnetometer can help estimate user displacement relative to a known starting point. This approach known as pedestrian dead reckoning (PDR) system [12], where PDR determines the user’s location by adding the currently estimated displacement to previously estimated location. The displacement is estimated by combining step detection, step length estimation with user heading estimation from accelerometer, gyroscope and magnetometer data streams. PDR can achieve accurate positioning over short distances but is a subject of drift over long distance. Environment sensor such as barometer, for example, may be useful in determining the floor inside a building.

Both WiFi and PDR have serious limitations though (high variation of WiFi signals and the drift of PDR)  [22, 34], so an auxiliary tool for indoor localization has been proposed. Namely, landmarks can be easily identified based on specific sensor patterns in the environment [4], and then be exploited to correct WiFi and PDR predictions.

In particular, human motion recognition [21, 36] from smartphone sensors may be used to improve indoor positioning. User motion states, like staying still, walking or taking stairs can be treated as indoor landmarks [1] to reset location estimation by PDR. Therefore, recent works tend to fuse WiFi positioning, PDR and landmarks to enhance the indoor positioning accuracy [7].

I-a Our proposal

In this paper, we propose a new sensor fusion framework for accurate indoor positioning and tracking using the smartphone inertial sensors, WiFi measurements and landmarks. Our framework integrates new components which distinguish it from the state of art approaches and they are as follows:

  1. Deep PDR. Inspired by using deep learning for user’s activity detection [7, 21], we apply deep learning approach to PDR. We pre-process and reshape sensor data streams and use convolutional (CNN) and recurrent (RNN) networks to extract underlying hidden correlations between different sensors and modalities to learn a model of user local displacement.

    This allows to cope with sensor noise and replace the manual feature extraction which is a frequent subject to data noise and sophisticated thresholding, including tuning to different pedestrian profiles, depending on gender, age, height etc. 


    While this approach gives a better relative displacement of the user, since the sensors measurements are always noisy, we use WiFi based predictions and observed landmarks in order to obtain an absolute position of the user.

  2. Landmarks and pseudo labels.

    CNN/RNN models require a large annotated dataset for training, while genuine ground truth annotations are sparse and available for a limited number of landmarks. On the other hand, raw sensor data are massively generated at a high rate. So we annotate sensor data with pseudo labels and generate a large annotated set for training CNN/RNNs. It is based on simpler tasks of user walking and landmark detection and a interpolation of user’s behaviour between the landmarks.

  3. Semi-supervised VAE for WiFi. A radiomap/fingerprinting is constructed from the WiFi data provided in training and validation data. Recorded data provides a WiFi scan reading every 4 seconds approximately, however without an exact position where this scan was taken. Using the provided inertial sensor data, we can infer the approximate position where the WiFi fingerprint was taken and build a radiomap with this information.

The preliminary version of our framework participated in the off-site smartphone based positioning track of the competition organized at IPIN 2019 conference [27] and was ranked 2nd. The full framework presented in this paper improves our own results by 25%. Moreover, as evaluations show, it reduces the localization error obtained by the IPIN’19 challenge winner by 20%.

The rest of the paper is organized as follows. Section 2 reviews the related work in WiFi based positioning, PDR based positioning and deep learning from sensor data. Section 3 presents the full architecture for indoor positioning and tracking. It then describes in detail the main components, paying a particular attention to deep PDR modeling, landmark recognition and pseudo labels for training CNN/RNN. Sections 4 presents the evaluation setting of IPIN’19 indoor localization challenge, and reports evaluation results and ablation studies. Finally, Section 5 concludes the paper.

Ii Related Work

Several recent surveys give an exhaustive picture of different aspects of research in the mobile and wireless networking domains, including indoor positioning for smartphones  [6, 12, 22, 40]. In this section, we briefly present works relevant to our architecture for indoor positioning and tracking, in particular, Wifi and PDR-based positioning, activity recognition and free-map matching.

Ii-a WiFi based positioning

The most popular technique for smartphone-based indoor localization today is WiFi-fingerprinting [6, 11, 22, 17]. A location is represented by a WiFi-fingerprint which lists visible access points and their respective received signal strength (RSS). Positioning is performed by matching the WiFi-fingerprint that is measured on the mobile device to a database of reference fingerprints collected beforehand during a calibration phase. The location associated with the closest match is returned as position estimate.

Over the past decade, most research effort focused either on improving the matching of measurements to reference data or on generalizing training data into signal strength models. Beyond a simple nearest neighbor matching [11], Ferris et al.[8] proposed to model signal strength across an entire building using Gaussian processes which allows to extrapolate to areas with no reference data. In contrast to that,  [13] considered a continuous building-wide WiFi model unnecessary and propose a sparser representation by mapping fingerprints to a graph-based reference database.

To be accurate, the fingerprints should be densely recorded and annotated with exact coordinates. Classical methods suffer from hand-crafted algorithms, subject of heavy complex calibration and parameter tuning.

The main challenge is a gap between a massive generation of non-annotated sensor data and their modest annotation allowing to deploy only simple machine learning algorithms 


. To enable deployment of modern deep learning techniques, the problem of sizable annotations is usually addressed by crowd-sourcing, pseudo-labeling or semi-supervised learning able to combine unlabeled and labeled sensor data. Some efforts have been proposed for WiFi-based localization 

[5, 24].

In [38]

Y. Yuan et al. introduced an efficient fingerprint training method, using semi-supervised learning, reporting 80% time cost reduction while guaranteeing the localization accuracy. Another method was recently proposed in 

[3] where a faster radio-map construction is achieved by allowing larger distances between successive fingerprints and by using adaptive path loss model interpolation to estimate locations of fingerprints.

A semi-supervised method for localization of a moving smart-phone robot was proposed in [37]. First they obtain pseudo labels for the unlabeled data using Laplacian Embedded Regression Least Square. During the learning phase, two decoupled balancing parameters are individually weighted to labeled and pseudo-labeled data. Semi-supervised learning with generative models based on Variational Auto-Encoder (VAE) has been applied to WiFi based localization in [5]; we deploy it in our positioning and tracking architecture.

Ii-B Pedestrian dead reckoning

PDR-based localization technique utilizes the inertial sensors available on modern smartphones, in particular, accelerometer, gyroscope and magnetometer [12]. As all inertial methods, it can give an accurate position only in a short period of time, but requires regular corrections of user’s position to avoid the error accumulation. PDR is often composed of step detection, step length estimation and heading determination

. The user’s position is estimated recursively by accumulating vectors that represent the movement of the user at each detected step.

All PDR components are a subject of heavy parameter tuning [21]. Step length depends on user’s characteristics such as height or age, and even for the same user, it may vary according to the activity the user is performing, i.e., walking slowly vs. walking fast. Step detection algorithms, such as peak detection, flat-zone detection and zero-crossing detection, are not free of heavy parameter tuning either [32]. Accuracy of these techniques depends on thresholds, that may be conditioned by the user’s characteristics but also by the quality and particularities of the inertial sensors, being appropriately set [20].

With respect to heading estimation, usually the heading angle offset, which is the angle between the direction of smartphone and the direction of the user, will not remain constant during the navigation. The assumption that the angle remains constant can be satisfied when pedestrians hold smartphones on the front of the body, but if the phone pose is arbitrary, the heading offset cannot be guaranteed to be constant.

Displacement and direction of motion are then estimated for individual steps. To this end, recent research relies on machine learning techniques and activity recognition [21] has been extended from distinguishing not only between the user moving and standing still, but to further include estimating the walking speed, climbing on stairs, taking an elevator, etc. [35, 41].

Ii-C Deep learning from sensor data

A new generation of systems for indoor localization confirms a transition from traditional approaches of signal processing to machine learning solutions including deep learning [35, 22].

Most mobile devices can only produce unlabeled position data, therefore unsupervised and semi-supervised learning become essential. Mohammadi et al. [24]

address this problem by leveraging deep reinforcement learning and variational auto-encoders (VAE). In particular, their framework envisions a virtual agent in indoor environments, which can constantly receive state information during training, including signal strength indicators, current agent location, and the real (labeled data) and inferred (via a VAE) distance to the target.

Deep learning for recognition of human activities has been approached by using both ambient sensing methods and wearable sensing methods [41].

Activity recognition using sensor data is a multivariate time-series classification problem, which extracts discriminative features from sensor data to recognize activities by a classifier 

[21]. As time-series data have a strong one-dimensional structure, in which the variables temporally nearby are highly correlated [34, 41]. Traditional methods rely on extracting complex hand-crafted features which require laborious human intervention and leads to the incapability of pedestrian activities identification.

In [41]

, a deep learning-based method for indoor activity recognition by using the combination of data from multiple smartphone built-in sensors. A new convolutional neural network (CNN) has been designed for the one-dimensional sensor data to learn the proper features automatically.

Ii-D Map-aided navigation

The idea of using indoor space geometry for reduction of position and heading errors in autonomous positioning systems has been extensively exploited in the last several years. In the case of indoor navigation, building floor plans represent constraints that restrict movements, as people cannot walk through walls and floor changes can occur only via staircases or elevators. The goal of map-aided navigation is to exploit prior information contained in maps to improve positioning accuracy [6, 26]. There are currently three approaches to map aided navigation indoors [33], all of which can be implemented on smartphones probabilistic map matching based on particle filtering using wall constraints, topological map matching based on link-node representation of a building plan, and reduction of heading error by comparison with building cardinal heading. The purpose of these algorithms is to improve positioning and heading by adjusting the estimated path to the building plan [25].

Fig. 1: Sensor data for landmarks detection.

Iii System design

We illustrate our approach using an example of user’s route (see Figure 1

) from IPIN’19 localization challenge dataset and associated data from accelerometer, gyroscope, magnetometer and barometer sensors, as well as speed and stride estimations. The route spans 10 points; it starts by switching the smart phone on at point 0 and letting the calibration terminate. The user then walks through points 1, 2, 3, 4 to point 5. Once at point 5, she returns (creating point 6) and walks through points 7, 8, 9 to get back to the starting point. The figure plots the sensor data streams along the timeline, through point 0 to 10.

Sensor data and landmarks are used to generate the pseudo labels for deep PDR learning. The main elements of the sensor data annotation are the following:

  • Walking vs standing still. A small amount of sensor data is sufficient to train an accurate classifier to distinguish between these two activities [34]. In Figure 1, accelerometer data between points 0 and 1 (after the calibration phase) and points 5 and 6 clearly suggests user’s standing still.

  • Landmarks. Points 1 to 5 and 7 to 10 of the route are landmarks, they refer to direction changes. Crucial for training indoor localization systems, they are commonly annotated with the ground truth positions. Figure 1 suggests that orientation changes111Orientation vectors can be estimated from accelerometer, gyroscope and magnetometer data either separately or via a smartphone application.

    are highly correlated with landmarks. By coupling orientation data with other sensor data and landmark ground truth, a simple Random Forest can be trained to get the accurate landmark predictor 


  • Pressure. The user’s route in Figure 1 stays on the same floor. In general, barometer sensor data allow to easily recognize the floor change [41].

  • Speed and Stride Estimations. Obtained from accelerometer and gyroscope data, they are important to generate pseudo labels and annotations, and their values are inferred from PDR and averaged over each route’s segment.

Our assumption of steady user’s walk between two landmarks is inspired by indoor localization datasets created for IPIN’18 and IPIN’19 challenges [30, 27]. In a more general case with multiple open spaces and erratic user’s walking, it can lead to over- or under-segmentation of a trajectory and a high noise in generated pseudo labels.

Iii-a Deep Learning from sensor data

CNNs are state-of-the-art models in image recognition tasks, where the nearby pixels typically have strong relationships with each other thus forming visual patterns. While CNNs are used for computer vision tasks, we believe their convolutional layers are able to capture relationships in motion signals and identify correlations between sensors once the input is shaped as an image. In multi-modal approaches, where many sensors are used to capture a movement, grasping correlations among sensors may help to better interpret data. Thus CNN can exploit the local dependency characteristics inherent in time-series sensor data and the translation-invariant nature of movement.

To enable convolution on smartphone sensor data we frame it as an image. We first down-sample all raw sensor data to 50Hz, a frequency sufficient to characterize any user’s displacement [41]. Then we implement two modes of converting sensor data, using raw data or recurrence plots.

In the raw data mode, we concatenate all sensor data in one -dimensional stream and run a sliding window over the stream. The window width determines the width of each data point as CNN input, and represents the time interval considered. If the window width is set to one second and the data is sampled at 50Hz, each data point will be 50 columns wide. It is a rule of thumb in the community that a one second interval is adequate to characterize human activities and, therefore, it should be sufficient to learn a meaningful user’s movement model.

The window height depends on the number of sensors that are being taken into account. For accelerometer, gyroscope and magnetometer sensors we generate four rows, one for each axis and the one for the magnitude, calculated from the three axial values. Figure 2 illustrates this process. For each window considered, a total of 12 features are extracted for each time stamp and framed as an image.

We build the local displacement model as having one regression branch and one classification branch (see Figure 2). The regression branch is aimed at predicting user local displacement , where the classification branch predicts the user activity. The activity classification is trained with the standard cross entropy loss; the regression branch is trained by minimizing the loss over a set of 2D points, defined as follows:


where is a ground truth and is a prediction.

We train the deep PRD model using the Adam [19] optimizer with a learning rate and a weight decay value . First, the raw sensor data is extracted from a set of annotated logfiles containing recorded values from all sensors and all tracks. A subset of these tracks is reserved for validation. The total training and validation losses are calculated as a weighted sum of in (1) and the cross entropy for the activity the user is performing (standing still vs. walking):


where is a trade-off between the two terms. In our experiments, we set

to 1. The training process stops when the validation loss has not improved for 50 epochs.

Fig. 2: Framing raw sensor data as input to CNN trained to predict the local displacement () and user activity.

Iii-B Recurrence Plots

Recurrence-based analysis [10] utilizes a fundamental characteristic that any system eventually returns close to its earlier states as time passes. In the case of real-world time series, systems often repeat earlier behavior, even though they might at times be interrupted by regime shifts and dynamical transitions. Recurrence plots encode the pairwise recurrences of time series values and thus create a visual representation of system dynamics, solely from the measured time series.

Consider , a -dimensional time series of length . The system is said to recur when a state vector at time is close to a different state vector at time , i.e., . Here, the notion of being close to depends on (i) the choice of a norm such as the Euclidean norm or the maximum norm, and (ii) the choice of a distance threshold, which helps unambiguously define all states farther apart as ’not close’, and vice versa. We can thus encode all possible pairs of recurrences in the recurrence matrix , where


is a norm, is a chosen distance threshold and is a normalization function.

Working with the entire time series being not practical, we consider a window width . The resulting matrix of size is a matrix comprising solely values between 1 and 0 where the values close to 1 denote pairs of points where the sensor data recur, while values close to 0 denote non-recurring pairs of points. is symmetric only if the chosen norm is symmetric.

A recurrence plot (RP) is obtained by visualizing the recurrence matrix [23]. Based on the simple estimation given by Eq.(3), a powerful visual representation can capture the difference in dynamical behaviour. Figure 3 shows 10 sequential recurrence plots for a sliding window on -dimensional time series, where =12 is the dimension of data stream composed with accelerometer, gyroscope and magnetometer data.

Fig. 3: 10 sequential recurrence plots from a accelerometer, gyroscope and magnetic sensor data streams.
Fig. 4: General overview of the architecture. Dashed boxes and lines represent data used to train the Deep PDR model.

Iii-C System architecture

Our system for indoor positioning and tracking is composed of four main components:

  • A deep PDR model that provides a high-rate update of the user’s relative displacement.

  • A WiFi fingerprinting component that provides a prediction of the absolute user’s position each time a WiFi scan is received, which occurs approximately every 4 seconds.

  • A Kalman filter to fuse the different rate predictions from the deep PDR and WiFi components. The filter provides an estimate of the user’s position, without taking into account physical restrictions imposed by the environment.

  • A map-free projection algorithm that projects the prediction from the Kalman filter on the paths that are possible given the physical constraints of the environment (corridors, doors, etc.). In this way, the final prediction is adjusted to a feasible route in which crossing regions of the environment that are impossible for a user on foot is avoided.

Figure 4 shows the proposed architecture; the following sections describe the main components in detail.

Iii-D Deep PDR

Any PDR based system monitors the user’s behavior by gathering relevant data from IMU sensors. The sensors’ data stream is processed into handcrafted features. Relevant and discriminative characteristics, such as number of steps, step length, orientation, etc, are extracted from the raw data. Finally, the user’s relative position is using a theoretical dynamic model of the movement.

Classical PDR techniques need to infer the speed of the user through the determination of steps, extracted typically from accelerometer data, and an approximation of the user’s step length. The error of PDR estimations is usually caused from both heading and step length error. The stride of the user does not have to be constant and depends, among other factors, on the physical characteristics of the user.

Instead, we propose a deep learning approach to PDR. To learn the local displacement model from IMU sensor data, we make a simplifying assumption motivated by the analysis of the IPIN’19 challenge dataset. Indeed, user tracks in the dataset correspond to routes carried out inside administrative buildings composed mostly of long corridors. The landmarks provided with the data refer to user’s direction changes. Therefore, we assume that there are no changes in orientation during the user’s path between two consecutive landmarks, so the user moves in a straight line between these points. However, the user’s speed is unknown and can vary due to an obstacle, such as a door or other people, or as a consequence of a user decision. We determine the speed by obtaining the number of steps from the data coming from the accelerometer, and adjusting the speed based on the distance between each two consecutive landmarks and their corresponding timestamps. In this way, the speed is not considered constant between two landmarks, but varies depending on the data provided by the accelerometer.

CNN as Deep PDR model

Our CNN consists of 3 convolution layers and 2 max-pooling layers followed by fully connected layers. Two dropouts layers interleaves convolution layers to improve the overall accuracy of CNN results.

CNN inputs the image-framed raw sensor data or RPs, and passes it through convolution layers. Convolution kernels in these layers vary in the function of input image size. The pooling layer is placed after the activation of convolution layer. This layer can extract features of the convolution layer output by reducing the number of rows and columns of the image-like input. In our implementation, max pooling layer with a two by two filter (stride two) will store the maximum value of the two by two subsection. At the final stage of CNN, there are fully connected layers with softmax function which calculates the output of CNN. The softmax acts as a regressor based on the displacements.

RNN as Deep PRD model

We explore the performance of a deep PDR model by replacing CNN by a Recurrent Neural Network 

[31] (RNN). RNNs are specialized in processing sequences of values and capturing long-distance inter-dependencies in the input stream. They can pass information among time steps, which allows them to remember information about previous values in the sequence. When dealing with time series of IMU sensor data, recurrent networks are capable of identifying these temporal patterns and produce accurate predictions. In each step, the internal state of the RNN, a sort of ’memory’ of previous time steps, is combined with the current time step input to produce an output. This way, the last output for a given sequence will be based on information obtained from all previous values in the sequence.

Vanilla RNN architecture suffers from some severe issues, like the vanishing and exploding gradient problems 


, which makes optimization a complex challenge. Long Short Term Memory (LSTM) networks 

[14] have been designed as a way to avoid these problems while efficiently learning long-range dependencies. If fed in a bidirectional fashion, using both the data from start to end and from end to start, LSTMs can achieve better results, since they can recognize patterns in both directions. In our experiments, we use bidirectional LSTMs and assess their capacity to learn the user’s relative displacement model given a series of raw sensor data.

Iii-E Landmarks and pseudo labels

Landmarks play an important role in indoor positioning and tracking; they refer to direction changes, restrained passages like doors and elevators. Landmarks can be often identified by analysing sensor data. Once identified, they allow to obtain pseudo labels and thus generate a richer training set, which is critical for training an accurate deep PDR model.

Indeed, genuine ground truth annotations are sparse and mostly available for a limited number of landmarks. On the other hand, raw sensor data are massively generated at high rate. So we develop a method to annotate sensor data with pseudo labels and generate a large annotated set for training a deep PRD model. It is based on simpler tasks of user activity and landmark detection and a guess about how users behave between the landmarks. To generate the pseudo labels, we make a simplifying assumption that user moves along a straight line between any two landmarks.

Such an assumption is validated in a major part of indoor environments where any user’s trajectory can be represented by a sequence of segments and the error is limited to choices in multi-door passages, the width of corridor, etc. Once landmarks are identified, pseudo labels are obtained by interpolating user’s position between two landmarks, under the assumption that all paths between landmarks are straight trajectories with no turns.

We run a sliding window over the IMU data stream obtained from accelerometer, gyroscope and magnetometer data and associate every image-framed input with the corresponding change in user’s position. Temporal and multi-modal correlations present in sensor data are learned using a deep PDR model. We train a network to predict the relative displacement using image-shaped input with associated ground truth or pseudo labels from the training set.

Using this approach, the inertial sensor readings are used to predict relative user’s displacements. The challenge logfiles provide the orientation of the device. Since the user trajectories have been recorded while holding the phone in front of the user’s chest, the provided yaw angle corresponds with the user’s heading. All the data contained in the training and validation sets are used to train the deep learning model that will be responsible for predicting the user’s trajectory based on the inertial sensor data, thus replacing the classic PDR method.

Fig. 5: Deep PDR network. (a) CNN based (b) RNN based.

Iii-F WiFi: VAE based predictions

The deep PDR model predicts user’s relative displacements and is prone to drift accumulation. To reduce the drift, we add the absolute position estimation to the system. We use available low frequency WiFi RSS data to build a WiFi based positioning [22].

While it is relatively easy to collect unlabeled WiFi data by crowdsourcing, it is significantly more expensive and tedious to annotate the data with an exact location. WiFi data is massively collected (every 4 seconds), but a small part is annotated with coordinates.

The semi-supervised learning is a paradigm where both labeled and unlabeled data are used for building accurate prediction models. The semi-supervised setting is well suited for WiFi data collection where one or more equipped devices can combine a low cost collection of non-annotated WiFi data with a limited annotation effort. Several semi-supervised methods [5, 9, 29, 37] showed their efficiency in reducing the annotation needed for an accurate WiFi based localization.

We follow [5] in applying the recent advances in deep and semi-supervised learning to the WiFi based positioning. We implement a method based on Variational Auto-Encoder (VAE) [18] introduced in Section 2.1, that significantly reduces the need for the labeled data. It can combine a small amount of the labeled data with a large unlabeled dataset to build an accurate predictor for the localization component. We adapt the standard VAE encoder-decoder architecture, where the encoder maps the RSS data into latent variables, and also plays the additional role of a regressor of the available labeled data. The VAE decoder plays the regularization role on both labeled ans unlabeled data.

The method is semi-supervised and able to train a prediction model from a small set of annotated WiFi observations (10-15% of WiFi data used by the VAE) completed with a massive set of non-annotated WiFi observations.

Relative user’s displacements are predicted by the deep PRD model at high frequency, while Wifi-based absolute predictions are low frequent; the two predictions are fused by Kalman Filter.

Iii-G Kalman Filter for fusion

Existing data fusion frameworks mainly include particle filter and Kalman filter [4, 6, 28]. The particle filter may achieve reasonable accuracy by deploying a large number of particles, but a large amount of computational cost is required and is not suitable for resource limited smartphones.

The Kalman filter-based approaches are computational lightweight [15]. However, an explicit measurement equation connecting user’s position with RSS measurements is unavailable due to complex indoor radio propagation, thus rendering the measurement noise statistics unavailable. Previous Kalman filter-based fusion approaches manually and empirically set the related measurement noise covariance matrix. As a result, the fusion process cannot adapt the uncertainty of WiFi positioning results and, thus, rendering a degraded positioning accuracy.

We follow [4] in adopting Kalman Filter as a sensor fusion framework for combining low-rate WiFi and high-rate PDR predictions. The sensor fusion problem is formulated in a linear perspective, so enabling the whole system to run on a smartphone.

Iii-H Map-free projection

The Kalman Filter does not take into account physical constraints imposed by the floor layout. At the same time, the user may not go through walls, she may change the floor only at stairs, etc. Therefore, we performed an additional step of adjusting the Kalman Filter output by projecting its output on the walkable paths only. Even though that the floor map was not provided explicitly, we could implicitly reconstruct the underlying map by extracting ’walkable paths’ between landmarks. Thus, as the final step, we could adjust the output of the Kalman filter based on the floor layout, by projecting its prediction to the closest path.

We introduce an additional component which turns to be critical in the regression based localization. All conventional regression methods often ignore the structure of the output variables and therefore face the problem of predictions outside the target space. Indeed, when testing our system on IPIN’19 dataset, a number of predictions fail to fit the indoor building space. We therefore implement a method of a structured regression [37] which guarantees that predictions fit the feasibility space.

A naive solution assumes an access to an accurate location map; then any location prediction is first tested for being inside the feasibility space, and a correction is required if the test fails. To make our system more generic and map-independent, we do not assume any map and count only on the training set for the possible corrections.

The method which turns to be robust in the semi-supervised setting, is based on the weighted neighbourhood projection. For each location prediction, we consider top neighbors in the available annotated set. The projection is given by the weighted sum of the neighbors; these weights are calculated as an inverse of distances between the prediction and corresponding neighbours. This projection belongs to a convex hull defined by the neighbours.

The map-free projection works well when all neighbours are topologically close and the convex hull they define is a part of the feasibility space. However, if the neighbours are topologically distant (for example, located in different buildings), the error caused by the projection can increase. To minimize the risk of error, we consider rather small values of .

Iv Evaluation

Iv-a IPIN localization challenges

The Indoor Positioning and Indoor Navigation (IPIN) conference holds an annual competition that provides a rigorous evaluation methodology in order to fairly compare different technologies both in online (real-time, on-site) and offline (post-processing, off-site) settings [30].

The main goal of the off-site smartphone based positioning track of the IPIN competition, is to recreate a path traversed by a person holding a conventional modern smartphone (Samsung A5, 2017), based on the readings from the smartphone’s sensors. Sensors data was recorded and stored in a logfile using the ”GetSensorData” Android application [16]. The application records all the raw data that is available from the smartphone sensors, such as WiFi/BLE RSS, GPS location, acceleration, gyroscope, magnetic field, orientation, pressure, light and sound intensity, etc. The logfiles are divided into training, validation and evaluation sets. The organizers supplied a set of landmarks, consisting on user’s positions at a given timestamp, for training and validation logfiles. Training set consists of 50 logfiles corresponding to 15 different trajectories, of length ~5 mins each, that were traversed multiple times in both directions (see Figure 6). Validation set contained 10 logfiles associated with 10 different trajectories, of lengths ~10 mins each (see Figure 7). The main difference between training and validation logfiles is that in the training logfile, all significant turns have been recorded (i.e. annotated) with a landmark, while in the validation set a trajectory between two consecutive landmarks is not necessarily a straight line and may include turns, u-turns, stops and other challenging movements.

The evaluation logfile contains only recordings of the sensors data, for  20 mins without any landmarks information. The goal of the competition is to recreate the path of the user, based on this sensors data, providing user position estimations every 0.5 seconds. The final results are benchmarked by the organizers, on landmarks unknown to the competitors and 75% quartile of the error distribution is used to determine the winner. The competition data is publicly available 


Fig. 6: Example of a training path with annotated landmarks.
Fig. 7: Example of a validation path with partially annotated landmarks.

Iv-B Evaluation results

IPIN’19 Indoor MAE 50% 75% 90%
Localization Challenge (m.) Err Err* Err
Winner 2.0 1.5 2.27 5.1
2nd place(*) 1.7 1.3 2.36 3.9
3rd place 2.1 1.8 2.54
Our pipeline
PLs RPs Model Wi- PRJ MAE 50% 75% 90%
Fi (m.) Err Err* Err
RNN 1.79 1.33 2.44 4.50
RNN 1.53 1.29 1.92 3.31
RNN 2.10 1.56 2.85 4.49
RNN 1.74 1.47 2.19 3.32
RNN 1.64 1.28 1.99 3.45
CNN 1.98 1.42 2.46 4.51
CNN 1.54 1.16 1.99 3.21
CNN 1.97 1.38 2.32 5.01
CNN 2.22 1.89 2.83 4.11
CNN 1.58 1.05 1.80 3.70
TABLE I: The best results of IPIN’19 challenge and MAE, 50%, 75% and 90% errors for our system, by ablating pseudo labels(PLs), RPs, WiFi and map-free projections(PRJ).

We validate the effectiveness of our system by ablating different components, measuring the corresponding localization errors and comparing them to the challenge’s best results. In all experiments, we evaluate 75% quartile of the error distribution used by the IPIN’19 challenge organizers during the competition. In addition, we report the standard Mean Average Error (MAE), and 50% and 90% quartiles.

Table I first reports three top results of the challenge. Then it presents results when using CNN and RNN as deep PDR models. We feed the network with raw sensor data streams or recurrence plots and ablate pseudo labels, WiFi and map-free projections.

The best performance of 1.80 m of 75% error is obtained for CNN as deep PDR model, completed with pseudo labels and RPs. In IPIN’19 challenge, the winner reported 2.27 m error (our contribution with 2.36 m error took the 2nd place). In other words, our improved architecture allows to reduce the winner’s error by 20%. This improvement is obtained due to adding RPs, fine-tuning the full pipeline and hyper-parameter optimization.

Beyond the deep PDR, we also ablate WiFi and map-free projection components of our pipeline. As Table I shows, both components play an important role. Localization and tracking without WiFi predictions or map-free projection lead to an important performance drop, for both CNN and RNN models.

Fig. 8: Snapshot of the video comparing different scenarios.

Visual comparison.

Beyond the evaluation results, we generated a video to visually compare the behaviour of different configurations of our network on the test tracks (this video is provided in Additional material). Figure 8 offers a snapshot of the video. For the test track, it shows standard and deep PDRs, WiFi based predictions, KF fusion of PDR and WiFi predictions, the final predictions after map-free projection, as well as the ground truth.

The most remarkable is the difference between standard and deep PDRs, with the later showing a much smaller accumulated drift. Then, WiFi global position predictions and Kalman Filter fusion allow to correct errors of the deep PDR. Finally, the map-free projections allow to fix some impossible predictions and project them back into the feasible navigation space.

Iv-C Discussion and Future Work

Most important lessons learnt from the evaluation are the following:

  1. Deep PDR represents a strong alternative to the standard PDR. Deep PDR models outperform them in all configurations and allow to reduce the accumulated drift.

  2. Using recurrence plots is preferable to a direct, naive conversion of sensor data in 2D image-shaped representation.

  3. Magnetometer data turns to be valuable information for indoor positioning and tracking; in combination with accelerometer and gyroscope data, it contributes to the reducing localization error. Instead, removing magnetic field data leads to a performance drop.

  4. Our attempt to take a benefit from sequential nature of sensor data and to deploy more complex LSTM as a deep PDR model was only partially successful. While the results outperform the last year competition winner, they were slightly worse than those we obtained with CNN, despite an intensive hyper-parameter optimization. It would be interesting to apply latest state-of-the-art techniques, such as attention mechanism to take advantage of the sequential nature of the sensors data.

Earlier we mentioned several contributions enabling our system to obtain the state of the art performance on IPIN’19 dataset. Another important factor is an assumption about a corridor-based navigation space; it allows to simplify the landmark detection and the generation of pseudo labels and, therefore to train accurate deep PDR models. Instead, indoor positioning and tracking in multiple open spaces with erratic user navigation represents a more serious challenge.

Relaxing the simplifying assumption represents the most intriguing direction of future work. One promising direction may come from our WiFi component which avoids pseudo labels; instead it deploys the semi-supervised learning to successfully project both labeled and unlabeled RSS WiFi data in VAE latent space and to make absolute position predictions.

V Conclusion

We propose a novel architecture for user’s indoor localization based on data collected by a smartphone. We build a reliable prediction of the user’s trajectory using inertial sensors such as accelerometer, gyroscope and magnetometer, as well as barometer and WiFi scanner. Our main innovation is a deep learning based pedestrian dead reckoning (PDR) model that provides a high-rate estimation of user’s local displacement. We describe the full system and its components, including the landmark detection, relative and absolute position estimation from sensor data, prediction fusion map-free projection. We show how to shape sensor data to train CNNs/RNNs architecture. We evaluate our system on the IPIN’19 indoor localization challenge dataset and obtain the localization error which outperforms the challenge winner’s performance by 20%.


  • [1] H. Abdelnasser, R. Mohamed, A. Elgohary, M. F. Alzantot, H. Wang, S. Sen, R. R. Choudhury, and M. Youssef (2016) SemanticSLAM: using environment landmarks for unsupervised indoor localization. IEEE Trans. Mob. Comput. 15 (7), pp. 1770–1782. Cited by: §I.
  • [2] Y. Bengio, P. Simard, and P. Frasconi (1994) Learning long-term dependencies with gradient descent is difficult.

    IEEE transactions on neural networks

    5 (2), pp. 157–166.
    Cited by: §III-D.
  • [3] J. Bi, Y. Wang, Z. Li, S. Xu, J. Zhou, M. Sun, and M. Si (2019) Fast radio map construction by using adaptive path loss model interpolation in large-scale building. Sensors 19 (3), pp. 712. Cited by: §II-A.
  • [4] Z. Chen, H. Zou, H. Jiang, Q. Zhu, Y. C. Soh, and L. Xie (2015) Fusion of wifi, smartphone sensors and landmarks using the kalman filter for indoor localization. Sensors 15 (1), pp. 715–732. Cited by: §I, §III-G, §III-G.
  • [5] B. Chidlovskii and L. Antsfeld (2019)

    Semi-supervised variational autoencoder for wifi indoor localization

    In Intern. Conf. Indoor Positioning and Indoor Navigation (IPIN), pp. 1–8. Cited by: §II-A, §II-A, §III-F, §III-F.
  • [6] P. Davidson and R. Piché (2017) A survey of selected indoor positioning methods for smartphones. IEEE Communications Surveys Tutorials 19 (2), pp. 1347–1370. Cited by: §II-A, §II-D, §II, §III-G.
  • [7] Z. Deng, G. Wang, D. Qin, Z. Na, Y. Cui, and J. Chen (2016-09) Continuous indoor positioning fusing wifi, smartphone sensors and landmarks. Sensors 16, pp. . Cited by: item 1, §I, 2nd item.
  • [8] B. Ferris, D. Fox, and N. D. Lawrence (2007) WiFi-slam using gaussian process latent variable models. In

    Proc. 20th Intern. Joint Conference on Artificial Intelligence (IJCAI)

    , M. M. Veloso (Ed.),
    pp. 2480–2485. Cited by: §II-A.
  • [9] N. Ghourchian, M. Allegue-Martínez, and D. Precup (2017) Real-time indoor localization in smart homes using semi-supervised learning. In Proc. Thirty-First AAAI Conference on Artificial Intelligence, pp. 4670–4677. Cited by: §III-F.
  • [10] B. Goswami (2019-02) A brief introduction to nonlinear time series analysis and recurrence plots. Vibration, pp. 332–368. Cited by: §III-B.
  • [11] B. Gressmann, H. S. Klimek, and V. Turau (2010) Towards ubiquitous indoor location based services and indoor navigation. In 7th Workshop on Positioning Navigation and Communication, WPNC, pp. 107–112. Cited by: §I, §II-A, §II-A.
  • [12] R. Harle (2013) A survey of indoor inertial positioning systems for pedestrians. IEEE Communications Surveys and Tutorials 15 (3), pp. 1281–1293. Cited by: §I, §II-B, §II.
  • [13] S. Hilsenbeck, D. Bobkov, G. Schroth, R. Huitl, and E. Steinbach (2014) Graph-based Data Fusion of Pedometer and WiFi Measurements for Mobile Indoor Positioning. In ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 147–158. Cited by: §II-A.
  • [14] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §III-D.
  • [15] S. Hosseinyalamdary (2018-04) Deep Kalman Filter: Simultaneous Multi-Sensor Integration and Modelling; A GNSS/IMU Case Study. Sensors 18 (5), pp. 1316. Cited by: §III-G.
  • [16] A. Jiménez, F. Seco, and J. Torres-Sospedra (2019-10) Tools for smartphone multi-sensor data registration and gt mapping for positioning applications. pp. . Cited by: §IV-A.
  • [17] A. Khalajmehrabadi, N. Gatsis, and D. Akopian (2017) Modern wlan fingerprinting indoor positioning methods and deployment challenges. IEEE Communications Surveys Tutorials 19 (3), pp. 1974–2002. Cited by: §II-A.
  • [18] D. P. Kingma and M. Welling (2019) An introduction to variational autoencoders. Now Foundations and Trends. Cited by: §III-F.
  • [19] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §III-A.
  • [20] J. H. Lee, B. Shin, C. Kim, J. Kim, S. Lee, and T. Lee (2013) Real time adaptive step length estimation for smartphone user. In Intern. Conf. Control, Automation and Systems, pp. 382–385. Cited by: §II-B.
  • [21] W. S. Lima, E. Souto, K. El-Khatib, R. Jalali, and J. Gama (2019) Human activity recognition using inertial sensors in a smartphone: an overview. Sensors 19 (14), pp. 3213. Cited by: item 1, §I, §II-B, §II-B, §II-C.
  • [22] Y. Ma, G. Zhou, and S. Wang (2019-06) WiFi sensing with channel state information: a survey. ACM Comput. Surv. 52 (3), pp. 46:1–46:36. Cited by: §I, §I, §II-A, §II-C, §II, §III-F.
  • [23] N. Marwan (2008-10) A historical review of recurrence plots. The European Physical Journal Special Topics 164, pp. 3–12. External Links: Document Cited by: §III-B.
  • [24] M. Mohammadi, A. Al-Fuqaha, M. Guizani, and J. Oh (2018) Semi-supervised Deep Reinforcement Learning in Support of IoT and Smart City Services. IEEE Internet of Things Journal 5 (2). Cited by: §II-A, §II-C.
  • [25] K. Nguyen-Huu, K. Lee, and S. Lee (2017) An indoor positioning system using pedestrian dead reckoning with wifi and map-matching aided. In Intern. Conf. Indoor Positioning and Indoor Navigation (IPIN), pp. 1–8. Cited by: §II-D.
  • [26] A. Perttula, H. Leppäkoski, M. Kirkko-Jaakkola, P. Davidson, J. Collin, and J. Takala (2014) Distributed indoor positioning system with inertial measurements and map matching. IEEE Trans. Instrumentation and Measurement 63 (11), pp. 2682–2695. Cited by: §II-D.
  • [27] F. Potortì, S. Park, A. Crivello, F. Palumbo, M. Girolami, P. Barsocchi, S. Lee, J. Torres-Sospedra, A. R. Jimenez, A. Pérez-Navarro, G. M. Mendoza-Silva, F. Seco, M. Ortiz, J. Perul, V. Renaudin, H. Kang, S. Park, J. H. Lee, C. G. Park, J. Ha, J. Han, C. Park, K. Kim, Y. Lee, S. Gye, K. Lee, E. Kim, J. Choi, Y. -S. Choi, S. Talwar, S. Y. Cho, B. Ben-Moshe, A. Scherbakov, L. Antsfeld, E. Sansano-Sansano, B. Chidlovskii, N. Kronenwett, S. Prophet, Y. Landau, R. Marbel, L. Zheng, A. Peng, Z. Lin, B. Wu, C. Ma, S. Poslad, D. R. Selviah, W. Wu, Z. Ma, W. Zhang, D. Wei, H. Yuan, J. -B. Jiang, S. -Y. Huang, J. -W. Liu, K. -W. Su, J. -S. Leu, K. Nishiguchi, W. Bousselham, H. Uchiyama, D. Thomas, A. Shimada, R. -I. Taniguchi, V. Cortés, T. Lungenstrass, I. Ashraf, C. Lee, M. U. Ali, Y. Im, G. Kim, J. Eom, S. Hur, Y. Park, M. Opiela, A. Moreira, M. J. Nicolau, C. Pendão, I. Silva, F. Meneses, A. Costa, J. Trogh, D. Plets, Y. -R. Chien, T. -Y. Chang, S. -H. Fang, and Y. Tsao (2020) The ipin 2019 indoor localisation competition -description and results. IEEE Access (), pp. 1–1. Cited by: §I-A, §III, §IV-A.
  • [28] A. Poulose, J. Kim, and D. Han (2019-10) A sensor fusion framework for indoor localization using smartphone sensors and wi-fi rssi measurements. Applied Sciences 9, pp. 4379. Cited by: §III-G.
  • [29] T. Pulkkinen, T. Roos, and P. Myllymäki (2011) Semi-supervised learning for WLAN positioning. In Artificial Neural Networks and Machine Learning (ICANN), pp. 355–362. Cited by: §III-F.
  • [30] V. Renaudin, M. Ortiz, J. Perul, J. Torres-Sospedra, A. Jimenez, A. Pérez-Navarro, G. Mendoza-Silva, F. Seco, Y. Landau, R. Marbe, B. Ben-Moshe, X. Zheng, F. Ye, J. Kuang, Y. Li, X. Niu, V. Landa, S. Hacohen, N. Shv, and Y. Park (2019-09) Evaluating indoor positioning systems in a shopping mall: the lessons learned from the ipin 2018 competition. IEEE Access PP, pp. 1–1. Cited by: §III, §IV-A.
  • [31] D. E. Rumelhart, G. E. Hinton, and R. J. Williams (1986) Learning representations by back-propagating errors. nature 323 (6088), pp. 533. Cited by: §III-D.
  • [32] S. H. Shin, C. G. Park, J. W. Kim, H. S. Hong, and J. M. Lee (2007) Adaptive step length estimation algorithm using low-cost mems inertial sensors. In IEEE Sensors Applications Symposium, pp. 1–5. Cited by: §II-B.
  • [33] H. Tran, S. Pandey, and N. Bulusu (2017) Online map matching for passive indoor positioning systems. In Proc. 15th Annual Intern. Conference on Mobile Systems, Applications, and Services (MobiSys), pp. 175. Cited by: §II-D.
  • [34] J. Wang, Y. Chen, S. Hao, X. Peng, and L. Hu (2019) Deep learning for sensor-based activity recognition: A survey. Pattern Recognit. Lett. 119, pp. 3–11. Cited by: §I, §I, §II-C, 1st item.
  • [35] Q. Wang, L. Ye, H. Luo, A. Men, F. Zhao, and Y. Huang (2019)

    Pedestrian Stride-Length Estimation Based on LSTM and Denoising Autoencoders

    Sensors 19 (4). Cited by: item 1, §II-A, §II-B, §II-C.
  • [36] X. Wang, Z. Yu, and S. Mao (2020) Indoor localization using smartphone magnetic and light sensors: a deep LSTM approach. MONET 25 (2), pp. 819–832. Cited by: §I.
  • [37] J. Yoo and K. H. Johansson (2017) Semi-supervised learning for mobile robot localization using wireless signal strengths. In Intern. Conf. Indoor Positioning and Indoor Navigation (IPIN), pp. 1–8. Cited by: §II-A, §III-F, §III-H.
  • [38] Y. Yuan, L. Pei, C. Xu, Q. Liu, and T. Gu (2014) Efficient wifi fingerprint training using semi-supervised learning. In Ubiquitous Positioning Indoor Navigation and Location Based Service, pp. 148–155. Cited by: §II-A.
  • [39] F. Zafari, A. Gkelias, and K. Leung (2017) A survey of indoor localization systems and technologies. CoRR abs/1709.01015. Cited by: §I.
  • [40] C. Zhang, P. Patras, and H. Haddadi (2019) Deep learning in mobile and wireless networking: A survey. IEEE Communications Surveys and Tutorials (3), pp. 2224–2287. Cited by: §II.
  • [41] B. Zhou, J. Yang, and Q. Li (2019-02) Smartphone-based activity recognition for indoor localization using a convolutional neural network. Sensors 19, pp. 621. External Links: Document Cited by: §II-B, §II-C, §II-C, §II-C, 3rd item, §III-A.