Localizing smartphone users is a essential technique for many location-based services like navigation and advertisement. Though GPS can provide relatively accurate position, it cannot function well in indoor environment. Thus we need to seek other options to recognize user location. Since the WiFi signal strength is related to distance of the hotspots and the devices, if we have the WiFi fingerprint data labeled with actual user coordinates, then we can interpret WiFi fingerprints into user location information via supervised learning approaches.
A lot of research works are focusing on the use of the WiFi received signal strength indicator (RSSI) value data. Among the solutions presented in literature, traditional neural networks are historically among the most widespread. Their limitations are that they can be regarded as deterministic functions in a sense and their loss functions usually are Euclidean distance (for instance, mean squared errors) for regression problems. Conventional neural networks work well in many cases but when the dataset contains too much noisy information, they are not powerful enough to learn the useful information from the dataset. In our case, the user current position is normally only related to a few number of WiFi access points, and the rest of RSSI values in the input vector are in fact not useful. However, in the modelling process, we need to feed all the RSSI values to the model. As a result, the unrelated information in the input data will lead to bad performance when training the neural network.
and Variational Autoencoders (VAEs)-based models
. These methods are based on the probability theory and Bayesian statistics, by introducing uncertainty to the models to prevent overfitting problems. However, according to the Information Bottleneck theory,  these models do not consider what the useful information in the dataset are for the learning tasks. Thus, the aforementioned deep probabilistic models can solve our problems in a sense but they may not be the optimal solutions.
In this work, we propose a novel model to calculate the accurate user location by using the related WiFi fingerprints. We treat this problem as a supervised regression problem. It means that we use the WiFi RSSI value data as the input and the actual user location (latitudes and longitudes) as the output. However, there are some difficulties to achieve this goal. First, to provide good quality of network connections, modern building are normally equipped with abundant WiFi access points (WAPs). Therefore, when we use the WiFi received signal strength indicator (RSSI) value data as the modeling input, which usually are very high dimensional. Meanwhile, due to the signal-fading and multi-path  effects, the RSSI values can be very noisy. These two properties result in severe overfitting when we use conventional neural network-based models.
For this reason, in contrast with the existing methods, based on the Information Bottleneck method and Variational Inference, we propose a Variational Information Bottleneck model in this work. This model consists of two sub-models, one is the encoder model, the other is the predictor model which is used to predict the target values. According to the Information Bottleneck theory 
, the encoder in our model is used to find a good latent representation of the input data for the related learning task so that the nuisance information in the original input will be token out. Afterwards, the predictor utilizes the latent representation as its input, instead of the original input, to predict the target values. Our model is an end-to-end deep learning model and scalable to large scale datasets which makes it easy to train.
The reminder of the paper is organized as follows. Section II surveys the related research work. In Section III introduce the proposed model. Section IV demonstrates the validation experiments and the results and gives a detailed discussion. The conclusions and the possible future work are in Section V.
Ii Related Work
In previous research, both conventional machine learning and deep learning methods are widely explored for WiFi fingerprint based user location recognition problems. Many previous works treat this problem as classification or clustering tasks, which means to identify the buildings and/or floors. Some researchers used conventional machine learning methods, for instance, i.g., Decision Trees, K-nearest neighbors, Naive Bayes, Neural Networks, K-means and the affinity clustering algorithm, , , , , . In addition, since RSSI values are high dimensional sometimes, some researchers used deep learning techniques like Autoencoders  to reduce the input dimension before preceding the main learning tasks , , .
For learning the accurate user position information, i.e., calculating the real coordinates of the users, Gaussian Processes (GPs) can be one of the options , , . But GPs are extremely computationally expensive when it comes to datasets with large scales because they need to compute the covariances between each data points. To circumvent this issue, one can resort to deep learning approaches. ,  and 
used Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs). However, since deterministic neural networks can cause serve overfitting issue, their methods calculate the use coordinates indirectly. In our study, we find that some deep probabilistic models can be better solutions. For example, Mixture Density Networks
use a set of mixed Gaussian distributions at the output layers to compute the final output and use the negative log-likelihood as the loss function. The disadvantage of such method is that, as a maximum likelihood estimation (MLE) method and it fails to consider the prior of the model parameters so it is prone to be overfitting. Though Bayesian Neural Networks (BNNs) are maximum a posteriori (MAP) methods, they do not extract the noisy information in the input data, so their performance is not good as expected either. In our previous research, we take advantage of Variational Autoencoders (VAEs) , 
to develop the VAE-based semi-supervised learning models. However, these models all neglect the effects of the nuisance information in the dataset. The nuisance information is redundant for the learning tasks and will damage the modeling performance. To circumvent the effect of the nuisance information, one can resort to the Deep Variational Information Bottleneck (DVIB) model . DVIB is a model based on Variational Inference and the Information Bottleneck method. It aims at learning a good latent representation of the input data for the downstream learning tasks.
In this work, we want to apply the Information Bottleneck method to WiFi fingerprint-based location recognition problem in order to reduce the nuisance information damaging the modeling performance. Inspired by VAEs, -VAEs ,  and DVIB, we devise a Variational Information Bottleneck model to interpret the user WiFi RSSI values into the actual user location information. This model is solved via Monte Carlo sampling and Variational Inference.
In our model, the input is the WiFi RSSI values , the output is the user’s coordinates . To make the model more robust to noise, we use a set of probabilistic distributions such as and to describe the relationship between the variables instead of deterministic functions as in conventional neural networks. Furthermore, in order to let our model work theoretically, we need to make some assumptions first:
Assumption : assume that there exists a latent distribution of . Let’s say that , ,
belong to the same information Markov chain:.
Assumption : assume that is solely sufficient enough to learn , which leads to .
Assumption : assume that is solely sufficient enough to learn , which leads to .
We make the above assumptions based on the idea that the values of both the WiFi RSSIs and GPS coordinates are related to the user’s real physical position. Hence either the WiFi RSSI values or the GPS coordinates contains the sufficient information of the real user physical position (which we use a the latent variable to represent). This suggests that we can use to compute (encoding step) and then use to compute (predicting step). The above assumptions will facilitate the derivation of our model.
In a maximum a posteriori (MAP) modeling setting, the parameters of the model are related to not only the dataset but also the prior of the parameters:
where, is the dataset, is the model parameters, is the posterior, is the likelihood and is the prior. Applying such a setting to our problem, the prior of the latent representation , and the posterior can both be represented by Gaussian distributions. Through using Variational Inference, can be calculated via a neural network.
In Variational Autoencoders, one assumes that there is a latent distribution of which can be used to reconstruct the original input . Hence the information Markov chain of VAEs is , where is the reconstructed input. Accordingly, the loss function can be written as :
where represents the Kullback–Leibler (KL) divergence, which is to measure the closeness between the posterior and the prior, is the parameters of the encoder network, is the parameters of the decoder network, is an uninformative prior of
, here we can use a standard Normal distribution.
Furthermore, according to the Information Bottleneck principle , let be the input, be the learning target and be the representation, then we can have the following optimization objective:
where is the mutual information, is the information constraint.
Or equivalently, if we apply the Karush-Kuhn-Tucker (KKT) conditions to Eq. (3), then we will have the following Lagrangian:
where is a Lagrangian multiplier.
Since our learning task is supervised, as opposed to VAEs and -VAEs, we have the information Markov chain:
. As opposed to -VAEs , , based on Eq. (3) and the assumptions we have made, we know that the latent variable can be represented by alone () and the output can depend on alone (). For this reason, we can replace the term in Eq. (III-B) with . As a result, now we have this following optimization objective for our model:
where is the dataset, is the parameters of the predictor network, is a positive constant with small value.
Iii-C Model Solver
To solve the Eq. (III-B), we need to adopt some special techniques. First, for computing the term , we can use the reparameterization trick proposed in , in which the random distribution of is decomposed as the combination of the mean
and the variance:
where, and can be calculated via the neural networks respectively, and can be sampled from a standard diagonal Normal distribution.
Afterwards we need to calculate the term . This term cannot be solved directly but we can use Monte Carlo method to compute it.
If we adopt Monte Carlo sampling, then Eq. (III-B) becomes:
where denotes the total instance number. is the same deterministic neural network used in the encoder to calculate the parameters of the distribution :
Last but not least,
is a hyperparameter which is used to balance the encoding term and the predicting term so that it needs to be chosen carefully.
Iii-D Computing Output
In VAEs and -VAEs, one can obtain new samples from an uninformative standard Gaussian first then use them as the input of the decoder. Whereas since our model is a supervised model, once the model is trained, we use the sample from the conditional distribution, i.e., , to feed the predictor network to compute the final output, which is the same as the training procedure.
The overall algorithm is summarized in Algorithm 1.
Iv Experimental Results
Iv-a Dataset Description
For the validation, we use the UJIindoor dataset  whose input dimension is and each dimension represents a WAP. The RSSI values range from dB to dB when the WAPs are detected, otherwise the RSSI values are set to be . Also each RSSI vector corresponds to a pair of latitude and longitude as the geo-location label. In our experiments, we use scaled GPS coordinates values for computational convenience. The total instance number is about . For Experiment 1 and Experiment 2, we use of the dataset for training and the rest as the test dataset. In Experiment 3, the training data number will vary.
Iv-B Model Structure
|Encoder||hidden layer||neuron number: 512; latent dimension: 5||ReLU|
|Predictor||hidden layer||neuron number: 512: dropout rate: 0.3||ReLU|
|Predictor||hidden layer||neuron number: 512: dropout rate: 0.3||ReLU|
|Predictor||hidden layer||neuron number: 512: dropout rate: 0.3||ReLU|
|Optimizer: Adam; learning rate: 1e-3|
Table I demonstrates the implementation details of our model. The encoder neural network includes of one hidden layer, and the dimension of the latent codes is set to be . In practice, we find that the latent dimension of is in line with the Minimal Description Length principle  for our task. The predictor is composed of three hidden layers. Each hidden layer has units. Especially, in order to improve modeling generalization on test data, we can increase the model uncertainty. Hence we apply the Dropout technique  to the hidden layers of the predictor. The optimizer for the model is Adam  and the learning rate is .
Iv-C Experiment 1
In the loss function of the proposed model, the constant is related to the constraint for the optimization, which is to balance the encoding error term and the prediction error term . A larger value means the model tends to be more compressive for the input and less expressive for the output, and vice versa. Therefore, different values can result in different modeling results.
To find the optimal values, we test different values, ranging from to , for our model. From the results shown in Fig. 2, we can see that, when is , the proposed model has the best performance. Thus, we will hereafter set to be for the propose model in all following experiments.
Fig. 3 shows the ground truth and the test modeling result of our model. It can be seen the proposed model can calculate the user location coordinates accurately using the relevant WiFi fingerprints. In addition, Fig. 4 demonstrates how the latent distribution is related to the building IDs and floor IDs.
Iv-D Experiment 2
To show the advantages of our method, we run other methods proposed in the literature on the UJIindoor dataset. K-NN is used as the baseline model. The MDN-2 model refers to the Mixture Density Network model with Gaussian distributions at the output layers. Similarly, the MDN-5 model is a MDN model with Gaussian distributions at the output layers. The Semi-VAE model is a semi-supervised variational autoencoder (VAE) model, which will be explained later. The overall results are shown in Table II
. We use the root mean squared error (RMSE) as the evaluation metrics.
From the results, we can see that the proposed model has the best modeling performance. Also in practice we find that compared to our model, the Gaussian Process model suffers from heavy computation load and the MDN models are not very stable during the learning process.
Iv-E Experiment 3
From our previous assumptions, as an alternative approach, we can also formulate a semi-supervised learning approach, the semi-VAE model. The learning procedure can be described briefly as follow. If we learn a VAE model via unsupervised model at first, then we will have and . After that we can do a supervised learning procedure, by sampling from to compute . Especially, in the semi-VAE model, the model uses both the labeled and unlabeled data for unsupervised learning and then uses the labeled data for supervised learning. While, in our proposed model, we only use the labeled data for supervised learning.
To compare with the semi-supervised learning approach, the semi-VAE model, we run our model and other models on different portions of labeled data. As shown in Fig. 5, we can see that once the labeled data are more than of the total training data, our method surprisingly has the best performance among all the methods.
Why the proposed method can outperform other deep learning methods? First, our problem can be regarded as a regression problem, and especially, the input (RSSI vectors) is relatively high dimensional and the target (GPS coordinates) is rather low dimensional. Thus, it causes the issue that the input has redundant information for the learning tasks. If we use a conventional Neural Network to solve this problem directly, the results will not be satisfying at all. Mixture Density Networks and Bayesian Neural Networks handle this problem by introducing uncertainty into the models. The difference is that MDNs are MLE method while BNNs are MAP method. Surprisingly, The BNN has worse performance than the MDNs on our tasks because the uncertainty of BNNs does not depend on the input data. Variational Autoencoders are originally designed as generative approaches to obtain new sample data. For our problem, we can use a VAE to learn the latent representation of the input data first. Then this model can be trivially extended to be a semi-supervised model by using the pre-learned representation to obtain the final output. However, in our study, we find that leveraging the Information Bottleneck method to this problem is a better option than the semi-VAE model. It is because that, with the Information Bottleneck method, we can treat the original task as a constrained optimization problem. The optimization objective is the learning tasks and the constraint is the data representation. That’s to say the Variational Information Bottleneck Model is to directly find the optimal representation for the learning tasks, whereas the semi-VAE model is to find the representation to reconstruct the original inputs.
Interpreting WiFi fingerprints into real user location via neural networks is a tricky problem. In this work, we combined the Information Bottleneck theory with Variational Inference to propose a novel deep learning model for WiFi fingerprint-based user location recognition. The proposed model consists of two neural networks, an encoder and a predictor. According to the Information Bottleneck theory, the encoder neural network is to find an optimal representation of the input data and mitigate the negative effect of the nuisance information for the learning tasks. The predictor neural network is to use the latent representation to compute the final output. The main advantages of the proposed model are that it is scalable to large scale dataset, computationally stable and robust to noisy information. To evaluate our model, we run our model and other previous models on the real-world WiFi fingerprint dataset and the finally results verify the effectiveness and show the advantages of our method compared to the existing approaches. For the future research, we plan to explore other methods in information theory and Variational Inference to improve the performance of our models or develop other applications.
The authors would like to thank the China Scholarship Council for the financial support.
-  (2016) Deep variational information bottleneck. arXiv preprint arXiv:1612.00410. Cited by: §I, §II.
-  (1994) Mixture density networks. Cited by: §I, §II.
-  (2015) Weight uncertainty in neural networks. arXiv preprint arXiv:1505.05424. Cited by: §I, §II.
-  (2015) A comparative study on machine learning algorithms for indoor positioning. In 2015 International Symposium on Innovations in Intelligent SysTems and Applications (INISTA), pp. 1–8. Cited by: §II.
-  (2018) Understanding disentangling in -vae. arXiv preprint arXiv:1804.03599. Cited by: §II, §III-B.
-  (2016) Clustering benefits in mobile-centric wifi positioning in multi-floor buildings. In 2016 International Conference on Localization and GNSS (ICL-GNSS), pp. 1–6. Cited by: §II.
Improving deep neural networks for lvcsr using rectified linear units and dropout. In 2013 IEEE international conference on acoustics, speech and signal processing, pp. 8609–8613. Cited by: §IV-B.
-  (2007) Wifi-slam using gaussian process latent variable models.. In IJCAI, Vol. 7, pp. 2480–2485. Cited by: §II, §II.
-  (2006) Gaussian processes for signal strength-based location estimation. In Proceeding of robotics: science and systems, Cited by: §II, §II.
-  (2017) Beta-vae: learning basic visual concepts with a constrained variational framework.. Iclr 2 (5), pp. 6. Cited by: §II, §III-B.
-  (1994) Autoencoders, minimum description length and helmholtz free energy. In Advances in neural information processing systems, pp. 3–10. Cited by: §II, §IV-B.
-  (2019) Recurrent neural networks for accurate rssi indoor localization. arXiv preprint arXiv:1903.11703. Cited by: §I, §II.
-  (2018) CNN based indoor localization using rss time-series. In 2018 IEEE Symposium on Computers and Communications (ISCC), pp. 01044–01049. Cited by: §II.
-  (2018) A scalable deep neural network architecture for multi-building and multi-floor indoor localization based on wi-fi fingerprinting. Big Data Analytics 3 (1), pp. 4. Cited by: §II.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §IV-B.
-  (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §II, §III-C.
-  (2017) Low-effort place recognition with wifi fingerprints using deep learning. In International Conference Automation, pp. 575–584. Cited by: §II.
-  (2019) Supervised and semi-supervised deep probabilistic models for indoor positioning problems. arXiv, pp. arXiv–1911. Cited by: §I, §II.
Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082. Cited by: §II.
-  (2019) A novel convolutional neural network based indoor localization framework with wifi fingerprinting. IEEE Access 7, pp. 110698–110709. Cited by: §II, §II.
-  (2000) The information bottleneck method. arXiv preprint physics/0004057. Cited by: §I, §I, §III-B.
-  (2014) UJIIndoorLoc: a new multi-building and multi-floor database for wlan fingerprint-based indoor localization problems. In 2014 international conference on indoor positioning and indoor navigation (IPIN), pp. 261–270. Cited by: §IV-A.
-  (2015) Gaussian process assisted fingerprinting localization. IEEE Internet of Things Journal 3 (5), pp. 683–690. Cited by: §II, §II.
-  (2019) A wifi positioning algorithm based on deep learning. In 2019 7th International Conference on Information, Communication and Networks (ICICN), pp. 99–104. Cited by: §II.