CDM: Compound dissimilarity measure and an application to fingerprinting-based positioning

05/16/2018 ∙ by Caifa Zhou, et al. ∙ ETH Zurich 0

A non-vector-based dissimilarity measure is proposed by combining vector-based distance metrics and set operations. This proposed compound dissimilarity measure (CDM) is applicable to quantify similarity of collections of attribute/feature pairs where not all attributes are present in all collections. This is a typical challenge in the context of e.g., fingerprinting-based positioning (FbP). Compared to vector-based distance metrics (e.g., Minkowski), the merits of the proposed CDM are i) the data do not need to be converted to vectors of equal dimension, ii) shared and unshared attributes can be weighted differently within the assessment, and iii) additional degrees of freedom within the measure allow to adapt its properties to application needs in a data-driven way. We indicate the validity of the proposed CDM by demonstrating the improvements of the positioning performance of fingerprinting-based WLAN indoor positioning using four different datasets, three of them publicly available. When processing these datasets using CDM instead of conventional distance metrics the accuracy of identifying buildings and floors improves by about 5 of root mean squared error (RMSE) are reduced by a factor of two, and the percentage of position solutions with less than 2m error improves by over 10



There are no comments yet.


page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The core idea of this paper comes from analyzing a particular challenge occurring during the online step of a fingerprinting-based indoor positioning system (e.g., using the received signal strength (RSS) from WLAN access points (APs) as the features) based on the nearness in the fingerprint space as the principle for localization (e.g., NN). Each measured fingerprint consists of a collection of actually observed attributes (e.g., identifications of APs and corresponding signal strengths). Fingerprints measured at different locations or at different times may contain different numbers of measurable APs e.g., due to changed availability of APs or changed signal reception conditions. In such cases it is not clear how the similarity/dissimilarity between such fingerprints should be assessed, and in particular, how the similarity/dissimilarity between a fingerprint measured online (i.e. when the user position is to be determined) and the fingerprints collected in the reference fingerprint map (RFM), a collection of fingerprints with labeled ground locations for representing the functional relationship between the location and fingerprint, should be handled.

In a general context this belongs to the class of missing data problems [1]

mostly addressed in the fields of data analysis, data mining and machine learning

[2]. A comprehensive review of the missing data problem is out-of-scope of this paper. Instead we focus on a concrete proposal to handle this problem within the context of positioning. In previous publications the authors either formulated the online measurements into vectors of equal length [3, 4, 5] or used only the measurability of the individual attributes as binary features [6]. The former scheme requires filling in values for missing attributes and ignoring newly measured ones. In this way it is easy to apply vector-based distance metrics for computing the dissimilarity but there are two disadvantages. One is the limited flexibility in dealing with missing or newly available data, the other one is time and computational resource cost: in most cases the number of all APs contained in the RFM is much larger than the number of APs in an individual measured fingerprint. Therefore the vectorized data of uniform dimension which need to be composed and handled typically have many more elements than the individual measurements. The approach mapping the measured APs into a set of binary features, instead, is efficient in the sense of computational burden for assessing dissimilarity but it does not take the actual similarity of the measured values into account and thus does not support exploiting the potential for accurate positioning.

After analyzing how this case is handled in previous publications, we explore the possibility of estimating the dissimilarity between the measurements which have the characteristics of partially missing observations of the attributes without formulating them into vectors of equal length. To this end we propose a non-vector-based dissimilarity measure (Section

III) which is a compound of a typical distance metric (e.g., Minkowski) and set operations. In addition, we exploit the applicability of the proposed compound dissimilarity measure (CDM) by applying it to four datasets used for fingerprinting-based positioning (FbP) and the result proves the benefits of the proposed dissimilarity measure (Section IV).

Ii Related work of distance metrics

The concept of distance metrics used for measuring the nearness between the online measured features and the ones stored in the RFM is a key for the realization of FbP algorithms. The Euclidean distance is one of the most prevalent metrics in different research fields and communities [7]. However, there is a variety of alternative distance metrics which may be more suitable for certain applications. In [7], Cha reported over 40 distance metrics or measures and analyzed their capability of measuring the difference between probability density functions (PDFs). Minaev et. al. [8] followed Cha’s research and applied them to an FbP algorithm NN by using the synthetic RSS from WLAN APs as the fingerprints and found that Lorentzian distance performs best among them. In [9], Torres-Sospedra et. al. surveyed and analyzed the performance of different distance metrics by applying them to a fingerprinting-based WLAN indoor positioning system which covers multi-buildings and multi-floors (i.e. UJIIndoorLoc dataset). In this paper, we propose the concept of CDM joining it with the 8 distance metrics (see TABLE I) performing best according to [8], and apply CDM to FbP using four different datasets.

Iii Compound dissimilarity measure

Suppose that there are several collections of measurements which need to be compared but differ with respect to the number and type of data included. For instance, the measurements in a WLAN-based indoor positioning system consist of the RSSs from all available APs at individual locations. However, only APs associated with RSS values exceeding the measuring sensitivity of the used WiFi device are observable at the individual locations, thus the APs measured at different points in space time will differ. Each measurement consists of an attribute e.g., the media access control (MAC) address of the respective AP, and an RSS value. The fingerprint at a particular location is the collection of measurements actually made at that location. As stated above, we propose an approach herein to estimate the similarity or dissimilarity between such fingerprints without reformulating all the measurements with different attributes into vectors of equal length as in several other publications [4, 10, 11, 3].

Given measurements denoted as , each measurement consisting of a set of paired attribute and measured value (e.g., the RSS), i.e. , where is the set of the attributes of the measurement. The initial idea of measuring the dissimilarity between and is by splitting them into three parts (Fig. 1), namely computing and weighting the dissimilarity of the shared and unshared attributes differently:


where is a chosen distance metric, and are the measured values associated with attribute . is a predefined value indicating a missing measured attribute and the regularization values are introduced to regulate or balance the contribution to the dissimilarity from those mutually unshared attributes. Herein we regulate the contribution of unshared attributes equally because there is no prior assumption that can be used to determine which of them should have more influence on the dissimilarity. In a specific application (e.g., fingerprinting-based positioning), it might be reasonable to weight these two terms differently. The CDM

offers additional degrees of freedom owing to the contribution of the hyperparameters (i.e.

and ). Their values can be determined in a data driven approach according to the specific application.

Fig. 1: The scheme of calculating the dissimilarity measure into three parts.

The basic CDM formulated in (1) weights the contribution from the shared and unshared attributes differently. However, we additionally propose two further variants of this measure which take also the actual numbers of these attributes into account. The application examples will later indicate that these are useful extensions. One is obtained by dividing the CDM in (1) by the total number of attributes thus yielding the average dissimilarity of the attributes, where denotes the cardinality of a set, here (The same symbol is used in this contribution also to indicate the absolute value of a scalar.):


We call this measure an averagely weighted compound dissimilarity measure (ACDM). The second extension is obtained by weighting the terms in (1) relatively according to the number of shared and unshared attributes, i.e.


where and are calculated by


In (4), but is introduced for avoiding division by zero in case there are no shared attributes at all. The relatively weighted compound dissimilarity measure (RCDM) introduced in (3) yields a large dissimilarity value in such a case. Comparing to the widely used vector-based distance metrics, the CDMs have three advantages:

  • The measurements do not have to be rearranged into vectors of equal length.

  • CDMs can be used to balance the contributions to the dissimilarity from the shared and mutually unshared attributes.

  • CDMs have hyperparameters and are capable of adapting to different data.

Subsequently, we compare the three proposed CDMs by applying them to FbP using different values of the hyperparameters and joining them with selected distance metrics (see Section IV).

Metric Equation111The formula and nomenclature are from [7] except the formula of Jaccard distance. In all these equations and are the -th element of the vectors and , respectively and is the dimension of and .
Hamming 222 is an indicator function and it yields 1 if and only if the condition is fulfilled.
Jaccard 333This formula is taken from [8]. is the indicator of a missing measured attribute. In case of CDM, Hamming and Jaccard distances are equivalent.
Wave Hedges
City block 444In [12], City Block is named Manhattan distance.
TABLE I: Distance metrics used herein

Iv An application of CDMs to FbP

In this section, we first describe the fundamentals of FbP, the widely used positioning algorithm NN

, the chosen evaluation metrics, and four datasets used for practical application and assessment. We then evaluate the performance of the three

CDM in terms of positioning results taking into account only very few different values . Then the cross validation (CV) method is applied to search for particularly suitable values of for a chosen distance metric and dataset. We compare the positioning performance of the approach using the CDM to that of NN without CDM.

Iv-a The baseline algorithm, performance criteria and data sets

Iv-A1 Fingerprinting-based positioning

The measured fingerprints for the purpose of FbP have the characteristic of missing attributes because the coverage of an AP is restricted by the transmitting power, free space loss, signal attenuation and the sensitivity of the receiver. The coverage is higher for higher transmission power, higher sensitivity and lower attenuation. One benefit of using CDM instead of vector-based distance metrics is that it avoids the need for conversion of the measurements into vectors of equal length. Further experimental analysis in the consecutive sections shows that using CDM can also improve the positioning accuracy and stability.

Generally an FbP (e.g., using the signal strength from WLAN APs as the fingerprints for indoor positioning) consists of two stages: offline fingerprinting stage and online positioning stage. During the offline stage, a RFM representing the relationship between the measurements (e.g., RSS) and locations is collected via carrying out a site survey (either by a professional surveyor or by crowd-sourcing). During online positioning stage a user measured observation is matched to the RFM using a FbP algorithm for estimating the user’s location (an estimated location is denoted as ), i.e. .

Herein we use NN, one of the most widely used FbP algorithms, as the baseline positioning method for evaluation and comparison. UJIIndoorLoc (see Section IV-A3) is a dataset including multiple buildings and multiple floors. We use the hierarchical NN according to [13] for processing this dataset. For the other datasets we use NN as follows:

  • Computing the dissimilarity measure between the user’s measurement and the ones stored in the RFM;

  • Finding the nearest reference points in the feature space, i.e. reference points with the smallest dissimilarity;

  • Taking the average coordinates of these reference points as the user’s location.

More details about FbP and NN can be found, e.g., in [11, 3].

Iv-A2 Evaluation of positioning performance

The Euclidean distance between the estimated location and the ground truth location

is used as the basic evaluation of the positioning error. In addition, we also use the statistical values (e.g., mean or standard deviation),

root mean squared error (RMSE), and

empirical cumulative distribution function

(ECDF) with respect to the error distance as further performance evaluation.

The implementation of the proposed CDMs and their relevant functions are in Python and partially based on the scikit-learn package [14].

Iv-A3 Available positioning datasets

We use four different datasets (both the RFMs and validation datasets) from fingerprinting-based WLAN indoor positioning systems for evaluating and comparing the performance of the proposed CDMs. The datasets Alcala2017, Tampere and UJIIndoorLoc are available online [15] and more details about them can be found in [9, 13]. Another dataset HIL is described in [11]. These four datasets represent different FbP scenarios with respect to area of indoor region, number of available APs, method of fingerprint collection, and device heterogeneity. The summarized characteristics of the datasets are given in TABLE II.

    Dataset Buil- dings Floors APs Training samples Validation samples555In case there is no provided validation samples, we randomly split the training samples into two datasets. 75% of them are used for training and the remaining ones are used for validation.
Alcala2017 1 1 152 670 0
HIL 1 1 490 1525 509
Tampere 1 4 309 1478 0
UJIIndoorLoc 3 4–5 520 3818 1110
TABLE II: The characteristics of the datasets (Tampere: herein only data from one building; UJIIndoorLoc: clearing procedure applied see Appendix)

In case of applying the proposed CDMs to FbP, we use a fixed value of for indicating the missing attributes. In HIL, is set to -110 dBm and in other three datasets [13].

Iv-B Evaluation and comparison of different CDMs

Herein we propose three versions to CDMs. However, we want to briefly investigate whether FbP is less sensitive with respect to dataset and distance metric included than the others. Since the results may also depend on , we use a few fixed values i.e. for the analysis and compare the results.

(a) Alcala2017
(b) HIL
(c) Tampere
(d) UJIIndoorLoc
Fig. 2: An example of RMSE of three CDMs with

Fig. 2 shows the RMSE of the positioning result using all three CDMs with . Similar results are also obtained using the other two values of . From Fig. 2 , we can conclude that the CDM relatively weighted by the shared and mutually unshared attributes performs more stable than that of other two CDMs in the sense that the RMSE of all four datasets of using RCDM compounding with all eight distance metrics has the smallest deviation. Therefore, we focus on the application of RCDM for the remainder of this paper.

Iv-C Tuning the regularization value

In order to find suitable values of we carry out cross validation (CV) [16], a widely used method for model selection, for various distance metrics and the four datasets. CV can make full use of the training dataset by randomly splitting it into several folds and iteratively using one of them as the temporal test samples and the remaining ones as the temporal training dataset. Herein we use 10-fold CV and search for the suitable value of in the range of with the interval of 0.1. In this paper, we only illustrate that a useful value of can be found using CV without specifically looking for the most appropriate search space of . The related investigation into search space and optimal value is left for future work. In the consecutive part of this section, we mainly show the results of HIL and UJIIndoorLoc using NN () as the FbP algorithm with RCDM (compounding with Lorentzian and Minkowski) for remaining the clarity. The results of other datasets and distance metrics are also similar to what presented herein.

(a) Lorentzian
(b) Minkowski
Fig. 3: CV box-plots of HIL ( , RCDM).
(a) Lorentzian
(b) Minkowski
Fig. 4: CV box-plots of UJIIndoorLoc ( , RCDM).

As shown in Fig. 3 and Fig. 4, there is a value of resulting in the minimum RMSE and maximum success rate for a chosen distance metric and dataset. Regarding dataset Alcala2017, HIL, and Tampere, we select the which achieves minimum average value of RMSE of 10-fold CV as the suitable value for a chosen distance metric and dataset. We only plot part of the cross validation result for preserving the clarity. For UJIIndoorLoc, we use the regularization value which obtains the maximum average success rate (Fig. 4), defined as the percentage of correctly locating both the building and floor, as the proper value of , since in case of applying FbP to multi-buildings and multi-floors, the success rate is a better indicator for the positioning performance than using RMSE [12]. One reason using success rate instead of RMSE as the criterion is that the wrongly locating either the buildings or the floors introduces large positioning errors and it makes that the RMSE is no longer a good indicator of positioning performance. From the CV results shown in Fig. 3 and Fig. 4, the suitable values of for HIL and UJIIndoorLoc are 2.7 and 3.0, and 0.5 and 0.2 in case of relatively compounding with both Lorentzian and Minkowski () distances, respectively.

Iv-D Comparison of positioning performance

We compare the RMSE of the positioning result obtained using RCDM (using the regularization value () found by CV) to the ones attained using vector-based distance metrics (Fig. 5). As shown in Fig. 5 and Fig. (b)b, the proposed RCDM outperforms almost all eight original distance metrics on all four datasets (except in case of compounding with Hamming and Jaccard distances on Alcala2017 (see Fig. (a)a) and Wave Hedges and city block distances on HIL (see Fig. (b)b). In addition, the reduction of the RMSE is over two times comparing to that of without using RCDM and the deviation of the RMSE of compounding with all eight distance metrics is much smaller than that of the original ones. In Fig. 6, the success rate of using RCDM is higher than that of using the original distance metrics and the improvement is up to 13% (Fig. (a)a).

Metrics Lor Ham Jac WH Can Cla CB Min
Building accuracy (%) w/o 99.76 96.65 99.67 99.92 99.92 99.92 99.92 99.84
w 99.92 99.92 99.92 99.92 99.92 99.92 99.92 99.92
Success rate (%) w/o 94.12 80.98 85.55 92.9 93.47 92.73 92.57 91.1
w 96.33 93.71 93.71 97.22 96.73 96.24 97.47 96.98
Median of error distance (m) w/o 1.78 2.04 1.73 2.76 2.48 2.65 2.89 3.92
w 1.44 2.07 2.07 1.48 1.53 1.64 1.38 1.47
80-percentile of error distance (m) w/o 9.35 14.72 14.76 11.06 10.65 10.65 11.19 13.76
w 8.36 11.58 11.58 8.12 8.34 8.79 7.96 8.11
TABLE III: Positioning results of UJIIndoorLoc
(a) Alcala
(b) HIL
(c) Tampere
Fig. 5: Comparison of RMSE (, RCDM). In the figure, w and w/o denotes the ones with RCDM and without RCDM, respectively.

According to the comparison of the empirical cumulative distribution function (ECDF) of all eight original distance metrics and RCDM of dataset HIL (Fig. 7), we can conclude that the cumulative positioning accuracy using RCDM is higher than that of using original distance metrics. In addition, the maximum positioning error distance using RCDM is much smaller than that without using RCDM. The FbP algorithms achieving a small maximum positioning error distance make it easier to constrain the upper bound of the positioning error.

Fig. 6: Comparison of success rate and RMSE of UJIIndoorLoc ().
(a) Without RCDM
(b) With RCDM
Fig. 7: Comparison of ECDF between different distance metrics (HIL, )

In TABLE III, we present the positioning results of applying the proposed approach to UJIIndoorLoc. From the building accuracy, defined as the percentage of correctly identifying the building, it seems that there are some validation samples which are not placed in the correct building by the positioning algorithm, because the building accuracy of RCDM compounding with different distance metrics keeps the same, i.e. the building accuracy is saturated. This saturation might be caused by the cleaning of the dataset (see Appendix). Regarding the success rate, it improves about 5% on average using the RCDM and is over 97% of correctly identifying both the building and floor. In addition, the median and 80-percentile positioning error distances using RCDM reduce obviously except compounding with Hamming and Jaccard distances.

V Conclusion

We propose a non-vector-based dissimilarity measure, named compound dissimilarity measure (CDM), by combining a typical distance metric with set operations for the purpose of measuring the dissimilarity between measurements despite the possibility of missing attributes. The proposed CDM is flexible because it includes hyperparameters, which can be tuned according to the data and needs of the application. We apply the proposed CDM to four datasets collected in fingerprinting-based WLAN indoor positioning systems and the positioning performance verifies the validity of it. Both the accuracy of identifying buildings and floors, and the specific locations improve obviously, which are over 5% and 10%, respectively. Although the CDM is proposed herein starting from the idea of handling missing data in fingerprinting-based positioning, it is applicable to other missing data problems as well (e.g., searching the correspondences of point clouds according to sparsely described local features).


The China Scholarship Council (CSC) financially supports the first author’s doctoral research.


We clean the UJIIndoorLoc dataset from two aspects: i) the invalid samples, and ii) the replicas in the training dataset. An invalid sample is a measurement that all of the APs are filled with missing values. A replica of the measurement is that at least two measurements are measured at the same location by the same user using the same device in a short time range (e.g., less than 5 minutes).

  • Invalid samples: We find that 76 out of 20013 samples in the training dataset are invalid measurements by checking whether the RSS of all APs of a measurement is indicated by a missing value (e.g., 100 used in UJIIndoorLoc). We thus delete them from the training dataset.

  • Replicas: These replicas are highly correlated and they might cause the failure of the cross validation using the training dataset because it is easy to get an over-optimistic results using a dataset containing replicas for cross validation [16]. This makes the parameter found by cross validation not applicable to another test dataset. We find out there are a lot of replicas in the training dataset (only 3818 out of 19937 reference measurements do not have a replica). We thus randomly sample one of those replicas as the reference fingerprint for the training dataset in our experimental analysis 666Herein we use only one of the replicas as the reference fingerprint, however, it is useful that grouping or averaging those replicas as one reference fingerprint..


  • [1]

    B. Efron, “Missing data, imputation, and the bootstrap,”

    Journal of the American Statistical Association, vol. 89, no. 426, pp. 463–475, 1994. [Online]. Available:
  • [2] R. J. Little, Missing Data/Imputation.   American Cancer Society, 2015, pp. 1–5. [Online]. Available:
  • [3] P. B. Padmanabhan, V. N., and V. N., “RADAR: An in-building RF based user location and tracking system,” Proceedings IEEE INFOCOM 2000. Conference on Computer Communications. Nineteenth Annual Joint Conference of the IEEE Computer and Communications Societies (Cat. No.00CH37064), vol. 2, no. c, pp. 775–784, 2000. [Online]. Available:
  • [4] S. He and S.-H. G. Chan, “Wi-fi fingerprint-based indoor positioning: Recent advances and comparisons,” IEEE Communications Surveys & Tutorials, vol. 18, no. 1, pp. 466–490, 2016.
  • [5] S. He, W. Lin, and S. H. G. Chan, “Indoor localization and automatic fingerprint update with altered ap signals,” IEEE Transactions on Mobile Computing, vol. 16, no. 7, pp. 1897–1910, July 2017.
  • [6] J. Machaj, P. Brida, and R. Piché, “Rank based fingerprinting algorithm for indoor positioning,” in 2011 International Conference on Indoor Positioning and Indoor Navigation, Sept 2011, pp. 1–6.
  • [7] S.-H. Cha, “Comprehensive survey on distance/similarity measures between probability density functions,” International Journal of Mathematical Models and Methods in Applied Sciences, vol. 1, no. 4, pp. 300–307, 2007. [Online]. Available:
  • [8] G. Minaev, A. Visa, and R. Piché, “Comprehensive survey of similarity measures for ranked based location fingerprinting algorithm,” in 2017 International Conference on Indoor Positioning and Indoor Navigation (IPIN), Sept 2017, pp. 1–4.
  • [9] J. Torres-Sospedra, R. Montoliu, S. Trilles, Óscar Belmonte, and J. Huerta, “Comprehensive analysis of distance and similarity measures for wi-fi fingerprinting indoor positioning systems,” Expert Systems with Applications, vol. 42, no. 23, pp. 9263 – 9278, 2015. [Online]. Available:
  • [10] R. Mautz and S. Tilch, “Survey of optical indoor positioning systems,” in Indoor Positioning and Indoor Navigation (IPIN), 2011 International Conference on.   IEEE, 2011, pp. 1–7.
  • [11] C. Zhou and A. Wieser,

    Jaccard Analysis and LASSO-Based Feature Selection for Location Fingerprinting with Limited Computational Complexity

    .   Cham: Springer International Publishing, 2018, pp. 71–87. [Online]. Available:
  • [12] N. Marques, F. Meneses, and A. Moreira, “Combining similarity functions and majority rules for multi-building, multi-floor, wifi positioning,” in 2012 International Conference on Indoor Positioning and Indoor Navigation (IPIN), Nov 2012, pp. 1–9.
  • [13] J. Torres-Sospedra, R. Montoliu, A. Martínez-Usó, J. P. Avariento, T. J. Arnau, M. Benedito-Bordonau, and J. Huerta, “Ujiindoorloc: A new multi-building and multi-floor database for wlan fingerprint-based indoor localization problems,” in 2014 International Conference on Indoor Positioning and Indoor Navigation (IPIN), Oct 2014, pp. 261–270.
  • [14] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011. [Online]. Available:
  • [15] E. Sansano, R. Montoliu, O. Belmonte, and J. Torres-Sospedra, “Uji indoor positioning and navigation repository,” 2016. [Online]. Available:
  • [16] C. M. Bishop, Pattern Recognition and Machine Learning (Information Science and Statistics).   Secaucus, NJ, USA: Springer-Verlag New York, Inc., 2006.