Zero-Shot Multi-View Indoor Localization via Graph Location Networks

08/06/2020 ∙ by Meng-Jiun Chiou, et al. ∙ National University of Singapore 2

Indoor localization is a fundamental problem in location-based applications. Current approaches to this problem typically rely on Radio Frequency technology, which requires not only supporting infrastructures but human efforts to measure and calibrate the signal. Moreover, data collection for all locations is indispensable in existing methods, which in turn hinders their large-scale deployment. In this paper, we propose a novel neural network based architecture Graph Location Networks (GLN) to perform infrastructure-free, multi-view image based indoor localization. GLN makes location predictions based on robust location representations extracted from images through message-passing networks. Furthermore, we introduce a novel zero-shot indoor localization setting and tackle it by extending the proposed GLN to a dedicated zero-shot version, which exploits a novel mechanism Map2Vec to train location-aware embeddings and make predictions on novel unseen locations. Our extensive experiments show that the proposed approach outperforms state-of-the-art methods in the standard setting, and achieves promising accuracy even in the zero-shot setting where data for half of the locations are not available. The source code and datasets are publicly available at https://github.com/coldmanck/zero-shot-indoor-localization-release.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 4

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Figure 1. An illustration of a multi-view image-based indoor localization system. Current location of a user is predicted with the query images (photos of the user’s surroundings).

Indoor localization seeks to localize a user or a device in an indoor environment. Accurate indoor localization systems could enable various applications, e.g., guiding users in an underground parking lot to find a space, and in a large airport to get to the right boarding gate on time (Snoonian, 2003); however, it remains to be an open challenge. While Global Positioning System (GPS) has been widely adopted to precisely localize a device in an outdoor environment with a 1- to 5-meter localization accuracy (Kárník and Streit, 2016), it cannot be simply applied indoors since the GPS signal is significantly weakened after passing through roofs and walls. Researchers have explored other techniques such as WiFi (Vasisht et al., 2016; Shu et al., 2015), radio frequency identification (RFID) (Holm, 2009), optics (Liu et al., 2010), acoustics (Kim and Choi, 2008) and magnetism (Shu et al., 2015). However, most of the existing approaches require additional infrastructures, such as WiFi access points, RF transmitters, or specially-designed optic/acoustic receivers. Moreover, manual periodic re-calibration is indispensable for RF-based methods since the signals are prone to fluctuation, which in turn is harder to maintain.

Purely image-based approaches (Liang et al., 2013; Gao et al., 2016; Ravi et al., 2006) are proposed to alleviate part of the deployment costs by utilizing only indoor images. However, most of the existing methods still require special devices to collect data (Liang et al., 2013) and utilize fully supervised models to make predictions on each location (Gao et al., 2016; Ravi et al., 2006). These limitations cause their approaches to be not only time-consuming but also labor-intensive when deployed in large-scale indoor environments. In this context, an interesting and fundamental question arises: is it possible to infer the location of the user while data is collected for only several locations? To answer this question, we will consider zero-shot learning where models recognize novel locations by transferring knowledge learned from seen to unseen classes.

Multi-view images are photos of different views at indoor locations and comprise rich location contexts. To leverage this information, a multi-view image- and geomagnetism-based localization strategy is proposed in (Liu et al., 2017) to transform the problem of indoor localization into a graph retrieval problem. However, they treat different camera views as of equal importance, which is usually not true especially when some views are similar and others are more representative. For example, for two neighboring locations (within one meter) at the same corridor, the views that parallel the corridor are extremely similar while the perpendicular views may consist of more identifiable objects (e.g., different doors and windows). Treating them equally undermines the representativeness of their graph features during retrieval. Moreover, the geomagnetic signals which they rely on are unstable, hard to collect and prone to change with time.

Recently, deep learning

(LeCun et al., 2015) has achieved remarkable success in various areas, including but not limited to image-level understanding (Simonyan and Zisserman, 2015; He et al., 2016), object-level detection (Ren et al., 2015; Redmon et al., 2016)

, human-level estimation

(Toshev and Szegedy, 2014; Ruan et al., 2019), text classification (Bahdanau et al., 2015; Joulin et al., 2017) and audio understanding (van den Oord et al., 2016; Yin et al., 2019). To overcome the aforementioned problems, we exploit the strong representation power of neural networks and propose an infrastructure free, neural network-based architecture Graph Location Networks (GLN) to perform multi-view indoor localization. Given photos of different views, GLN computes robust node representations by aggregating and updating features from neighboring nodes, in which identifiable features permeate the whole graph. A location prediction is made by feeding the representations to a single fully-connected layer. Our proposed approach requires neither any infrastructure nor special devices but only a camera phone to collect the photo database. In addition to evaluating on the publicly available multi-view indoor dataset (i.e., ICUBE (Liu et al., 2017)), we provide a benchmark dataset WCP that has been collected in a shopping center. We show that our approach outperforms the baseline and existing methods in terms of localization accuracy by a large margin.

Furthermore, to motivate researches in reducing data collection labor costs we introduce a novel task named zero-shot indoor localization, in which half of the locations are masked during training while a system is required to predict the precise user location. We propose a three-step framework to tackle this task and demonstrate the efficiency of it by extending our GLN to a dedicated zero-shot version. Specifically, to transfer the knowledge from the seen locations to the unseen ones, we propose the Map2Vec mechanism that trains location-aware embeddings for both seen and unseen classes by incorporating their geometric contexts of the floor plan. These embeddings are then leveraged to train a compatibility function that maps image-class pairs to scalar scores. Finally, a prediction is made by picking out the best class maximizing the score function of the query image. We demonstrate that, trained through the proposed framework, our model not only surpasses the baseline by a large margin but also achieves promising localization accuracy, e.g., 56.3% 5-meter accuracy on the ICUBE dataset, while the query locations are never seen during training. To the best of our knowledge, our work is the first exploration of enabling zero shot recognition for indoor localization.

The key contributions of our work are summarized as follows: (a) We propose a novel, neural network based architecture Graph Location Networks which performs effective, infrastructure-free multi-view indoor localization. (b) We introduce zero-shot indoor localization and propose a training framework to tackle it. We demonstrate the efficiency by extending our proposed architecture to a dedicated zero-shot version. (c) We contribute an additional multi-view image-based indoor localization dataset. Our extensive experiments shows that the proposed approach significantly outperforms state-of-the-art methods in the fully supervised setting and achieves competitive localization accuracy in the zero-shot setting.

2. Related Work

2.1. Indoor Localization

Indoor localization has been a popular topic ever since the outdoor localization was mostly tackled (Kárník and Streit, 2016). Most of the previous efforts rely on RF technology which requires additional transmitters/receivers to estimate the location (Vasisht et al., 2016; Holm, 2009; Liu et al., 2010; Kim and Choi, 2008; Shu et al., 2015) or special devices to collect data (Chung et al., 2011), causing large-scale deployment to be costly and prohibitive. More recently and related to our work, a multi-view image- and geomagnetism-based method has been proposed to formulate indoor localization into a graph retrieval problem (Liu et al., 2017); however, it does not consider the difference between views and thus fails to capture a robust representation. While purely image-based techniques do not require additional facilities, they either need special devices to do data collection in advance (Liang et al., 2013) or require a user to take photos with specific reference objects (Gao et al., 2016). Therefore both are not ideal methods to achieve a pervasive indoor positioning system. In addition, for all existing image-based approaches it is inevitable to collect data of all locations of interest, resulting in costly deployment for large-scale indoor environments. In our work, to implement a truly infrastructure-free indoor localization system, we adopt a purely image-based approach which does not require any special device, only a camera phone, to construct an image database. Furthermore, we introduce zero-shot indoor localization to reduce data collection labor costs. Note that while our method is closely related to outdoor place recognition (Arandjelovic et al., 2016; Lowry et al., 2015; Sünderhauf et al., 2015) and is possible to incorporate corresponding techniques (e.g. NetVLAD layer (Arandjelovic et al., 2016), contrastive loss (Bell and Bala, 2015) or 2D-3D hybrid method (Sarlin et al., 2018)) to improve the performance, we focus on the graph-based location network with zero-shot setting in this work and leave as possible extensions in future work.

2.2. Graph-based Methods

2.2.1. Graph analysis

Graph has garnered a lot of attention from researchers due to its nature of being suitable for representing data in various real-life applications (Zhou et al., 2018), including protein-protein interaction (Fout et al., 2017), social relationship networks (Hamilton et al., 2017), natural science (Sanchez-Gonzalez et al., 2018; Battaglia et al., 2016)

, and knowledge graphs

(Hamaguchi et al., 2017). The typical problems that graph analysis is dealing with include node classification, link prediction and clustering. Graph Neural Networks (GNNs) (Kipf and Welling, 2017; Velickovic et al., 2018; Battaglia et al., 2018)

have become the de facto standard for processing graph-based data for their ability to work on large-scale graphs by borrowing the ideas of weight-sharing and local connections from Convolutional Neural Networks (CNNs)

(Zhou et al., 2018).

2.2.2. Graph embedding

Nodes in a graph can be represented as feature vectors by incorporating the information of the graph topology and initial node feature

(Goyal and Ferrara, 2018). In our work, we leverage GNNs in two scenarios: (a) to perform message passing on a locally-connected location graph for a more robust representation, and (b) to train location embeddings to encode position information for all locations to perform zero-shot recognition.

2.3. Zero-Shot Learning

Unlike traditional supervised learning, zero-shot learning aims to recognize the instance classes that have never been seen by the model during training

(Xian et al., 2019). There has been an increasing interest in zero-shot learning and its applications (Lampert et al., 2014; Romera-Paredes and Torr, 2015; Wang et al., 2019; Xian et al., 2017) since it is not unusual that data is only available for some classes. To transfer knowledge to unseen classes, a compatibility function is learned to relate semantic attributes to features (Akata et al., 2015; Sumbul et al., 2018). Specifically, in our work, we learn a compatibility function which maps image features to the location embeddings, and the predicted location is chosen as the one maximizing the compatibility score. Following the definition in (Xian et al., 2019), we perform generalized zero-shot learning since our search space contains both training and test classes during testing.

3. Methodology

In this section, we first formulate the indoor localization problem under fully supervised setting, followed by introducing our proposed method, Graph Location Networks (GLN), which serves as the backbone under both settings. We then demonstrate how to extend our approach to a dedicated zero-shot version to perform indoor positioning on locations of unseen classes.

3.1. Problem Formulation

We formulate the image-based indoor localization problem as follows. Given that denotes a set of images and the space of all sets of images, we are to predict the location for , where is the set of all locations. The goal is to learn a function that maps the input to the target class . In our settings, comprises images of four different directions, i.e., images of the front, behind, right and left at a location.

3.2. Standard Graph Location Networks

Figure 2. The architecture of our Graph Location Networks (GLN) based indoor localization system. First, features of the front, behind, right and left views extracted through CNNs are taken as input by a multi-view quadrilateral graph. An attentional message-passing algorithm is performed on the graph to extract robust location representation, which is then passed into a fully-connected layer followed by a softmax function to make prediction.

The main idea of our proposed approach is that different views of a location possessing distinct information can be used to form a holistic representation. To take advantage of this, we formulate a locally-connected graph in which features are being refined during the message passing and finally producing a robust location representation for classification. We define Graph Location Networks (GLN) as an indoor localization approach which includes three major modules: feature extraction module, location prediction module, and especially message passing module to exploit the aforementioned graph to make accurate location prediction. We explain each module in detail in the following subsections. Figure

2 shows an overview of GLN.

3.2.1. Feature Extraction Module.

Given a set of images of the front, behind, right and left views at a specific location, we utilize a Convolutional Neural Networks based backbone that takes as input to extract high-dimensional features , where

is the feature dimension. The choice of the backbone network and hyperparameters of the model are given in section

4.1.

3.2.2. Message Passing Module.

We define a quadrilateral graph for four views by and , where denotes an undirected edge between nodes and . Node represents a specific direction and its hidden state is initialized with of that direction. To obtain a robust location representation, our system has to effectively exploit and combine neighboring features. Graph Neural Networks (GNNs) have been shown to be able to aggregate information of neighbor nodes and update the node’s hidden state accordingly (Scarselli et al., 2009; Kipf and Welling, 2017). We employ GNNs to pass messages within the graph and refine the hidden states of the nodes. Let and be the hidden state of node at layer and , the updating procedure of hidden state of node is defined as follows:

(1)

where denotes the set of neighboring nodes of node ,

is a nonlinear activation function,

is a normalization constant and represents a shared weight matrix for node-wise feature transformation at layer .

However, each neighboring node (i.e., ) should not have an equal affect to node , e.g., some neighbors may share more overlapped scenes than others. Attention mechanism (Vaswani et al., 2017; Velickovic et al., 2018) has been demonstrated to be effective to capture relational representation. We introduce a graph self-attention mechanism to assign different weights to each neighbor according to its importance to node . Specifically, we update at layer with the weight , which is defined as follows:

(2)

where denotes the concatenation operation, and is a shared attention mechanism that computes the importance of node ’s feature to node .

3.2.3. Location Prediction Module.

After layers of message-passing the final hidden states are fused to form a single robust representations as the following:

(3)

is then passed into a single fully-connected layer mapping the concatenated feature vector into the location space, followed by a Softmax function to generate a probability distribution over all classes:

. We adopt the Softmax loss as the objective function as follows:

(4)

where is the -th row of and denotes the ground truth label for location .

3.3. Zero-Shot Graph Location Networks

Figure 3. The key steps for training a zero-shot indoor localization model. (a) Train Map2Vec location embeddings for a given map (floor plan). (b) Learn a compatibility function with only the seen classes (circles without multiplication sign). (c) Perform zero-shot prediction by assigning an input to the location that maximizes the compatibility function. Dotted lines mean that the edge information is not available during testing. The ”GLNs” block can be replaced with any indoor localization model.

In this section, we describe the proposed learning framework that enables indoor localization models to perform zero-shot prediction. We use our proposed GLN as the backbone model.

In the zero-shot setting, is divided into two disjointed sets: denotes a set of seen classes, and represents a set of () unseen classes, where . Note that we assign and alternately (one every other) on the map. Refer to Figures 3 for an illustration.

For zero-shot indoor localization, the goal is to enable the system to recognize photos of unseen classes through training only on photos of seen classes . It is impossible to employ traditional supervised learning methods to train a model that can recognize the unseen classes without seeing them before. Instead, we leverage the information that is available to both groups (i.e., floor plans) to bridge them together. There are three key steps to perform zero-shot indoor localization: (a) training the Map2Vec location embeddings, (b) learning a compatibility function with the embeddings and GLN, and (c) performing zero-shot recognition.

3.3.1. Map2Vec Location Embedding.

To overcome the aforementioned problem, we propose the Map2Vec mechanism to learn location-aware graph embeddings to correlate the seen and unseen classes. Figure 3(a) shows an illustration of this procedure. For a given map (floor plan) with locations, we define a graph where each vertex represents a location and each edge is a path between locations. Similar to GLN in the standard setting, we adopt the Graph Neural Networks as in Eq. 1 to train graph structure-aware node embeddings and initialize each of the hidden states using the coordinate for location . After layers of message-passing, we extract the final hidden state as the location embedding for class : .

3.3.2. Compatibility Function.

To take advantage of the learned location embeddings, we aim at doing knowledge transfer so that the indoor localization knowledge can be transferred from seen to unseen classes. To carry out the knowledge transfer, we utilize a compatibility function which is a mapping from an image-class pair to a scalar score for the specific class. Figure 3(b) shows an illustration of learning the compatibility function. Since only the samples from seen classes are used for learning the compatibility function, it should be in a class-agnostic form. We follow (Sumbul et al., 2018) and define the compatibility function in a bilinear form as follows:

(5)

where is the image representation of an image from seen classes, is the location embedding of a seen class and is the weights that we are actually learning. In this context, is in fact our GLN that takes in -dimensional image representation and output

-dimensional logits. Similar to standard indoor localization, we adopt cross entropy loss as the objective function.

3.3.3. Zero-Shot Recognition.

Once the compatibility function is learned, we can utilize it to make predictions on unseen classes. Refer to Figure 3(c) for an illustration. Zero-shot indoor localization is achieved by assigning the query image a location class that maximizes :

(6)

Unlike (Sumbul et al., 2018), we predict on all possible locations instead of on only unseen classes to simulate real-world scenarios.

4. Experiments

In this section, we conduct extensive experiments to evaluate the proposed method. Towards this aim, we first explain the implementation details, evaluation datasets and metrics. We then compare our GLN based indoor localization systems with existing models under both standard and zero-shot settings.

4.1. Implementation Details

We implement our model based on PyTorch (Paszke et al., 2017) framework and train on a single NVIDIA Titan X. To extract image representation, we adopt ResNet-152 (He et al., 2016)

and utilize off-the-shelf weights from torchvision package of PyTorch. We employ data augmentation technique to randomly flip, rotate by 10 degrees and resize to

pixels, followed by randomly cropping a patch of pixels. The output of the CNNs is a 2,048-dimensional feature for each image (). For both of our standard GLN and zero-shot GLN, we adopt Graph Convolutional Networks (Kipf and Welling, 2017) as the backbone of message passing process and the attention mechanism in Graph Attention Networks (GATs) (Velickovic et al., 2018), and we utilize the implementation provided by PyTorch Geometric (Fey and Lenssen, 2019). An undirected edge is implemented with two directed edges of opposite directions in the experiments. We observe one layer of message propagation () empirically gives the best performance. The dimension of the latent representation is

. We utilize ReLU nonlinear activation for the original GLN and adopt LeakyReLU for the attentional GLN at each layer, while both are followed by a batch normalization layer and a dropout layer to stabilize training. Attention mechanism

is implemented with a single FC layer. We train the model in an end-to-end manner with learning rate

with Adam optimizer of the exponential decay rate 0.9 and 0.999 for the first- and the second-moment estimates respectively.

4.2. Evaluation Datasets and Metrics

Figure 4. An illustration of WCP dataset where the red vertices represent locations, and black edges denote the adjacency of vertices. Note that the locations are not draw to scale and are for illustrative purposes only.

We evaluate our proposed method on two datasets: ICUBE (Liu et al., 2017) that is publicly available and WCP that is collected by ourselves.

4.2.1. ICUBE dataset

The ICUBE dataset contains 2,896 photos of 214 locations in an academic building. For standard indoor localization, to perform a fair comparison, we closely follow the original paper (Liu et al., 2017) to divide the dataset into a training set of 1,712 images and test set of 1,184 images. While in the zero-shot setting, we set aside 1,368 images of 102 locations as seen classes, where 1,092 of them are for training and the other 276 of them are for validation during training the compatibility function. The remaining 1,528 images of 112 locations are set as unseen classes to be used in zero-shot recognition.

4.2.2. WCP dataset

The WCP dataset consists of 3,280 photos of 394 locations in a shopping center. We assign 2,624 images for training and the other 656 images for testing in the standard indoor localization experiment. In the zero-shot setting, 1,696 images of 204 locations are assigned as seen classes, in which 1,360 and 336 of them are for training and validation compatibility function, respectively. The other 1,584 images of 190 locations are unseen classes. Overall, WCP is more difficult than ICUBE due to its higher number of classes and more complicated scenes such as shops and restaurants. Both datasets are collected in 1-meter distance interval and have a corresponding map that has vertices of locations and edges of adjacency. Figure 4 shows an illustration of the WCP dataset.

4.2.3. Evaluation Metrics

We report one-meter-level accuracy and Cumulative Distribution Function of localization error (CDF@k) at distance k. For zero-shot indoor localization, to perform a more detailed evaluation of the models’ strengths and weaknesses, we utilize multiple metrics including CDF@k, Recall@k that sees if the ground truth presents in top k predictions ordered by confidence scores, and Median Error Distance (MED) which calculates the error distance of 50-percentile predictions. Note that the distance unit is 1-meter for CDF and MED.

4.3. Quantitative Results

4.3.1. Standard Indoor Localization

Figure 5. The cumulative distribution function (CDF) curves of the localization error of the previous and our approaches in standard indoor localization setting on ICUBE dataset.
Dataset Method Meter-level Accuracy
ICUBE Pedes (Li et al., 2012) 58.30%
Magicol (Shu et al., 2015) 69.20%
Matching (Ravi et al., 2006) 75.00%
MVG (Liu et al., 2017) 82.50%
GLN-STA 93.92%
GLN-STA-ATT 90.88%
MALL-1† Sextant (Gao et al., 2016) 47%
MALL-2‡ GeoImage (Liang et al., 2013) 53%
WCP GLN-STA 79.88%
GLN-STA-ATT 79.88%
Table 1. Performance comparison with state-of-the-art models on ICUBE, WCP and the respective MALL datasets. Results of previous approaches on ICUBE are taken from (Liu et al., 2017), while results on distinct MALL datasets are taken from their respective papers. †MALL-1 consists of 108 locations and 686 images. ‡Mall-2 contains 20,000 images (locations).
Dataset Method Recall@k CDF@k MED
k=1 k=2 k=3 k=5 k=10 k=1 k=2 k=3 k=5 k=10
ICUBE Baseline-coord 0.00 0.01 0.02 0.03 0.03 3.53 3.73 5.96 11.65 23.95 23.00
GLN-ZS 8.12 14.40 22.78 30.89 46.60 19.90 33.77 45.81 56.28 74.87 3.76
GLN-ZS-ATT 8.38 14.92 23.30 32.20 45.81 18.59 34.55 43.71 55.24 73.04 4.09
WCP Baseline-coord 0.00 0.00 0.00 0.00 0.00 1.01 1.01 2.78 3.79 8.84 27.00
GLN-ZS 2.02 6.06 7.83 12.37 24.75 8.84 13.38 17.42 22.98 50.25 9.97
GLN-ZS-ATT 2.02 4.55 8.33 13.64 24.50 9.09 13.38 19.70 25.00 51.52 9.93
Table 2. Results of zero-shot indoor localization in comparison of Recall@k, CDF@k and Median Error Distance

(MED) on ICUBE and WCP datasets. Note that numbers of recall and CDF are in % (the higher the better), while the numbers of median error distance are in meter (the lower the better). MED results are estimated with linear interpolation.

To compare with existing indoor localization methods, we choose not only those that are purely based on images but also those based on signals. Pedes (Li et al., 2012) is a pedestrian dead reckoning localization method using inertial sensors. Magicol (Shu et al., 2015) incorporates geomagnetic field and WiFi signal to perform indoor positioning. Matching (Ravi et al., 2006)

performs image comparison by scoring with multiple off-the-shelf algorithms. MVG

(Liu et al., 2017) is a multi-view localization method via graph retrieval based on images and geomagnetism. Note that Magicol and MVG are not purely image-based methods. GLN-STA is our original GLN and GLN-STA-ATT is the GLN with self attention mechanism. The upper part of Table 1 shows the meter-level accuracy compared to existing image-based methods, where our GLN variants surpass the others with significantly higher within one-meter accuracy on the ICUBE dataset and improve the state-of-the-art by 13.8%.

We observe that the usage of attention mechanism does not help on both datasets. Note that the scale of our quadrilateral graph is very different from the common graph datasets (e.g. citation networks (Sen et al., 2008), WebKB graphs (8)) that have thousands of nodes and edges. Moreover, it was observed in (Mostafa and Nassar, 2020; Zhang et al., 2018) that the common instantiation of the attention mechanism on GNNs (i.e. GATs) does not necessarily bring performance boost over standard GCNs on distinct graph datasets. Therefore, more investigation into the way of instantiating and applying graph attention mechanism in our architecture is needed and we leave it as our future work.

Figure 5 shows the full localization error curve (CDF@k), where our GLN perform consistently better than previous approaches, regardless of the requirement of infrastructures.

We also list the additional results of the previous approaches that cannot be reproduced to evaluate on our datasets due to their infrasture requirements.222Note that since they were evaluated on distinct shopping center datasets, the results may not be directly comparable and serve for reference purposes. Sextant (Gao et al., 2016) leverages image matching algorithms to identify and match with the pre-selected reference objects. GeoImage (Liang et al., 2013) performs image matching against a geo-referenced 3D image dataset. Their localization accuracy on the respective shopping mall datasets and our GLN variants on the WCP dataset is showed at the lower part of Table 1. Our GLN-variants achieve significantly better localization performance than the previous methods without any infrastructure requirement.

4.3.2. Zero-Shot Indoor Localization

To simulate the real-word case, while we only perform the localization on data of unseen classes , we still make predictions on all possible locations. To demonstrate that our proposed approach helps in zero-shot localization, we implement a baseline method Baseline-coord that utilize the coordinates but not the Map2Vec location embeddings to train the compatibility function. Baseline-coord uses the same standard GLN as the backbone architecture.

The experimental results of zero-shot indoor localization on the ICUBE and WCP dataset are shown in table 2. GLN-ZS is the original GLN and GLN-ZS-ATT is the attentional GLN, both in the zero-shot setting. On both datasets, GLN-ZS and GLN-ZS-ATT outperform the baseline approach by a large margin. In specific on the ICUBE dataset, GLN-ZS significantly outperforms Baseline-coord by achieving 56.3% 5-meter accuracy (CDF@5) and median error of 3.76 meters, which are considered promising since all test locations are never seen during training.

Similar to the observation in the experiments of the standard setting, GLN-ZS-ATT has similar performance to GLN-ZS on both ICUBE and WCP. In addition to the possible reasons we discussed in the previous section, we find that it may also results from the relatively monotonic scenes in the ICUBE dataset so that the attention mechanism does not help much on distinguishing views. In contrast, GLN-ZS-ATT has slight performance improvements over GLN-ZS in terms of CDF@k and MED on WCP.

Overall, compared to experiments on ICUBE, both variants of GLN perform less powerful on the metrics, presumably due to higher variance of scenes and a larger number of classes.

Figure 6. The cumulative distribution function (CDF) curves of the localization error of the zero-shot indoor localization experiments on ICUBE (left) and WCP (right) datasets.
Figure 7. Qualitative results of zero-shot indoor localization on ICUBE (the top row) and WCP (the bottom row) dataset. The first two columns show examples of successful localization cases by utilizing the adjacency of seen classes to unseen classes, where the red, blue and green circles represent three adjacent locations. The last column shows examples of unsuccessful localization cases where our system is misled, especially when there are more query photos lacking distinguishable features.

Figure 6 shows the full CDF@k curves of zero shot GLN variants and Baseline-coord, where ours perform consistently better. For example on the ICUBE dataset, GLN-ZS shows strong performance improvements ranging from 3.1 to 5.6 times higher CDF@k than the baseline, demonstrating the benefit of the Map2Vec embedding. In addition, as mentioned above, GLN-ZS-ATT is shown to have more consistent performance improvements over GLN-ZS especially on the harder WCP dataset.

4.4. Qualitative Results for Zero Shot GLN

To better identify the strengths and weaknesses of our proposed zero-shot approach, we perform qualitative analysis for GLN-ZS in zero-shot indoor localization setting. The left two columns of Figure 7 show examples of successful localization where the correct prediction is made by inferring that the location (e.g., location 114 in Fig. 7(a)) of the query images is between two neighboring seen locations (e.g., location 113 and 115) by referring to the Map2Vec location embeddings. While some of the views that parallel corridor are extremely similar (e.g., the second view of Fig. 7(d) and (e)), our model is able to extract robust representation by passing identifiable features from neighboring views (e.g., the first view of Fig. 7(d) and (e)) to infer the correct location. However, our system could still be misled especially when there are more query photos lacking distinguishable features. The last column (Fig. 7(c) and (f)) shows unsuccessful localization cases where more images contain no distinguishable features. For instance, in Fig. (c) the first query photo consists of mostly a white wall and the second photo contains merely the surface of a cabinet.

5. Conclusion

In this paper, we first propose a novel neural network based architecture, namely Graph Location Networks (GLN) to perform multi-view indoor localization. GLN takes in photos of different views and make location predictions based on robust location representations with the message-passing mechanism. To reduce prohibitive labor cost when deployed in large-scale indoor environments, we introduce a novel task named zero-shot indoor localization and propose a effective learning framework which is used to adapt GLN to a dedicated zero-shot version to make predictions on unseen locations. We evaluate our proposed approach not only on the publicly available ICUBE dataset but also on our own benchmark dataset WCP that we make publicly available to facilitate researches in multi-view indoor localization systems. Experimental results show that our proposed method achieves state-of-the-art results in the standard setting and performs well with promising accuracy in the zero-shot setting.

Acknowledgements.
This research is partly supported by the Natural Science Foundation of Zhejiang Province, China (No. LQ19F020001), the National Natural Science Foundation of China (No. 61902348, U1609215, 61976188, 61672460), and Singapore’s Ministry of Education (MOE) Academic Research Fund Tier 1, grant number T1 251RES1713.

References

  • Z. Akata, S. E. Reed, D. Walter, H. Lee, and B. Schiele (2015) Evaluation of output embeddings for fine-grained image classification. In

    IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015

    ,
    pp. 2927–2936. External Links: Link, Document Cited by: §2.3.
  • R. Arandjelovic, P. Gronát, A. Torii, T. Pajdla, and J. Sivic (2016) NetVLAD: CNN architecture for weakly supervised place recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pp. 5297–5307. External Links: Link, Document Cited by: §2.1.
  • D. Bahdanau, K. Cho, and Y. Bengio (2015) Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Link Cited by: §1.
  • P. W. Battaglia, J. B. Hamrick, V. Bapst, A. Sanchez-Gonzalez, V. Zambaldi, M. Malinowski, A. Tacchetti, D. Raposo, A. Santoro, R. Faulkner, et al. (2018) Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261. Cited by: §2.2.1.
  • P. W. Battaglia, R. Pascanu, M. Lai, D. J. Rezende, and K. Kavukcuoglu (2016) Interaction networks for learning about objects, relations and physics. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, D. D. Lee, M. Sugiyama, U. von Luxburg, I. Guyon, and R. Garnett (Eds.), pp. 4502–4510. External Links: Link Cited by: §2.2.1.
  • S. Bell and K. Bala (2015) Learning visual similarity for product design with convolutional neural networks. ACM Trans. Graph. 34 (4), pp. 98:1–98:10. External Links: Link, Document Cited by: §2.1.
  • J. Chung, M. Donahoe, C. Schmandt, I. Kim, P. Razavai, and M. Wiseman (2011) Indoor location sensing using geo-magnetism. In Proceedings of the 9th international conference on Mobile systems, applications, and services, pp. 141–154. Cited by: §2.1.
  • [8] (2001) CMU world wide knowledge base (web-¿kb) project. Note: http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-11/www/wwkb/Accessed: 2020-08-04 Cited by: §4.3.1.
  • M. Fey and J. E. Lenssen (2019) Fast graph representation learning with PyTorch Geometric. In ICLR Workshop on Representation Learning on Graphs and Manifolds, Cited by: §4.1.
  • A. Fout, J. Byrd, B. Shariat, and A. Ben-Hur (2017) Protein interface prediction using graph convolutional networks. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett (Eds.), pp. 6530–6539. External Links: Link Cited by: §2.2.1.
  • R. Gao, Y. Tian, F. Ye, G. Luo, K. Bian, Y. Wang, T. Wang, and X. Li (2016) Sextant: towards ubiquitous indoor localization service by photo-taking of the environment. IEEE Trans. Mob. Comput. 15 (2), pp. 460–474. External Links: Link, Document Cited by: §1, §2.1, §4.3.1, Table 1.
  • P. Goyal and E. Ferrara (2018) Graph embedding techniques, applications, and performance: A survey. Knowl. Based Syst. 151, pp. 78–94. External Links: Link, Document Cited by: §2.2.2.
  • T. Hamaguchi, H. Oiwa, M. Shimbo, and Y. Matsumoto (2017) Knowledge transfer for out-of-knowledge-base entities : A graph neural network approach. In

    Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI 2017, Melbourne, Australia, August 19-25, 2017

    , C. Sierra (Ed.),
    pp. 1802–1808. External Links: Link, Document Cited by: §2.2.1.
  • W. L. Hamilton, Z. Ying, and J. Leskovec (2017) Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett (Eds.), pp. 1024–1034. External Links: Link Cited by: §2.2.1.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pp. 770–778. External Links: Link, Document Cited by: §1, §4.1.
  • S. Holm (2009) Hybrid ultrasound-rfid indoor positioning: combining the best of both worlds. In 2009 IEEE International Conference on RFID, pp. 155–162. Cited by: §1, §2.1.
  • A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov (2017) Bag of tricks for efficient text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2017, Valencia, Spain, April 3-7, 2017, Volume 2: Short Papers, M. Lapata, P. Blunsom, and A. Koller (Eds.), pp. 427–431. External Links: Link, Document Cited by: §1.
  • J. Kárník and J. Streit (2016) Summary of Available Indoor Location Techniques. IFAC-PapersOnLine 49 (25), pp. 311–317. External Links: Document, ISSN 24058963, Link Cited by: §1, §2.1.
  • H. Kim and J. Choi (2008) Advanced indoor localization using ultrasonic sensor and digital compass. In 2008 International Conference on Control, Automation and Systems, pp. 223–226. Cited by: §1, §2.1.
  • T. N. Kipf and M. Welling (2017) Semi-supervised classification with graph convolutional networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, External Links: Link Cited by: §2.2.1, §3.2.2, §4.1.
  • C. H. Lampert, H. Nickisch, and S. Harmeling (2014) Attribute-based classification for zero-shot visual object categorization. IEEE Trans. Pattern Anal. Mach. Intell. 36 (3), pp. 453–465. External Links: Link, Document Cited by: §2.3.
  • Y. LeCun, Y. Bengio, and G. Hinton (2015) Deep learning. nature 521 (7553), pp. 436–444. Cited by: §1.
  • F. Li, C. Zhao, G. Ding, J. Gong, C. Liu, and F. Zhao (2012) A reliable and accurate indoor localization method using phone inertial sensors. In Proceedings of the 2012 ACM Conference on Ubiquitous Computing, UbiComp ’12, New York, NY, USA, pp. 421–430. External Links: ISBN 978-1-4503-1224-0, Link, Document Cited by: §4.3.1, Table 1.
  • J. Z. Liang, N. Corso, E. Turner, and A. Zakhor (2013) Image based localization in indoor environments. In Fourth International Conference on Computing for Geospatial Research and Application, COM.Geo ’13, San Jose, CA, USA, July 22-24, 2013, pp. 70–75. External Links: Link, Document Cited by: §1, §2.1, §4.3.1, Table 1.
  • X. Liu, H. Makino, and K. Mase (2010) Improved indoor location estimation using fluorescent light communication system with a nine-channel receiver. IEICE Trans. Commun. 93-B (11), pp. 2936–2944. External Links: Link, Document Cited by: §1, §2.1.
  • Z. Liu, L. Cheng, A. Liu, L. Zhang, X. He, and R. Zimmermann (2017) Multiview and multimodal pervasive indoor localization. In Proceedings of the 2017 ACM on Multimedia Conference, MM 2017, Mountain View, CA, USA, October 23-27, 2017, Q. Liu, R. Lienhart, H. Wang, S. ”. Chen, S. Boll, Y. P. Chen, G. Friedland, J. Li, and S. Yan (Eds.), pp. 109–117. External Links: Link, Document Cited by: §1, §1, §2.1, §4.2.1, §4.2, §4.3.1, Table 1.
  • S. Lowry, N. Sünderhauf, P. Newman, J. J. Leonard, D. Cox, P. Corke, and M. J. Milford (2015) Visual place recognition: a survey. IEEE Transactions on Robotics 32 (1), pp. 1–19. Cited by: §2.1.
  • H. Mostafa and M. Nassar (2020) Permutohedral-gcn: graph convolutional networks with global attention. arXiv preprint arXiv:2003.00635. Cited by: §4.3.1.
  • A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in PyTorch. In NIPS Autodiff Workshop, Cited by: §4.1.
  • N. Ravi, P. Shankar, A. Frankel, A. M. Elgammal, and L. Iftode (2006) Indoor localization using camera phones. In Seventh IEEE Workshop on Mobile Computing Systems & Applications, WMCSA’06, Semiahmoo Resort, Washington, USA, April 6-7, 2006, pp. 19. External Links: Link, Document Cited by: §1, §4.3.1, Table 1.
  • J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi (2016) You only look once: unified, real-time object detection. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pp. 779–788. External Links: Link, Document Cited by: §1.
  • S. Ren, K. He, R. B. Girshick, and J. Sun (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.), pp. 91–99. External Links: Link Cited by: §1.
  • B. Romera-Paredes and P. H. S. Torr (2015) An embarrassingly simple approach to zero-shot learning. In

    Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015

    , F. R. Bach and D. M. Blei (Eds.),
    JMLR Workshop and Conference Proceedings, Vol. 37, pp. 2152–2161. External Links: Link Cited by: §2.3.
  • W. Ruan, W. Liu, Q. Bao, J. Chen, Y. Cheng, and T. Mei (2019) POINet: pose-guided ovonic insight network for multi-person pose tracking. In Proceedings of the 27th ACM International Conference on Multimedia, MM 2019, Nice, France, October 21-25, 2019, L. Amsaleg, B. Huet, M. A. Larson, G. Gravier, H. Hung, C. Ngo, and W. T. Ooi (Eds.), pp. 284–292. External Links: Link, Document Cited by: §1.
  • A. Sanchez-Gonzalez, N. Heess, J. T. Springenberg, J. Merel, M. A. Riedmiller, R. Hadsell, and P. W. Battaglia (2018) Graph networks as learnable physics engines for inference and control. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, J. G. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, pp. 4467–4476. External Links: Link Cited by: §2.2.1.
  • P. Sarlin, F. Debraine, M. Dymczyk, and R. Siegwart (2018) Leveraging deep visual descriptors for hierarchical efficient localization. In 2nd Annual Conference on Robot Learning, CoRL 2018, Zürich, Switzerland, 29-31 October 2018, Proceedings, Proceedings of Machine Learning Research, Vol. 87, pp. 456–465. External Links: Link Cited by: §2.1.
  • F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini (2009) The graph neural network model. IEEE Trans. Neural Networks 20 (1), pp. 61–80. External Links: Link, Document Cited by: §3.2.2.
  • P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Gallagher, and T. Eliassi-Rad (2008) Collective classification in network data. AI Magazine 29 (3), pp. 93–106. External Links: Link, Document Cited by: §4.3.1.
  • Y. Shu, C. Bo, G. Shen, C. Zhao, L. Li, and F. Zhao (2015) Magicol: indoor localization using pervasive magnetic field and opportunistic wifi sensing. IEEE J. Sel. Areas Commun. 33 (7), pp. 1443–1457. External Links: Link, Document Cited by: §1, §2.1, §4.3.1, Table 1.
  • K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Link Cited by: §1.
  • D. Snoonian (2003) Smart buildings. IEEE Spectrum 40 (8), pp. 18–23. Cited by: §1.
  • G. Sumbul, R. G. Cinbis, and S. Aksoy (2018) Fine-grained object recognition and zero-shot learning in remote sensing imagery. IEEE Trans. Geosci. Remote. Sens. 56 (2), pp. 770–779. External Links: Link, Document Cited by: §2.3, §3.3.2, §3.3.3.
  • N. Sünderhauf, S. Shirazi, F. Dayoub, B. Upcroft, and M. Milford (2015) On the performance of convnet features for place recognition. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2015, Hamburg, Germany, September 28 - October 2, 2015, pp. 4297–4304. External Links: Link, Document Cited by: §2.1.
  • A. Toshev and C. Szegedy (2014)

    DeepPose: human pose estimation via deep neural networks

    .
    In 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, June 23-28, 2014, pp. 1653–1660. External Links: Link, Document Cited by: §1.
  • A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. W. Senior, and K. Kavukcuoglu (2016) WaveNet: A generative model for raw audio. In The 9th ISCA Speech Synthesis Workshop, Sunnyvale, CA, USA, 13-15 September 2016, pp. 125. External Links: Link Cited by: §1.
  • D. Vasisht, S. Kumar, and D. Katabi (2016) Decimeter-level localization with a single wifi access point. In 13th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2016, Santa Clara, CA, USA, March 16-18, 2016, K. J. Argyraki and R. Isaacs (Eds.), pp. 165–178. External Links: Link Cited by: §1, §2.1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett (Eds.), pp. 5998–6008. External Links: Link Cited by: §3.2.2.
  • P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio (2018) Graph attention networks. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, External Links: Link Cited by: §2.2.1, §3.2.2, §4.1.
  • W. Wang, V. W. Zheng, H. Yu, and C. Miao (2019) A survey of zero-shot learning: settings, methods, and applications. ACM Trans. Intell. Syst. Technol. 10 (2), pp. 13:1–13:37. External Links: Link, Document Cited by: §2.3.
  • Y. Xian, C. H. Lampert, B. Schiele, and Z. Akata (2019) Zero-shot learning - A comprehensive evaluation of the good, the bad and the ugly. IEEE Trans. Pattern Anal. Mach. Intell. 41 (9), pp. 2251–2265. External Links: Link, Document Cited by: §2.3.
  • Y. Xian, B. Schiele, and Z. Akata (2017) Zero-shot learning - the good, the bad and the ugly. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pp. 3077–3086. External Links: Link, Document Cited by: §2.3.
  • Y. Yin, M. Chiou, Z. Liu, H. Shrivastava, R. R. Shah, and R. Zimmermann (2019)

    Multi-level fusion based class-aware attention model for weakly labeled audio tagging

    .
    In Proceedings of the 27th ACM International Conference on Multimedia, MM 2019, Nice, France, October 21-25, 2019, L. Amsaleg, B. Huet, M. A. Larson, G. Gravier, H. Hung, C. Ngo, and W. T. Ooi (Eds.), pp. 1304–1312. External Links: Link, Document Cited by: §1.
  • L. Zhang, H. Song, and H. Lu (2018) Graph node-feature convolution for representation learning. CoRR abs/1812.00086. External Links: Link, 1812.00086 Cited by: §4.3.1.
  • J. Zhou, G. Cui, Z. Zhang, C. Yang, Z. Liu, and M. Sun (2018) Graph neural networks: A review of methods and applications. CoRR abs/1812.08434. External Links: Link, 1812.08434 Cited by: §2.2.1.