Log In Sign Up

Graph Attention Network for Camera Relocalization on Dynamic Scenes

by   Mohamed Amine Ouali, et al.

We devise a graph attention network-based approach for learning a scene triangle mesh representation in order to estimate an image camera position in a dynamic environment. Previous approaches built a scene-dependent model that explicitly or implicitly embeds the structure of the scene. They use convolution neural networks or decision trees to establish 2D/3D-3D correspondences. Such a mapping overfits the target scene and does not generalize well to dynamic changes in the environment. Our work introduces a novel approach to solve the camera relocalization problem by using the available triangle mesh. Our 3D-3D matching framework consists of three blocks: (1) a graph neural network to compute the embedding of mesh vertices, (2) a convolution neural network to compute the embedding of grid cells defined on the RGB-D image, and (3) a neural network model to establish the correspondence between the two embeddings. These three components are trained end-to-end. To predict the final pose, we run the RANSAC algorithm to generate camera pose hypotheses, and we refine the prediction using the point-cloud representation. Our approach significantly improves the camera pose accuracy of the state-of-the-art method from 0.358 to 0.506 on the RIO10 benchmark for dynamic indoor camera relocalization.


page 1

page 2

page 4

page 10


S3E-GNN: Sparse Spatial Scene Embedding with Graph Neural Networks for Camera Relocalization

Camera relocalization is the key component of simultaneous localization ...

Robust Neural Routing Through Space Partitions for Camera Relocalization in Dynamic Indoor Environments

Localizing the camera in a known indoor environment is a key building bl...

Camera Relocalization by Computing Pairwise Relative Poses Using Convolutional Neural Network

We propose a new deep learning based approach for camera relocalization....

Automatic Co-Registration of Aerial Imagery and Untextured Model Data Utilizing Average Shading Gradients

The comparison of current image data with existing 3D model data of a sc...

Incremental Visual-Inertial 3D Mesh Generation with Structural Regularities

Visual-Inertial Odometry (VIO) algorithms typically rely on a point clou...

UAN: Unified Attention Network for Convolutional Neural Networks

We propose a new architecture that learns to attend to different Convolu...

MESH2IR: Neural Acoustic Impulse Response Generator for Complex 3D Scenes

We propose a mesh-based neural network (MESH2IR) to generate acoustic im...

I Introduction

The problem of relocalization aims to estimate the 6-degree-of-freedom camera pose (translation and rotation) in a known 3D environment. Several pioneering approaches have been proposed to solve this problem, and recent advancements have reached outstanding results, for instance,

for accuracy on static scenes [4]. The majority of these approaches are limited to static scenes mainly because the model is built to embed the 3D structure of the scenes, so any change in the environment will affect the dependency between the model and the scene. However, the real world is very dynamic. Changes like visual illumination and geometric changes are very frequent. This produces an urge to solve the more challenging problem of camera relocalization in dynamic scenes. Wald et al. [33] showed that the astonishing results scored by the state-of-the-art method on static scenes turn out to be very poor when applied to dynamic scenes. The accuracy dropped from to drastically .

A recent work [12] proposed the NeuralRouting approach to solve the challenge of dynamic scenes. This approach inherits the decision tree framework of the state-of-the-art approaches for static scenes. The main idea is to build a binary decision tree that embeds physical knowledge of the environment. The tree is used to route points sampled from the example frame to the leaves, where 3D coordinates are estimated. Dong et al. [12]

add an outlier-aware mechanism to their approach which rejects points that correspond to dynamic regions. Besides, they proposed an explicit hierarchical spacing of the scenes to associate the tree leaves with meaningful physical regions. These proposed techniques lead to an increase in the model accuracy on dynamic scenes. However, the NeuralRouting approach still suffers from some limits. First, there is the problem of point sampling: given the random aspect of the selected points, we could easily finish up with multiple points that belong to dynamic objects in the scene. Second, points belonging to dynamic regions could sometimes help identify the camera position, especially in configurations, where we lack texture information for static regions and the corresponding object was not moved from its place. Finally, models that embed the physical 3D knowledge of the scene could perform well in dynamic scenes if only static objects are considered for static scenes, whereas for dynamic scenes, this is only helpful for static unmoved objects. Therefore, a scene-independent model is a much better approach for targeting dynamic scenes.

Fig. 1: Illustrating our hierarchical grid system for the query image. We show the three first level. At each level we split the current grids to subgrids. A vertex is routed through their corresponding grid of the image until reaching the final level, where we predict the offset of the vertex.

To address the aforementioned observations, this paper presents a new approach (GATs-Loc) to solve the camera relocalization problem by leveraging the mesh representation of the scene and its semantic segmentation. First, we build a graph attention network model (GATs) to calculate the embedding of the triangle mesh vertices using their positions, normal vectors, colors, and semantic labels as input features, and the triangular polygon that forms the faces as edges. Second, we define a hierarchical grid configuration on the image as described by Figure

1, and we design a CNN model to calculate the image cells embedding. Third, we apply a single layer neural network at each hierarchical level to match the embedding of the vertices with the embedding of the grid cells. If a vertex projection falls into a grid cell of the image, the grid cell is responsible for detecting the vertex. For the last hierarchical level, we predict the offset of the vertices in the grid cell. After establishing the correspondences between the RGB-D image pixels and the vertices of the 3D mesh, we apply the RANSAC algorithm with the Kabsch algorithm to predict the camera pose hypothesis. Finally, we refine the results with the Iterative Closest Point (ICP) algorithm using the mesh representation.

Contributions. (i) We built a new scene-independent trainable network to establish 3D-3D correspondence between the mesh representation and the query image. (ii) Instead of relying on randomly sampled points or blind keypoints identification, we design a hierarchical grid system to establish 3D-3D correspondence. (iii) We evaluate GATs-Loc against the recently created benchmark RIO10 for camera relocalization in a dynamic indoor environment. Our experiments show that GATs-Loc outperforms the state-of-the-art approach in terms of camera accuracy.

Ii Related work

Early work used a simplistic but efficient approach to solve the camera relocalization problem. It consists of a database of images with known camera positions. In the prediction stage, the query image is compared to images in the database to find the most similar one. Then, the new camera location is interpolated

[5, 20, 2]. Hierarchical localization [22] uses the local feature to establish correspondence between the retrieved and query images. Then, it calculates the query image camera pose using the ground-truth pose of the retrieved image and the respective correspondence. Using compact descriptors and advanced indexing techniques, these approaches can scale to bigger scenes [20, 26, 31]

. Early image retrieval-based localization approaches have used hand-crafted methods to calculate global and local image descriptors. But, recent methods have used CNNs to calculate these features. Using compact descriptors and advanced indexing techniques, this approach can scale to larger scenes

[20, 26, 31]. However, each component of these methods is trained separately, and there is a difficulty in matching the local features, especially in changing environments, because they are blindly calculated [28]. These methods also struggle if only a virtual 3D mesh representation is available. In such a case, the approach needs an efficient process to systematize the reference images of the dataset from the 3D mesh representation. Thus, the remaining question is how to generate such reference images, while keeping sufficient information for the pose prediction and without ending up with a lot of redundant images.

A more direct approach to solve the camera relocalization problem is the straight pose regression, formally named scene position regression [1, 19, 21]

. The idea is to use a machine learning model trained with image and ground truth positions to derive the camera pose directly. This method only stores the model parameters instead of several images of the scene, so it reduces the time complexity which only depends on the model complexity. Furthermore, these approaches generalize well to new poses by creating an explicit or implicit internal representation of the 3D scene. CNNs are the main models used for camera pose regression. This category of methods shows an increase in performance when given a sequence of consecutive frame images

[11, 24]. The main problem with such methods is the lack of precision when feeding only one image.

The aforementioned methods do not exploit the available prior knowledge about 3D geometry. The perspective transformation exactly defines the relationship between the 3D position of the vertices, the image pixels, and the camera pose. To take advantage of such prior knowledge, structure-based or correspondence-based methods try to find the correspondence between the 2D/3D pixels in the image and 3D coordinates in the scene, by regressing the 3D coordinates and use these correspondences to estimate the pose. Such methods could be split into two steps. First, points’ selection and scene coordinate regression. Second, camera pose estimation.

(a) Global architecture.
(b) Scoring.
(c) Filter and scoring.
(d) Filter and offset.
Fig. 2: The proposed GATs-Loc framework.

For the first step, the authors in [7] have used a CNN to learn the physical structure of the scene and regress the 3D coordinates from the query image. Other methods [8, 15, 16]

constructed a regression decision tree to establish the correspondence. First, they select points from the query frame using random sampling or keypoints identification. Second, they calculate the feature vector for these points used to build the tree and route the points to leaf nodes. Finally, the 3D coordinates are estimated for the points using a probability distribution identified in the training step. After establishing the point correspondence, the PnP (Perspective n Points) algorithm

[14] for RGB images (2D to 3D correspondence) or the Kabsch algorithm [18] for RGB-D images (3D to 3D correspondence) are used to estimate the camera pose (the second step). The result is smoothed using the RANSAC algorithm [14]. It is an iterative method that calculates the parameters of a mathematical equation from the observed data that contains outliers. RANSAC is essential because some correspondence points are wrongly identified.

Many variants of these approaches are then proposed. For example, [6] uses a CNN for 3D coordinates regression with a differential version of the RANSAC algorithm to enable end-to-end learning. Cavallari et al. [9]

speed up the training for the regression tree on new scenes by only training the leaf nodes using a probability density. Finally,

Dong et al. [12] proposed to adapt the tree-based method to the dynamic scene, by introducing an outlier-aware decision tree with neural routing, through the space partition of the scene, and achieved the best result on the RIO10 benchmark for camera relocalization in changing indoor environment.

Iii The Proposed GATs-Loc Approach

Several camera relocalization methods are based on learning an implicit or explicit representation of the physical environment to regress the 3D coordinates. Such approaches are adequate for static scenes. However, a scene-independent model that does not encode the scene representation should be more robust to changes in the dynamic environment. This idea is partially introduced by the regression trees, where only the leaf nodes embed scene information to regress the 3D coordinates, and fully implemented by the image retrieval approach. We define our model as a scene-independent model. The general idea is to find the correspondence between the RGB-D image points and the vertices of the triangular mesh, and then use this correspondence to derive the camera’s position.

Figure 2 presents an overview of our approach. First, we hierarchically split the image into grids and define a single CNN model to calculate their embeddings (Section III-A). Second, we define a graph attention network to find embeddings for the triangle mesh vertices (Section III-B). Third, we match the vertices with their corresponding grid regions (Section III-C), and calculate their relative offset in the cell (Section III-D). We calculate the exact 3D-3D correspondence to infer the camera pose using these offsets.

It is worth noting that, in our approach, we eliminate the step of random points sampling or keypoints identification used in other methods. Our method is similar to the image retrieval techniques, but it is more general as it uses the 3D mesh representation. Therefore, a lack of images for new poses does not affect the model and we do not rely on image interpolation to predict the camera pose.

Iii-a Image grid cells embedding

The first input to our model is the RGB-D image. We concatenate the depth map (representing the distance of each point from the camera) with the RGB image to obtain a four-channel 2D tensor

used as an input to the CNN model. The image is then split into hierarchical regions as shown by Figure 1. The hierarchical partition is similar to a tree representation. The root node of the tree represents the entire image. In the second level, we split the image with respect to its height to produce two equal cells (the height is approximately equal to two times the width). The same process is repeated for each subregion by splitting the regions into four parts. At the end, we obtain regions corresponding to pixels in the image.

Fig. 3: The global CNN architecture.

We use the Darknet53 CNN model to calculate an embedding for each region in the image. Darknet53 is defined for YOLOv3 [23], and it uses a succession of CNN blocks. Each time a CNN block is applied, it reduces the height and the width of the image by half and doubles the channel size. We use Darknet53 as the backbone for a larger model to calculate all image region embeddings in parallel.

Figure 3

illustrates the global CNN architecture. Our model uses a set of elementary building blocks. First, we have the “ConvBN” building block consisting of a convolution layer followed by a batch normalization layer and a LeakyRelu. The “DarknetBlock” building block consists of

“ConvBN” followed by

“ConvBN” blocks with residual connection. For the global architecture, we first apply 32 convolution filters of size

to the RGB-D image. Second, we apply a successive “DarknetSet” block. Each time the “DarknetSet” block set is applied, we use a “ConvBN” layer to divide the width and height by half and double the channel size. Then, we apply “DarknetBlock” layers. In the second stage, we select the final and the previous six intermediate results. We apply a “PredictionBlock” to produce seven tensors describing the image cells at each level for each tensor.

Iii-B Triangle mesh vertices embedding

We use a graph neural network to calculate the embedding of the vertices of the triangle mesh. Such a representation (e.g., triangle mesh) can be created from image sequences using the ground truth camera poses. A triangle mesh is a set of vertices and faces that define a 3D shape. The vertices are defined by their positions and the faces define edges between three vertices. A 3D mesh is defined in a Euclidean 3D space and a trivial method to process such data structure using neural networks is voxelization. Voxelization discretizes the continuous 3D data to represent them in a 3D tensor. Depending on the discretization precision, the size of the grid can be very large, which could yield high unrealistic computations. To overcome such problems, we can keep the original sparse representation and use graph neural networks [29] to process such structure. Graph neural networks are used essentially for non-Euclidean data. However, they can also be useful for sparse Euclidean data, such as triangular mesh.

Fig. 4: The global GNN architecture.

Our Graph Neural Network (GNN) is composed of seven layers. The number of layers represents how many indirect neighbor vertices to consider when calculating the vertex embedding. Our GNN model uses one type of layer which is the GATs [32] (Graph Attention Layer) defined by Eq. (1). GAT uses the attention mechanism to calculate the weights needed to aggregate the neighboring vertices embedding, because some vertices could be more important than others. Eq. (1.a

) applies a linear transformation to the previous layer vertices embedding using a learnable weight matrix. Eq. (

1.b) calculates the attention scalar for all vertex pairs including the vertex itself (self-attention). In Eq. (1.c), we apply a softmax function to normalize the attention score of the node neighbors. Finally, in Eq. (1

) we aggregate the linear transformation using the attention scores and we apply the nonlinear activation function LeakyReLU.

To improve the model capacity, we also leverage the multi-head attention mechanism. Eq. (2) defines the multi-head attention in which we concatenate multiple embeddings calculated using different attention scores. represents the pair-wise attention score for the attention head. The linear transformation of the embedding is calculated using different learnable weight matrices, one for each attention head. The multi-head attention mechanism helps creating various embeddings, each focusing on a particular aspect of the 3D mesh representation that matches its specific hierarchical region. The output of the graph neural network is a descriptor vector for each vertex in the triangle mesh. The size of the descriptor vector is units which are divided into seven subvectors each of which is used with its corresponding hierarchical grid of the image.


Figure 4 illustrates the global architecture of our graph neural network (GNN) model. The output of the proposed GNN model is a set of vertices with edges defined by the triangle mesh. Each vertex has a 12 dimension feature vector representing the position, the normal vector, color of the vertex, and semantic segmentation which makes the difference between static and dynamic objects in the scene. First, we apply a single neural network layer to each feature vector to calculate the 16 embedding vectors. Then we apply seven “GATNorm” blocks that include a normalization for the feature vector followed by a graph attention layer. The number of attentions-heads and dimensions of the resulting vector is different from one block to another. Each attention-head focuses on a set of neighboring vertices. The final layer of our model is a single neural network layer applied to each vertex descriptor to calculate an embedding vector of size 259.

Iii-C Vertex-region matching

After calculating the embedding, we use a third neural network model to establish the matching between the vertices and the image zones. In the first level, the aim is to identify the vertices that belong to the image. To establish this, we calculate the absolute difference between the image embedding and the vertex embedding . Then, we use a single-layer neural network to predict whether a vertex belongs to the image or not. Only the vertices that belong to the image are kept. In the next level, we assign the remaining vertices to regions of the image using distance similarity. Explicitly, we calculate the pairwise Euclidean distance between the regions and the vertices. Then, we match the vertices to the regions using the smallest distance. Next, we use a one-layer neural network to calculate the matching confidence to keep or discard the vertices.

Eq. (4) calculates the confidence of the assignment and it is applied at all levels. The confidence network is essential to discard vertices that are miss-assigned to regions. Eq. (3) is used at all levels, except the first one, to establish correspondences. This process is repeated in all hierarchical splits of the image. Finally, the remaining vertices are assigned to regions defined by grids of the image. The incremental hierarchical assignment (assigning the vertices to the region and then to its subregions) is used to reduce the computation complexity. The idea is to check only the sub-region of the region previously assigned to the vertex. For example, in level , a vertex is assigned to region of the image. In the next level , only the subregions of region are considered in the assignment.



Embedding of image region at level .
Embedding of vertex at level .
, Learnable weights at level .
Region assigned to the vertex at level .
Probability of vertex assignment to region .

Iii-D Relative offset

The previous step assigns the vertices to regions obtained by splitting the image to regions. Each region consists of pixels. Therefore, we still need to calculate the exact location of the vertices in each region. Here, we do not use similarity-based method, but instead we adopt the offset estimation method, in order to reduce the computational complexity. Eq. (5) represents the offset formula. We use a single layer perception to estimate offset for each remaining vertex. First, we concatenate the embedding of the vertex and its corresponding last region. Then, we apply a neural network to calculate the coordinates.


Iii-E Loss functions

To train the matching model end-to-end, we use four loss functions. Eq. (

6.a) defines the similarity loss function. This function aims to make the vertex embedding similar to the corresponding region embedding . A trivial approach minimizes the Euclidean distance between the vertex embedding and its corresponding cell embedding. Although, a better approach is to use a loss function equivalent to triplet loss, in such an approach, we ensure that the Euclidean distance between the vertex embedding and its corresponding region embedding is smaller than the distance between the vertex and the other regions at the same level . The term shows when is smaller than the loss decreases and when is greater than the loss increases. The margin between the two distance depends on the level and it is chosen empirically with , , , , and .



Vertices set considered for level .
if the vertex is assigned to its correct region, else .
Regions set for level .
Embedding of vertex correct region at level .
Embedding of the vertex at level .
Embedding of region at level .

To reduce the time complexity, at level , we only consider the subregions of the region assigned at level . For example, if we consider a region at a level with four subregions , , , and at the next level. A vertex corresponds to the region , then only the losses from the three other subregions , , and are considered in the loss function of vertex . To compute the offset loss defined by Eq. (6.b), we use the known camera poses to calculate the ground truth offset. Then, we apply the mean square error to calculate the loss. The offset loss is calculated for the set of vertices that survived the routing process defined by and are assigned to the correct final cell. In Eq. (6.b), denotes if a vertex is assigned to its correct region at level 6. Eq. (6.c) defines the confidence loss. To calculate it, we consider the vertices kept by the previous layers. Then, we use the cross-entropy loss function to calculate the loss, where defines the probability vertex assignment to the considered region at level . is the ground truth probability which equals if the considered region is the correct region, otherwise . Finally, Eq. (6.d) consists of minimizing the norm of the embedding for the vertices and the regions. This loss is a regularizer term that also helps to avoid gradient explosion. The global loss function in Eq. (6) is defined as a weighted sum of the other loss functions. In our experiments we set , and .

After establishing the 3D-3D correspondence, we leverage the RANSAC [14] algorithm with the Kabsch [18] algorithm to predict the camera pose hypothesis, and we use the iterative closest point algorithm [3] to refine the camera relocalization.

Iv Experimentation

In this section, we present the results of our approach on RIO10 [33], a recently introduced benchmark for camera relocalization in dynamic indoors environments. The RIO10 dataset consists of 10 scenes. Each scene presents a real indoor environment with different configurations showing several geometric and visual changes encountered in a dynamic environment. The number of configurations varies from a scene to another. The first and second scene’s configurations are used for training and validating the model, respectively. The remaining configurations are used to test the model. The ground truth poses are unknown for the test configurations and the results are obtained by submitting the prediction to the online platform:

Iv-a Implementation details

We used different configuration to train and test our GATs-Loc model. The common transformations are the resizing of the image to pixels, and the normalization of the coordinates of the triangle mesh by centering the model to the world origin and scaling all coordinates by the same scalar to get all coordinates within . The same normalization parameters are used in the validation and testing process.

We trained the correspondence model using three stages. For the first stage, we used the training configuration of all scenes for a better generalization of the correspondence model. We also applied data augmentation techniques to the image by applying blurring, Gaussian noise, and different illumination and contrast transformations. For the mesh, we applied random rotations within for the -axis and within for the and

-axis, which favors a rotation-invariant model. In the second stage, we do a transfer learning of the first stage model and we remove the data augmentation process for the triangular mesh. In the third stage, we refine the obtained model by using one scene configuration at one time. We assessed the refinement process using the validation configuration and it took between

to epochs to refine the model to the target scene.

We used several techniques to validate and test our model. The first is to calculate the embedding of the triangle mesh vertices and retain it. Since the vertices embeddings do not depend on the image frame being evaluated, the graph neural network model can be removed by memorizing the embeddings to reduce the time complexity. The second technique is multi-routing throw the hierarchical level. Due to the potential error in assigning the vertices to the correct region of the image, we allow multi region matching to a single vertex in the test phase. For our configuration, we allow one assignment for the second hierarchical level, so the vertex is only matched with the top or the bottom half of the image. Then we keep the best , , , and image regions matching for the third, fourth, fifth, and sixth levels, respectively. Concretely, we allow each vertex to be assigned to three grid cells in the third, fourth, and fifth level, and to four grid cells in the sixth hierarchical level. The considered grids are selected based on the minimum embedding distance. Finally, we get four potential 3D correspondence for each vertex that survived the routing process. All assignments are used in the camera pose prediction. Our experimentation shows that this trick can help correct many points’ correspondences, especially points near the split boundary.

Input: 3D-3D correspondence from the global matching model.
Output: camera transformation matrix. 

1:  Let
2:  Let Pose hypothesis table
3:  while   do
4:     Select 3 points pair from the 3D-3D correspondence set.
5:     if farEnough(A, B, C) and rigidTransformation(A, B, C) then
6:         = Kabsch(A, B, C).
7:        Append to the hypothesis list .
8:     end if
10:  end while
11:  Calculate a score for each camera pose hypothesis in using the vertices of static objects.
12:   = Select the best camera hypothesis using the previous calculated score.
13:  Refine the camera pose using ICP with static points.
14:  return .
Algorithm 1 Custom RANSAC algorithm for pose estimation.

Algorithm 1 illustrates the steps needed to predict the camera pose from the 3D-3D correspondence. We start by generating 1024 camera pose hypotheses that satisfy two conditions. The ”farEnough” condition ensures that the selected vertices’ pairs are far from each other. The ”rigidTransformation” condition ensures that the distances between the points in the world space and the camera space are similar enough. We rank them after generating the camera hypothesis and select the best-score hypothesis. To calculate the score, we take the 3D points identified in the camera space and multiply them with the inverse camera hypothesis matrix to find their coordinates in the world space. Then, we calculate the mean square error between the calculated world coordinates and those identified in the matching process. Only points corresponding to static objects are used to calculate the mean square error. In the end, we take the inverse of the mean square error.

Iv-B Correspondence model results

Fig. 5: The rates of the correctly predicted vertices belonging to the image along with the rates of their assignment for seq01_02.

In this experiment, we evaluate the hierarchical correspondence model. Figure 5 illustrates the results of the validation configuration of the first scene (seq01_02). The figure shows the mean results for images. Therefore, the results can differ from one image to another. To better understand the results, we note that the number of vertices for this scene is and the mean number of vertices corresponding to each image is approximately , which implies that a random selection of vertices will have an accuracy of . As shown in the figure, the first level can correctly identify of the vertices belonging to the image and this percentage keeps increasing at each level until reaching the level, where about of the identified vertices belong to the image. The orange color in Figure 5 represents the rate of the vertices that were correctly assigned to their respective grids. Their rate keeps on increasing until reaching level four, and then the rate decreases. This pattern is comprehensible. As we split the image further in the hierarchical levels, the neighboring cells embedding become more similar, which results in wrong correspondence. Such a deviation in identifying the correspondence has small to no effect because it is corrected by the multi-label assignment presented in Section IV-A, which makes it less significant.

Iv-C Main results

Method Score DCRE(0.05) DCRE(0.15) Pose(0.05m,5°) Outlier(0.5) NaN
GATs-Loc (RGB-D) 1.401 0.533 0.579 0.506 0.132 0.196
NeuralRouting [12] (RGB-D) 1.441 0.538 0.615 0.358 0.097 0.227
Grove v2 [10] (RGB-D) 1.162 0.416 0.488 0.274 0.254 0.162
Grove [9] (RGB-D) 1.240 0.342 0.392 0.230 0.102 0.452
KAPTURE [17](RGB) 1.338 0.447 0.612 0.176 0.109 0.000
D2-Net [13](RGB) 1.247 0.392 0.521 0.155 0.144 0.014
HF-Net (trained) [25](RGB) 0.789 0.192 0.300 0.073 0.403 0.000
Active Search [27](RGB) 1.166 0.185 0.250 0.070 0.019 0.690
HF-Net [25](RGB) 0.373 0.064 0.103 0.018 0.690 0.000
NetVLAD [2](RGB) 0.507 0.008 0.136 0.000 0.501 0.006
DenseVLAD [31] (RGB) 0.575 0.007 0.137 0.000 0.431 0.000

TABLE I: Results on the RIO10 benchmark. The bold and blue rank the first and second metric, respectively.

Table I illustrates the quantitative results of our framework and other state-of-the-art methods for camera relocalization on the RIO10 test set. In indoor camera relocalization, camera pose accuracy Pose(0.05m,5°) is regarded as the more common and direct measurement for assessing the results [12, 30]. This metric represents the rate of test frames whose rotation and translation errors are within 5 degrees and 5cm, respectively. For the Pose(0.05m,5°) metric. Our result (0.506) surpasses the state-of-the-art approach (0.358) by about 15%. As shown in Table I, there is a correlation between the DCRE(0.05) and the Pose(0.05m,5°). In fact, both metrics are proportional. Test frames considered by Pose(0.05m,5°) metric are considered by the DCRE(0.05). Similar prediction and ground truth camera pose will always show similar visual perception (high DCRE value). Regarding DCRE(0.05) metric, the results show that our framework is the second-best compared to NeuralRouting . GATs-Loc has similar results for DCRE(0.05) and Pose(0.05m,5°) in contrast to NeuralRouting approach, where the DCRE(0.05) is much better than the Pose(0.05m,5°) error.

The table also covers four more metrics. The NaN metric corresponds to the percentage of test frames, where the model could not find the corresponding camera pose, which is about 19% of the test frames for our model. The missing camera pose predictions are the results of insufficient 3D-3D corresponding. In other words, not enough vertices survived the routing process. The Outlier(0.5) and the DCRE(0.15) are related to the DCRE metric. DCRE(0.15) corresponds to the percentage of test frames with a DCRE value within 0.15. The Outlier(0.5) metric corresponds to the test frames rate with DCRE greater than 0.5. Our approach shows less competitive results for Outlier(0.5) and the DCRE(0.15) metric compared to the state-of-the-art. The Score metric (first column of the table) also depends on the DCRE metric as it equals . Our framework is second to NeuralRouting if we consider the Score metric, because it does not depend on the Pose(0.05m,5°) metric. We show the qualitative results of the matching model in the appendix.

V Ablation study

To demonstrate the influence of various design choices, we performed an ablation study on the RIO10 validation dataset111We use the validation dataset because the ground-truth labels of the test dataset are not publicly available. In addition, the online evaluation platform imposes a two-week duration between submissions to forbid parameter tuning on the test dataset.. The results of our experiments are shown in Table II.

GCNs instead of GATs. This experimental study shows the importance of using graph attention layers in comparison to a simple graph convolution network. Specifically, it demonstrates the importance of the multi-head attention mechanism used by the graph neural network layers. In fact, the graph convolution layer traits the vertex and its neighbors in the same manner when aggregating their results. However, not all vertices should be traits similarly in the mesh representation because some vertices are more critical than others. For example, vertices representing the corners or the borders of an object are more important than those located in a flat area. Hence, the attention mechanism is critical when calculating the embedding of the vertices. The attention mechanism performs an intelligent aggregation based on the calculated weight for each adjacent vertex.

Three graph attention layers instead of seven. For the graph neural network model, a three-layer graph neural network will aggregate the vertices representation that is at most two vertices units further from the vertex. By increasing the number of layers, we consider more vertices that are further away from the current vertex when calculating the embedding. The experiment shows a drop in performance when we decrease the number of layers. Further experiments show that the model became incapable of associating the correct vertices to the image at the first hierarchical level. By considering less distant vertices, the calculating embedding becomes a local representation that is harder to match with the global representation of the image.

GCNs instead of GATs 28.12%
Three GATs layer instead of seven 27.39%
GNN without vertices semantic 39.20%
GNN without vertices normal 50.18%
GNN without vertices colors 19.66%
GNN without vertices position 25.32%
GATs-Loc 53.93%
TABLE II: Ablation study for the RIO10 validation dataset.

Removing the vertices semantic/normal/colors/position. In this experiment, we wanted to study the effect of reducing the information feed to our graph neural network model. Each time, we remove a single input data and make comparisons on the RIO10 validation dataset. The most important information seems to be the colors. Obviously, the network relies too much on the colors to match the image and the mesh vertices. In fact, this information exists in both data, and it is very easy for the model to make the association between them. The result also shows that the position of the vertices is also very significant to achieve the final results. Theoretically, the vertex position is important because it helps identify the object shapes, which can then be matched to the similar shape identified visually in the image. The network seems to also rely on semantic information. In computer vision, CNNs have proven their ability to identify the classes of objects present in the scene. Therefore, our CNN model can identify the class of static objects in the images and then match them with the semantic information available in the mesh representation. As for the vertices normals, we notice a small performance drop in Pose(0.05m,5°) accuracy compared to the other input information. When normalizing the data, we have centered the model on the origin of the 3D world. Consequently, the model could deduce the normal information for some vertices.

Vi Conclusion

We have designed a triangle mesh-based camera relocalization framework for dynamic indoor environments, in which we leverage the learning capability of graph neural networks. We also proposed a hierarchical structure of the image to identify the 3D-3D correspondence between the test frame and the vertices of the triangle mesh, without randomly sampled points or blind keypoints identification. Finally, we show the performance of our results by validating our method on the RIO10 benchmark. Our framework proposes a novel way for solving the camera relocalization problem, and different ideas can be proposed to improve the overall performance. For instance, in our ongoing work, we plan to extend our approach to RGB-only test images, which are based on more common and less expensive hardware compared to RGB-D images. We are investigating better representation for the semantics which will be based on the probability that an object changes its location in the scene. This could significantly improve the performance of the model. We also plan to take advantage of Vision Transformers (ViT) to calculate image region embedding instead of CNN. Finally, we intend to build a hierarchical graph neural network model in which we compact the number of vertices after each applied layer.


  • [1] D. Acharya, K. Khoshelham, and S. Winter (2019)

    BIM-posenet: indoor camera localisation using a 3d indoor model and deep learning from synthetic images

    ISPRS Journal of Photogrammetry and Remote Sensing 150, pp. 245–258. Cited by: §II.
  • [2] R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic (2016) NetVLAD: cnn architecture for weakly supervised place recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 5297–5307. Cited by: §II, TABLE I.
  • [3] K. S. Arun, T. S. Huang, and S. D. Blostein (1987) Least-squares fitting of two 3-d point sets. IEEE Transactions on pattern analysis and machine intelligence (5), pp. 698–700. Cited by: §III-E.
  • [4] X. Bai, M. Huang, N. R. Prasad, and A. D. Mihovska (2019) A survey of image-based indoor localization using deep learning. In 2019 22nd International Symposium on Wireless Personal Multimedia Communications (WPMC), pp. 1–6. Cited by: §I.
  • [5] V. Balntas, S. Li, and V. Prisacariu (2018) Relocnet: continuous metric learning relocalisation using neural nets. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 751–767. Cited by: §II.
  • [6] E. Brachmann, A. Krull, S. Nowozin, J. Shotton, F. Michel, S. Gumhold, and C. Rother (2017) Dsac-differentiable ransac for camera localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6684–6692. Cited by: §II.
  • [7] E. Brachmann and C. Rother (2018) Learning less is more-6d camera localization via 3d surface regression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4654–4662. Cited by: §II.
  • [8] T. Cavallari, L. Bertinetto, J. Mukhoti, P. Torr, and S. Golodetz (2019) Let’s take this online: adapting scene coordinate regression network predictions for online rgb-d camera relocalisation. In 2019 International Conference on 3D Vision (3DV), pp. 564–573. Cited by: §II.
  • [9] T. Cavallari, S. Golodetz, N. A. Lord, J. Valentin, L. Di Stefano, and P. H. Torr (2017) On-the-fly adaptation of regression forests for online camera relocalisation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4457–4466. Cited by: §II, TABLE I.
  • [10] T. Cavallari, S. Golodetz, N. A. Lord, J. Valentin, V. A. Prisacariu, L. Di Stefano, and P. H. Torr (2019) Real-time rgb-d camera pose estimation in novel scenes using a relocalisation cascade. IEEE transactions on pattern analysis and machine intelligence 42 (10), pp. 2465–2477. Cited by: TABLE I.
  • [11] R. Clark, S. Wang, A. Markham, N. Trigoni, and H. Wen (2017) Vidloc: a deep spatio-temporal model for 6-dof video-clip relocalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6856–6864. Cited by: §II.
  • [12] S. Dong, Q. Fan, H. Wang, J. Shi, L. Yi, T. Funkhouser, B. Chen, and L. J. Guibas (2021) Robust neural routing through space partitions for camera relocalization in dynamic indoor environments. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8544–8554. Cited by: §I, §II, §IV-C, TABLE I.
  • [13] M. Dusmanu, I. Rocco, T. Pajdla, M. Pollefeys, J. Sivic, A. Torii, and T. Sattler (2019) D2-Net: A Trainable CNN for Joint Detection and Description of Local Features. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: TABLE I.
  • [14] M. A. Fischler and R. C. Bolles (1981) Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM 24 (6), pp. 381–395. Cited by: §II, §III-E.
  • [15] S. Golodetz, T. Cavallari, N. A. Lord, V. A. Prisacariu, D. W. Murray, and P. H. Torr (2018) Collaborative large-scale dense 3d reconstruction with online inter-agent pose optimisation. IEEE transactions on visualization and computer graphics 24 (11), pp. 2895–2905. Cited by: §II.
  • [16] A. Guzman-Rivera, P. Kohli, B. Glocker, J. Shotton, T. Sharp, A. Fitzgibbon, and S. Izadi (2014) Multi-output learning for camera relocalization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1114–1121. Cited by: §II.
  • [17] M. Humenberger, Y. Cabon, N. Guerin, J. Morat, J. Revaud, P. Rerole, N. Pion, C. de Souza, V. Leroy, and G. Csurka (2020) Robust image retrieval-based visual localization using kapture. arXiv preprint arXiv:2007.13867. Cited by: TABLE I.
  • [18] W. Kabsch (1976) A solution for the best rotation to relate two sets of vectors. Acta Crystallographica Section A: Crystal Physics, Diffraction, Theoretical and General Crystallography 32 (5), pp. 922–923. Cited by: §II, §III-E.
  • [19] A. Kendall and R. Cipolla (2017-07) Geometric loss functions for camera pose regression with deep learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §II.
  • [20] H. J. Kim, E. Dunn, and J. Frahm (2017) Learned contextual feature reweighting for image geo-localization. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3251–3260. Cited by: §II.
  • [21] I. Melekhov, J. Ylioinas, J. Kannala, and E. Rahtu (2017) Image-based localization using hourglass networks. In Proceedings of the IEEE international conference on computer vision workshops, pp. 879–886. Cited by: §II.
  • [22] S. Middelberg, T. Sattler, O. Untzelmann, and L. Kobbelt (2014) Scalable 6-dof localization on mobile devices. In European conference on computer vision, pp. 268–283. Cited by: §II.
  • [23] J. Redmon and A. Farhadi (2018) Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767. Cited by: §III-A.
  • [24] S. Saeedi, E. D. Carvalho, W. Li, D. Tzoumanikas, S. Leutenegger, P. H. Kelly, and A. J. Davison (2019) Characterizing visual localization and mapping datasets. In 2019 International Conference on Robotics and Automation (ICRA), pp. 6699–6705. Cited by: §II.
  • [25] P. Sarlin, C. Cadena, R. Siegwart, and M. Dymczyk (2019) From coarse to fine: robust hierarchical localization at large scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12716–12725. Cited by: TABLE I.
  • [26] T. Sattler, M. Havlena, K. Schindler, and M. Pollefeys (2016) Large-scale location recognition and the geometric burstiness problem. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1582–1590. Cited by: §II.
  • [27] T. Sattler, B. Leibe, and L. Kobbelt (2016) Efficient & effective prioritized matching for large-scale image-based localization. IEEE transactions on pattern analysis and machine intelligence 39 (9), pp. 1744–1756. Cited by: TABLE I.
  • [28] T. Sattler, W. Maddern, C. Toft, A. Torii, L. Hammarstrand, E. Stenborg, D. Safari, M. Okutomi, M. Pollefeys, J. Sivic, et al. (2018) Benchmarking 6dof outdoor visual localization in changing conditions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8601–8610. Cited by: §II.
  • [29] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini (2008) The graph neural network model. IEEE transactions on neural networks 20 (1), pp. 61–80. Cited by: §III-B.
  • [30] S. Tang, C. Tang, R. Huang, S. Zhu, and P. Tan (2021) Learning camera localization via dense scene matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1831–1841. Cited by: §IV-C.
  • [31] A. Torii, R. Arandjelovic, J. Sivic, M. Okutomi, and T. Pajdla (2015) 24/7 place recognition by view synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1808–1817. Cited by: §II, TABLE I.
  • [32] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio (2017) Graph attention networks. 6th International Conference on Learning Representations. Cited by: §III-B.
  • [33] J. Wald, T. Sattler, S. Golodetz, T. Cavallari, and F. Tombari (2020) Beyond controlled environments: 3d camera re-localization in changing indoor scenes. In European Conference on Computer Vision, pp. 467–487. Cited by: §I, §IV.

Matching model results

(a) Level 1
(b) Level 2
(c) Level 4
(d) Level 6
(e) Test frame
Fig. 6: An example showing the vertices that survived the routing process (in black) throw level 1 (Figure a), level 2 (Figure b), level 4 (Figure c), and level 6 (Figure d) for a specific test frame which is represented by Figure e. In this example, we illustrates some illumination and geometric changes between the triangle mesh and the test frame. We can perceive that the door of the washing machine is closed in the test frame (Figure e) and the blue basket was moved. The test frame also illustrate illumination change.
Fig. 7: We show the vertices that survived the routing process (in black) until level 6 for 3 test frames.