I Introduction
The problem of relocalization aims to estimate the 6degreeoffreedom camera pose (translation and rotation) in a known 3D environment. Several pioneering approaches have been proposed to solve this problem, and recent advancements have reached outstanding results, for instance,
for accuracy on static scenes [4]. The majority of these approaches are limited to static scenes mainly because the model is built to embed the 3D structure of the scenes, so any change in the environment will affect the dependency between the model and the scene. However, the real world is very dynamic. Changes like visual illumination and geometric changes are very frequent. This produces an urge to solve the more challenging problem of camera relocalization in dynamic scenes. Wald et al. [33] showed that the astonishing results scored by the stateoftheart method on static scenes turn out to be very poor when applied to dynamic scenes. The accuracy dropped from to drastically .A recent work [12] proposed the NeuralRouting approach to solve the challenge of dynamic scenes. This approach inherits the decision tree framework of the stateoftheart approaches for static scenes. The main idea is to build a binary decision tree that embeds physical knowledge of the environment. The tree is used to route points sampled from the example frame to the leaves, where 3D coordinates are estimated. Dong et al. [12]
add an outlieraware mechanism to their approach which rejects points that correspond to dynamic regions. Besides, they proposed an explicit hierarchical spacing of the scenes to associate the tree leaves with meaningful physical regions. These proposed techniques lead to an increase in the model accuracy on dynamic scenes. However, the NeuralRouting approach still suffers from some limits. First, there is the problem of point sampling: given the random aspect of the selected points, we could easily finish up with multiple points that belong to dynamic objects in the scene. Second, points belonging to dynamic regions could sometimes help identify the camera position, especially in configurations, where we lack texture information for static regions and the corresponding object was not moved from its place. Finally, models that embed the physical 3D knowledge of the scene could perform well in dynamic scenes if only static objects are considered for static scenes, whereas for dynamic scenes, this is only helpful for static unmoved objects. Therefore, a sceneindependent model is a much better approach for targeting dynamic scenes.
To address the aforementioned observations, this paper presents a new approach (GATsLoc) to solve the camera relocalization problem by leveraging the mesh representation of the scene and its semantic segmentation. First, we build a graph attention network model (GATs) to calculate the embedding of the triangle mesh vertices using their positions, normal vectors, colors, and semantic labels as input features, and the triangular polygon that forms the faces as edges. Second, we define a hierarchical grid configuration on the image as described by Figure
1, and we design a CNN model to calculate the image cells embedding. Third, we apply a single layer neural network at each hierarchical level to match the embedding of the vertices with the embedding of the grid cells. If a vertex projection falls into a grid cell of the image, the grid cell is responsible for detecting the vertex. For the last hierarchical level, we predict the offset of the vertices in the grid cell. After establishing the correspondences between the RGBD image pixels and the vertices of the 3D mesh, we apply the RANSAC algorithm with the Kabsch algorithm to predict the camera pose hypothesis. Finally, we refine the results with the Iterative Closest Point (ICP) algorithm using the mesh representation.Contributions. (i) We built a new sceneindependent trainable network to establish 3D3D correspondence between the mesh representation and the query image. (ii) Instead of relying on randomly sampled points or blind keypoints identification, we design a hierarchical grid system to establish 3D3D correspondence. (iii) We evaluate GATsLoc against the recently created benchmark RIO10 for camera relocalization in a dynamic indoor environment. Our experiments show that GATsLoc outperforms the stateoftheart approach in terms of camera accuracy.
Ii Related work
Early work used a simplistic but efficient approach to solve the camera relocalization problem. It consists of a database of images with known camera positions. In the prediction stage, the query image is compared to images in the database to find the most similar one. Then, the new camera location is interpolated
[5, 20, 2]. Hierarchical localization [22] uses the local feature to establish correspondence between the retrieved and query images. Then, it calculates the query image camera pose using the groundtruth pose of the retrieved image and the respective correspondence. Using compact descriptors and advanced indexing techniques, these approaches can scale to bigger scenes [20, 26, 31]. Early image retrievalbased localization approaches have used handcrafted methods to calculate global and local image descriptors. But, recent methods have used CNNs to calculate these features. Using compact descriptors and advanced indexing techniques, this approach can scale to larger scenes
[20, 26, 31]. However, each component of these methods is trained separately, and there is a difficulty in matching the local features, especially in changing environments, because they are blindly calculated [28]. These methods also struggle if only a virtual 3D mesh representation is available. In such a case, the approach needs an efficient process to systematize the reference images of the dataset from the 3D mesh representation. Thus, the remaining question is how to generate such reference images, while keeping sufficient information for the pose prediction and without ending up with a lot of redundant images.A more direct approach to solve the camera relocalization problem is the straight pose regression, formally named scene position regression [1, 19, 21]
. The idea is to use a machine learning model trained with image and ground truth positions to derive the camera pose directly. This method only stores the model parameters instead of several images of the scene, so it reduces the time complexity which only depends on the model complexity. Furthermore, these approaches generalize well to new poses by creating an explicit or implicit internal representation of the 3D scene. CNNs are the main models used for camera pose regression. This category of methods shows an increase in performance when given a sequence of consecutive frame images
[11, 24]. The main problem with such methods is the lack of precision when feeding only one image.The aforementioned methods do not exploit the available prior knowledge about 3D geometry. The perspective transformation exactly defines the relationship between the 3D position of the vertices, the image pixels, and the camera pose. To take advantage of such prior knowledge, structurebased or correspondencebased methods try to find the correspondence between the 2D/3D pixels in the image and 3D coordinates in the scene, by regressing the 3D coordinates and use these correspondences to estimate the pose. Such methods could be split into two steps. First, points’ selection and scene coordinate regression. Second, camera pose estimation.
For the first step, the authors in [7] have used a CNN to learn the physical structure of the scene and regress the 3D coordinates from the query image. Other methods [8, 15, 16]
constructed a regression decision tree to establish the correspondence. First, they select points from the query frame using random sampling or keypoints identification. Second, they calculate the feature vector for these points used to build the tree and route the points to leaf nodes. Finally, the 3D coordinates are estimated for the points using a probability distribution identified in the training step. After establishing the point correspondence, the PnP (Perspective n Points) algorithm
[14] for RGB images (2D to 3D correspondence) or the Kabsch algorithm [18] for RGBD images (3D to 3D correspondence) are used to estimate the camera pose (the second step). The result is smoothed using the RANSAC algorithm [14]. It is an iterative method that calculates the parameters of a mathematical equation from the observed data that contains outliers. RANSAC is essential because some correspondence points are wrongly identified.Many variants of these approaches are then proposed. For example, [6] uses a CNN for 3D coordinates regression with a differential version of the RANSAC algorithm to enable endtoend learning. Cavallari et al. [9]
speed up the training for the regression tree on new scenes by only training the leaf nodes using a probability density. Finally,
Dong et al. [12] proposed to adapt the treebased method to the dynamic scene, by introducing an outlieraware decision tree with neural routing, through the space partition of the scene, and achieved the best result on the RIO10 benchmark for camera relocalization in changing indoor environment.Iii The Proposed GATsLoc Approach
Several camera relocalization methods are based on learning an implicit or explicit representation of the physical environment to regress the 3D coordinates. Such approaches are adequate for static scenes. However, a sceneindependent model that does not encode the scene representation should be more robust to changes in the dynamic environment. This idea is partially introduced by the regression trees, where only the leaf nodes embed scene information to regress the 3D coordinates, and fully implemented by the image retrieval approach. We define our model as a sceneindependent model. The general idea is to find the correspondence between the RGBD image points and the vertices of the triangular mesh, and then use this correspondence to derive the camera’s position.
Figure 2 presents an overview of our approach. First, we hierarchically split the image into grids and define a single CNN model to calculate their embeddings (Section IIIA). Second, we define a graph attention network to find embeddings for the triangle mesh vertices (Section IIIB). Third, we match the vertices with their corresponding grid regions (Section IIIC), and calculate their relative offset in the cell (Section IIID). We calculate the exact 3D3D correspondence to infer the camera pose using these offsets.
It is worth noting that, in our approach, we eliminate the step of random points sampling or keypoints identification used in other methods. Our method is similar to the image retrieval techniques, but it is more general as it uses the 3D mesh representation. Therefore, a lack of images for new poses does not affect the model and we do not rely on image interpolation to predict the camera pose.
Iiia Image grid cells embedding
The first input to our model is the RGBD image. We concatenate the depth map (representing the distance of each point from the camera) with the RGB image to obtain a fourchannel 2D tensor
used as an input to the CNN model. The image is then split into hierarchical regions as shown by Figure 1. The hierarchical partition is similar to a tree representation. The root node of the tree represents the entire image. In the second level, we split the image with respect to its height to produce two equal cells (the height is approximately equal to two times the width). The same process is repeated for each subregion by splitting the regions into four parts. At the end, we obtain regions corresponding to pixels in the image.We use the Darknet53 CNN model to calculate an embedding for each region in the image. Darknet53 is defined for YOLOv3 [23], and it uses a succession of CNN blocks. Each time a CNN block is applied, it reduces the height and the width of the image by half and doubles the channel size. We use Darknet53 as the backbone for a larger model to calculate all image region embeddings in parallel.
Figure 3
illustrates the global CNN architecture. Our model uses a set of elementary building blocks. First, we have the “ConvBN” building block consisting of a convolution layer followed by a batch normalization layer and a LeakyRelu. The “DarknetBlock” building block consists of
“ConvBN” followed by“ConvBN” blocks with residual connection. For the global architecture, we first apply 32 convolution filters of size
to the RGBD image. Second, we apply a successive “DarknetSet” block. Each time the “DarknetSet” block set is applied, we use a “ConvBN” layer to divide the width and height by half and double the channel size. Then, we apply “DarknetBlock” layers. In the second stage, we select the final and the previous six intermediate results. We apply a “PredictionBlock” to produce seven tensors describing the image cells at each level for each tensor.IiiB Triangle mesh vertices embedding
We use a graph neural network to calculate the embedding of the vertices of the triangle mesh. Such a representation (e.g., triangle mesh) can be created from image sequences using the ground truth camera poses. A triangle mesh is a set of vertices and faces that define a 3D shape. The vertices are defined by their positions and the faces define edges between three vertices. A 3D mesh is defined in a Euclidean 3D space and a trivial method to process such data structure using neural networks is voxelization. Voxelization discretizes the continuous 3D data to represent them in a 3D tensor. Depending on the discretization precision, the size of the grid can be very large, which could yield high unrealistic computations. To overcome such problems, we can keep the original sparse representation and use graph neural networks [29] to process such structure. Graph neural networks are used essentially for nonEuclidean data. However, they can also be useful for sparse Euclidean data, such as triangular mesh.
Our Graph Neural Network (GNN) is composed of seven layers. The number of layers represents how many indirect neighbor vertices to consider when calculating the vertex embedding. Our GNN model uses one type of layer which is the GATs [32] (Graph Attention Layer) defined by Eq. (1). GAT uses the attention mechanism to calculate the weights needed to aggregate the neighboring vertices embedding, because some vertices could be more important than others. Eq. (1.a
) applies a linear transformation to the previous layer vertices embedding using a learnable weight matrix. Eq. (
1.b) calculates the attention scalar for all vertex pairs including the vertex itself (selfattention). In Eq. (1.c), we apply a softmax function to normalize the attention score of the node neighbors. Finally, in Eq. (1) we aggregate the linear transformation using the attention scores and we apply the nonlinear activation function LeakyReLU.
To improve the model capacity, we also leverage the multihead attention mechanism. Eq. (2) defines the multihead attention in which we concatenate multiple embeddings calculated using different attention scores. represents the pairwise attention score for the attention head. The linear transformation of the embedding is calculated using different learnable weight matrices, one for each attention head. The multihead attention mechanism helps creating various embeddings, each focusing on a particular aspect of the 3D mesh representation that matches its specific hierarchical region. The output of the graph neural network is a descriptor vector for each vertex in the triangle mesh. The size of the descriptor vector is units which are divided into seven subvectors each of which is used with its corresponding hierarchical grid of the image.
(2) 
Figure 4 illustrates the global architecture of our graph neural network (GNN) model. The output of the proposed GNN model is a set of vertices with edges defined by the triangle mesh. Each vertex has a 12 dimension feature vector representing the position, the normal vector, color of the vertex, and semantic segmentation which makes the difference between static and dynamic objects in the scene. First, we apply a single neural network layer to each feature vector to calculate the 16 embedding vectors. Then we apply seven “GATNorm” blocks that include a normalization for the feature vector followed by a graph attention layer. The number of attentionsheads and dimensions of the resulting vector is different from one block to another. Each attentionhead focuses on a set of neighboring vertices. The final layer of our model is a single neural network layer applied to each vertex descriptor to calculate an embedding vector of size 259.
IiiC Vertexregion matching
After calculating the embedding, we use a third neural network model to establish the matching between the vertices and the image zones. In the first level, the aim is to identify the vertices that belong to the image. To establish this, we calculate the absolute difference between the image embedding and the vertex embedding . Then, we use a singlelayer neural network to predict whether a vertex belongs to the image or not. Only the vertices that belong to the image are kept. In the next level, we assign the remaining vertices to regions of the image using distance similarity. Explicitly, we calculate the pairwise Euclidean distance between the regions and the vertices. Then, we match the vertices to the regions using the smallest distance. Next, we use a onelayer neural network to calculate the matching confidence to keep or discard the vertices.
Eq. (4) calculates the confidence of the assignment and it is applied at all levels. The confidence network is essential to discard vertices that are missassigned to regions. Eq. (3) is used at all levels, except the first one, to establish correspondences. This process is repeated in all hierarchical splits of the image. Finally, the remaining vertices are assigned to regions defined by grids of the image. The incremental hierarchical assignment (assigning the vertices to the region and then to its subregions) is used to reduce the computation complexity. The idea is to check only the subregion of the region previously assigned to the vertex. For example, in level , a vertex is assigned to region of the image. In the next level , only the subregions of region are considered in the assignment.
(3)  
(4) 
where:
Embedding of image region at level .  
Embedding of vertex at level .  
,  Learnable weights at level . 
Region assigned to the vertex at level .  
Probability of vertex assignment to region . 
IiiD Relative offset
The previous step assigns the vertices to regions obtained by splitting the image to regions. Each region consists of pixels. Therefore, we still need to calculate the exact location of the vertices in each region. Here, we do not use similaritybased method, but instead we adopt the offset estimation method, in order to reduce the computational complexity. Eq. (5) represents the offset formula. We use a single layer perception to estimate offset for each remaining vertex. First, we concatenate the embedding of the vertex and its corresponding last region. Then, we apply a neural network to calculate the coordinates.
(5) 
IiiE Loss functions
To train the matching model endtoend, we use four loss functions. Eq. (
6.a) defines the similarity loss function. This function aims to make the vertex embedding similar to the corresponding region embedding . A trivial approach minimizes the Euclidean distance between the vertex embedding and its corresponding cell embedding. Although, a better approach is to use a loss function equivalent to triplet loss, in such an approach, we ensure that the Euclidean distance between the vertex embedding and its corresponding region embedding is smaller than the distance between the vertex and the other regions at the same level . The term shows when is smaller than the loss decreases and when is greater than the loss increases. The margin between the two distance depends on the level and it is chosen empirically with , , , , and ..  

Vertices set considered for level .  
if the vertex is assigned to its correct region, else .  
Regions set for level .  
Embedding of vertex correct region at level .  
Embedding of the vertex at level .  
Embedding of region at level . 
To reduce the time complexity, at level , we only consider the subregions of the region assigned at level . For example, if we consider a region at a level with four subregions , , , and at the next level. A vertex corresponds to the region , then only the losses from the three other subregions , , and are considered in the loss function of vertex . To compute the offset loss defined by Eq. (6.b), we use the known camera poses to calculate the ground truth offset. Then, we apply the mean square error to calculate the loss. The offset loss is calculated for the set of vertices that survived the routing process defined by and are assigned to the correct final cell. In Eq. (6.b), denotes if a vertex is assigned to its correct region at level 6. Eq. (6.c) defines the confidence loss. To calculate it, we consider the vertices kept by the previous layers. Then, we use the crossentropy loss function to calculate the loss, where defines the probability vertex assignment to the considered region at level . is the ground truth probability which equals if the considered region is the correct region, otherwise . Finally, Eq. (6.d) consists of minimizing the norm of the embedding for the vertices and the regions. This loss is a regularizer term that also helps to avoid gradient explosion. The global loss function in Eq. (6) is defined as a weighted sum of the other loss functions. In our experiments we set , and .
Iv Experimentation
In this section, we present the results of our approach on RIO10 [33], a recently introduced benchmark for camera relocalization in dynamic indoors environments. The RIO10 dataset consists of 10 scenes. Each scene presents a real indoor environment with different configurations showing several geometric and visual changes encountered in a dynamic environment. The number of configurations varies from a scene to another. The first and second scene’s configurations are used for training and validating the model, respectively. The remaining configurations are used to test the model. The ground truth poses are unknown for the test configurations and the results are obtained by submitting the prediction to the online platform: http://vmnavab26.in.tum.de/RIO10/.
Iva Implementation details
We used different configuration to train and test our GATsLoc model. The common transformations are the resizing of the image to pixels, and the normalization of the coordinates of the triangle mesh by centering the model to the world origin and scaling all coordinates by the same scalar to get all coordinates within . The same normalization parameters are used in the validation and testing process.
We trained the correspondence model using three stages. For the first stage, we used the training configuration of all scenes for a better generalization of the correspondence model. We also applied data augmentation techniques to the image by applying blurring, Gaussian noise, and different illumination and contrast transformations. For the mesh, we applied random rotations within for the axis and within for the and
axis, which favors a rotationinvariant model. In the second stage, we do a transfer learning of the first stage model and we remove the data augmentation process for the triangular mesh. In the third stage, we refine the obtained model by using one scene configuration at one time. We assessed the refinement process using the validation configuration and it took between
to epochs to refine the model to the target scene.We used several techniques to validate and test our model. The first is to calculate the embedding of the triangle mesh vertices and retain it. Since the vertices embeddings do not depend on the image frame being evaluated, the graph neural network model can be removed by memorizing the embeddings to reduce the time complexity. The second technique is multirouting throw the hierarchical level. Due to the potential error in assigning the vertices to the correct region of the image, we allow multi region matching to a single vertex in the test phase. For our configuration, we allow one assignment for the second hierarchical level, so the vertex is only matched with the top or the bottom half of the image. Then we keep the best , , , and image regions matching for the third, fourth, fifth, and sixth levels, respectively. Concretely, we allow each vertex to be assigned to three grid cells in the third, fourth, and fifth level, and to four grid cells in the sixth hierarchical level. The considered grids are selected based on the minimum embedding distance. Finally, we get four potential 3D correspondence for each vertex that survived the routing process. All assignments are used in the camera pose prediction. Our experimentation shows that this trick can help correct many points’ correspondences, especially points near the split boundary.
Algorithm 1 illustrates the steps needed to predict the camera pose from the 3D3D correspondence. We start by generating 1024 camera pose hypotheses that satisfy two conditions. The ”farEnough” condition ensures that the selected vertices’ pairs are far from each other. The ”rigidTransformation” condition ensures that the distances between the points in the world space and the camera space are similar enough. We rank them after generating the camera hypothesis and select the bestscore hypothesis. To calculate the score, we take the 3D points identified in the camera space and multiply them with the inverse camera hypothesis matrix to find their coordinates in the world space. Then, we calculate the mean square error between the calculated world coordinates and those identified in the matching process. Only points corresponding to static objects are used to calculate the mean square error. In the end, we take the inverse of the mean square error.
IvB Correspondence model results
In this experiment, we evaluate the hierarchical correspondence model. Figure 5 illustrates the results of the validation configuration of the first scene (seq01_02). The figure shows the mean results for images. Therefore, the results can differ from one image to another. To better understand the results, we note that the number of vertices for this scene is and the mean number of vertices corresponding to each image is approximately , which implies that a random selection of vertices will have an accuracy of . As shown in the figure, the first level can correctly identify of the vertices belonging to the image and this percentage keeps increasing at each level until reaching the level, where about of the identified vertices belong to the image. The orange color in Figure 5 represents the rate of the vertices that were correctly assigned to their respective grids. Their rate keeps on increasing until reaching level four, and then the rate decreases. This pattern is comprehensible. As we split the image further in the hierarchical levels, the neighboring cells embedding become more similar, which results in wrong correspondence. Such a deviation in identifying the correspondence has small to no effect because it is corrected by the multilabel assignment presented in Section IVA, which makes it less significant.
IvC Main results
Method  Score  DCRE(0.05)  DCRE(0.15)  Pose(0.05m,5°)  Outlier(0.5)  NaN 

GATsLoc (RGBD)  1.401  0.533  0.579  0.506  0.132  0.196 
NeuralRouting [12] (RGBD)  1.441  0.538  0.615  0.358  0.097  0.227 
Grove v2 [10] (RGBD)  1.162  0.416  0.488  0.274  0.254  0.162 
Grove [9] (RGBD)  1.240  0.342  0.392  0.230  0.102  0.452 
KAPTURE [17](RGB)  1.338  0.447  0.612  0.176  0.109  0.000 
D2Net [13](RGB)  1.247  0.392  0.521  0.155  0.144  0.014 
HFNet (trained) [25](RGB)  0.789  0.192  0.300  0.073  0.403  0.000 
Active Search [27](RGB)  1.166  0.185  0.250  0.070  0.019  0.690 
HFNet [25](RGB)  0.373  0.064  0.103  0.018  0.690  0.000 
NetVLAD [2](RGB)  0.507  0.008  0.136  0.000  0.501  0.006 
DenseVLAD [31] (RGB)  0.575  0.007  0.137  0.000  0.431  0.000 

Table I illustrates the quantitative results of our framework and other stateoftheart methods for camera relocalization on the RIO10 test set. In indoor camera relocalization, camera pose accuracy Pose(0.05m,5°) is regarded as the more common and direct measurement for assessing the results [12, 30]. This metric represents the rate of test frames whose rotation and translation errors are within 5 degrees and 5cm, respectively. For the Pose(0.05m,5°) metric. Our result (0.506) surpasses the stateoftheart approach (0.358) by about 15%. As shown in Table I, there is a correlation between the DCRE(0.05) and the Pose(0.05m,5°). In fact, both metrics are proportional. Test frames considered by Pose(0.05m,5°) metric are considered by the DCRE(0.05). Similar prediction and ground truth camera pose will always show similar visual perception (high DCRE value). Regarding DCRE(0.05) metric, the results show that our framework is the secondbest compared to NeuralRouting . GATsLoc has similar results for DCRE(0.05) and Pose(0.05m,5°) in contrast to NeuralRouting approach, where the DCRE(0.05) is much better than the Pose(0.05m,5°) error.
The table also covers four more metrics. The NaN metric corresponds to the percentage of test frames, where the model could not find the corresponding camera pose, which is about 19% of the test frames for our model. The missing camera pose predictions are the results of insufficient 3D3D corresponding. In other words, not enough vertices survived the routing process. The Outlier(0.5) and the DCRE(0.15) are related to the DCRE metric. DCRE(0.15) corresponds to the percentage of test frames with a DCRE value within 0.15. The Outlier(0.5) metric corresponds to the test frames rate with DCRE greater than 0.5. Our approach shows less competitive results for Outlier(0.5) and the DCRE(0.15) metric compared to the stateoftheart. The Score metric (first column of the table) also depends on the DCRE metric as it equals . Our framework is second to NeuralRouting if we consider the Score metric, because it does not depend on the Pose(0.05m,5°) metric. We show the qualitative results of the matching model in the appendix.
V Ablation study
To demonstrate the influence of various design choices, we performed an ablation study on the RIO10 validation dataset^{1}^{1}1We use the validation dataset because the groundtruth labels of the test dataset are not publicly available. In addition, the online evaluation platform imposes a twoweek duration between submissions to forbid parameter tuning on the test dataset.. The results of our experiments are shown in Table II.
GCNs instead of GATs. This experimental study shows the importance of using graph attention layers in comparison to a simple graph convolution network. Specifically, it demonstrates the importance of the multihead attention mechanism used by the graph neural network layers. In fact, the graph convolution layer traits the vertex and its neighbors in the same manner when aggregating their results. However, not all vertices should be traits similarly in the mesh representation because some vertices are more critical than others. For example, vertices representing the corners or the borders of an object are more important than those located in a flat area. Hence, the attention mechanism is critical when calculating the embedding of the vertices. The attention mechanism performs an intelligent aggregation based on the calculated weight for each adjacent vertex.
Three graph attention layers instead of seven. For the graph neural network model, a threelayer graph neural network will aggregate the vertices representation that is at most two vertices units further from the vertex. By increasing the number of layers, we consider more vertices that are further away from the current vertex when calculating the embedding. The experiment shows a drop in performance when we decrease the number of layers. Further experiments show that the model became incapable of associating the correct vertices to the image at the first hierarchical level. By considering less distant vertices, the calculating embedding becomes a local representation that is harder to match with the global representation of the image.
Pose(0.05m,5°)  

GCNs instead of GATs  28.12% 
Three GATs layer instead of seven  27.39% 
GNN without vertices semantic  39.20% 
GNN without vertices normal  50.18% 
GNN without vertices colors  19.66% 
GNN without vertices position  25.32% 
GATsLoc  53.93% 
Removing the vertices semantic/normal/colors/position. In this experiment, we wanted to study the effect of reducing the information feed to our graph neural network model. Each time, we remove a single input data and make comparisons on the RIO10 validation dataset. The most important information seems to be the colors. Obviously, the network relies too much on the colors to match the image and the mesh vertices. In fact, this information exists in both data, and it is very easy for the model to make the association between them. The result also shows that the position of the vertices is also very significant to achieve the final results. Theoretically, the vertex position is important because it helps identify the object shapes, which can then be matched to the similar shape identified visually in the image. The network seems to also rely on semantic information. In computer vision, CNNs have proven their ability to identify the classes of objects present in the scene. Therefore, our CNN model can identify the class of static objects in the images and then match them with the semantic information available in the mesh representation. As for the vertices normals, we notice a small performance drop in Pose(0.05m,5°) accuracy compared to the other input information. When normalizing the data, we have centered the model on the origin of the 3D world. Consequently, the model could deduce the normal information for some vertices.
Vi Conclusion
We have designed a triangle meshbased camera relocalization framework for dynamic indoor environments, in which we leverage the learning capability of graph neural networks. We also proposed a hierarchical structure of the image to identify the 3D3D correspondence between the test frame and the vertices of the triangle mesh, without randomly sampled points or blind keypoints identification. Finally, we show the performance of our results by validating our method on the RIO10 benchmark. Our framework proposes a novel way for solving the camera relocalization problem, and different ideas can be proposed to improve the overall performance. For instance, in our ongoing work, we plan to extend our approach to RGBonly test images, which are based on more common and less expensive hardware compared to RGBD images. We are investigating better representation for the semantics which will be based on the probability that an object changes its location in the scene. This could significantly improve the performance of the model. We also plan to take advantage of Vision Transformers (ViT) to calculate image region embedding instead of CNN. Finally, we intend to build a hierarchical graph neural network model in which we compact the number of vertices after each applied layer.
References

[1]
(2019)
BIMposenet: indoor camera localisation using a 3d indoor model and deep learning from synthetic images
. ISPRS Journal of Photogrammetry and Remote Sensing 150, pp. 245–258. Cited by: §II. 
[2]
(2016)
NetVLAD: cnn architecture for weakly supervised place recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 5297–5307. Cited by: §II, TABLE I.  [3] (1987) Leastsquares fitting of two 3d point sets. IEEE Transactions on pattern analysis and machine intelligence (5), pp. 698–700. Cited by: §IIIE.
 [4] (2019) A survey of imagebased indoor localization using deep learning. In 2019 22nd International Symposium on Wireless Personal Multimedia Communications (WPMC), pp. 1–6. Cited by: §I.
 [5] (2018) Relocnet: continuous metric learning relocalisation using neural nets. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 751–767. Cited by: §II.
 [6] (2017) Dsacdifferentiable ransac for camera localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6684–6692. Cited by: §II.
 [7] (2018) Learning less is more6d camera localization via 3d surface regression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4654–4662. Cited by: §II.
 [8] (2019) Let’s take this online: adapting scene coordinate regression network predictions for online rgbd camera relocalisation. In 2019 International Conference on 3D Vision (3DV), pp. 564–573. Cited by: §II.
 [9] (2017) Onthefly adaptation of regression forests for online camera relocalisation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4457–4466. Cited by: §II, TABLE I.
 [10] (2019) Realtime rgbd camera pose estimation in novel scenes using a relocalisation cascade. IEEE transactions on pattern analysis and machine intelligence 42 (10), pp. 2465–2477. Cited by: TABLE I.
 [11] (2017) Vidloc: a deep spatiotemporal model for 6dof videoclip relocalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6856–6864. Cited by: §II.
 [12] (2021) Robust neural routing through space partitions for camera relocalization in dynamic indoor environments. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8544–8554. Cited by: §I, §II, §IVC, TABLE I.
 [13] (2019) D2Net: A Trainable CNN for Joint Detection and Description of Local Features. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: TABLE I.
 [14] (1981) Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM 24 (6), pp. 381–395. Cited by: §II, §IIIE.
 [15] (2018) Collaborative largescale dense 3d reconstruction with online interagent pose optimisation. IEEE transactions on visualization and computer graphics 24 (11), pp. 2895–2905. Cited by: §II.
 [16] (2014) Multioutput learning for camera relocalization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1114–1121. Cited by: §II.
 [17] (2020) Robust image retrievalbased visual localization using kapture. arXiv preprint arXiv:2007.13867. Cited by: TABLE I.
 [18] (1976) A solution for the best rotation to relate two sets of vectors. Acta Crystallographica Section A: Crystal Physics, Diffraction, Theoretical and General Crystallography 32 (5), pp. 922–923. Cited by: §II, §IIIE.
 [19] (201707) Geometric loss functions for camera pose regression with deep learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §II.
 [20] (2017) Learned contextual feature reweighting for image geolocalization. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3251–3260. Cited by: §II.
 [21] (2017) Imagebased localization using hourglass networks. In Proceedings of the IEEE international conference on computer vision workshops, pp. 879–886. Cited by: §II.
 [22] (2014) Scalable 6dof localization on mobile devices. In European conference on computer vision, pp. 268–283. Cited by: §II.
 [23] (2018) Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767. Cited by: §IIIA.
 [24] (2019) Characterizing visual localization and mapping datasets. In 2019 International Conference on Robotics and Automation (ICRA), pp. 6699–6705. Cited by: §II.
 [25] (2019) From coarse to fine: robust hierarchical localization at large scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12716–12725. Cited by: TABLE I.
 [26] (2016) Largescale location recognition and the geometric burstiness problem. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1582–1590. Cited by: §II.
 [27] (2016) Efficient & effective prioritized matching for largescale imagebased localization. IEEE transactions on pattern analysis and machine intelligence 39 (9), pp. 1744–1756. Cited by: TABLE I.
 [28] (2018) Benchmarking 6dof outdoor visual localization in changing conditions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8601–8610. Cited by: §II.
 [29] (2008) The graph neural network model. IEEE transactions on neural networks 20 (1), pp. 61–80. Cited by: §IIIB.
 [30] (2021) Learning camera localization via dense scene matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1831–1841. Cited by: §IVC.
 [31] (2015) 24/7 place recognition by view synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1808–1817. Cited by: §II, TABLE I.
 [32] (2017) Graph attention networks. 6th International Conference on Learning Representations. Cited by: §IIIB.
 [33] (2020) Beyond controlled environments: 3d camera relocalization in changing indoor scenes. In European Conference on Computer Vision, pp. 467–487. Cited by: §I, §IV.