Hierarchical Scene Coordinate Classification and Regression for Visual Localization
Visual localization is pivotal to many applications in computer vision and robotics. To address single-image RGB localization, state-of-the-art feature based methods solve the task by matching local descriptors between a query image and a pre-built 3D model. Recently, deep neural networks have been exploited to directly learn the mapping between raw pixels and 3D coordinates in the scene, and thus the matching is implicitly performed by the forward pass through the network. In this work, we present a new hierarchical joint classification-regression network to predict pixel scene coordinates in a coarse-to-fine manner from a single RGB image. The network consists of a series of output layers with each of them conditioned on the outputs of previous ones, where the final output layer regresses the coordinates and the others produce coarse location labels. Our experiments show that the proposed method outperforms the vanilla scene coordinate regression network and is more scalable to large environments. With data augmentation, it achieves the state-of-the-art single-image RGB localization performance on three benchmark datasets.READ FULL TEXT VIEW PDF
Camera localization aims to estimate 6 DoF camera poses from RGB images....
Image-based localization, or camera relocalization, is a fundamental pro...
In this paper, we propose hybrid building/floor classification and
Image-based camera relocalization is an important problem in computer vi...
In this paper, we propose a deep neural network approach for mapping the...
Visual localization is of great importance in robotics and computer visi...
Feature extraction and matching are among central problems of computer
Hierarchical Scene Coordinate Classification and Regression for Visual Localization
Visual localization aims at estimating precise six degree-of-freedom (6-DoF) camera pose with respect to a known environment. It is a fundamental component that enables many intelligent autonomous systems and applications in computer vision and robotics,e.g., augmented reality, autonomous driving, or camera-based indoor localization for personal assistants. Commonly used visual localization methods rely on matching local visual descriptors. Correspondences are typically established between 2D interest points in the query and 3D points in the pre-built structure-from-motion model with nearest neighbor searches, and the 6-DoF camera pose of the query can then be computed from the correspondences .
Instead of explicitly establishing 2D-3D correspondences, recent works have proposed to leverage the capability of CNNs to directly regress dense 3D coordinates from an image [2, 3]. In this way, correspondences between 2D points in the image and 3D points in the scene can be obtained without explicit matching, and thus no feature detectors and descriptors are needed. In addition, no 3D maps are required at test time. Therefore, these CNN-based methods have the potential for deployment on mobile devices. Moreover, experimental results have shown that these methods achieve better localization performance on small-scale benchmark datasets compared to the state-of-the-art feature-based methods [3, 4].
Scene coordinate regression networks are typically designed to have a limited receptive field, i.e., only a small local image patch is considered for each scene coordinate prediction. This allows the network to generalize well from limited training data, since local patch appearance is more stable w.r.t. viewpoint change. On the other hand, a limited receptive field size can lead to ambiguous patterns in the scene, especially in large-scale environments. Due to the ambiguities, it is harder for the network to accurately model the regression problem, resulting in inferior performance at test time. Using larger receptive field sizes, up to the full image, to regress the coordinates can mitigate the issues caused by ambiguities. This, however, has been shown to be prone to overfitting the larger input patterns in the case of limited training data, even if data augmentation alleviates this problem to some extent .
In contrast, in this work we overcome the ambiguities due to small receptive fields by conditioning on discrete location labels for the pixels. The labels are obtained by a coarse quantization of the 3D space. The pixel location labels are obtained using a classification network, which can more easily deal with the location ambiguity since it is trained using the cross-entropy classification loss which permits a multi-modal prediction in 3D space. Our model allows for several classification layers, using progressively finer location labels, obtained through a hierarchical clustering of the ground-truth 3D point-cloud data. Our hierarchical coarse-to-fine architecture is implemented using conditioning layers that are related to the FiLM architecture, resulting in a compact model.
We validate our approach with an ablative study where we compare it to a baseline which uses the same back-bone architecture as our model, but lacks the hierarchical coarse-to-fine joint classification-regression structure. We present results on three datasets used in previous work : 7-Scenes , 12-Scenes , and Cambridge Landmarks . Our approach shows consistently better performance and achieves state-of-the-art results when trained with data augmentation. Moreover, by compiling the 7-Scenes dataset into a single large virtual scene, we show that our approach scales more robustly to larger environments.
In summary, the contributions of this paper are as follows:
We introduce a new joint classification-regression architecture for scene coordinate prediction.
We show that the joint coarse-to-fine hierarchy improves the performance and scalability over vanilla scene coordinate regression network.
We show that our method achieves state-of-the-art results for single-image RGB localization on three benchmark datasets when trained with data augmentation.
Visual localization aims at predicting 6-DoF camera pose for a given query image. To obtain precise 6-DoF camera pose, visual localization methods are typically structure-based, i.e., they rely on 2D-3D correspondences between 2D image positions and 3D scene coordinates. With the established 2D-3D correspondences, a RANSAC  optimization scheme is responsible for producing the final pose estimation. The correspondences are typically obtained by matching local features such as SIFT , and many matching and filtering techniques have been proposed, which enable efficient and robust city-scale localization [1, 12, 13].
Image retrieval can also be used for visual localization . The pose of the query image can be directly approximated by the most similar retrieved database image. Since compact image-level descriptors are used for matching, image retrieval methods can scale to very large environments. The retrieval methods can be combined with structure-based methods [15, 16, 17, 18] or relative pose estimation [19, 20] to predict precise poses. Typically, the retrieval step helps restrict the search space and leads to faster and more accurate localization.
In recent years, learning-based localization approaches have been explored. One popular direction is to replace the entire localization pipeline with a single neural network. PoseNet  and its variants [21, 22, 23, 24] directly regress the camera pose from a query image. However, Sattler et al.  have pointed out that direct pose regression is more closely related to pose approximation via image retrieval than to accurate pose estimation via 3D structure. Therefore, these approaches are still outperformed by structure-based methods. By fusing estimated pose information from the previous frame, [26, 27] achieve significantly better performance. However, they require sequences of images rather than single images.
Instead of learning the entire pipeline, scene coordinate regression based methods learn the first stage of the pipeline in the structure-based approaches. Namely, either a random forest[7, 28, 29, 30, 31, 32, 33, 34, 35] or a neural network [2, 3, 4, 5, 32, 36] is trained to directly predict 3D scene coordinates for the pixels and thus the 2D-3D correspondences are established. These methods do not explicitly rely on feature detection, description and matching, and are able to provide correspondences densely. They are more accurate than the feature-based method at small and medium scale, but do not scale well to larger scenes. In order to generalize well from novel viewpoints, these methods typically rely on only local image patch to produce the scene coordinate predictions. However, this makes the training difficult, especially when the scale of the scene is large. In this work, we introduce the element-wise conditioning layer to modulate the intermediate feature maps of the regression network using coarse location information. We show this leads to better localization performance and makes scene coordinate regression more scalable.
In , a joint classification-regression forest is trained to predict scene coordinates. However, the forest predicts scene IDs and coordinates directly rather than doing joint classification-regression in a hierarchical coarse-to-fine manner. Our approach is also related to the work of Rogez et al. [37, 38] which proposed a classification-regression approach for human pose estimation from single images. Similar to our experience, they found an initial classification layer followed by a class-conditioned regression layer to break the complexity of a original complicated regression problem. Our work differs from theirs in the use of our FiLM-like conditioning layers, and the use of a classification hierarchy.
In this section we describe our hierarchical coarse-to-fine classification-regression approach for scene coordinate prediction. Figure 1 gives a schematic overview of our approach. Note that we address single-image RGB camera localization  in this work, rather than using RGB-D information [7, 28, 29, 31, 34, 35] or a sequence of images [26, 27] during inference.
Hierarchical location labels.
We hierarchically partition the ground-truth 3D point-cloud data with k-means. In this way, in addition to the ground-truth 3D scene coordinates, each pixel in a training image is also associated with a number of labels, from coarse to fine, obtained at different levels of the clustering tree. For each level, except the root, our network has a corresponding classification output layer which predicts for each pixel these discrete location labels.
Our network has a final output layer that regresses the continuous 3D scene coordinates for the pixels, generating putative 2D-3D matches. Instead of directly regressing the absolute coordinates, we regress the relative positions to the cluster centers at the finest level, which accelerates convergence of network training. Conditioned on the preceding label map, the output layers have deceasing sizes of receptive field. Note that for the conditioning information, we use ground truth label maps during training and predicted at test time. Coarser location labels are obtained with larger receptive fields that are more robust to ambiguities, and the final regression layer has a relatively limited receptive field size and is more stable to viewpoint changes. Note that at test time the receptive field size of the regression layer is also large since it is influenced by the predicted location labels.
Conditioning layers. To make use of the discrete location labels information predicted by the network at coarser levels, these predictions should be fed back to the finer levels. Inspired by the Feature-wise Linear Modulation (FiLM) conditioning method , we introduce several conditioning layers just before each of the output layers. A conditioning parameter generator takes the predicted label map as input, outputs a set of scaling and shifting parameters and
, and these parameters are fed into the conditioning layer to apply linear transformation to the input feature map. Unlike FiLM layers, however, which does channel-wise modulation, our conditioning layer performs linear modulation in a per-element manner,i.e., element-wise multiplication and addition as shown in Figure 2
(right). Therefore, instead of vectors, the output parametersand from a generator are feature maps of the same dimensions as the input feature map to the corresponding conditioning layer. More formally, give the input feature map , the scaling and shifting parameters and the linear modulation can be written as:
where denotes the Hadamard product. In addition, the generators consist of only 11 convolutional layers so that each pixel is only conditioned on the label of its own. As a non-linearity, we add a ELU activation  after the feature modulation.
Network architecture. Our overall network architecture is shown in Figure 2 (left). In our experiments we use 3-level hierarchy for all the datasets, i.e
. our network has two classification output layers and one regression output layer. As mentioned above, the space of the scene is partitioned into regions which are hierarchically organized in a tree structure. The first classification branch outputs the labels of the coarse regions, and the second one predicts the labels of the finer subregions. We use strided convolution, upconvolution and dilated convolution for the two classification branches to enlarge the size of receptive field, while preserving the output resolution. All the layers after the conditioning layers have kernel size ofsuch that the conditioning information of a pixel affects only itself. More details on the architecture are provided in the supplementary material.
Loss function. Our network predicts location labels and regresses scene coordinates at same time. Therefore, we need both regression loss and classification loss during training. For the regression task, we minimize the Euclidean distance between the predicted scene coordinates and ground truth scene coordinates ,
where ranges over the pixels in the image. For the classification task, we use cross-entropy loss at both levels, i.e.
where denotes the one-hot coding of the ground-truth label of pixel at level , and
denotes the vector of predicted label probabilities for the same label, and the log is applied element wise. The final loss function is given by
where , , are weights for the loss terms. Details on the weights and training procedure are provided in the supplementary material.
In this section, we present our experimental setup and evaluation results on three standard visual localization datasets.
|7-Scenes||DSAC++ ||Baseline||Ours||12-Scenes||DSAC++ ||Baseline||Ours|
|Great Court||40, 0.2||157, 0.8||35, 0.2||Manolis||96.4%||96.6%||100%|
|K. College||18, 0.3||18, 0.3||16, 0.3||Floor 5a||83.7%||97.2%||98.8%|
|Old Hospital||20, 0.3||20, 0.3||17, 0.3||Floor 5b||95%||93.6%||97.3%|
|Shop Facade||6, 0.3||6, 0.3||6, 0.3||Complete||96.4%||97.9%||99.1%|
|St M. Church||13, 0.4||15, 0.5||10, 0.3|
|Average||19, 0.3||43, 0.4||17, 0.3|
We use three standard benchmark datasets for visual localization. The 7-Scenes  dataset is a widely used RGB-D dataset that contains seven indoor scenes. Sequences of RGB-D images of the scenes are recorded by a KinectV1. Ground truth poses and dense 3D models obtained are also provided. 12-Scenes  is another indoor RGB-D dataset. It is composed of twelve rooms captured with a Structure.io depth sensor and an iPad color camera, and ground truth poses are provied along with the RGB-D images. The recorded environments are significantly larger compared to 7-Scenes. Cambridge Landmarks  is an outdoor RGB visual localization dataset. It consists of RGB images of six scenes captured using a Google LG Nexus 5 smartphone. Ground truth poses and sparse 3D reconstructions generated with structure from motion are provided. In addition to these three datasets, we synthesize a larger scene based on 7-Scenes by placing all the seven scenes into a single coordinate system. This large integrated 7-Scenes dataset is denoted by i7-Scenes.
For the Cambridge Landmarks, we report median pose accuracy as in the previous works. Following [3, 4], we do not include the Street scene, since the 3D reconstruction of this scene has rather poor quality that hampers performance. For the 7-Scenes and 12-Scenes and i7-Scenes, we report the percentage of the test images with error below 5cm and 5, which gives more information about the localization performance.
To generate the ground truth coarse location labels, we run hierarchical k-means clustering on dense point cloud models. For all the scenes, unless stated otherwise, the branching factor is set to 25. For the i7-Scenes, we simply use the labels generated for the 7-Scenes. For our network, we use the same VGG-style  architecture as DSAC++  as the base regression network, except we use ELU activation 
instead of ReLU
. The output neurons of the regression layer, the first and second classification layer have a receptive field size of 7373, 185185, and 409409, respectively. To show the advantage of the proposed method, we create a baseline network without the classification branches, the conditioning layers and the generators (i.e., with the base regression network only), and train it with the Euclidean loss term only. Unless specified otherwise, we perform affine data augmentation with additive brightness changes during training. For pose estimation, we follow 
, and use the same PnP-RANSAC algorithm with the same hyperparameter settings. Further details about the architecture, training and other settings can be found in the supplementary material.
To validate our model, we compare it in Table 1 with the baseline model, as well as DSAC++  which currently has the state-of-the-art results for single-image RGB localization on all three datasets. We report the numbers given in the original DSAC++ paper. A more complete comparison to other recent methods (with worse results) is provided in the supplementary material. In general, methods that exploits extra depth information or sequences of images provide better localization performance. However, they are not directly comparable to our method, and thus we do not compare to those methods in this work. Overall, our approach yields excellent results. Compared to the baseline, our approach provides consistently better localization performance on all the scenes across the three datasets. During training, we also observed consistently lower regression training error. This indicates that the additional information from the conditioning layers makes the training of the regression branch easier, thus leading to better performance at test time. Although trained with data augmentation, the baseline method is still outperformed by DSAC++ on 7-Scenes and Cambridge Landmarks, which is trained without data augmentation. This is not surprising, given that the two additional training steps proposed in DSAC++ (optimizing the reprojection error and end-to-end training) boost the accuracy (from 72.4% to 76.1% on 7-Scenes ). In fact, despite some minor implementation differences, the baseline without data augmentation is supposed to be identical to DSAC++ trained without the two additional training steps. Below, in Section 4.3 and Table 3 we provide additional results to assess the impact of data augmentation in more detail. As we can see, without data augmentation, the baseline method provides similar performance on 7-Scenes (72.6%). Remarkably, with data augmentation, our approach can already outperform DSAC++ on all three datasets. Potentially our approach could also benefit from the two additional training steps, but we leave this for future work. Since the additional branches and layers do not add significant overhead, our approach has nearly the same run-time as DSAC++ at test time.
|i7-Scenes||Baseline||DSAC++ ||DSAC++ ||Ours||Ours-512||Ours-512+|
The individual scenes from 7-Scenes dataset all have very limited physical extent. To go beyond such small environments we use our i7-Scenes dataset, which integrates all the scenes from 7-Scenes in a single coordinate system. In Table 2 we compare our method to the baseline as well as DSAC++ in this setting. We use the publicly available implementation to produce the numbers for DSAC++. Here, DSAC++ denotes that the network is trained with only the first step of the full DSAC++ 3-step training procedure (optimizing the Euclidean error). We see that the localization performance of the baseline decreases dramatically when trained on all the scenes together compared to trained and tested on each of the scenes individually, c.f. Table 1. Our method is much more robust to this increase in the environment size and significantly outperforms the baseline, underlining the importance of the proposed approach when the environment is large and potentially contains more ambiguities.
DSAC++ also significantly outperforms the baseline on i7-Scenes. According to the results, the two additional DSAC++ training steps are even more helpful to improving the localization performance when the environment is large, as DSAC++ exceeds the accuracy of DSAC++
by 15.7%. By inspecting the results, it seems that the two additional training steps proposed in DSAC++ can make the already good scene coordinate predictions even better, and deteriorate the predictions for local image patches which are difficult. This makes it easier for the RANSAC algorithm to filter out the outliers and find the correct predictions to fit the model, thus leading to better performance. However, this does not change the fact that the receptive field of the network is limited. Thus, the hardest cases remain hard since there are not enough good correspondences due to the ambiguities. Our method still has overall better performance than DSAC++, and clearly outperforms it on the hardest scenes. For example, in Figure3, we show that the Heads scene is hard since it is difficult to distinguish its test frames from the training frames of the Office scene. Stairs is also hard since it already contains many repetitive patterns. The baseline and DSAC++ can easily fail in these cases, while our method is able to provide reasonable performance, since without global image context, it is extremely difficult for the network to produce enough precise scene coordinate predictions. As mentioned above, our method could also benefit from the two additional training steps. This could be addressed in the future work.
In the last two columns of Table 2 we consider the effect of reducing the number of channels in the last two convolutional layers before the regression output layer from 4,096 to 512, the model size decreases significantly but the regression branch becomes less discriminative and harder to train. This leads to significantly decreased accuracy in column “Ours-512”. We find that by adding more conditioning layers at earlier stage of the network and using less shared layers between the regression and classification branches, denoted “Ours-512+”, the model size can be reduced without much loss in accuracy.
Data augmentation. We perform affine transformations to the images with additive brightness changes as data augmentation during training. In general, this improves the generalization capability of the network and makes it more robust to lighting and viewpoint changes. However, this could also increase the ambiguity level of the training data and make the network training more difficult. Moreover, since the ground truth coordinates and poses are not perfect, data augmentation may also increase the error at test time in some rare cases. Our results on 7-Scenes without using data augmentation are given in Table 3. Our method still has better results than the baseline without data augmentation. According to the numbers in Table 1, our method benefits more from data augmentation (+4.2%) than the baseline (+2.1%).
Hierarchy. For all the previous experiments, we use a network with two levels of classification layers for predicting gradually finer location labels. Here we experiment with only one classification branch. We use either the first (Ours-C1) or the second (Ours-C2) branch to predict the finest labels. Note that we have different size of receptive field at different levels and typically we want to use more global context. The results in Table 4 show that the hierarchical architecture achieves overall better results than the single branch alternatives. We did not explore more hierarchical levels in this work, although it might be helpful for localization in larger environments.
Directly using global information.. In Table 5 we show that direct regression of scene coordinates with large receptive field size is also problematic. We use the same baseline network (without classification branches), but now with dilated convolution such that the receptive field size of the output neurons is much larger (409409). Although the training loss decreases, the network does not generalize well at test time, which results in inferior localization performance.
Partition granularity. We show that the granularity of the clustering matters in Table 6. We experiment on different branching factors such that the numbers of leaf clusters differ. Although the accuracy varies with the number of leaf clusters, for all tested branching factors we observe overall better results than obtained with the baseline.
We have proposed a novel hierarchical coarse-to-fine classification-regression approach to scene coordinate prediction. Coarser localization labels are predicted with classification networks on larger image patches trained via cross-entropy, which allows the model to deal with ambiguous local appearances during both training and inference. Accurate regression of the continuous coordinate regression is facilitated by conditioning on the coarse labels. Our experiments on the 7-Scenes, 12-Scenes and Cambridge Landmarks datasets show that the hierarchical classification-regression approach leads to more accurate camera re-localization predictions than the previous regression-only approach, achieving state-of-the-art results for single-image RGB localization when trained with data augmentation. More over, our experiments with our “integrated 7-Scenes” dataset (which combines the 7-Scenes dataset into a single large space), shows that our approach scales more robustly to larger environments compared to the previous scene coordinate regression based approaches.
Despite our encouraging results, our approach has some limitations. First, it requires high quality depth maps or a dense 3D scene reconstruction, which might be difficult to obtain. To address this we want to explore the use of self-supervised reprojection losses [3, 4]. Second, we train our model in a “teacher forced” manner, i.e. during training we use the ground-truth location labels as input to the conditioning layers. At test time, however, we use the predicted location labels as input instead. Note that during training the model has never seen predicted labels, and the error patterns therein. Scheduled sampling techniques used to train language generation models  might prove useful also in our context to remedy this exposure bias. Lastly, for robust localization in challenging large-scale environments with significant appearance changes between training and test images and global ambiguities, pure scene specific training of the network might not be sufficient. One could resort to trainable descriptors [43, 44] and modern image retrieval techniques  in future work.
Camera relocalization by computing pairwise relative poses using convolutional neural network.In ICCV Workshops, 2017.
Scheduled sampling for sequence prediction with recurrent neural networks.In NeurIPS, 2015.