Image-based Localization using Hourglass Networks

03/23/2017 ∙ by Iaroslav Melekhov, et al. ∙ Tampereen teknillinen yliopisto 0

In this paper, we propose an encoder-decoder convolutional neural network (CNN) architecture for estimating camera pose (orientation and location) from a single RGB-image. The architecture has a hourglass shape consisting of a chain of convolution and up-convolution layers followed by a regression part. The up-convolution layers are introduced to preserve the fine-grained information of the input image. Following the common practice, we train our model in end-to-end manner utilizing transfer learning from large scale classification data. The experiments demonstrate the performance of the approach on data exhibiting different lighting conditions, reflections, and motion blur. The results indicate a clear improvement over the previous state-of-the-art even when compared to methods that utilize sequence of test frames instead of a single frame.



There are no comments yet.


page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Image-based localization, or camera relocalization refers to the problem of estimating camera pose (orientation and position) from visual data. It plays a key role in many computer vision applications, such as simultaneous localization and mapping (SLAM), structure from motion (SfM), autonomous robot navigation, and augmented and mixed reality. Currently, there are plenty of relocalization methods proposed in the literature. However, many of these approaches are based on finding matches between local features extracted from an input image (by usually applying local image descriptor methods such as SIFT, ORB, or SURF 

[18, 23, 2]

) and features corresponding to 3D points in a model of the scene. In spite of their popularity, feature-based methods are not able to find matching points accurately in all scenarios. In particular, extremely large viewpoint changes, occlusions, repetitive structures and textureless scenes often produce simply too many outliers in the matching process. In order to cope with many outliers, the typical first aid is to apply RANSAC which unfortunately increases time and computational costs.

The increased computational power of graphic processing units (GPUs) and the availability of large-scale training datasets have made Convolutional Neural Networks (CNNs) the dominant paradigm in various computer vision problems, such as image retrieval 

[1, 8], object recognition, semantic segmentation, and image classification [17, 10]. For image-based localization, CNNs were considered for the first time by Kendall  [15]. Their method, named PoseNet, casts camera relocalization as a regression problem, where 6-DoF camera pose is directly predicted from a monocular image by leveraging transfer learning from a large scale classification data. Although PoseNet overcomes many limitations of the feature-based approaches, its localization performance still lacks behind traditional approaches in typical cases where local features perform well.

Looking for possible ways to further improve the accuracy of image-based localization using CNN-based architectures, we adopt some recent advances discovered in efforts solving the problems of image restoration [19], semantic segmentation [22] and human pose estimation [20]. Inspired by these ideas, we propose to add more context to the regression process to better collect the overall information, from coarse structures to fine-grained object details, available in the input image. We argue that this kind of a mechanism is suitable for getting an accurate camera pose estimate using CNNs. In detail, we propose a network architecture which consists of a bottom part (the encoder) that is used to encode the overall context and a latter part (the decoder) that recovers the fine-grained visual information by up-convolving the output feature map of the encoder by gradually increasing its size towards the original resolution of the input image. Such a symmetric ”encoder-decoder” network structure is also known as an hourglass architecture [20].

The contributions of this paper can be summarized as follows:

  • We complement a deep convolutional network by adding a chain of up-convolutional layers with shortcut connections and apply it to the image-based localization problem.

  • The proposed network significantly outperforms the current state-of-the-art methods proposed in the literature for estimating camera pose.

The remainder of this paper is organized as follows. Section 2 discusses related work. In Section 3 we provide the details of the proposed CNN architecture. Section 4 presents the experimental methodology and results on a standard evaluation dataset. We conclude with a summary and ideas for future work.

The source code and trained models will be publicly available upon publication.

Figure 1: Overview of our proposed architecture. It takes an RGB-image as input and predicts the camera pose. The overall network consists of three components, namely encoder, decoder and regressor. The encoder is fully convolutional up until a certain spatial resolution. The decoder then gradually increases the resolution of the feature map which is eventually fed to the regressor that is composed of three fully connected layers.

2 Related Work

Image-based localization can be solved by casting it as a place recognition problem. In this approach, image retrieval techniques are often applied to find similar views of the scene in a database of images for which camera position is known. The method then estimates an approximate camera pose using the information in retrieved images. As noted in [30], these methods suffer in situations where there are no strong constraints on the camera motion. This is due to the number of the key-frames that is often very sparse.

Perhaps a more traditional approach to image-based localization is based on finding correspondences between a query image and a 3D scene model reconstructed using SfM. Given a query image and a 3D model, an essential part of this approach is matching points from 2D to 3D. The main limitation of this approach is the 3D model that may grow eventually too big in its size or just go too complex if the scene itself is somehow complicated, like large-scale urban environments. In such scenarios, the ratio of outliers in the matching process often grows too high. This in turn results in a growth in the run-time of RANSAC. There are methods to handle this situation, such as prioritizing matching regions in 2D to 3D and/or 3D to 2D and using co-visibility of the query and the model [24].

Applying machine learning techniques has proven very effective in image-based indoor localization. Shotton  

[25] proposed a method to estimate scene coordinates from an RGB-D input using decision forests. Compared to traditional algorithms based on matching point correspondences, their method removes the need for the traditional pipeline of feature extraction, feature description, and matching. Valentin [30] further improved the method by exploiting uncertainty in the model in order to move from sole point estimates to predict also their uncertainties for more robust continuous pose optimization. Both of these methods are designed for cameras that have an RGB-D sensor.

Very recently, applying deep learning techniques has resulted in remarkable performance improvements in many computer vision problems 

[1, 19, 22]. Partly motivated by studies applying CNNs and regression [27, 32, 28], Kendall  [15] proposed an architecture trying to directly regress camera relocalization from an input RGB image. More recent CNN-based approaches cover those of Clark  [4] and Walch  [31]. Both of these follow [15], and similarly adopt the same CNN architecture, by pre-training it first on large-scale image classification data, for extracting features from input images to be localized. In detail, Walch  [31] consider these features as an input sequence to a block of four LSTM units operating along four directions (up, down, left, and right) independently. On top of that, there is a regression part which encompasses fully-connected layers for predicting the camera pose. In turn, Clark  [4]

applied LSTMs to predict camera translation only, but using short videos as an input. Their method is a bidirectional recurrent neural network (RNN), which captures dependencies between adjacent image frames yielding refined accuracy of the global pose. Both of the two architectures lead to improvement in the accuracy of 6-DoF camera pose outperforming PoseNet 


Compared to non-CNN based approaches, our method belongs to the very recent initiative of models that do not require any online 3D models in camera pose estimation. In contrast to [25, 30], our method is solely based on monocular RGB images and no depth information is required. Compared to PoseNet [15], our method aims at better utilization of context and provides improvement in pose estimation accuracy. In comparison to [31], our method is more accurate in indoor locations. Finally, our method does not rely on video inputs, but still outperforms the CNN-model presented in [4] for video-clip relocalization.

Figure 2: Visual representation of the categories of 7-Scenes dataset. From left to right: Chess, Fire, Heads, Office, Pumpkin, Red Kitchen and Stairs.

3 Method

Following [15, 31]

, our goal is to estimate camera pose directly from an RGB image. We propose a CNN architecture that predicts a 7-dimensional camera pose vector

consisting of an orientation component represented by quaternions and a translation component .

Hiding the architectural details, the overall network structure is illustrated in Fig. 1. The network consists of three components, namely encoder, decoder and regressor. The encoder is fully convolutional acting as a feature extractor. The decoder consists of up-convolutional layers stacked to recover the fine-grained details of the input from the decoder outputs. Finally, the decoder is followed by the regressor that estimates the camera pose .

To train our hourglass-shaped CNN model, we apply the following objective function [15]:


where and are ground truth and estimated translation-orientation pairs, respectively. is a scale factor, tunable by grid search, that keeps the estimated orientation and translation to be nearly equal. The quaternion based orientation vector

is normalized to unit length at test time. We provide the detailed information about the other hyperparameters used in training in Section 


3.1 CNN Architecture

Training convolutional neural networks from scratch for image-based localization task is impractical due to the lack of training data. Following [15], we leverage a pre-trained large-scale classification network. Specifically, to find a balance between the number of parameters of the network and accuracy, we adopt ResNet34 [10] architecture which has good performance among other classification approaches [3] as our base network. We remove the last fully-connected layer from the original ResNet34 model but keep the convolutional and pooling layers intact. The resulting architecture is considered as the encoder part of the whole pipeline.

Instead of connecting the encoder to the regression part directly, we propose to add some extra layers between them. In detail, we add three up-convolutional and one convolutional layer. The main idea of using up-convolutional layers is to restore essential fine-grained visual information of the input image lost in encoder part of the network. Up-convolutional layers have been widely applied in image restoration [19], structure from motion [29] and semantic segmentation [11, 21]. The proposed architecture is presented in Fig. 3.

Finally, there is a regressor module on top of the encoder. The regressor consists of three fully connected layers, namely localization layer, orientation layer and translation layer. In contrast to the regressor originally proposed in [15]

, we slightly modified its architecture by appending batch-normalization after each fully connected layer.

Inspired by the visualization of the steps of downsampling and upsampling of the feature maps flowing through encoder-decoder part and by [20]’s work, we call our CNN architecture Hourglass-Pose.

3.1.1 Hourglass-Pose

As explained, the encoder part of our architecture is the slightly modified ResNet34 model. It differs from the original one presented in [10]

so that the final softmax layer and the last average pooling layer have been removed. As a result the spatial resolution of the encoder feature map is


To better preserve finer details of the input image for the localization task, we added skip (shortcut) connections from each of the four residual blocks of the encoder to the corresponding up-convolution and the final convolution layers of the decoder. The last part of the decoder, namely the final convolutional module (a chain of convolutional, batch-normalization [12]

and ReLU layers) does not alter the spatial resolution of the feature map (

), but is used to decrease the number of channels. In our preliminary experiments, we also experimented with a Spatial Pyramid Pooling (SPP) layer [9] instead of the convolutional module. Particularly, SPP layer consists of a set of pooling layers (pyramid levels) producing a fix-sized feature map regardless the size of the input image. However, the camera pose estimations were not improved, and we omitted SPP in favor of simpler convolutional module. The encoder-decoder module is followed by a regressor which predicts the camera orientation q and translation t. The detailed network configuration is shown in Table 1.

In order to investigate the benefits of using skip connections more thoroughly, we experimented with different aggregation strategies of the encoder and the decoder feature maps. In contrast to Hourglass-Pose where the outputs of corresponding layers are concatenated (See Fig. 3), we evaluated the whole pipeline by also calculating an element-wise sum of the feature maps connected via skip connections. We refer to the corresponding architecture as HourglassSum-Pose. Schematic illustration of a decoder-regressor part of this structure is presented in Fig. 4.

Figure 3: An illustration of the proposed architecture referred to as Hourglass-Pose for predicting camera pose. The encoder part is a modified version of ResNet34 [10], where We removed the last fully-connected and average pooling layers from the original ResNet34 arhitecture and kept only the convolutional layers. The decoder consists of a set of stacked up-convolutional layers gradually increasing the spatial resolution of the feature maps up to . We further added one convolutional layer for dimensionality reduction. Skip connections connect each block of the encoder to the corresponding parts of the decoder allowing the decoder to re-utilize features from the earlier layers of the network. Finally, camera pose is estimated by the regressor as explained in Section 3.
Figure 4: The structure of the decoder and the regressor of HourglassSum-Pose architecture for estimating camera pose . The output of the decoder is connected to the regressor consisting of a set of FC-layers to predict and respectively. The number of connections of each FC-layer is given in parenthesis.
Module Layers Output Size Hourglass-Pose


Conv , (3 / s2)
Pool max (s2)
ResBlock1 (1 / s1)
ResBlock2 (1 / s1)
ResBlock3 (1 / s1)
ResBlock4 (1 / s1)


UpConv1 upconv (1 / s2)
UpConv2 upconv (1 / s2)
UpConv3 upconv (1 / s2)
Conv , 32, (1 / s1)


FC 2048
FC 4 fully-connected
FC 3
Table 1:

Details of our Hourglass-Pose architecture for estimating camera pose. Note that each convolutional/upconvolutional layer in the encoder-decoder part corresponds ’Conv-ReLU-BatchNorm’ sequence. In Resblocks the resolution is downsampled with stride 2 convolutions. Upconvolution is implemented by first upsampling the signal by zero padding and then by applying normal convolution.

3.2 Evaluation Dataset

To evaluate our method and compare with the state-of-the-art approaches, we utilize Microsoft 7-Scenes Dataset containing RGB-D images of 7 different indoor locations [26]. The dataset has been widely used for camera relocalization [6, 15, 31, 4]. The images of the scenes were recorded with a camera of the Kinect device at resolution and divided to train and evaluation parts accordingly. The ground truth camera poses were obtained by applying the KinectFusion algorithm [13] producing smooth camera trajectories. Sample images covering all scenes of the dataset are illustrated in Fig. 2. They represent indoor views of the 7 scenes exhibiting different lighting conditions, textureless (two statues in ’Heads’) and repeated objects (’Stairs’ scene), changes in viewpoint and motion blur. All of these factors make camera pose estimation an extremely challenging problem.

4 Experiments

In the following section we empirically demonstrate the effectiveness of the proposed approach on the 7-Scenes evaluation dataset and compare it to other state-of-the-art CNN-based methods. Like it was done in [15], we report the median error of camera orientation and translation in our evaluations.

4.1 Other state-of-the-art approaches

In this work we consider three recently proposed 6-DoF camera relocalization systems based on CNNs.

PoseNet is [15] is based on the GoogLeNet [27] architecture. It processes RGB-images and is modified so that all three softmax and fully connected layers are removed from the original model and replaced by regressors in the training phase. In the testing phase the other two regressors of the lower layers are removed and the prediction is done solely based on the regressor on the top of the whole network.

Bayesian PoseNet Kendall  [14] propose a Bayesian convolutional neural network to estimate uncertainty in the global camera pose which leads to improving localization accuracy. The Bayesian convolutional neural is based on PoseNet architecture by adding dropout after the fully connected layers in the pose regressor and after one of the inception layer (layer 9) of GoogLeNet architecture.

LSTM-Pose [31] is otherwise similar to PoseNet, but applies LSTM networks for output feature coming from the final fully connected layer. In detail, it is based on utilizing the pre-trained GoogLeNet architecture as a feature extractor followed by four LSTM units applying in the up, down, left and right directions. The outputs of LSTM units are then concatenated and fed to a regression module consisting of two fully connected layers to predict camera pose.

VidLoc [4] is a CNN-based system based on short video clips. As in PoseNet and LSTM-Pose, VidLoc incorporates similarly modified pre-trained GoogLeNet model for feature extraction. The output of this module is passed to bidirectional LSTM units predicting the poses for each frame in the sequence by exploiting contextual information in past and future frames.

Scene Frames Spatial PoseNet Bayesian LSTM-Pose VidLoc Hourglass-Pose HourglassSum-Pose
Train Test Extent ICCV’15 [15] PoseNet [14] [31] [4]
Chess 4000 2000 m m, m, m, m, N/A m, m,
Fire 2000 2000 m m, m, m, m, N/A m, m,
Heads 1000 1000 m m, m, m, m, N/A m, m,
Office 6000 4000 m m, m, m, m, N/A m, m,
Pumpkin 4000 2000 m m, m, m, m, N/A m, m,
Red Kitchen 7000 5000 m m, m, m, m, N/A m, m,
Stairs 2000 1000 m m, m, m, m, N/A m, m,
Average m, m, m, m, N/A m, m,
Table 2: Performance comparison of two architectures (Hourglass-Pose and HourglassSum-Pose) and state-of-the-art methods on 7-Scenes evaluation dataset. Numbers are median translation and orientation errors for the entire test subset of each scene. Both models significantly outperform PoseNet [10] and LSTM-Pose [31] in terms of localization. It is a crucial observation emphasizing the importance of re-utilizing feature maps by using direct (skip) connections between encoder and decoder modules for image-based relocalization task. An Hourglass-Pose and HourglassSum-Pose architectures’ comparison reveals that applying element-wise summation is more beneficial than features concatenation providing more accurate camera pose. Remarkably, the proposed models do perform even better than VidLoc [4] approach, which uses a sequence of test frames to estimate camera pose.

4.2 Training Setup

We trained our models for each scene of 7-Scenes dataset according to the data splits provided by [26].

For all of our methods, we take the weights of ResNet34 [10]

pre-trained on ImageNet to initialize the encoder part with them. The weights of the decoder and the regressor are initialized according to 

[7]. Our initial learning rate is

and that is kept for the first 50 epochs. Then, we continue for 40 epochs with

and subsequently decrease it to for the last 30 epochs.

As a preprocessing step, all images of the evaluation dataset are rescaled so that the smaller side of the image is always 256 pixels. We calculate mean and standard deviation of pixel intensities separately for each scene and use them to normalize intensity value of every pixel in the input image.

We trained our models using random crops () and performed the evaluation using central crops at the test time. All experiments were conducted on two NVIDIA Titan X GPUs with data parallelism using Torch7 [5]

. We minimize the loss function (

1) over a training part of each scene of the evaluation dataset using Adam [16] (, ). The scale factor (1) varies between to . Training mini-batches are randomly shuffled in the beginning of each training epoch. We further used set the weight decay as , used a mini-batch size of

and the dropout probability as

. These parameter values were kept fixed during our experiments.

4.3 Results

To compare Hourglass-Pose and HourglassSum-Pose architectures with other state-of-the-art methods, we follow the evaluation protocol presented in  [15]. Specifically, we report the median error of camera pose estimations for all scenes of the 7-Scenes dataset. Like in [14, 31, 4], we also provide an average median orientation and translation error.

Table 2 shows the performance of our approaches along with the other state-of-the-art. The values for other methods are taken from [15][14][31], and [4]. According to the results, several conclusions can be drawn. First, our architectures clearly outperform the other state-of-the-art CNN-based approaches. In general, HourglassSum-Pose improves the accuracy of the camera position by 52.27% and orientation by 8.47% for average error with respect to PoseNet. Furthermore, HourglassSum-Pose manages to achieve better orientation accuracy than LSTM-Pose [31] in all scenes of the evaluation dataset. It can be seen that both of our architectures are even competitive with VidLoc [4] that is based on a sequence of frames. Our methods improve the average position error by 1 cm and 2 cm. The results in Table 2 confirm that it is beneficial to utilize an hourglass architecture for image-based localization.

For a more detailed comparison, we plot a family of cumulative histogram curves for all scenes of the evaluation dataset illustrated in Fig. 5. We note that both hourglass architectures outperforms PoseNet method on translation accuracy by a factor of 1.5 to 2.3 in all test scenes. Besides that, HourglassSum-Pose substantially improves orientation accuracy. The only exception is ’Office’ and ’Red Kitchen’ scenes where performance of HourglassSum-Pose is on par with PoseNet.

Figure 6 shows histograms of localization accuracy for both orientation (left) and position (right) for the two entire test scenes of the evaluation dataset. It is interesting to see that more than 60% of camera pose estimations produced by HourglassSum-Pose are within 20 cm in ’Chess’ scene, while for PoseNet this quotient is equal to 5%. Remarkably, HourglassSum-Pose is able to improve accuracy even for such an ambiguous and challenging scene like ’Stairs’ exhibiting many repetitive structures (See Fig. 5(b)). The presented results verify that an hourglass neural architecture is an efficient and promising approach for image-based localization.

(a) Chess
(b) Fire
(c) Heads
(d) Office
(e) Pumpkin
(f) Red Kitchen
(g) Stairs
Figure 5: Localization performance of the proposed hourglass-based network architectures (Hourglass-Pose and HourglassSum-Pose) presented as a cumulative histogram (normalized) of errors for all categories of 7-Scenes dataset. One of the important conclusion is that both architectures can significantly improve the accuracy of estimations camera location clearly outperforming state-of-the-art method (PoseNet). HourglassSum-Pose achieves better orientation performance in 5 cases to compare to Hourglass-Pose architecture.
(a) Chess
(b) Stairs
Figure 6: Histogram of orientation (left) and translation (right) errors of two approaches (PoseNet and HourglassSum-Pose) for the two entire scenes (’Chess’ and ’Fire’) of the evaluation dataset. It is clearly seen that an hourglass-architecture-based method performs consistently better than PoseNet.

5 Conclusion

In this paper, we have presented an end-to-end trainable CNN-based approach for image-based localization. One of the key aspect of this work is applying encoder-decoder (hourglass) architecture consisting of a chain of convolutional and up-convolutional layers for estimating 6-DoF camera pose. Furthermore, we propose to use direct connections forwarding feature maps from early residual layers of the model directly to the later up-convolutional layers improving the accuracy. We studied two hourglass models and showed that they significantly outperform other state-of-the-art CNN-based image-based localization approaches.