1 Introduction
Imagebased localization, or camera relocalization refers to the problem of estimating camera pose (orientation and position) from visual data. It plays a key role in many computer vision applications, such as simultaneous localization and mapping (SLAM), structure from motion (SfM), autonomous robot navigation, and augmented and mixed reality. Currently, there are plenty of relocalization methods proposed in the literature. However, many of these approaches are based on finding matches between local features extracted from an input image (by usually applying local image descriptor methods such as SIFT, ORB, or SURF
[18, 23, 2]) and features corresponding to 3D points in a model of the scene. In spite of their popularity, featurebased methods are not able to find matching points accurately in all scenarios. In particular, extremely large viewpoint changes, occlusions, repetitive structures and textureless scenes often produce simply too many outliers in the matching process. In order to cope with many outliers, the typical first aid is to apply RANSAC which unfortunately increases time and computational costs.
The increased computational power of graphic processing units (GPUs) and the availability of largescale training datasets have made Convolutional Neural Networks (CNNs) the dominant paradigm in various computer vision problems, such as image retrieval
[1, 8], object recognition, semantic segmentation, and image classification [17, 10]. For imagebased localization, CNNs were considered for the first time by Kendall [15]. Their method, named PoseNet, casts camera relocalization as a regression problem, where 6DoF camera pose is directly predicted from a monocular image by leveraging transfer learning from a large scale classification data. Although PoseNet overcomes many limitations of the featurebased approaches, its localization performance still lacks behind traditional approaches in typical cases where local features perform well.Looking for possible ways to further improve the accuracy of imagebased localization using CNNbased architectures, we adopt some recent advances discovered in efforts solving the problems of image restoration [19], semantic segmentation [22] and human pose estimation [20]. Inspired by these ideas, we propose to add more context to the regression process to better collect the overall information, from coarse structures to finegrained object details, available in the input image. We argue that this kind of a mechanism is suitable for getting an accurate camera pose estimate using CNNs. In detail, we propose a network architecture which consists of a bottom part (the encoder) that is used to encode the overall context and a latter part (the decoder) that recovers the finegrained visual information by upconvolving the output feature map of the encoder by gradually increasing its size towards the original resolution of the input image. Such a symmetric ”encoderdecoder” network structure is also known as an hourglass architecture [20].
The contributions of this paper can be summarized as follows:

We complement a deep convolutional network by adding a chain of upconvolutional layers with shortcut connections and apply it to the imagebased localization problem.

The proposed network significantly outperforms the current stateoftheart methods proposed in the literature for estimating camera pose.
The remainder of this paper is organized as follows. Section 2 discusses related work. In Section 3 we provide the details of the proposed CNN architecture. Section 4 presents the experimental methodology and results on a standard evaluation dataset. We conclude with a summary and ideas for future work.
The source code and trained models will be publicly available upon publication.
2 Related Work
Imagebased localization can be solved by casting it as a place recognition problem. In this approach, image retrieval techniques are often applied to find similar views of the scene in a database of images for which camera position is known. The method then estimates an approximate camera pose using the information in retrieved images. As noted in [30], these methods suffer in situations where there are no strong constraints on the camera motion. This is due to the number of the keyframes that is often very sparse.
Perhaps a more traditional approach to imagebased localization is based on finding correspondences between a query image and a 3D scene model reconstructed using SfM. Given a query image and a 3D model, an essential part of this approach is matching points from 2D to 3D. The main limitation of this approach is the 3D model that may grow eventually too big in its size or just go too complex if the scene itself is somehow complicated, like largescale urban environments. In such scenarios, the ratio of outliers in the matching process often grows too high. This in turn results in a growth in the runtime of RANSAC. There are methods to handle this situation, such as prioritizing matching regions in 2D to 3D and/or 3D to 2D and using covisibility of the query and the model [24].
Applying machine learning techniques has proven very effective in imagebased indoor localization. Shotton
[25] proposed a method to estimate scene coordinates from an RGBD input using decision forests. Compared to traditional algorithms based on matching point correspondences, their method removes the need for the traditional pipeline of feature extraction, feature description, and matching. Valentin [30] further improved the method by exploiting uncertainty in the model in order to move from sole point estimates to predict also their uncertainties for more robust continuous pose optimization. Both of these methods are designed for cameras that have an RGBD sensor.Very recently, applying deep learning techniques has resulted in remarkable performance improvements in many computer vision problems
[1, 19, 22]. Partly motivated by studies applying CNNs and regression [27, 32, 28], Kendall [15] proposed an architecture trying to directly regress camera relocalization from an input RGB image. More recent CNNbased approaches cover those of Clark [4] and Walch [31]. Both of these follow [15], and similarly adopt the same CNN architecture, by pretraining it first on largescale image classification data, for extracting features from input images to be localized. In detail, Walch [31] consider these features as an input sequence to a block of four LSTM units operating along four directions (up, down, left, and right) independently. On top of that, there is a regression part which encompasses fullyconnected layers for predicting the camera pose. In turn, Clark [4]applied LSTMs to predict camera translation only, but using short videos as an input. Their method is a bidirectional recurrent neural network (RNN), which captures dependencies between adjacent image frames yielding refined accuracy of the global pose. Both of the two architectures lead to improvement in the accuracy of 6DoF camera pose outperforming PoseNet
[15].Compared to nonCNN based approaches, our method belongs to the very recent initiative of models that do not require any online 3D models in camera pose estimation. In contrast to [25, 30], our method is solely based on monocular RGB images and no depth information is required. Compared to PoseNet [15], our method aims at better utilization of context and provides improvement in pose estimation accuracy. In comparison to [31], our method is more accurate in indoor locations. Finally, our method does not rely on video inputs, but still outperforms the CNNmodel presented in [4] for videoclip relocalization.
3 Method
, our goal is to estimate camera pose directly from an RGB image. We propose a CNN architecture that predicts a 7dimensional camera pose vector
consisting of an orientation component represented by quaternions and a translation component .Hiding the architectural details, the overall network structure is illustrated in Fig. 1. The network consists of three components, namely encoder, decoder and regressor. The encoder is fully convolutional acting as a feature extractor. The decoder consists of upconvolutional layers stacked to recover the finegrained details of the input from the decoder outputs. Finally, the decoder is followed by the regressor that estimates the camera pose .
To train our hourglassshaped CNN model, we apply the following objective function [15]:
(1) 
where and are ground truth and estimated translationorientation pairs, respectively. is a scale factor, tunable by grid search, that keeps the estimated orientation and translation to be nearly equal. The quaternion based orientation vector
is normalized to unit length at test time. We provide the detailed information about the other hyperparameters used in training in Section
4.3.1 CNN Architecture
Training convolutional neural networks from scratch for imagebased localization task is impractical due to the lack of training data. Following [15], we leverage a pretrained largescale classification network. Specifically, to find a balance between the number of parameters of the network and accuracy, we adopt ResNet34 [10] architecture which has good performance among other classification approaches [3] as our base network. We remove the last fullyconnected layer from the original ResNet34 model but keep the convolutional and pooling layers intact. The resulting architecture is considered as the encoder part of the whole pipeline.
Instead of connecting the encoder to the regression part directly, we propose to add some extra layers between them. In detail, we add three upconvolutional and one convolutional layer. The main idea of using upconvolutional layers is to restore essential finegrained visual information of the input image lost in encoder part of the network. Upconvolutional layers have been widely applied in image restoration [19], structure from motion [29] and semantic segmentation [11, 21]. The proposed architecture is presented in Fig. 3.
Finally, there is a regressor module on top of the encoder. The regressor consists of three fully connected layers, namely localization layer, orientation layer and translation layer. In contrast to the regressor originally proposed in [15]
, we slightly modified its architecture by appending batchnormalization after each fully connected layer.
Inspired by the visualization of the steps of downsampling and upsampling of the feature maps flowing through encoderdecoder part and by [20]’s work, we call our CNN architecture HourglassPose.
3.1.1 HourglassPose
As explained, the encoder part of our architecture is the slightly modified ResNet34 model. It differs from the original one presented in [10]
so that the final softmax layer and the last average pooling layer have been removed. As a result the spatial resolution of the encoder feature map is
.To better preserve finer details of the input image for the localization task, we added skip (shortcut) connections from each of the four residual blocks of the encoder to the corresponding upconvolution and the final convolution layers of the decoder. The last part of the decoder, namely the final convolutional module (a chain of convolutional, batchnormalization [12]
and ReLU layers) does not alter the spatial resolution of the feature map (
), but is used to decrease the number of channels. In our preliminary experiments, we also experimented with a Spatial Pyramid Pooling (SPP) layer [9] instead of the convolutional module. Particularly, SPP layer consists of a set of pooling layers (pyramid levels) producing a fixsized feature map regardless the size of the input image. However, the camera pose estimations were not improved, and we omitted SPP in favor of simpler convolutional module. The encoderdecoder module is followed by a regressor which predicts the camera orientation q and translation t. The detailed network configuration is shown in Table 1.In order to investigate the benefits of using skip connections more thoroughly, we experimented with different aggregation strategies of the encoder and the decoder feature maps. In contrast to HourglassPose where the outputs of corresponding layers are concatenated (See Fig. 3), we evaluated the whole pipeline by also calculating an elementwise sum of the feature maps connected via skip connections. We refer to the corresponding architecture as HourglassSumPose. Schematic illustration of a decoderregressor part of this structure is presented in Fig. 4.
Module  Layers  Output Size  HourglassPose 

Encoder 
Conv  , (3 / s2)  
Pool  max (s2)  
ResBlock1  (1 / s1)  
ResBlock2  (1 / s1)  
ResBlock3  (1 / s1)  
ResBlock4  (1 / s1)  
Decoder 
UpConv1  upconv (1 / s2)  
UpConv2  upconv (1 / s2)  
UpConv3  upconv (1 / s2)  
Conv  , 32, (1 / s1)  
Regressor 
FC  2048  
FC  4  fullyconnected  
FC  3 
Details of our HourglassPose architecture for estimating camera pose. Note that each convolutional/upconvolutional layer in the encoderdecoder part corresponds ’ConvReLUBatchNorm’ sequence. In Resblocks the resolution is downsampled with stride 2 convolutions. Upconvolution is implemented by first upsampling the signal by zero padding and then by applying normal convolution.
3.2 Evaluation Dataset
To evaluate our method and compare with the stateoftheart approaches, we utilize Microsoft 7Scenes Dataset containing RGBD images of 7 different indoor locations [26]. The dataset has been widely used for camera relocalization [6, 15, 31, 4]. The images of the scenes were recorded with a camera of the Kinect device at resolution and divided to train and evaluation parts accordingly. The ground truth camera poses were obtained by applying the KinectFusion algorithm [13] producing smooth camera trajectories. Sample images covering all scenes of the dataset are illustrated in Fig. 2. They represent indoor views of the 7 scenes exhibiting different lighting conditions, textureless (two statues in ’Heads’) and repeated objects (’Stairs’ scene), changes in viewpoint and motion blur. All of these factors make camera pose estimation an extremely challenging problem.
4 Experiments
In the following section we empirically demonstrate the effectiveness of the proposed approach on the 7Scenes evaluation dataset and compare it to other stateoftheart CNNbased methods. Like it was done in [15], we report the median error of camera orientation and translation in our evaluations.
4.1 Other stateoftheart approaches
In this work we consider three recently proposed 6DoF camera relocalization systems based on CNNs.
PoseNet is [15] is based on the GoogLeNet [27] architecture. It processes RGBimages and is modified so that all three softmax and fully connected layers are removed from the original model and replaced by regressors in the training phase. In the testing phase the other two regressors of the lower layers are removed and the prediction is done solely based on the regressor on the top of the whole network.
Bayesian PoseNet Kendall [14] propose a Bayesian convolutional neural network to estimate uncertainty in the global camera pose which leads to improving localization accuracy. The Bayesian convolutional neural is based on PoseNet architecture by adding dropout after the fully connected layers in the pose regressor and after one of the inception layer (layer 9) of GoogLeNet architecture.
LSTMPose [31] is otherwise similar to PoseNet, but applies LSTM networks for output feature coming from the final fully connected layer. In detail, it is based on utilizing the pretrained GoogLeNet architecture as a feature extractor followed by four LSTM units applying in the up, down, left and right directions. The outputs of LSTM units are then concatenated and fed to a regression module consisting of two fully connected layers to predict camera pose.
VidLoc [4] is a CNNbased system based on short video clips. As in PoseNet and LSTMPose, VidLoc incorporates similarly modified pretrained GoogLeNet model for feature extraction. The output of this module is passed to bidirectional LSTM units predicting the poses for each frame in the sequence by exploiting contextual information in past and future frames.
Scene  Frames  Spatial  PoseNet  Bayesian  LSTMPose  VidLoc  HourglassPose  HourglassSumPose  
Train  Test  Extent  ICCV’15 [15]  PoseNet [14]  [31]  [4]  
Chess  4000  2000  m  m,  m,  m,  m, N/A  m,  m, 
Fire  2000  2000  m  m,  m,  m,  m, N/A  m,  m, 
Heads  1000  1000  m  m,  m,  m,  m, N/A  m,  m, 
Office  6000  4000  m  m,  m,  m,  m, N/A  m,  m, 
Pumpkin  4000  2000  m  m,  m,  m,  m, N/A  m,  m, 
Red Kitchen  7000  5000  m  m,  m,  m,  m, N/A  m,  m, 
Stairs  2000  1000  m  m,  m,  m,  m, N/A  m,  m, 
Average  m,  m,  m,  m, N/A  m,  m, 
4.2 Training Setup
We trained our models for each scene of 7Scenes dataset according to the data splits provided by [26].
For all of our methods, we take the weights of ResNet34 [10]
pretrained on ImageNet to initialize the encoder part with them. The weights of the decoder and the regressor are initialized according to
[7]. Our initial learning rate isand that is kept for the first 50 epochs. Then, we continue for 40 epochs with
and subsequently decrease it to for the last 30 epochs.As a preprocessing step, all images of the evaluation dataset are rescaled so that the smaller side of the image is always 256 pixels. We calculate mean and standard deviation of pixel intensities separately for each scene and use them to normalize intensity value of every pixel in the input image.
We trained our models using random crops () and performed the evaluation using central crops at the test time. All experiments were conducted on two NVIDIA Titan X GPUs with data parallelism using Torch7 [5]
. We minimize the loss function (
1) over a training part of each scene of the evaluation dataset using Adam [16] (, ). The scale factor (1) varies between to . Training minibatches are randomly shuffled in the beginning of each training epoch. We further used set the weight decay as , used a minibatch size ofand the dropout probability as
. These parameter values were kept fixed during our experiments.4.3 Results
To compare HourglassPose and HourglassSumPose architectures with other stateoftheart methods, we follow the evaluation protocol presented in [15]. Specifically, we report the median error of camera pose estimations for all scenes of the 7Scenes dataset. Like in [14, 31, 4], we also provide an average median orientation and translation error.
Table 2 shows the performance of our approaches along with the other stateoftheart. The values for other methods are taken from [15], [14], [31], and [4]. According to the results, several conclusions can be drawn. First, our architectures clearly outperform the other stateoftheart CNNbased approaches. In general, HourglassSumPose improves the accuracy of the camera position by 52.27% and orientation by 8.47% for average error with respect to PoseNet. Furthermore, HourglassSumPose manages to achieve better orientation accuracy than LSTMPose [31] in all scenes of the evaluation dataset. It can be seen that both of our architectures are even competitive with VidLoc [4] that is based on a sequence of frames. Our methods improve the average position error by 1 cm and 2 cm. The results in Table 2 confirm that it is beneficial to utilize an hourglass architecture for imagebased localization.
For a more detailed comparison, we plot a family of cumulative histogram curves for all scenes of the evaluation dataset illustrated in Fig. 5. We note that both hourglass architectures outperforms PoseNet method on translation accuracy by a factor of 1.5 to 2.3 in all test scenes. Besides that, HourglassSumPose substantially improves orientation accuracy. The only exception is ’Office’ and ’Red Kitchen’ scenes where performance of HourglassSumPose is on par with PoseNet.
Figure 6 shows histograms of localization accuracy for both orientation (left) and position (right) for the two entire test scenes of the evaluation dataset. It is interesting to see that more than 60% of camera pose estimations produced by HourglassSumPose are within 20 cm in ’Chess’ scene, while for PoseNet this quotient is equal to 5%. Remarkably, HourglassSumPose is able to improve accuracy even for such an ambiguous and challenging scene like ’Stairs’ exhibiting many repetitive structures (See Fig. 5(b)). The presented results verify that an hourglass neural architecture is an efficient and promising approach for imagebased localization.









5 Conclusion
In this paper, we have presented an endtoend trainable CNNbased approach for imagebased localization. One of the key aspect of this work is applying encoderdecoder (hourglass) architecture consisting of a chain of convolutional and upconvolutional layers for estimating 6DoF camera pose. Furthermore, we propose to use direct connections forwarding feature maps from early residual layers of the model directly to the later upconvolutional layers improving the accuracy. We studied two hourglass models and showed that they significantly outperform other stateoftheart CNNbased imagebased localization approaches.
References
 [1] A. Babenko, A. Slesarev, A. Chigorin, and V. S. Lempitsky. Neural codes for image retrieval. In Computer Vision  ECCV, 2014.
 [2] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool. Speededup robust features (SURF). Comput. Vis. Image Underst., 110(3):346–359, 2008.
 [3] A. Canziani, A. Paszke, and E. Culurciello. An analysis of deep neural network models for practical applications. CoRR, abs/1605.07678, 2016.
 [4] R. Clark, S. Wang, A. Markham, N. Trigoni, and H. Wen. VidLoc: 6DoF videoclip relocalization. CoRR, abs/1702.06521v1, 2017.
 [5] R. Collobert, K. Kavukcuoglu, and C. Farabet. Torch7: A matlablike environment for machine learning. In BigLearn, NIPS Workshop, 2011.
 [6] B. Glocker, S. Izadi, J. Shotton, and A. Criminisi. Realtime rgbd camera relocalization. In International Symposium on Mixed and Augmented Reality (ISMAR). IEEE, 2013.
 [7] X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proc.AISTATS, 2010.
 [8] A. Gordo, J. Almazán, J. Revaud, and D. Larlus. Deep image retrieval: Learning global representations for image search. CoRR, abs/1604.01325, 2016.
 [9] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. CoRR, abs/1406.4729, 2014.
 [10] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
 [11] S. Hong, H. Noh, and B. Han. Decoupled deep neural network for semisupervised semantic segmentation. In NIPS, 2015.
 [12] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
 [13] S. Izadi, D. Kim, O. Hilliges, D. Molyneaux, R. Newcombe, P. Kohli, J. Shotton, S. Hodges, D. Freeman, A. Davison, and A. Fitzgibbon. Kinectfusion: Realtime 3d reconstruction and interaction using a moving depth camera. In Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology, UIST ’11, 2011.
 [14] A. Kendall and R. Cipolla. Modelling uncertainty in deep learning for camera relocalization. Proceedings of the International Conference on Robotics and Automation (ICRA), 2016.
 [15] A. Kendall, M. Grimes, and R. Cipolla. Convolutional networks for realtime 6DOF camera relocalization. CoRR, abs/1505.07427, 2015.
 [16] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
 [17] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in NIPS, 2012.
 [18] D. G. Lowe. Distinctive image features from scaleinvariant keypoints. Int. J. Comput. Vision, 60(2):91–110, Nov. 2004.
 [19] X. Mao, C. Shen, and Y. Yang. Image restoration using very deep convolutional encoderdecoder networks with symmetric skip connections. In Advances in NIPS, 2016.
 [20] A. Newell, K. Yang, and J. Deng. Stacked hourglass networks for human pose estimation. CoRR, abs/1603.06937, 2016.
 [21] H. Noh, S. Hong, and B. Han. Learning deconvolution network for semantic segmentation. In ICCV, 2015.
 [22] P. O. Pinheiro, T. Lin, R. Collobert, and P. Dollár. Learning to refine object segments. In ECCV, 2016.
 [23] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski. Orb: An efficient alternative to sift or surf. In Proc. ICCV, 2011.
 [24] T. Sattler, B. Leibe, and L. Kobbelt. Efficient and effective prioritized matching for largescale imagebased localization. IEEE TPAMI, 2016.
 [25] J. Shotton, B. Glocker, C. Zach, S. Izadi, A. Criminisi, and A. Fitzgibbon. In Proc. CVPR, title = Scene Coordinate Regression Forests for Camera Relocalization in RGBD Images, year = 2013, pages = 2930–2937,.

[26]
J. Shotton, B. Glocker, C. Zach, S. Izadi, A. Criminisi, and A. Fitzgibbon.
Scene coordinate regression forests for camera relocalization in
RGBD images.
In
Proc. Computer Vision and Pattern Recognition (CVPR)
. IEEE, 2013.  [27] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015.
 [28] D. Tomè, C. Russell, and L. Agapito. Lifting from the deep: Convolutional 3d pose estimation from a single image. CoRR, abs/1701.00295, 2017.
 [29] B. Ummenhofer, H. Zhou, J. Uhrig, N. Mayer, E. Ilg, A. Dosovitskiy, and T. Brox. DeMoN: Depth and motion network for learning monocular stereo. In Proc. Computer Vision and Pattern Recognition (CVPR). IEEE, 2017.
 [30] J. Valentin, M. Niebner, J. Shotton, and P. Torr. Exploiting uncertainty in regression forests for accurate camera relocalization. In Proc. CVPR, 2015.
 [31] F. Walch, C. Hazirbas, L. LealTaixé, T. Sattler, S. Hilsenbeck, and D. Cremers. Imagebased localization with spatial LSTMs. CoRR, abs/1611.07890, 2016.
 [32] X. Xi, Y. Luo, F. Li, P. Wang, and H. Qiao. A fast and compact saliency score regression network based on fully convolutional network. CoRR, abs/1702.00615, 2017.