In Convolutional Neural Networks (CNNs) and other Neural Network (NN) based architectures, a ‘loss’ function is provided which quantifies the error between the ground truths and each of the NN’s predictions. This scalar quantity is used during the backpropagation process, essentially ‘informing’ the NN on how to adjust its trainable parameters. Naturally, the design of this loss function greatly affects the training process, yet simple metrics such as mean squared error (MSE) are often used in place of more intuitive, task specific loss functions. In this work, we explore the design and subsequent impact of a NN’s loss function in the context of a monocular, RGB-only, image localization task.
The problem of image localization — that is; extracting the position and rotation (herein referred to collectively as the ‘pose’) of a camera, directly from an image — has been approached using a variety of traditional and deep learning based techniques in the recent years. The problem remains exceedingly relevant as it lies at the heart of numerous technologies in Computer Vision (CV) and robotics, geo-tagging, augmented reality and robotic navigation.
More colloquially, the problem can be understood as trying to find out where you are, and where you are looking, by considering only the information present in an RGB image.
CNN based approaches to image localization — such as PoseNet  — have found success in the recent years due to the availability of large datasets and powerful training hardware, but the performance gap between these systems and the more accurate SIFT feature-based pipelines remains large. For example, the SIFT-based Active Search algorithm  remains as a reminder that significant improvements need to be made before CNN techniques can be considered competitive when localizing images.
However, CNN-based approaches do possess number of characteristics which qualify them to handle this task well. Namely, CNNs are robust to changes in illumination and occlusion , they can operate in close to real time  ( frames per second) and can be trained from labelled data (which can easily be gathered via Structure from Motion (SfM) for any arbitrary scene [13, 14]). CNN based systems also tend to excel in textureless environments where SIFT based methods would typically fail . They are also proven to operate well using purely RGB image data — making them an ideal solution for localizing small, cheap, robotic devices such as drones and unmanned ground vehicles. The major concern of this work is to extend existing pipelines whilst ensuring that the benefits provided by CNNs are preserved.
A key observation when considering existing CNN approaches is how position and rotation are treated separately in the loss function. It can be observed that altering a camera’s position or rotation both affect the image produced, and hence the error in the regressed position and the regressed rotation cannot be decoupled — each mutually affects the other. In order to optimize a CNN for regressing a camera’s pose accurately, a loss term should be used which combines both distinct quantities in an intuitive fashion.
This publication thus offers the following key contributions:
The formulation of a loss term which considers the error in both the regressed position and rotation (Section 3).
Comparison of a CNN trained with and without this loss term on common RGB image localization datasets (Section 5).
An indoor image localization dataset (the Gemini dataset) with over pose-labelled images per-scene (Section 4.1).
2 Related work
This work builds chiefly on the PoseNet architecture (a camera pose regression network 
). PoseNet was one of the first CNNs to regress the 6 degrees of freedom in a camera’s pose. The network is pretrained on object detection datasets in order to maximize the quality of feature extraction, which occurs in the first stage of the network. It only requires a single RGB image as input, unlike other networks[17, 11], and operates in real time.
Notably, PoseNet is able to localize traditionally difficult-to-localize images, specifically those with large textureless areas (where SIFT-based methods fail). PoseNet’s end-to-end nature and relatively simple ‘one-step’ training process makes it perfect for the purpose of modification, and in the case of this work, this comes in the form of changing its loss function.
PoseNet has had its loss function augmented in prior works. In  it was demonstrated that changing a pose regression network’s loss function is sufficient enough to cause an improvement in performance. The network was similarly ‘upgraded’ in  using LSTMs to correlate features at the CNN’s output. Additional improvements to the network were completed in 
, where a Bayesian CNN implementation was used to estimate re-localization accuracy.
More complex CNN approaches do exist [9, 8, 10]. For example, the pipeline outlined in  uses a CNN to regress the relative poses between a set of images which are similar to a query image. These relative pose estimates are coalesced in a fusion algorithm which produces an estimate for the camera pose of the query image.
Depth data has also been incorporated into the inputs of pose regression networks (to improve performance by leveraging multi-modal input information). These RGB-D input pipelines are commonplace in the image localization literature , and typically boast higher localization accuracy at the cost of requiring additional sensors, data and computation.
A variety of non-CNN solutions exist, with one of the more notable solutions being the Active Search algorithm , which uses SIFT features to inform a matching process. SIFT descriptors are calculated over the query image and are directly compared to a known 3D model’s SIFT features. SIFT and other non-CNN learned descriptors have been used to achieve high localization accuracy, but these descriptors tend to be susceptible to changes in the environment, and they often necessitate systems with large amounts of memory and computational power (comparatively to CNNs) .
The primary focus of this work is quantifying the impact of the loss function when training a pose regression CNN. Hence, we do not draw direct comparisons between the proposed model and significantly different pipelines — such as SIFT-based feature matching algorithms or PoseNet variations with highly modified architectures. Moreover, for the purpose of maximizing the number of available benchmark datasets, we consider pose regressors which handle purely RGB query images. In this way, this work deals specifically with CNN solutions to the monocular, RGB-only image localization task.
3 Formulating the proposed loss term
When trying to accurately regress one’s pose based on visual data alone, the error in the two terms which define pose — position and rotation — obviously needs to be minimized. If these error terms were entirely minimized, the camera would be in the correct location and would be ‘looking’ in the correct direction.
Formally, pose regression networks — such as the default PoseNet — are trained to regress an estimate for a camera’s true pose . They do this by calculating the loss after every training iteration, which is formulated as the MSE between the predicted position and the true position , plus the MSE between the predicted rotation and the true rotation . Note that rotations are encoded as quaternions, since the space of rotations is continuous, and results can be easily normalized to the unit sphere in order to ensure valid rotations. Hyperparameters and control the balance between positional and rotational error, as illustrated in Equation (1). In practice, RGB-only pose regression networks reach a maximum localization accuracy when minimizing these error terms independently.
Rather than considering position and rotation as two separate quantities, we consider them together as a line in 3D space: the line travels in a direction defined by the rotation, and must travel through the position vector defined by the position
. We then introduce a ‘line-of-sight’ term which constrains our predictions to lie on this line. The line-of-sight term considers the cosine similarity between the direction of the poseand the direction of the difference vector , as per Equation (2) and Figure 2. This term is only zero when the predicted position lies on the line defined by the ground truth pose, hence constraining the pose regression objective further. In the context of image localization, this ensures that the predicted poses lie on the line-of-sight defined in the ground truth image.
We modify the default loss function presented in Equation (1) by adding a weighted contribution of the line-of-sight loss term, producing the proposed loss function in Equation (3). In practice, the value of is chosen to roughly reflect the scale of the scene being considered, and is found via a hyperparameter grid search. Note that the line-of-sight term can contribute to the loss through multiplication, higher order terms, etc. but it was determined that weighted addition produced the best performing networks.
In short, the final loss function used to train the proposed model (Equation (3)) is the result of an exploration in the space of possible loss terms, and the term’s design was informed by task specific observations and experimentation.
Our experiments are naturally centred around testing the performance of the proposed model (defined in Section 3). This performance is defined with respect to the following criteria:
Accuracy: the system should be able to regress a camera’s pose with a level of positional and rotational accuracy that is competitive with similar classes of algorithms. Accuracy is reported using per-scene and average median positional and rotational error (See Section 5.1).
Time performance: evaluation should occur in real-time ( frames per second), such that the system is suitable in hardware limited real-time applications, or on platforms with RGB-only image sensors, on mobile phones (See Section 5.3).
We compare our proposed model against the default PoseNet and other PoseNet variants.
The following datasets are used to benchmark model performance. Each scene’s recommended train and test split (see Table 1) is used throughout the following experiments.
|Scene||(metres)||# Train||# Test|
|St Mary’s Church||1487||530|
|Great Court||Kings College||Old Hospital||Shop Facade||St Mary’s Church||Street|
7Scenes . indoor locations in a domestic office context. The dataset features large training and testing sets (in the thousands). The camera paths move continuously while gathering images in distinct sequences. Images include motion blur, featureless spaces and specular reflections (see Figure 8), making this a challenging dataset, and one that has been used prolifically in the image localization literature. The ground truths poses are gathered with KinectFusion, and the RGB-D frames each have resolutions of px.
Cambridge Landmarks [4, 2, 4]. outdoor locations in and around Cambridge, The United Kingdom. The larger spatial extent and restricted dataset size make this a challenging dataset to learn to regress pose from — methods akin to the one presented in this work typically only deliver positional accuracy in the scale of metres. However, the dataset does provide a common point of comparison, and also includes large expanses of texture-less surfaces. Ground truth poses are generated by a SfM process, so some comparison can be drawn between this dataset and the one created in this work.
University . indoor scenes in a university context. Ground truth poses are gathered using odometry estimates and “manually generated location constraints in a pose-graph optimization framework” . The dataset, similarly to 7Scenes, includes challenging frames with high degrees of perceptual aliasing, where multiple frames (with different poses) give rise to similar images . Although the scenes are registered to a common coordinate system in the University dataset and thus a network could be trained on the full dataset, the models created in this work are trained and tested scene-wise for the purpose of consistency.
Gemini111This dataset has been made available at https://github.com/anon-datasets/gemini. indoor scenes in a university lab context. This dataset was created for the purpose of studying the effect of texture and colour on pose regression networks: both scenes survey the same environment, with one scene including decor (posters, screen-savers, paintings etc.) and the other deliberately not including visually rich, textured, and colorful decor. As such the two scenes are labelled Decor and Plain. A photogrammetry pipeline (COLMAP ) was used to generate the ground truth poses. Images were captured in separate video sequences using a FujiFilm X-T20 with a 23mm prime autofocus lens (in order to ensure a fixed calibration matrix between sequences). Visualizations of the with decor scene are provided in Figure 7.
(a) Top down view
|(b) Isometric view|
4.2 Architecture and training
As stated, we primarily experiment with the PoseNet architecture (using TensorFlow). For the purpose of brevity we redirect the reader to the original publication, as here we only describe crucial elements of the network’s design and operation.
The PoseNet architecture is in itself based on the GoogLeNet architecture , a
layer deep network which performs classification and detection. PoseNet extracts GoogLeNet’s early feature extracting layers, and replaces the final three softmax classifiers with affine regressors. The network is pretrained using large classification datasets such asPlaces .
Strictly, the default loss function used is not exactly as defined in Equation (1). Instead, PoseNet uses the predictions from all three affine regressors (hence there are three predictions for each quantity). We label the affine regressor’s hyperparameters and predictions using a subscript , as per Equation (4). All three affine regressors’ predictions are used in the loss function, but each have different hyperparameter weightings: , , and .
In order to demonstrate the consistency and generalization of the proposed network, we train against all scenes in all datasets using the same experimental setup. For each scene we train PoseNet using the default loss (Equation (4)) and the proposed loss (Equation (3)) with the contribution from all three affine regressors. Each model is trained per-scene over iterations with a batch size of on a Tesla K40c, which takes hours to complete.
We compare our proposed model to PoseNet and one of its variants — Bayesian PoseNet  — in Table 2. This is to show the proposed model’s performance when compared to other variants of PoseNet with modified loss functions. We then provide results specifically comparing the default PoseNet to our proposed model in Table 3. A discussion of our system’s performance regarding the criteria outlined in Section 4 follows.
|Scene||PoseNet ||PoseNet ||model|
Average calculated using only the scenes: King’s College, Old Hospital, Shop Facade & St Mary’s Church as full dataset performance is not available for all pipelines.
It is observed that the proposed model outperforms the default version of PoseNet in approximately half the 7Scenes scenes — particularly the Stairs scene. In the Stairs scene, repetitious structures, staircases, make localization harder, yet the proposed model is robust to such challenges. The network is outperformed in others scenes; namely outdoor datasets with large spatial extents, but in general, performance is improved for the indoor datasets 7Scenes, University and Gemini.
A set of cumulative histograms for six of the evaluated scenes are provided in Table 4, where we compare the distribution of the positional errors and rotational errors. Median values (provided in Table 2 and Table 3) are plotted for reference.
The proposed model’s errors are strictly less than the default PoseNet’s throughout the majority of the Chess and Coffee Room distributions. However, the default PoseNet outperforms our proposed model with respect to rotational accuracy in the - range in the Coffee Room scene.
Note the lesser performance observed from the proposed model on the King’s College scene; where the positional errors distributions for the two networks are nearly aligned. Moreover, the default PoseNet more accurately regresses rotation in this outdoor scene. See Section 5.2 and Section 6 for further discussion.
The robustness of our system to challenging test frames — that is, images with motion blur, repeated structures or demonstrating perceptual aliasing  — can be determined via the cumulative histograms in Table 4. For the purpose of visualization, some difficult testing images from the 7Scenes dataset are displayed in Figure 8.
The hardest frames in the test set by definition produce the greatest errors. Consider the positional error for the Meeting scene: our proposed model reaches a value of on the y-axis before the default PoseNet does, meaning that the hardest frames in the test set have their position regressed more accurately. This analysis extends to each of the cumulative histograms in Table 4, thus confirming our proposed loss function’s robustness to difficult test scenarios, as the frames of greatest error consistently have less than or comparable errors when compared to the default PoseNet.
(a) Motion blur
|(b) Repeated structures||(c) Textureless & specular surfaces|
Moreover, the proposed model significantly exceeds the default PoseNet’s performance throughout the Gemini dataset. The performance gap in the Plain scene proves that our model is more robust to textureless spaces than the default PoseNet.
Training time. The duration of the training stage compared between our implementation and default PoseNet is by design, very similar, and highly competitive when compared to the other systems analyzed in Table 2. This is due to the relatively inexpensive computing cost of introducing a simple line-of-sight loss term into the network’s overall loss function. The average training time for default PoseNet and for our augmented PoseNet over the University dataset is and respectively (HH:MM:SS), where both tests are ran on the same hardware.
Testing time. The network operation during the test time is naturally not affected by the loss function augmentation. The time performance when testing is similar to that of the default PoseNet and in general is competitive amongst camera localization pipelines (especially feature based matching techniques). We observe a total elapsed time of seconds when evaluating the entire Coffee Room scene testing set, whereas it takes seconds using the default PoseNet. In other words, both systems take ms to complete a single inference on our hardware.
Memory cost. Memory cost in general for CNNs is low — only the weights for the trained layers and the input image need to be loaded into memory. When compared to feature matching techniques, which need to store feature vectors for all instances in the test set, or SIFT-based matching methods with large memory and computational overheads, CNN approaches are in general quite desirable — especially in resource constrained environments. Both the proposed model and the default PoseNet take MiB and MiB to train and test respectively (as reported by nvidia-smi). For interest, the network weights for the proposed model’s TensorFlow implementation total only MB.
6 Discussion and future work
Experimental results confirm that the proposed loss term has a positive impact on robustness and accuracy, whilst maintaining speed, memory usage, and robustness (to textureless spaces and so forth).
The network is outperformed by the SIFT-based image localization algorithm ‘Active Search’ , indicating that there is still some work required until the gap between SIFT-based algorithms and CNNs is closed (in the context of RGB-only image localization). However, SIFT localization operates on a much longer timescale, and can be highly computationally expensive depending on the dataset and pipeline being used .
Ultimately, the loss function described in this work illustrates that intuitive loss terms, designed with respect to a specific task (in this case image localization) can positively impact the performance of deep networks.
Possible avenues for future work include extending this loss function design methodology to other CV tasks, in order to achieve higher performance, or to consider RGB-D pipelines. An investigation on the effect that such loss terms have on the convergence rate, and upper performance limit of NNs could also be explored.
In summary, the effect of adding a line-of-sight loss term to an existing pose regression network is investigated. The performance of the proposed model is compared to other similar models across common image localization benchmarks and the newly introduced Gemini dataset. Improvements to performance in the image localization task are observed, without any drastic increase in evaluation speed or training time. Particularly, the median positional accuracy is — on average — increased for indoor datasets when compared to a version of the model without the suggested loss term.
This work suggests that means squared error between the ground truth and the regressed predictions — although often used as a measure of loss for many Neural Networks — can be improved upon. Specifically, loss functions designed with the network’s task in mind may yield better performing models. For pose regression networks, the distinct and coupled nature of positional and rotational quantities needs to be considered when designing a network’s loss function.
Learning less is more - 6d camera localization via 3d surface regression.
Conference on Computer Vision and Pattern Recognition (CVPR)abs/1711.10228. External Links: Cited by: §1, §2.
-  (2015) Modelling uncertainty in deep learning for camera relocalization. International Conference on Robotics and Automation (ICRA) abs/1509.05909. External Links: Cited by: §2, §4.1, Table 2.
-  (2017-04) Geometric Loss Functions for Camera Pose Regression with Deep Learning. Conference on Computer Vision and Pattern Recognition (CVPR). Note: arXiv: 1704.00390 External Links: Cited by: §2.
-  (2015-05) PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization. International Conference on Computer Vision (ICCV) (en). Note: arXiv: 1505.07427 External Links: Cited by: §1, §2, §2, §4.1, §4.2, Table 2, Table 3.
-  (2017) Camera relocalization by computing pairwise relative poses using convolutional neural network. International Conference on Computer Vision (ICCV). Cited by: §2, §4.1.
-  (2018) Full-frame scene coordinate regression for image-based localization. Robotics: Science and Systems. External Links: Cited by: §5.2.
-  (2016) Random forests versus neural networks - what’s best for camera relocalization?. International Conference on Robotics and Automation (ICRA) abs/1609.05797. External Links: Cited by: §1.
-  (2017) Relative camera pose estimation using convolutional neural networks. Advanced Concepts for Intelligent Vision Systems (ACIVS) abs/1702.01381. External Links: Cited by: §2.
-  (2017) Image-based localization using hourglass networks. International Conference on Computer Vision Workshops (ICCVW) abs/1703.07971. External Links: Cited by: §1, §2.
-  (2017) SPP-net: deep absolute pose regression with synthetic views. British Machine Vision Conference (BMVC) abs/1712.03452. External Links: Cited by: §2.
-  (2018) VLocNet++: deep multitask learning for semantic visual localization and odometry. Robotics and Automation Letters (RAL) 3. External Links: Cited by: §2.
-  (2017-09) Efficient and effective prioritized matching for large-scale image-based localization. Transactions on Pattern Analysis and Machine Intelligence (PAMI) 39 (09), pp. 1744–1756. External Links: Cited by: §1, §2, §6.
-  (2016-06) Structure-from-motion revisited. In Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 4104–4113. External Links: Cited by: §1, §4.1.
-  (2016) Pixelwise view selection for unstructured multi-view stereo. In European Conference on Computer Vision (ECCV), Cited by: §1.
-  (2013) Scene coordinate regression forests for camera relocalization in rgb-d images. Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2930–2937. External Links: Cited by: §4.1.
-  (2014) Going deeper with convolutions. Conference on Computer Vision and Pattern Recognition (CVPR) abs/1409.4842. External Links: Cited by: §4.2.
-  (2018) Deep auxiliary learning for visual localization and odometry. International Conference on Robotics and Automation (ICRA) abs/1803.03642. External Links: Cited by: §2.
-  (2016-11) Image-based localization using LSTMs for structured feature correlation. International Conference on Computer Vision (ICCV). Note: arXiv: 1611.07890 External Links: Cited by: §2, §5.
-  (2013-06) Towards linear-time incremental structure from motion. In International Conference on 3D Vision (3DV), Vol. , pp. 127–134. External Links: Cited by: §6.
-  (2010) The impact of perceptual aliasing on exploration and learning in a dynamic decision making task. In Proceedings of the Annual Meeting of the Cognitive Science Society, Cited by: §4.1.
-  (2014) . In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.), pp. 487–495. External Links: Cited by: §4.2.