We present a method to localize a monocular image stream in a previously mapped environment. In this context, we describe a light-weight system that operates on a single camera image and localizes that camera in a map. Mapping is done once, using a system that has access to depth perception (we use stereo). Subsequently, a second robot - say a drone with a single forward-facing camera - is able to obtain a full 6 Degree of Freedom (DoF) pose along the route taken by the mapping robot.
The philosophy behind this approach is that depth sensing can be demanding - computation, power and expense-wise, and is not always available. Mapping is done once, with a robot that is equipped with depth sensing. After this, robots equipped with monocular sensing are able to localize themselves in this environment, as long as they don’t stray too far from the mapped route. This opens up the possibility of small, cheap and low-cost robots, wheeled or aerial, equipped with a single camera, operating in a surveillance or delivery role in a pre-mapped environment.
Our system for monocular localization is hybrid in more ways than one. We use both Deep Learning and regular geometric techniques that pre-date the revolution in the former and combine topological and metric mapping approaches.
We present experiments, both simulated and real, where a mapping robot is first driven in a loop through an environment and subsequently, a monocular camera is able to localize itself to a full 6 DoF pose estimate along the same route. The mapping robot uses a stereo camera and runs a state-of-the-art open-source visual SLAM
algorithm to generate a map of the environment as it traverses the loop. This map, comprising of a sequence of camera poses and 3D points in the environment is first converted to a lighter topological map, with topological nodes at periodic intervals along the path taken by the robot. Each topo-node comprises of an image, 2D feature points (along with their visual descriptors), and their corresponding 3D locations. The mapping robot traverses the same route multiple times to build up a database of images for each topo-node. We train a deep learning based-classifier to classify images from the same route as belonging to one of the topo-nodes. In this work, we take a look at 3 different classifiers: an end-to-end trained classification CNN, a K-nearest neighbour classifier (KNN) using CNN features and a KNN using a NetVLAD feature descriptor. This allows a robot with a monocular camera to use a single image to determine the closest topo-node. 2D features are then matched between the new image and the features in the database for this node (whose 3D positions are known), followed by Perspective-n-Point (PnP) projection  to determine the pose of the robot to be localized. We also use a patch-normalization pre-processing step for making the feature matching invariant to lighting change. Thus, we use a hybrid of conventional geometric techniques and recent advances in Deep Learning, to localize a camera in a previously mapped environment. The system is illustrated in Figure 1.
Ii Related Work
Ii-a Visual SLAM - SfM and VO
The goal of Visual SLAM, or Visual Simultaneous Localisation and Mapping, is to use a moving camera to map an environment. The camera needs to simultaneously determine the 3D geometry of the structure in the scene and estimate its own moving position in this incrementally mapped structure. The structure estimation and camera trajectory estimation sub-problems are sometimes studied independently, depending on the relative importance of the two, and called Structure from Motion (SfM) and Visual Odometry (VO) respectively.
A typical SfM formulation  determines the 3D structure in the scene from a set of (not necessarily sequential) images. Image features are matched over multiple viewpoints and camera poses and the 3D geometry are jointly optimized over all the cameras. An additional texture mapping step ensures visual realism.
In contrast, VO emphasizes the accuracy of camera trajectory estimation that often needs to be real-time for robotics applications. Therefore, in case of VO, the 3D structural mapping is generally limited to the estimation of 3D point locations which are backprojected into the scene from their 2D image counterparts (using stereo or parallax from camera motion in the monocular case), and the optimization of the sequence of camera poses and the 3D point estimates are done incrementally in real-time. A comprehensive tutorial for VO can be found in .
ORB-SLAM  is a modern real-time open-source visual SLAM algorithm. As the name suggests, ORB features are extracted from images. These 2D feature points are then backprojected into 3D. The pose of the camera with respect to these points is determined by Perspective-n-Point (PnP) projection that optimizes the pose based on minimizing the reprojection error between the 2D projection of 3D points and their tracked counterparts obtained from optical flow across images. 3D points and pose information are stored for key-frames and a backend optimization procedure  operating in a separate thread is used to jointly optimize all 3D point and pose estimates when loop closure is detected. ORB-SLAM can be used with both monocular and stereo sensors. The former is less accurate and can only map the environment up to an arbitrary scale.
Ii-B Topo-metric Localization
Topological maps are a discretization of a continuous metric environment into nodes along paths frequented by users of the map. The topology of these nodes is described by edges between them that encode their geometric or physical arrangement. Localization in a topological map involves a determination of the node in the map the robot is currently at. In contrast, metric localization involves determining its precise metric location in the map. Topological representations are more compact, yet less precise compared to metric maps. A hybrid approach involves having a local metric map for each topological node and is a good compromise between the two.  is an early example of combining topological and metric maps. They discretize the environment into topological nodes, with local metric maps at each topo-node. A planar LIDAR is used to determine straight-line structures in the environment and topological nodes are automatically constructed at corridor junctions using corner detection. [8, 9] describe topo-metric localization using a camera. In , a vehicle equipped with GPS and a camera is used to make a topological map with an image associated with each topological node. During localization, image matching is used to determine the location of the vehicle closest to the nearest topological node.  use an omni-directional camera for localization in a topo-metric map, where the bearing of visual landmarks across 3 locations are used to triangulate the position of the camera. Like [8, 9], we also use a hybrid topo-metric map, but in contrast to them, we describe full 6 DoF pose estimation of a single image from a perspective camera.  is perhaps the closest conceptually, to our work in that they also use visual place recognition to localize to a topological node, followed by 6 DoF pose estimation by matching to a local point cloud. However, they use a stereo camera throughout, and their pose estimation is obtained by 3D-to-3D matching, whereas we use 3D to 2D matching and are able to localize using a monocular sensor.
Modern visual SLAM approaches like ORB-SLAM  described earlier implicitly use a topo-metric representation of the environment. Each keyframe is stored as a node in the graph and edges between nodes describe the SE(3) pose transformation between successive keyframes. We use ORB-SLAM, with stereo images, to generate our topometric map. The nodes in our map are much sparser and more distant compared to the keyframes in the ORB-SLAM map. And instead of the ORB features to match images, we treat the image matching to each node as a classification problem and train a Deep CNN to do this.
Ii-C Deep Learning for monocular 6 DoF pose estimation
With the recent success of Deep Learning, there have been efforts [11, 12, 13] at treating pose estimation from a single image as an end-to-end learning problem. The recent Posenet  and its extension 
(with a geometric loss function) use Deep Learning to regress 6 DoF global pose of the camera from a single image. Training poses are provided from an SfM pipeline. Undeep VO also uses Deep Learning to train separate networks for VO and for depth estimation from a single image. They use a stereo camera to train their networks with pose and photometric losses between the left and right images, and during test time, their networks are able to take a monocular image stream from the same environment and give depth maps and a VO output. However, unlike us, their VO pose output is a relative pose between the frames, and not an absolute global pose with respect to a prior map and therefore cannot be used for global localization.
The following sections describe our system in more detail. We use a stereo camera and a visual SLAM algorithm to generate the topo-metric map. Localization is done in a coarse-to-fine Deep-Geometric approach that first localizes the image to a topo-node and then uses geometry to determine its fine-grained 6 DoF pose.
Iii-a Constructing the Topo-metric map
We use a stereo camera (stereo ZED ) and the open-source visual SLAM algorithm, ORB-SLAM2  to extract the camera trajectory, i.e. a sequence of 6 DoF poses of the moving stereo pair. This trajectory is then discretized to a set of topological nodes in our topo-metric map. We found that SURF features facilitate much higher quality matching, so we elected to use that instead of ORB features. Each node comprises of a single template node image, its global SE(3) pose and a point cloud comprising of 2D SURF features 
and their 3D positions. The global SE(3) pose of each node is also stored in the map. Nodes are created based on heuristically determined translation and rotation parameters. These can be 2-20m apart, depending on the size of the environment. A new node is created when the following condition holds:
Where is the Euclidean distance the camera has moved and is the angular distance in radians. is a constant that determines how much rotations contribute the creation of new nodes. is the threshold distance.
The images in the vicinity of each topo-node are used to train a classifier that can classify RGB images to topo-nodes. We consider 3 different classifiers for this task. The first one is a more naive approach, we use a VGG-11 network 
, pretrained on ImageNet, and replace its output layer with a fully-connected layer with the same number of outputs as nodes in our topo-metric map. We then train the network to directly estimate the nearest topological node for each image in our mapping sequences. We refer to this setup asCNN-FC.
Our second classifier utilizes a pretrained VGG-16 network as a feature extractor. For all images in our mapping sequence, we extract a feature vector using the pretrained network and use it to construct a K-nearest neighbour classifier. The network is not retrained. We refer to this setup asCNN-NN.
The last classifier we consider is a VGG-16 network, with the final layers replaced with a NetVLAD layer . Like the last classifier, we build a K-nearest neighbour classifier using the feature vector computed by the NetVLAD architecture. We use pretrained weights for the network, which have been trained on the Pittsburgh 30K dataset111Code and weights taken from https://github.com/Nanne/pytorch-NetVlad. This architecture has been shown to be able to deal with challenging situations such as lighting changes and large time differences. We refer to this setup as VLAD-NN.
Iii-B Deep-Geometric Localization
Our localizer is a hybrid between newer Deep Learning techniques and geometric computer vision methods. We use a classifier (as described previously) to localize a single camera image to one of the topological nodes in the map. SURF keypoints in the test camera image are matched to keypoints from the template image of the topo-node using a fast nearest neighbour-based matcher . Next, the geometric Perspective-n-Point (PnP) algorithm is used to determine its SE(3) pose with respect to the topo-node. The PnP algorithm takes in the set of 3D points in the reference frame of the topo-node, and their 2D projections in the current camera image plane (for the matched SURF features), and outputs the pose of the camera in that reference frame. We utilize the Ransac-ed version of PnP from OpenCV, that applies the non-linear Levenberg Marquardt  optimization algorithm to find the best pose of the camera (the intrinsics of the camera are assumed known) given matching feature points. Once the pose of the camera is known with respect to the topo-node, and given the global pose of the topo-node found during the mapping process, the global 6 DoF pose of the camera image is easily obtained using a sequential concatenation of SE(3) poses.
The PnP solver is not bounded, so occasionally when keypoint matching fails, the pose will have jumped by a large distance. These large jumps are easily detected by checking if the distance travelled between two frames is not larger than a threshold (set between 1-5m, dependent on the scale of the environment). When such a jump is found, we copy the previous pose from the trajectory, ignoring the incorrect localization found by the PnP solver. In the future, this can be replaced by a more sophisticated mechanism that uses the current velocity of the camera.
Iii-C Patch normalization for Robustness to Lighting Change
We found that the SURF feature matching is more robust across lighting change if we patch-normalized  the images as a pre-processing step. Instead of finding normalizing parameters across the entire image, we do this in 17x17 windows according to the following formula:
Where: is the normalized image, is the centre of the patch, is the mean of the current patch,
is the standard deviation of this patch andis a small value to prevent division by zero. Figure 2 shows the effect of patch-normalization across time in the Corridor dataset. This is from the same topo-node in the map, but the images are taken during the day (when outdoor lighting streams in from large windows) and during the night (when the windows are dark, but indoor lights have come on). This pre-processing step is important for images in areas with significant lighting changes, as it results in more robust feature matching.
During preliminary experiments, we found that applying patch normalization before computing the SURF keypoints improved the localization accuracy by on non-simulated data.
Iv-a Simulated Environments
In an effort to make our results more reproducible, we first verify our method using the open-source autonomous driving simulator, CARLA . It allows us to simulate camera trajectories and collect photo-realistic outdoor RGB images and dense depth maps, along with accurate ground truth camera poses. We record sequences in this environment by selecting a path and moving a camera along that path. The approximate size of this environment is 333m x 81m. We consider 2 kinds of changes in the environment, creating different experiments. In the first experiment, we apply an offset to the trajectory, so the camera is moved further to the right. We split the data into mapping and testing sequences. This split is chosen such that the sequences are in different lanes on the road. We show the results of this experiment in table I.
To test how well our methods can deal with changes in the environment, we generate sequences with different weather conditions. We use the same mapping sequence as our previous experiment, but we attempt to localize in a cloudy and rainy environment. We show the visual difference between the mapping and localizing sequences in figure 4. We show the performance decreases across our proposed method in table II we show the topological accuracy (the classification accuracy of our coarse localizers) in figure 9.
For all our experiments with this dataset, we use and . In figure 3, we show the trajectory followed along with some example images from the simulator.
|Posenet||CNN - FC||CNN - NN||VLAD - NN|
|CARLA||3.114 m||1.842 m||1.822 m||1.845 m|
|Corridor||0.529 m||0.0458 m||0.0451 m||0.0446 m|
|Lab||0.131 m||0.0207 m||0.0202 m||0.0202 m|
|CNN - FC||CNN - NN||VLAD - NN|
|Baseline - CARLA||1.845 m||1.842 m||1.822 m|
|CARLA cloudy||2.526 m||2.702 m||2.371 m|
|CARLA rain||2.982 m||2.817 m||2.812 m|
Iv-B Corridor & Lab environments
We developed this technique at the Ford Palo Alto research centre, and hence, the labs and corridors of our office were the natural environment where we collected real data and tested our algorithm. We required multiple traversals of the same loop across time to allow accumulation of enough images per node and training of the CNN and hence, did not use the most common publicly available dataset KITTI, which contains relatively few loops. We also attempted to use the Oxford Robotcar Dataset , as it contains several loops, but due to issues such as camera exposure problems, we could not get accurate positioning information from ORB-SLAM. Our environment comprises large floor-to-ceiling windows and indoor ceiling lights. The brightly natural-lit windows present large, saturated regions in the image during daytime and dark areas at night. Some indoor lights go off during the day, while more come on during the evening and night. The heavily polished floors present confounding reflections in large regions of the images in our Corridor dataset. The data comprises of image sequences captured from the stereo Zed camera  mounted on a trolley with a laptop doing multiple loops around the Corridor that occupies an area of 25m x 18m, and the smaller Lab, that is of size 5m x 4m. Our algorithm was tested during day and night and lights-on and lights-off conditions. In our experiment, we map the environment with footage taken during daytime and localize on data captured in the evening. Example images can be found in figure 6. As our ground truth for these sequences, we use ORB-SLAM to map out the area with a stereo camera. While our experiments only use the left camera of the stereo sensor at test time, we use both sensors when mapping the environment and getting the ‘true’ localizations of each camera frame. Results are presented in Table I, the accuracy of the classifier is found in figure 9. For our experiments in these environments, we use and .
Iv-C Selecting Number of Nodes
An important parameter of our method is setting the threshold distance at which a new node is placed (see Eq. 1). This parameter effectively determines the number of nodes that are created. In turn, this also determines the memory footprint of our map: more nodes, means more data to be stored. During our experiments, we found that this parameter is not very sensitive w.r.t. the error metrics and works for most reasonable values. To confirm this idea, we test this by starting from a very low threshold distance, resulting in many nodes, and systematically increasing this threshold. The resulting error metrics, along with the memory footprint are plotted in figure 10.
The rotation and translation errors decrease as more nodes are being added. Eventually, adding more nodes gives diminishing returns, and the error stops decreasing while the memory footprint rises. We believe that at this point the accuracy is limited by the accuracy of the fine localizer: the SURF matching and PnP solving procedure.
Iv-D Discussion and Future Work
As a baseline, we apply a pure deep learning method, Posenet , to this problem. Posenet reported that they were able to outperform SIFT-based localizers. We train the network with the same ground truth position and orientation labels as our methods. In the case of the Lab and Corridor environments, these are the positions as found by ORBSLAM. We compare the quantitative results with our method in table I and see that it performs significantly worse. A qualitative example can be found in figure 11, using one of the corridor sequences. Posenet shows some strong deviations from the ground truth path, much more frequently than our method.
The runtime of our proposed methods vary. Our fastest method is the CNN-FC
, as it is an end-to-end trained neural network running on a GPU. Our other methods are slower due to the complexity of our naive K-nearest neighbour classifier increasing with the number of training examples. A GPU implementation could make this much faster. A detailed account of the time spent on each subtask can be found in table III. This was measured on a machine with an Intel i7-4770K CPU and an NVIDIA GeForce GTX 1660.
Though the nearest neighbour based methods are more computationally expensive, they do not require any retraining of the CNN. Though it has been shown  that the VLAD descriptors are more robust to changes, in our experiments the performance of VLAD-NN is comparable to CNN-NN.
Our method is not without limitations. As shown in our experiment where we localized in an area in different weather conditions, our fine localizer fails. This can be attributed to a flawed keypoint descriptor and replacing it with one that is more robust to these changes would improve localization.
We have shown that we can determine the 6 DoF global pose of a camera in a topo-metric map using a combination of Deep Learning and geometric methods. Our map at the moment is made by the mapping stereo camera travelling in a loop in one direction only. To make this more applicable to real-world applications, we will need to include traversals in both directions. The number of outputs in the Topo-CNN will double, one matching the node for each direction of travel. Two sets of SURF features and their 3D coordinates will also need to be stored per topo-node. Junction nodes with more than two directions of travel might also need to be included in the topo-metric map for more complicated environments.
At the moment, we use 2D image points and their SURF features to match a new image to a template image for a topo-node. The long-term maintenance of these 2D feature points is more difficult compared to say larger, more permanent 3D structures in the scene. Indoor environments are Manhattan-esque, with lots of vertical lines, and using vertical lines or surfels occupying a larger area for feature matching and PnP could improve map maintenance over longer periods of time. Another interesting research question is how new features detected from monocular traversals at test time supplant old features that degrade over time.
Our matching is light-weight and uses only a single image at the moment. Every new image at test time is matched independently to the template features from the topo-node. Tracking of local features across test images to determine a local trajectory using VO could be used to bolster the single image PnP. A local optimization / bundle adjustment procedure using poses and points across images should improve pose estimation, at the cost of increased computational complexity.
These are all interesting directions of exploration that will be avenues of future research.
We propose the Deep-Geometric Localizer: a novel, monocular localization method utilizing the strengths of both deep learning and geometric computer vision. We show that it is possible to use a deep learning classifier to coarsely localize a camera inside a known environment and refine that position using keypoint matching and a PNP solver. Moreover, our method can localize in an environment up to absolute scale with a single camera, after mapping the area with a stereo camera. We tested our method in both simulated and real environments. We show that our proposed method outperforms Posenet, a state of the art CNN-based localization method that regresses 6 DoF pose from a single image. This suggests that there is much to gain by combining the fields of geometric computer vision and Deep Learning.
-  R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos, “Orb-slam: a versatile and accurate monocular slam system,” IEEE Transactions on Robotics, vol. 31, no. 5, pp. 1147–1163, 2015.
-  R. Arandjelović, P. Gronat, A. Torii, T. Pajdla, and J. Sivic, “NetVLAD: CNN architecture for weakly supervised place recognition,” IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, p. 17.
-  V. Lepetit, F. Moreno-Noguer, and P. Fua, “Epnp: Efficient perspective-n-point camera pose estimation,” International Journal of Computer Vision, vol. 81, no. 2, pp. 155–166, 2009.
-  S. Agarwal, N. Snavely, I. Simon, S. M. Seitz, and R. Szeliski, “Building rome in a day,” in Computer Vision, 2009 IEEE 12th International Conference on. IEEE, 2009, pp. 72–79.
-  F. Fraundorfer and D. Scaramuzza, “Visual odometry: Part i: The first 30 years and fundamentals,” IEEE Robotics and Automation Magazine, vol. 18, no. 4, pp. 80–92, 2011.
-  R. Kümmerle, G. Grisetti, H. Strasdat, K. Konolige, and W. Burgard, “g 2 o: A general framework for graph optimization,” in 2011 IEEE International Conference on Robotics and Automation. IEEE, 2011, pp. 3607–3613.
-  N. Tomatis, I. Nourbakhsh, and R. Siegwart, “Hybrid simultaneous localization and map building: a natural integration of topological and metric,” Robotics and Autonomous systems, vol. 44, no. 1, pp. 3–14, 2003.
-  A. C. Murillo, J. J. Guerrero, and C. Sagues, “Surf features for efficient robot localization with omnidirectional images,” in Proceedings 2007 IEEE International Conference on Robotics and Automation. IEEE, 2007, pp. 3901–3907.
-  H. Badino, D. Huber, and T. Kanade, “Visual topometric localization,” in 2011 IEEE Intelligent Vehicles Symposium (IV). IEEE, 2011, pp. 794–799.
-  F. Dayoub, T. Morris, B. Upcroft, and P. Corke, “Vision-only autonomous navigation using topometric maps,” in Intelligent robots and systems (IROS), 2013 IEEE/RSJ international conference on. IEEE, 2013, pp. 1923–1929.
-  A. Kendall, M. Grimes, and R. Cipolla, “Posenet: A convolutional network for real-time 6-dof camera relocalization,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 2938–2946.
-  A. Kendall, R. Cipolla, et al., “Geometric loss functions for camera pose regression with deep learning,” in Proc. CVPR, vol. 3, 2017, p. 8.
-  R. Li, S. Wang, Z. Long, and D. Gu, “Undeepvo: Monocular visual odometry through unsupervised deep learning,” in 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2018, pp. 7286–7291.
-  “Stereo zed camera,” https://www.stereolabs.com/.
-  H. Bay, T. Tuytelaars, and L. Van Gool, “Surf: Speeded up robust features,” in European conference on computer vision. Springer, 2006, pp. 404–417.
-  K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
-  M. Muja and D. G. Lowe, “Fast approximate nearest neighbors with automatic algorithm configuration.”
-  K. Levenberg, “A method for the solution of certain non-linear problems in least squares,” Quarterly of applied mathematics, vol. 2, no. 2, pp. 164–168, 1944.
-  A. M. Zhang and L. Kleeman, “Robust appearance based visual route following in large scale outdoor environments,” in Proceedings of the Australasian Conference on Robotics and Automation, 2007.
-  A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun, “CARLA: An open urban driving simulator,” in Proceedings of the 1st Annual Conference on Robot Learning, 2017, pp. 1–16.
-  W. Maddern, G. Pascoe, C. Linegar, and P. Newman, “1 Year, 1000km: The Oxford RobotCar Dataset,” The International Journal of Robotics Research (IJRR), vol. 36, no. 1, pp. 3–15, 2017. [Online]. Available: http://dx.doi.org/10.1177/0278364916679498
V. Garcia, E. Debreuve, and M. Barlaud, “Fast k nearest neighbor search using
2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, June 2008, pp. 1–6.