As computer vision enabled robotic systems are increasingly deployed in the real world, simplicity, efficiency, verifiability and robustness of computer vision algorithms become highly important aspects of their design. An example of a desired solution satisfying the aforementioned requirements would likely include training a single network which would predict a structured, semantically meaningful output, the correctness of which could be verified at test time and from which all the necessary tasks for navigation and interaction with environment could be performed.
In this work we propose such a structured representation from which tasks of semantic segmentation, recognition and localisation can be performed efficiently and accurately. Our proposed solution is inspired by works on globally unique instance segmentation [BudvytisSC18] and scene coordinate regression [shottonscenecoords13, Brachmann2017DSACD, BrachmannR18]. It includes the following key steps. First, a dataset of densely sampled images, ideally a video, of the environment is created. It is labelled with globally unique instance labels [BudvytisSC18] and a corresponding 3D point cloud is obtained by running a structure-from-motion algorithm [mappilaryopensfm] on the collected images. Second, a CNN is trained to simultaneously predict globally unique instance labels and local coordinates of corresponding objects. Finally, at test time scene coordinates are formed by combining object center coordinates with local coordinates for an 6-DoF camera pose estimation which is formulated as a solution to a perspective-n-point problem (PnP) [lepetitepnp, kneipupnp14]. See Figure 1 for an example output of our method.
We evaluate our approach on real world and artificial autonomous driving datasets. Our method predicts more than and of pixels within of ground truth location for CamVid-360 [BudvytisSC18] and SceneCity Medium [BudvytisSC18] datasets spanning approximately and in driving length. We obtain 22 cm and 20 cm median distance error as well as and
median angular errors on estimated camera poses for the same datasets. Our method outperforms competing deep learning based localisation methods based on either direct 6-DoF pose prediction[PoseNetKendallGC15, KendallC17] or pose estimation from scene coordinates [BrachmannR18] on all datasets. When tested on highly challenging scenarios of using a different camera (Google StreetView images) or re-localising in scenes with missing buildings [BudvytisSC18] our method demonstrates higher robustness than alternative approaches. Our contributions include: (i) a novel formulation of scene coordinate regression as two separate tasks of object instance recognition and local coordinate regression and a demonstration that our proposed solution allows to predict accurate 3D geometry of static objects and estimate 6-DoF pose of camera on (ii) maps larger by several orders of magnitude than previously attempted [shottonscenecoords13, Brachmann2017DSACD, BrachmannR18, liangularscr18], as well as on (iii) lightweight, approximate 3D maps built from 3D primitives.
2 Related Work
In this section we provide a discussion of various related work on localisation.
Matching based localisation. Traditionally, large scale localisation problems are formulated as correspondence problems in the image domain or in the 3D point cloud domain. The first group of approaches work by identifying the most similar looking image in a database primarily in two ways: by employing either (i) a pipeline of keypoint detection and matching [Lowe2004, LIFTYi16, GoogleLandmarks] or (ii) fast-to-compare image level encoding [NetVladArandjelovic16, DenseVlad]. In order to obtain a 6-DoF pose estimation they are augmented with an additional step of establishing feature matches for one or more neighbour images and solving a perspective-n-point problem [kneipupnp14] inside a RANSAC [RANSACFischler] solver. The second group of approaches obtain local 3D geometry by using 3D sensors such as structured light [Izadi11kinectfusion, Scharstein2003], time-of-flight [rwolcott2014a] cameras as well as RGB based structure from motion [torii] and match it to a pre-built 3D model of the environment. While both types of works have a potential for providing high accuracy pose estimates at large scale, they are limited by large storage requirements of feature indices or 3D point clouds and relatively slow correspondence estimation procedures.
Direct location prediction. The need for test-time storage and correspondence estimation is addressed by the works which attempt to directly predict either a coarse [PlaNetWeyand2016] location or a full 6-DoF camera pose [PoseNetKendallGC15, KendallC17]. An estimation of location or a precise camera pose is obtained by simply training a deep network with a corresponding objective. The coarse methods [PlaNetWeyand2016] still require performing additional local feature matching if a 6-DoF pose estimate is needed and hence are not very efficient at test-time. In contrast, methods which directly predict camera pose demonstrate test-time efficiency as only a single pass through a network is required. However they are prone to over-fitting to the training images (e.g. a network may learn to predict a location based on the presence of a parked car in the image) and are not robust to changes in the environment as shown in Section 5 and discussed in detail in [torstenlimitations].
Localisation via scene coordinate prediction. Test time robustness is increased by approaches which perform localisation via scene coordinate regression [shottonscenecoords13, Brachmann2017DSACD, BrachmannR18, liangularscr18]. Such works often train a per pixel 3D scene coordinate regressor whether using a CNN [Brachmann2017DSACD]
or other method (e.g. Random Forest[shottonscenecoords13]
) and solves a perspective-n-point problem to obtain the estimate of the camera pose. Early works focus on learning outlier masks[shottonscenecoords13] in order to remove unreliable candidates for pose estimation and propose a differential pose estimation [Brachmann2017DSACD] to be compatible with fully end-to-end training schemes at the expense of more complex learning task. In contrast, [BrachmannR18] proposes to simplify the trainable components by making scene coordinate as the only trainable part of the algorithm. We further simplify their method by replacing the differentiable pose estimation algorithm with a classical one [kneipupnp14] and simply relying on our network ability to accurately predict 3D coordinate predictions. We also reformulate scene coordinate regression as a task of joint globally unique instance segmentation and prediction of local object coordinates (see Section 3) which allows us to obtain accurate pose estimates on orders of magnitude larger maps than in [shottonscenecoords13, Brachmann2017DSACD, BrachmannR18, liangularscr18] despite using training data consisting only of videos traversing environments of interest following a simple trajectory once.
Semantic localisation. Semantic information is often incorporated into localisation frameworks in one of the two ways. Approaches of the first type perform keypoint filtering [OB17] or feature reweighting [kim2017crn, SemanticVisLoc] of dynamic or difficult objects. Approaches of the second type attempt an explicit fitting of 3D models of individual rooms [satkinbmvc2012], or buildings [indooroutdooreccv16] or of detailed maps [SanFranLandmark, SanFranAlignment]. The former methods often increase the accuracy of underlying localisation algorithms but do not directly address their robustness under changes in the environment. The latter methods are often slow at test time and are more suitable for data collection. A recent work of [BudvytisSC18]
attempt to predict a rich representation of per-pixel globally unique instance labels and show that it is enough to perform localisation from it under severe changes in the environment. Our work augments this representation with local coordinate prediction which allows us to obtain 6-DoF pose estimates as opposed to performing image retrieval an to introduce robustness to unseen translation of the camera poses at test time as well as to avoid a computationally expensive step of explicit rotational alignment of label images.
Our proposed localisation framework consists of three key steps: data collection, training of a CNN to predict globally unique instance coordinates and pose estimation. More details are provided below and in Figure 2.
Data collection. First, a densely sampled collection of panoramic images of the environment is obtained. Second, a subset of images (e.g. every 30 frames) are hand labelled with both class labels (e.g. sky, road, pedestrian) and globally unique instance labels of buildings111Note that instances of other static objects such as trees or road signs could also be used as demonstrated in [BudvytisSC18]. and a label propagation algorithm [Budvytis2017ICCV] is used to label the rest of the images. Finally, camera pose estimates and corresponding semantic 3D point cloud are obtained using OpenSFM [mappilaryopensfm], an open source structure from motion (SfM) library. Default parameter settings are used unless stated otherwise. The point cloud is projected for each training image using ground truth camera pose in order to produce a 3 channel image containing (x,y,z) coordinates of projected 3D points. When multiple 3D points are projected to the same pixel, the closest one with the same instance label as the source pixel is chosen.
Training. During the second step a CNN is trained to jointly predict (i) panoptic labels consisting of class labels (e.g. 10 class labels such as road, sky, people as in [BudvytisSC18]) and instance labels of buildings as well as (ii) local PCA whitened coordinates of building instances for each pixel. Unwhitening transformation for point with label is performed as follows: , where and denote correspondingly local whitened coordinates and scene coordinates of point . is the mean coordinate of all points of label in the training data. is the whitening matrix for label . We apply standard cross entropy loss for both class and instance labels as in [BudvytisSC18] and a euclidean distance loss for fitting whitened local instance coordinates . Note that for each pixel a
dimensional vector is predicted where 3 corresponds to a 3-dimensional coordinate andcorresponds to a total number of panoptic labels. See Section 4 for more details.
Pose estimation. Camera pose is estimated with EPnP [lepetitepnp] perspective-n-point solution with RANSAC [RANSACFischler] loop from predicted scene coordinates. Other standard solutions [kneipupnp14] to PnP can be used as well. RANSAC is run for 1000 iterations with points within considered as an outlier threshold. Since we aim to recover as accurate 3D coordinates as possible we do not consider employing differentiable pose estimation algorithms used in [Brachmann2017DSACD, BrachmannR18]. The reprojection loss component employed in such algorithms reduces the accuracy of predicted 3D geometry in favour of a more accurate camera pose estimation. This is demonstrated in Figure 5 where scene coordinate prediction accuracy is lower for the loss L3-Rec-Repr ( of points lie within of ground truth) which combines reconstruction and reprojection losses than L2-Rec ( of points lie within of ground truth) which directly aims at minimising euclidean distance between predicted and ground truth scene coordinates.
4 Experiment Setup
Below we describe the details of the datasets, network architecture and evaluation protocol.
CamVid-360 dataset. CamVid-360 [BudvytisSC18] is a dataset of panoramic videos captured by cycling along the original path of CamVid [Brostow2009]. CamVid-360 training set consists of 7835 images sampled at 30 fps, at resolution which cover sequences 016E5, 001TP of the original CamVid [Brostow2009] dataset. Query set contains both test sequence222 Note that unlike [BudvytisSC18] we do not use images from sequence 006R0 as this sequence is not covered in training data. used in [BudvytisSC18] (318 images sampled at 1 fps) as well as a new additional test sequence obtained by downloading Google StreetView panoramic images along the tracks of the original dataset. We estimate ground truth poses for testing images by minimising the reprojection errors of SIFT [Lowe2004] feature matches from 80 closest images in training dataset via robust EPnP [lepetitepnp].
SceneCity dataset. SceneCity [BudvytisSC18] dataset contains images rendered from two artificial cities. See Figures 3(b,c) and 8 for example images and maps. The first city, referred as Small SceneCity, is borrowed from [ZhangRFS16]. It contains 102 buildings and 156 road segments. The second city, referred as Large SceneCity, contains 827 buildings and 966 road segments in total. Training database consists of 1146 and 6774 images sampled uniformly from each city respectively. 3D point clouds are obtained from Blender directly. Our algorithms are evaluated on two variants of Small SceneCity. For the first variant 300 camera poses are sampled uniformly from the original track of [ZhangRFS16]. For the second variant same camera poses are used but random of buildings are removed from the Small SceneCity map. Query set for the Large SceneCity consists of 1000 samples near the center of road segments as explained in [BudvytisSC18].
from the ground truth location as the number of training epochs (horizontal) increases. Methods which attempt at directly predicting 3D coordinates () converge significantly slower. While they may eventually reach similar accuracy if the amount of epochs was increased significantly, such a solution would be impractical. Graphs (c) and (d) plot euclidean distance in meters (vertical axis) between predicted camera location and its ground truth location and angular error in degrees respectively. The predictions are sorted (horizontal axis) from the smallest on the left side to the largest on the right side.
Training details. Our network consists of a ResNet-50 [HeZRS15ResNet] backbone followed by bilinear upsampling with skip layers analogously to FCN [long2015fully]
. ResNet-50 implementation and initialization weights provided by the PyTorch[paszke2017automatic]
repository are used. Strided convolution layers are replaced with dilated convolution in order to reduce the down-sampling inside the network. The models are trained for 3000 epochs unless stated otherwise, using a batch size of(for images of resolution ) and the adam optimizer [kingma2014adam]. The initial learning rate is set to 2e-4 and polynomial decrease [liu2015parsenet, chen2018deeplab] is applied by multiplying the initial learning rate with at each update step. An weight decay with factor 5e-4 is applied to all kernel weights and 2D-Dropout [tompson2015efficient] with rate is used on top of the final convolutional layer. We train all networks on four GeForce GTX 1080 Ti GPUs.
Evaluation protocol. In this work, we provide quantitative and qualitative evaluation of the accuracy of both predicted scene coordinates as well as estimated 6-DoF camera poses. Scene coordinate regression is evaluated by measuring the percentange of points falling within , and of corresponding ground truth targets. We also provide the average distance from ground truth coordinates for all points residing within of their targets. Points further away than 3m are excluded as they correspond to outliers which can obscure the true accuracy of the algorithms evaluated. The accuracy of 6-DoF camera pose estimation is evaluated by measuring median and 95-th percentile distance and angular errors between predicted and target cameras. See Figures 5 and 7 for example of results.
Three types of experiments are performed in order to evaluate our proposed framework for joint re-localisation and scene understanding. In the first two sets of experiments we evaluate the quality of the scene coordinate prediction and localisation respectively. In the final set of experiments we explore the feasibility of performing localisation by using highly compact and fast-to-query maps which are made of cuboids approximating buildings.
Scene coordinate regression. Firstly, we compare five CNNs trained with different losses and evaluate their performance on the task of scene coordinate regression on a subsequence 16E5-P2 of CamVid-360 [BudvytisSC18] dataset. The first loss L1-Repr minimizes a reprojection error of points on a spherical image plane: . Here and are ground truth camera rotation and translation matrices, - a set of all pixels in an image, - predicted scene coordinates for a pixel and is a vector pointing to pixel projection on a spherical image. It is a straight-forward adaptation a reprojection loss used in [BrachmannR18] from planar images to spherical images. It is also equivalent to a loss proposed in [liangularscr18]. The second loss directly minimises the euclidean distance between predicted scene coordinates and ground truth scene coordinates for all pixels for which ground truth coordinates are available - set . Depending on choices of learning rates and relative loss weighting parameters the first two stages of the approach of [BrachmannR18] can be viewed as a mixture of both aforementioned losses. We approximate this work by loss , where is set empirically to . Note that we do not use the third stage of differentiable pose prediction and use a classical method [lepetitepnp] instead. Also note that authors of [BrachmannR18] report only a small advantage of using this stage at a cost of the lack of convergence on Street scene of Cambridge Landmarks [PoseNetKendallGC15] dataset. The final two losses considered are L4-Rec-Lab and L5-LRec-Lab. The former combines a standard cross entropy loss used for semantic segmentation and a reconstruction loss L2, whereas the latter combines a cross entropy loss with reconstruction loss in local whitened coordinate space. We empirically set relative weighting between cross entropy loss and reconstruction losses to and respectively. As shown in Figures 3(a), 4(a) and 5 reprojection loss (L1-Repr) alone does not enable a CNN to recover accurate geometry and instead predict a point cloud of an approximately spherical shape. In contrast, using methods which directly predict scene coordinates (L2, L3, L4) lead to high accuracies with more than of points residing within of their ground truth location. Our method L5-LRec-Lab outperforms the alternatives by more than . It is due to a faster convergence at train time which is caused by a simpler optimization task resulting from the separation of object center and local coordinate prediction (see Figure 4(b)). The difference in performance between the aforementioned methods becomes even bigger when larger maps such as full CamVid-360 or artificial cities are considered as shown in Figures 3(b), 6 and 7.
Localisation. As with scene coordinate regression we first evaluate localisation accuracy of various methods on the sequence 16E5-P2 of CamVid-360 [BudvytisSC18] dataset in detail. We then follow by experiments on full length sequences of CamVid-360 and two artificial cities. It can be seen in Figures 4(c) and (d) that CNN trained using L1-Repr loss shows a poor performance in estimating 3D location, but a relatively low angular error. L2-Rec and L3-Rec-Repr perform poorly at both tasks if pixels not belonging to building instances are not masked out at test time. Note that we use ground truth masks in order to evaluate the upper bound of the performance of both methods. Directly predicting semantic scene coordinates (loss L4-Rec-Lab) produces performance similar to masked versions of L2 and L3 as its localisation accuracy is limited by the accuracy of 3D coordinates predicted. Hence it is not surprising that our proposed method based on predicting local object coordinates (L5-LRec-Lab) outperforms all the alternative methods at both angular error and camera location distance error significantly. Similar trends are observed on large experiments as reported quantitatively in Figure 7 and qualitatively in Figures 6 and 8. Also note that while PoseNet [KendallC17] (a standard setup333Note that in order to use PoseNet [KendallC17] on equirectangular images an explicit rotation augmentation of the camera pose needs to be performed for each crop. We limited crops to a horizontal shift only which corresponds to the rotation of the camera around its axis. with geometric reprojection error and ResNet-50 encoder), adapted to performing on panoramic images, shows a seemingly competitive performance on Small and Large SceneCity data, its performance drops on CamVid-360 dataset. It can be explained by PoseNet [KendallC17] sensitivity to overfitting to the training images as they are obtained from a single video as opposed to a diverse set of images. This is also supported by a significant drop in accuracy on Small SceneCity images with missing buildings.
Localisation in approximate maps. In the final set of experiments we explore the alternative of predicting a 6-DoF pose from simplified approximate 3D maps. As expected, on CamVid-360 dataset, scene coordinate and camera pose prediction is of lower accuracy than the one of the network trained using precise 3D map as shown in Figures 6 and 7. However on the Small SceneCity data, this network outperforms L5-LRec-Lab (Ours). This can be explained by cuboids being a good approximation for buildings in artificial cities. Moreover higher performance in localisation is obtained than competing approaches of PoseNet [KendallC17] and L3-3D-Repr in all experiments. This is a highly encouraging result which in the future may alleviate the need of a computationally expensive step of building 3D point clouds of cities.
In this work we presented a novel approach for a large scale joint semantic re-localisation and scene understanding via globally unique instance coordinate regression. To the best of our knowledge this is the first work to demonstrate that scene coordinate regression framework can be used for a large scale localisation. We achieve this by separating the task of scene coordinate prediction into object instance segmentation and local coordinate prediction. This significantly speeds up the convergence of scene coordinate regression networks and allows to achieve high scene coordinate prediction and localisation accuracies.