Large Scale Joint Semantic Re-Localisation and Scene Understanding via Globally Unique Instance Coordinate Regression

by   Ignas Budvytis, et al.

In this work we present a novel approach to joint semantic localisation and scene understanding. Our work is motivated by the need for localisation algorithms which not only predict 6-DoF camera pose but also simultaneously recognise surrounding objects and estimate 3D geometry. Such capabilities are crucial for computer vision guided systems which interact with the environment: autonomous driving, augmented reality and robotics. In particular, we propose a two step procedure. During the first step we train a convolutional neural network to jointly predict per-pixel globally unique instance labels and corresponding local coordinates for each instance of a static object (e.g. a building). During the second step we obtain scene coordinates by combining object center coordinates and local coordinates and use them to perform 6-DoF camera pose estimation. We evaluate our approach on real world (CamVid-360) and artificial (SceneCity) autonomous driving datasets. We obtain smaller mean distance and angular errors than state-of-the-art 6-DoF pose estimation algorithms based on direct pose regression and pose estimation from scene coordinates on all datasets. Our contributions include: (i) a novel formulation of scene coordinate regression as two separate tasks of object instance recognition and local coordinate regression and a demonstration that our proposed solution allows to predict accurate 3D geometry of static objects and estimate 6-DoF pose of camera on (ii) maps larger by several orders of magnitude than previously attempted by scene coordinate regression methods, as well as on (iii) lightweight, approximate 3D maps built from 3D primitives such as building-aligned cuboids.



There are no comments yet.


page 1

page 2

page 3

page 5

page 8

page 10


Decoupling Features and Coordinates for Few-shot RGB Relocalization

Cross-scene model adaption is a crucial feature for camera relocalizatio...

3D Robot Pose Estimation from 2D Images

This paper considers the task of locating articulated poses of multiple ...

YOLOff: You Only Learn Offsets for robust 6DoF object pose estimation

Estimating the 3D translation and orientation of an object is a challeng...

A Pose-only Solution to Visual Reconstruction and Navigation

Visual navigation and three-dimensional (3D) scene reconstruction are es...

A Dynamic Keypoints Selection Network for 6DoF Pose Estimation

6 DoF poses estimation problem aims to estimate the rotation and transla...

A Mixed Classification-Regression Framework for 3D Pose Estimation from 2D Images

3D pose estimation from a single 2D image is an important and challengin...

Scene Coordinate Regression with Angle-Based Reprojection Loss for Camera Relocalization

Image-based camera relocalization is an important problem in computer vi...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Three triplets of images on the top of this figure illustrate a typical result of our framework for joint semantic re-localistation and scene understanding via globally unique instance coordinate prediction. The top left image of the triplet corresponds to a query image provided as an input to our network. The output of the network consists of per-pixel 3D coordinates and corresponding globally unique instance labels are shown on the right together with ground truth (green) and estimated (red) camera poses. The image at the bottom left shows the closest image in the database with ground truth building instance label images overlaid. The bottom part of the figure illustrates cumulative 3D point cloud built from predicted scene coordinates as well as ground truth (green) and estimated (red) camera poses for sequences 16E5-P2, 16E5-P3 and 01TP. See Section 5 for more quantitative and qualitative results. Zoom in for a better view.

As computer vision enabled robotic systems are increasingly deployed in the real world, simplicity, efficiency, verifiability and robustness of computer vision algorithms become highly important aspects of their design. An example of a desired solution satisfying the aforementioned requirements would likely include training a single network which would predict a structured, semantically meaningful output, the correctness of which could be verified at test time and from which all the necessary tasks for navigation and interaction with environment could be performed.

In this work we propose such a structured representation from which tasks of semantic segmentation, recognition and localisation can be performed efficiently and accurately. Our proposed solution is inspired by works on globally unique instance segmentation [BudvytisSC18] and scene coordinate regression [shottonscenecoords13, Brachmann2017DSACD, BrachmannR18]. It includes the following key steps. First, a dataset of densely sampled images, ideally a video, of the environment is created. It is labelled with globally unique instance labels [BudvytisSC18] and a corresponding 3D point cloud is obtained by running a structure-from-motion algorithm [mappilaryopensfm] on the collected images. Second, a CNN is trained to simultaneously predict globally unique instance labels and local coordinates of corresponding objects. Finally, at test time scene coordinates are formed by combining object center coordinates with local coordinates for an 6-DoF camera pose estimation which is formulated as a solution to a perspective-n-point problem (PnP) [lepetitepnp, kneipupnp14]. See Figure 1 for an example output of our method.

We evaluate our approach on real world and artificial autonomous driving datasets. Our method predicts more than and of pixels within of ground truth location for CamVid-360 [BudvytisSC18] and SceneCity Medium [BudvytisSC18] datasets spanning approximately and in driving length. We obtain 22 cm and 20 cm median distance error as well as and

median angular errors on estimated camera poses for the same datasets. Our method outperforms competing deep learning based localisation methods based on either direct 6-DoF pose prediction 

[PoseNetKendallGC15, KendallC17] or pose estimation from scene coordinates [BrachmannR18] on all datasets. When tested on highly challenging scenarios of using a different camera (Google StreetView images) or re-localising in scenes with missing buildings [BudvytisSC18] our method demonstrates higher robustness than alternative approaches. Our contributions include: (i) a novel formulation of scene coordinate regression as two separate tasks of object instance recognition and local coordinate regression and a demonstration that our proposed solution allows to predict accurate 3D geometry of static objects and estimate 6-DoF pose of camera on (ii) maps larger by several orders of magnitude than previously attempted [shottonscenecoords13, Brachmann2017DSACD, BrachmannR18, liangularscr18], as well as on (iii) lightweight, approximate 3D maps built from 3D primitives.

The rest of this work is divided as follows. Section 2 discusses relevant work in localisation. Section 3 provides details of our proposed localisation approach. Sections 4 and 5 describe the experiment setup and corresponding results.

2 Related Work

Figure 2: This figure illustrates the three key steps of our method. First, a densely sampled dataset of images is collected and annotated with both class and globally unique instance labels. From these images a 3D point cloud is build using state-of-the-art library for structure from motion - OpenSFM [mappilaryopensfm]. Second, a CNN is trained to predict panoptic labels as well as local PCA whitened coordinates for each object instance. During the final step predicted local coordinates are unwhitened to obtain corresponding scene coordinates . EPnP [lepetitepnp] with RANSAC is used for camera pose estimation.

In this section we provide a discussion of various related work on localisation.

Matching based localisation. Traditionally, large scale localisation problems are formulated as correspondence problems in the image domain or in the 3D point cloud domain. The first group of approaches work by identifying the most similar looking image in a database primarily in two ways: by employing either (i) a pipeline of keypoint detection and matching [Lowe2004, LIFTYi16, GoogleLandmarks] or (ii) fast-to-compare image level encoding [NetVladArandjelovic16, DenseVlad]. In order to obtain a 6-DoF pose estimation they are augmented with an additional step of establishing feature matches for one or more neighbour images and solving a perspective-n-point problem [kneipupnp14] inside a RANSAC [RANSACFischler] solver. The second group of approaches obtain local 3D geometry by using 3D sensors such as structured light [Izadi11kinectfusion, Scharstein2003], time-of-flight [rwolcott2014a] cameras as well as RGB based structure from motion [torii] and match it to a pre-built 3D model of the environment. While both types of works have a potential for providing high accuracy pose estimates at large scale, they are limited by large storage requirements of feature indices or 3D point clouds and relatively slow correspondence estimation procedures.

Direct location prediction. The need for test-time storage and correspondence estimation is addressed by the works which attempt to directly predict either a coarse [PlaNetWeyand2016] location or a full 6-DoF camera pose [PoseNetKendallGC15, KendallC17]. An estimation of location or a precise camera pose is obtained by simply training a deep network with a corresponding objective. The coarse methods [PlaNetWeyand2016] still require performing additional local feature matching if a 6-DoF pose estimate is needed and hence are not very efficient at test-time. In contrast, methods which directly predict camera pose demonstrate test-time efficiency as only a single pass through a network is required. However they are prone to over-fitting to the training images (e.g. a network may learn to predict a location based on the presence of a parked car in the image) and are not robust to changes in the environment as shown in Section 5 and discussed in detail in [torstenlimitations].

Localisation via scene coordinate prediction. Test time robustness is increased by approaches which perform localisation via scene coordinate regression [shottonscenecoords13, Brachmann2017DSACD, BrachmannR18, liangularscr18]. Such works often train a per pixel 3D scene coordinate regressor whether using a CNN [Brachmann2017DSACD]

or other method (e.g. Random Forest 


) and solves a perspective-n-point problem to obtain the estimate of the camera pose. Early works focus on learning outlier masks 

[shottonscenecoords13] in order to remove unreliable candidates for pose estimation and propose a differential pose estimation [Brachmann2017DSACD] to be compatible with fully end-to-end training schemes at the expense of more complex learning task. In contrast, [BrachmannR18] proposes to simplify the trainable components by making scene coordinate as the only trainable part of the algorithm. We further simplify their method by replacing the differentiable pose estimation algorithm with a classical one [kneipupnp14] and simply relying on our network ability to accurately predict 3D coordinate predictions. We also reformulate scene coordinate regression as a task of joint globally unique instance segmentation and prediction of local object coordinates (see Section 3) which allows us to obtain accurate pose estimates on orders of magnitude larger maps than in [shottonscenecoords13, Brachmann2017DSACD, BrachmannR18, liangularscr18] despite using training data consisting only of videos traversing environments of interest following a simple trajectory once.

Semantic localisation. Semantic information is often incorporated into localisation frameworks in one of the two ways. Approaches of the first type perform keypoint filtering [OB17] or feature reweighting [kim2017crn, SemanticVisLoc] of dynamic or difficult objects. Approaches of the second type attempt an explicit fitting of 3D models of individual rooms [satkinbmvc2012], or buildings [indooroutdooreccv16] or of detailed maps [SanFranLandmark, SanFranAlignment]. The former methods often increase the accuracy of underlying localisation algorithms but do not directly address their robustness under changes in the environment. The latter methods are often slow at test time and are more suitable for data collection. A recent work of [BudvytisSC18]

attempt to predict a rich representation of per-pixel globally unique instance labels and show that it is enough to perform localisation from it under severe changes in the environment. Our work augments this representation with local coordinate prediction which allows us to obtain 6-DoF pose estimates as opposed to performing image retrieval an to introduce robustness to unseen translation of the camera poses at test time as well as to avoid a computationally expensive step of explicit rotational alignment of label images.

3 Method

(b) (c)
Figure 3: Part (a) illustrates scene coordinates predicted by CNNs trained on CamVid-360 16E5-P2 sequence with one of five different losses (see Section 5 for definitions) for a single image. Two different views of the same point cloud are shown on top and bottom rows. Red pixels in the images on the bottom row indicate pixels for which predicted 3D location is further than 0.5m (a) and 1m (b,c) away from their true location. Parts (b) and (c) similarly illustrate scene coordinate predictions on Small and Large SceneCity datasets respectively. All networks except ones trained with L1-Repr loss learn a good approximation of scene geometry for CamVid-360 16E5-P2 sequence. However when map size increases (e.g. Large SceneCity dataset) the 3D reconstruction accuracy significantly decreases for methods which attempt to predict 3D values directly (e.g. L3-Rec-Repr). Zoom in for a better view.

Our proposed localisation framework consists of three key steps: data collection, training of a CNN to predict globally unique instance coordinates and pose estimation. More details are provided below and in Figure 2.

Data collection. First, a densely sampled collection of panoramic images of the environment is obtained. Second, a subset of images (e.g. every 30 frames) are hand labelled with both class labels (e.g. sky, road, pedestrian) and globally unique instance labels of buildings111Note that instances of other static objects such as trees or road signs could also be used as demonstrated in [BudvytisSC18]. and a label propagation algorithm [Budvytis2017ICCV] is used to label the rest of the images. Finally, camera pose estimates and corresponding semantic 3D point cloud are obtained using OpenSFM [mappilaryopensfm], an open source structure from motion (SfM) library. Default parameter settings are used unless stated otherwise. The point cloud is projected for each training image using ground truth camera pose in order to produce a 3 channel image containing (x,y,z) coordinates of projected 3D points. When multiple 3D points are projected to the same pixel, the closest one with the same instance label as the source pixel is chosen.

Training. During the second step a CNN is trained to jointly predict (i) panoptic labels consisting of class labels (e.g. 10 class labels such as road, sky, people as in [BudvytisSC18]) and instance labels of buildings as well as (ii) local PCA whitened coordinates of building instances for each pixel. Unwhitening transformation for point with label is performed as follows: , where and denote correspondingly local whitened coordinates and scene coordinates of point . is the mean coordinate of all points of label in the training data. is the whitening matrix for label . We apply standard cross entropy loss for both class and instance labels as in [BudvytisSC18] and a euclidean distance loss for fitting whitened local instance coordinates . Note that for each pixel a

dimensional vector is predicted where 3 corresponds to a 3-dimensional coordinate and

corresponds to a total number of panoptic labels. See Section 4 for more details.

Pose estimation. Camera pose is estimated with EPnP [lepetitepnp] perspective-n-point solution with RANSAC [RANSACFischler] loop from predicted scene coordinates. Other standard solutions [kneipupnp14] to PnP can be used as well. RANSAC is run for 1000 iterations with points within considered as an outlier threshold. Since we aim to recover as accurate 3D coordinates as possible we do not consider employing differentiable pose estimation algorithms used in [Brachmann2017DSACD, BrachmannR18]. The reprojection loss component employed in such algorithms reduces the accuracy of predicted 3D geometry in favour of a more accurate camera pose estimation. This is demonstrated in Figure 5 where scene coordinate prediction accuracy is lower for the loss L3-Rec-Repr ( of points lie within of ground truth) which combines reconstruction and reprojection losses than L2-Rec ( of points lie within of ground truth) which directly aims at minimising euclidean distance between predicted and ground truth scene coordinates.

4 Experiment Setup

Below we describe the details of the datasets, network architecture and evaluation protocol.

CamVid-360 dataset. CamVid-360 [BudvytisSC18] is a dataset of panoramic videos captured by cycling along the original path of CamVid [Brostow2009]. CamVid-360 training set consists of 7835 images sampled at 30 fps, at resolution which cover sequences 016E5, 001TP of the original CamVid [Brostow2009] dataset. Query set contains both test sequence222 Note that unlike [BudvytisSC18] we do not use images from sequence 006R0 as this sequence is not covered in training data. used in [BudvytisSC18] (318 images sampled at 1 fps) as well as a new additional test sequence obtained by downloading Google StreetView panoramic images along the tracks of the original dataset. We estimate ground truth poses for testing images by minimising the reprojection errors of SIFT [Lowe2004] feature matches from 80 closest images in training dataset via robust EPnP [lepetitepnp].

SceneCity dataset. SceneCity [BudvytisSC18] dataset contains images rendered from two artificial cities. See Figures 3(b,c) and 8 for example images and maps. The first city, referred as Small SceneCity, is borrowed from [ZhangRFS16]. It contains 102 buildings and 156 road segments. The second city, referred as Large SceneCity, contains 827 buildings and 966 road segments in total. Training database consists of 1146 and 6774 images sampled uniformly from each city respectively. 3D point clouds are obtained from Blender directly. Our algorithms are evaluated on two variants of Small SceneCity. For the first variant 300 camera poses are sampled uniformly from the original track of [ZhangRFS16]. For the second variant same camera poses are used but random of buildings are removed from the Small SceneCity map. Query set for the Large SceneCity consists of 1000 samples near the center of road segments as explained in [BudvytisSC18].

(a) (b) (c) (d)
Figure 4: Graph (a) plots the percentage of points (vertical axis) for which predicted scene coordinates reside within a given distance (horizontal axis) from a corresponding ground truth location. Our method (L5-LRec-Lab) outperforms alternative approaches at predicting significantly more points with small euclidean distance. For distances larger than , our method is less accurate due to error introduced by mis-predictions of globally unique instance labels. Graph (b) plots the evolution of the percentage of points (vertical axis) which reside within

from the ground truth location as the number of training epochs (horizontal) increases. Methods which attempt at directly predicting 3D coordinates (

) converge significantly slower. While they may eventually reach similar accuracy if the amount of epochs was increased significantly, such a solution would be impractical. Graphs (c) and (d) plot euclidean distance in meters (vertical axis) between predicted camera location and its ground truth location and angular error in degrees respectively. The predictions are sorted (horizontal axis) from the smallest on the left side to the largest on the right side.
Figure 5: This figure provides a quantitative evaluation of scene coordinate prediction and localisation performance for five different losses on CamVid-360 sequence 16E5-P2 and its StreetView counterpart. For the task of scene coordinate prediction percentages of pixels within , , and are reported together with average distance for pixels which are within 3m (column M). For the task of localisation median as well as 95th percentile angular error (A) and camera location distance from ground truth value (D) are reported. For methods which do not explicitly predict instance labels (L1-L3), a result which is obtained by masking out pixels which do not belong to building instances (see column GT Mask) is reported for a fair evaluation.
Figure 6: The top left image of this figure displays the ground truth semantic point cloud as well as database (yellow) and query trajectories (green) for original CamVid-360 16E5-P3 sequence and our collected sequence from Google StreetView images (black). Two groups of three columns at the top show cumulative predicted 3D point-clouds (random sample of 1% of total points) with accompanying ground truth camera poses (green) and predicted camera poses (red) for three different methods. Camera poses predicted by PoseNet [KendallC17] are marked in blue. Similar results are provided for sequence 16E5-P2 on the bottom part of the figure. Zoom in for a better view. Also see supplementary material.

Training details. Our network consists of a ResNet-50 [HeZRS15ResNet] backbone followed by bilinear upsampling with skip layers analogously to FCN [long2015fully]

. ResNet-50  implementation and initialization weights provided by the PyTorch 


repository are used. Strided convolution layers are replaced with dilated convolution in order to reduce the down-sampling inside the network. The models are trained for 3000 epochs unless stated otherwise, using a batch size of

(for images of resolution ) and the adam optimizer [kingma2014adam]. The initial learning rate is set to 2e-4 and polynomial decrease [liu2015parsenet, chen2018deeplab] is applied by multiplying the initial learning rate with at each update step. An weight decay with factor 5e-4 is applied to all kernel weights and 2D-Dropout [tompson2015efficient] with rate is used on top of the final convolutional layer. We train all networks on four GeForce GTX 1080 Ti GPUs.

Evaluation protocol. In this work, we provide quantitative and qualitative evaluation of the accuracy of both predicted scene coordinates as well as estimated 6-DoF camera poses. Scene coordinate regression is evaluated by measuring the percentange of points falling within , and of corresponding ground truth targets. We also provide the average distance from ground truth coordinates for all points residing within of their targets. Points further away than 3m are excluded as they correspond to outliers which can obscure the true accuracy of the algorithms evaluated. The accuracy of 6-DoF camera pose estimation is evaluated by measuring median and 95-th percentile distance and angular errors between predicted and target cameras. See Figures 5 and 7 for example of results.

5 Experiments

Three types of experiments are performed in order to evaluate our proposed framework for joint re-localisation and scene understanding. In the first two sets of experiments we evaluate the quality of the scene coordinate prediction and localisation respectively. In the final set of experiments we explore the feasibility of performing localisation by using highly compact and fast-to-query maps which are made of cuboids approximating buildings.

Figure 7: This figure provides a quantitative evaluation of our approach on large datasets of CamVid-360 and SceneCity. Similar metrics are used as in Figure 5. A method based on joint reconstruction and reprojection losses L3-Rec-Repr is not able to accurately fit large maps. Our method demonstrates superior performance with more than of points predicted within on SceneCity Large. Relatively poorer reconstruction quality in SceneCity Small dataset (compared to SeneCity Large) can be explained by a higher density of tall buildings as the accuracy of the 3D points drops significantly for the tops of buildings. Our method L5-Rec-Repr (Ours) outperforms both PoseNet [KendallC17] and L3-Rec-Repr in all experiments with an exception of the angular distance error on Small SceneCity dataset due to the effect of reprojection loss component in . However a version of our method which uses approximate 3D maps outperforms . This can be explained by cuboids being a good approximation of artificial buildings. Also note that numbers in brackets correspond to 80th percentile distance and angular errors for CamVid-360 StreetView dataset.

Scene coordinate regression. Firstly, we compare five CNNs trained with different losses and evaluate their performance on the task of scene coordinate regression on a subsequence 16E5-P2 of CamVid-360 [BudvytisSC18] dataset. The first loss L1-Repr minimizes a reprojection error of points on a spherical image plane: . Here and are ground truth camera rotation and translation matrices, - a set of all pixels in an image, - predicted scene coordinates for a pixel and is a vector pointing to pixel projection on a spherical image. It is a straight-forward adaptation a reprojection loss used in [BrachmannR18] from planar images to spherical images. It is also equivalent to a loss proposed in [liangularscr18]. The second loss directly minimises the euclidean distance between predicted scene coordinates and ground truth scene coordinates for all pixels for which ground truth coordinates are available - set . Depending on choices of learning rates and relative loss weighting parameters the first two stages of the approach of [BrachmannR18] can be viewed as a mixture of both aforementioned losses. We approximate this work by loss , where is set empirically to . Note that we do not use the third stage of differentiable pose prediction and use a classical method [lepetitepnp] instead. Also note that authors of [BrachmannR18] report only a small advantage of using this stage at a cost of the lack of convergence on Street scene of Cambridge Landmarks [PoseNetKendallGC15] dataset. The final two losses considered are L4-Rec-Lab and L5-LRec-Lab. The former combines a standard cross entropy loss used for semantic segmentation and a reconstruction loss L2, whereas the latter combines a cross entropy loss with reconstruction loss in local whitened coordinate space. We empirically set relative weighting between cross entropy loss and reconstruction losses to and respectively. As shown in Figures 3(a), 4(a) and 5 reprojection loss (L1-Repr) alone does not enable a CNN to recover accurate geometry and instead predict a point cloud of an approximately spherical shape. In contrast, using methods which directly predict scene coordinates (L2, L3, L4) lead to high accuracies with more than of points residing within of their ground truth location. Our method L5-LRec-Lab outperforms the alternatives by more than . It is due to a faster convergence at train time which is caused by a simpler optimization task resulting from the separation of object center and local coordinate prediction (see Figure 4(b)). The difference in performance between the aforementioned methods becomes even bigger when larger maps such as full CamVid-360 or artificial cities are considered as shown in Figures 3(b), 6 and 7.

Localisation. As with scene coordinate regression we first evaluate localisation accuracy of various methods on the sequence 16E5-P2 of CamVid-360 [BudvytisSC18] dataset in detail. We then follow by experiments on full length sequences of CamVid-360 and two artificial cities. It can be seen in Figures 4(c) and (d) that CNN trained using L1-Repr loss shows a poor performance in estimating 3D location, but a relatively low angular error. L2-Rec and L3-Rec-Repr perform poorly at both tasks if pixels not belonging to building instances are not masked out at test time. Note that we use ground truth masks in order to evaluate the upper bound of the performance of both methods. Directly predicting semantic scene coordinates (loss L4-Rec-Lab) produces performance similar to masked versions of L2 and L3 as its localisation accuracy is limited by the accuracy of 3D coordinates predicted. Hence it is not surprising that our proposed method based on predicting local object coordinates (L5-LRec-Lab) outperforms all the alternative methods at both angular error and camera location distance error significantly. Similar trends are observed on large experiments as reported quantitatively in Figure 7 and qualitatively in Figures 6 and 8. Also note that while PoseNet [KendallC17] (a standard setup333Note that in order to use PoseNet [KendallC17] on equirectangular images an explicit rotation augmentation of the camera pose needs to be performed for each crop. We limited crops to a horizontal shift only which corresponds to the rotation of the camera around its axis. with geometric reprojection error and ResNet-50 encoder), adapted to performing on panoramic images, shows a seemingly competitive performance on Small and Large SceneCity data, its performance drops on CamVid-360 dataset. It can be explained by PoseNet [KendallC17] sensitivity to overfitting to the training images as they are obtained from a single video as opposed to a diverse set of images. This is also supported by a significant drop in accuracy on Small SceneCity images with missing buildings.

Figure 8: The top row of this figure shows artificial city maps using a top-view orthographic projection. Each city view has a region zoomed in and visualised from a different angle and a corresponding 3D point cloud obtained by accumulating 3D points predicted from test images. Examples of missing buildings are marked in blue rectangles. Three images at the bottom left illustrate the view seen by a camera marked in red dot. They show camera poses of ground truth database (yellow), ground truth query (green), L3-Rec-Repr (black), PoseNet [KendallC17] (blue) and L5-LRec-Lab (Ours) (red). Zoom in for a better view. Also see supplementary material.

Localisation in approximate maps. In the final set of experiments we explore the alternative of predicting a 6-DoF pose from simplified approximate 3D maps. As expected, on CamVid-360 dataset, scene coordinate and camera pose prediction is of lower accuracy than the one of the network trained using precise 3D map as shown in Figures 6 and 7. However on the Small SceneCity data, this network outperforms L5-LRec-Lab (Ours). This can be explained by cuboids being a good approximation for buildings in artificial cities. Moreover higher performance in localisation is obtained than competing approaches of PoseNet [KendallC17] and L3-3D-Repr in all experiments. This is a highly encouraging result which in the future may alleviate the need of a computationally expensive step of building 3D point clouds of cities.

6 Conclusions

In this work we presented a novel approach for a large scale joint semantic re-localisation and scene understanding via globally unique instance coordinate regression. To the best of our knowledge this is the first work to demonstrate that scene coordinate regression framework can be used for a large scale localisation. We achieve this by separating the task of scene coordinate prediction into object instance segmentation and local coordinate prediction. This significantly speeds up the convergence of scene coordinate regression networks and allows to achieve high scene coordinate prediction and localisation accuracies.