LookUP: Vision-Only Real-Time Precise Underground Localisation for Autonomous Mining Vehicles

03/20/2019 ∙ by Fan Zeng, et al. ∙ qut 0

A key capability for autonomous underground mining vehicles is real-time accurate localisation. While significant progress has been made, currently deployed systems have several limitations ranging from dependence on costly additional infrastructure to failure of both visual and range sensor-based techniques in highly aliased or visually challenging environments. In our previous work, we presented a lightweight coarse vision-based localisation system that could map and then localise to within a few metres in an underground mining environment. However, this level of precision is insufficient for providing a cheaper, more reliable vision-based automation alternative to current range sensor-based systems. Here we present a new precision localisation system dubbed "LookUP", which learns a neural-network-based pixel sampling strategy for estimating homographies based on ceiling-facing cameras without requiring any manual labelling. This new system runs in real time on limited computation resource and is demonstrated on two different underground mine sites, achieving real time performance at 5 frames per second and a much improved average localisation error of 1.2 metre.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 5

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Real-time high-accuracy localisation for autonomous vehicles in underground mine sites is challenging due to a lack of GPS, severe lighting changes, dust and environment ambiguity. As the mining industry seeks to become more efficient, companies are looking for more economical technology that will enable less lucrative secondary mining resources to be feasibly mined. One consequence of this for navigating autonomous mine vehicles is that infrastructure-based techniques are less feasible, while range-based sensors are often expensive and have been reported to struggle in geometrically aliased environments such as long uniform tunnels. Low-cost vision-based localisation technologies are among the most promising alternatives for overcoming these limitations.

Among vision-based localisation methods, the state-of-the-art general-purpose SLAM (Simultaneous Localisation and Mapping) algorithm ORB-SLAM [1] has been shown to perform unsatisfactorily in underground mine site environments [2]. Our previous work [3, 2]

on coarse localisation based on whole-image matching has a demonstrated localisation accuracy out-performing a state-of-the-art deep learning approach

[4] with a mean localisation error of a few metres [2]. Because it localises to the nearest node in the database, its accuracy is limited to the resolution of the node separation in the map. The goal of the research presented in this paper is to build on the previously presented coarse localisation system to enable a higher degree of precision with the eventual aim of enabling reliable, vision-only autonomous control of underground mining vehicles.

(a)
(b)
(c)
(d)
(e)
(f)
(g)
Fig. 1:

(a) The proposed system consists of a coarse localisation stage using a forward-facing camera (pink arrow) and a refinement stage, LookUP, using an upward-facing one (orange arrow). (b)(c) Examples of “optical flow” between a query image (top) and a nearby reference image (bottom) when (b) a regular grid is used, and (c) an FCN (Fully Convolutional Network) is used with LookUP. Note the significant reduction in number of sample points from (b) to (c). Inlier optical flow vectors are coloured in green. (d)(f) Sample point quality heat maps generated by the FCN in LookUP for the original images in (e)(g), respectively.

Developing localisation for underground autonomous vehicles presents some challenges and opportunities. Sensing and hardware capabilities are limited as sensors must be toughened, severely restricting the use of recent hardware and limiting the deployment of computationally intensive algorithms including full size deep learning architectures. There are also limitations in the practical amount of training data that can be obtained from a site. Naive deployment of full 6DOF SLAM systems (e.g. [5]) is not necessary as there are a range of constraints that can be applied: the pitch and roll variations of the vehicle (therefore the camera) relative to the tunnel can be assumed to be limited, as is the variation in the height of the ceiling. Even allowing for the occasional three-dimensional structures such as wind pipes, the ceiling of mine tunnels is mostly planar. This offers an opportunity to significantly reduce computation, since theoretically as few as four point-correspondences are required for planar homography estimation. Furthermore, the ceiling-facing camera is less affected by dust and lighting from other vehicles.

The paper makes the following contributions:

  • A new vision-only localisation system, designed for underground mine environments, which takes coarse localisation results and refines them through rapid quasi-planar surface homography estimation.

  • An efficient neural-network-based sample point selector that generates quality heat maps of candidate points for effective pixel-correspondence calculations, and an associated off-line training process that does not require manual dataset labelling.

  • Demonstration of new levels of vision-only localisation accuracy in two new challenging underground mine site datasets.

The paper proceeds as follows. Section II reviews previous work on robust localisation algorithms and various saliency generation methods used as preprocessing filters for image matchers. Section III provides a detailed description of the proposed localisation system. Section IV describes experimental settings including the datasets and our method to build the evaluation benchmark, with the results presented in Section V followed by the conclusion in Section VI.

Ii Literature Review

Ii-a LIDAR-based Localisation Methods

Laser scanners (LIDARs) can provide metric position estimations when there are ample features across the scanned angle span, but laser-scanner-based localisation systems [6, 7, 8, 9] can easily get lost in long tunnels, which are ubiquitous in underground mines, as the scanned point clouds appear confusingly similar along the tunnel. This problem is uncommon in environments such as typical rooms and warehouses because the shape of enclosing walls provides salient variations across the scanned angle span. In a long tunnel, a LIDAR essentially becomes one dimensional - it only knows its distance to the walls but has no idea about how far it has travelled along them. Moreover, in the areas of the mine where there are more features, such as draw points, there could be objects like metal meshes that could confuse localisation methods based on 2D laser scanners, because the returns from the mesh may also form occupied space no matter how accurate LIDAR can perform locally, it is not suitable in our application. that could be misinterpreted as a wall to align scans with. Therefore, due to the current limitations of LIDAR-based methods, we choose to exploit vision-based place recognition methods to do localisation, for enhanced global robustness.

Ii-B Vision-Based Methods

Traditional feature-based place recognition algorithms such as FAB-MAP [10, 11] work poorly in mine-tunnel environments due to severe visual aliasing. SeqSLAM [12] and many other SLAM frameworks [13, 14, 15] are less sensitive but require external sources like GPS or wheel odometry to provide metric information. As demonstrated by our previous results, the coarse localisation unit - Semi-Supervised SLAM - is able to produce better maps than ORB-SLAM [1] with 2.5 times smaller localisation error. Nevertheless, a higher localisation accuracy is desirable to better assist the automated control of vehicle pose during various activities such as digging, dumping and driving.

Given the range of uncertainty of the coarse localisation results and the sparse density of reference images sampled across the mine, the translation between reference and query images can be quite significant comparing to the captured range, even when a wide Field Of View (FOV) camera is used, because the walls and ceiling of the tunnel are usually a short distance away from the camera. As a result, the matched point pair, if it exists, can be a large distance apart (Fig. 2), under limited frame-rate constraints. Although we still refer to this translation vector as “optical flow”, traditional optical flow algorithms [16, 17] typically assume small displacement [18] and are not suitable for our application. I2-S2 [19] has been proposed to extract homographies between query and candidate reference images for pixels at predefined image locations. Different saliency generators [20, 3] have been proposed for sample point or patch filtering, however, they are based on pre-determined metrics of pixel intensities and do not adapt automatically to a different context.

(a)
(b)
(c)
Fig. 2:

(a) Displacement (optical flow) between matched images - shown in (b)(c) - could be large. Blue optical flow vectors were rejected by the Pixel Correspondence Matcher, the rest were RANSAC filtered, with inliers and outliers shown in green and gray, respectively.

Ii-C Deep Convolutional Networks

Deep convolutional networks [21] have been proven to be successful in place recognition [4, 22], image classification and semantic segmentation [23, 24]. However, there is no direct metric information output from these methods. Deep learning based methods [25, 26, 27] have also been used to analyse large optical flows, among which FCN-based pixel labelling [28, 29] is suitable for our application of sample point selection, and an FCN similar to [28] is used in this paper. In the next section, our precise localisation unit “LookUP” will be described.

Iii Approach

The more precise localisation unit takes in a query image, a coarse localisation result, and has access to a database of images with known camera poses. This database can be collected with a single camera or an array of cameras during the surveying process accompanying the construction of a mine. The associated poses can be obtained via surveying tools and recorded alongside the image frames. Based on the coarse localisation result, relevant reference images in the database are cross-examined with the query image. Since the ceiling of mine tunnels provide quasi-planar surfaces to allow homography calculation based on only a handful of points, the cameras used in the precise localisation unit looks up towards the ceiling; in addition, the pose estimation requires a “look up” in the database to find the reference camera pose, hence the name “LookUP”.

Fig. 3: Schematic diagram of the underground localisation system showing the query images (black), the coarse localisation unit (pink) and the precise localisation unit LookUP (orange).

Iii-a Pixel Correspondence Matcher

The Pixel Correspondence Matcher (Fig. 3) is used to find the most-likely corresponding pixel in a query image for a selected pixel in the reference image. It takes an -sized reference patch centred at the selected reference pixel, and generates a search neighbourhood in the query image, which is an sized square centred at the same pixel coordinates as the selected reference pixel. It then compares the reference patch to a set of candidate patches centred at every pixel in search neighbourhood. The best match candidate pixel in terms of Sum of Absolute Difference (SAD) score is reported. The process is visualised as colour-coded “optical flows” in Fig. 4.

(a)
(b)
Fig. 4: Two examples of “optical flow” between a query image (top row) and a set of reference images (bottom row), featuring “optical flow” outliers caused by (a) 3D objects on the ceiling, and (b) uneven rock surfaces.

The processing time for each pair of reference and query images is proportional to the number of sampled pixels for which the pixel correspondence is to be found. Under the small pitch and roll assumption, most likely the inlier vectors from the output of a RANSAC [30] filter are similar in direction and magnitude (Fig. 0(b)). If we can identify such inliers in advance and only find pixel correspondence for them, the computation time could be reduced. As for the outliers, they are excluded from the homography calculation anyway, it would be better if they were not sampled in the first place. In this paper we used an off-the-shelf neural network (VGG16-based FCN) to produce sampling point qualities.

Iii-B Sample Point Selector

As can be seen from Fig. 4, the outlier sample points that produce inconsistent optical flow vectors are most likely on 3D objects (Fig. 3(a)) or uneven rock surfaces (Fig. 3(b)

) on the ceiling. However, rather than defining rigid rules for classification, such as “avoid long wires, pipes and strong lights”, more general and adaptive qualification criteria are desirable. This is because in some situations, certain objects may provide high-quality sample points for template matching, but they may not work well in other cases - the semantics of the features affect their quality as a sample point, involving contextual information many pixels away from them. Support Vector Machines 

[31]

are usually effective binary classifiers but they are limited to local information around the sample point. A neural network architecture that incorporates more holistic information is preferred.

Although feature-based methods are susceptible to visual aliasing in our underground localisation application, they may work well for sample point quality generation because visual aliasing is not a problem for this task. We implemented an FCN similar to the one described in [29]. The query image is fed into the convolutional layers of a VGG16 [32]

network pre-trained with the ImageNet dataset

[33], the output of which goes into a convolution and three up-sampling layers, with skip connections to layer3 and layer4 of the original VGG16. The output of the network is a heat map of sample point quality (Figs. 0(d) and 0(f)), according to which the sample points are selected (Fig. 3). The training dataset of the FCN is generated by applying the Pixel Correspondence Matcher in LookUP to a training image dataset, processing points densely sampled on a regular grid, as shown in Figs. 0(b) and 4

. RANSAC is used to classify the sampled points into inliers (coloured coded green) and outliers (colour coded gray), according to the optical flow vector obtained on that sample point. The FCN is then trained using this labelled data. The loss function is defined as proportional to the total number of misclassified sample points for the training images. The output of the sample point selector is a heat map of quality for all candidate pixels. After the FCN is trained, all the reference images are processed with it and corresponding sample quality heat maps are generated alongside the reference image database. The training and classification processes are completed off-line, therefore they neither take up on-line run time nor require a GPU in the localisation system. At run time, it is up to the pixel correspondence matcher to decide how this heat map should be used.

Iii-C Homography Estimator

The set of “optical flow” vectors calculated by the pixel correspondence matcher from all selected sample points are used to compute a homography matrix that relates the pose of the query and reference images. Although multiple solutions exist for the homography matrix, it is not hard to identify the one that makes physical sense by choosing the solution that gives the smaller pitch and roll. Before the homography is found, there is an optional RANSAC filtering if the number of sampled points is greater than 10.

Iii-D Determination of Scaling Constant

The above homography estimation process can be done with multiple reference images (each column in Fig. 4(a)(b)). If there are more than one reference images for which good matches are found, it is possible to estimate the constant that converts the distance from pixel to metric space, using the assumption that the scaling constant should be similar for both homography relations. If only one reference image is used for faster processing, it is also possible to use a pre-determined constant for this conversion under the assumption of small variations in the ceiling height.

Iii-E Integration with the Coarse Localisation Unit

Currently, the interface between LookUP and Semi-Supervised SLAM is simply the time stamp of the database image that is considered a match. LookUP will fetch the reference images from the ceiling-facing camera that were taken most closely in time to the matched database image from the forward facing camera. A refined location is estimated by LookUP using this reference image and the system then decides whether this refinement should be applied. Two filters are applied. The refinement is deemed not reliable if 1) the percentage of inliers after the RANSAC filtering is lower than a threshold or 2) if the or translation from the reference pose extracted from the homography is larger than a threshold . These could happen if the coarse localisation result is incorrect, or the relative displacement is larger than the search range. The system will simply fall back to the coarse localisation result when LookUP is not confident. Apart from the above interface, the coarse and fine localisation units are highly independent and can be optimised separately. Next we describe the experiments we have done to evaluate the performance of LookUP system.

Iv Experiments

In order to evaluate the precise localisation system, coarse localisation needs to be performed first. Based on a map in which the reference poses are defined, the coarse localisation system, Semi-Supervised SLAM [2]

, takes in images with known locations and constructs an internal database according to their associated locations, grouping images taken at adjacent places to the same node and saves them in a database. When sequences of query images arrive, it compares query images to reference images in the database, and generates a confusion matrix corresponding to the sequence of query images. Using the confusion matrix, LookUP was then run to output metric location results in the map.

To evaluate the refinement achieved by LookUP, the localisation results corresponding to the confusion matrix were also generated by disabling LookUP and directly outputting the reference poses corresponding to the time stamp of the matched reference image. The frames for which Semi-Supervised SLAM generated localisation errors that were greater than 10 metres, for which a refinement is hardly possible, were excluded from the evaluation. Next we describe the real-world datasets collected to do such evaluation and how the maps, reference poses and benchmark localisation results were obtained.

Iv-a Datasets

In order to evaluate the localisation accuracy, a different localisation system that can generate benchmark localisation results that are at least locally accurate must be applicable to the datasets. If the datasets contain many draw points and junctions but few long stretches of tunnels, algorithms based on laser scan matching can be used for benchmarking. Based on such criteria, the following datasets were collected.

  1. Mine A dataset: This dataset includes nine traverses of a heavy vehicle in two connected tunnels of an underground mine (Fig. 4(a)). Four of the traverses are used to build the map and the reference image database, the other five are used as localisation query. This is the same dataset used in [2].

  2. Mine B dataset: The majority of the optical flows between images in the Mine A dataset are along the travelling direction of the vehicle. On the other hand, LookUP does not constrain the optical flow search along one direction. To study the generality of LookUP, a second dataset was collected in a different mine, featuring four traverses of a light vehicle in a mine tunnel (Fig. 4(b)). Traverse Middle(M): the light vehicle was driven along the centre of the tunnel. Traverse Left(L) and Right(R): the light vehicle was driven close to the left and right wall, respectively. Traverse Zigzag(Z): the light vehicle was driven deliberately in a zigzag motion. Traverse M was used to build the SLAM map; Traverses L, M and R were used to build the reference image database; Traverse Z was used as the localisation query. In this way, the query images in this dataset can have optical flows in various directions w.r.t the references.

    Altogether the two datasets contain 276,063 data frames over 5,117 seconds of 50 kilometre traverses (average vehicle speed 35 km/h). These datasets are particular challenging due to the affluence of heavily aliased patterns on multiple scales.

Iv-B Map Building and Reference Poses

  1. Mine A: The coarse localisation results were directly taken from [2]. However, the metric locations from [2] were based on an external Radio Telemetry System that is not accurate enough for evaluating the precise localisation system. A more precise occupancy grid map was required to generate the reference and benchmark poses. The attempt to build such a map using Hector-Mapping [34] was unsuccessful since this dataset contains a few sections of long tunnels and metal meshes. Therefore, a different approach was used to build the map. First, four separate maps, one for each reference traverse were built using Cartographer [35], then the four maps were manually aligned to form a large map, shown as the black occupancy grid in Fig. 4(a). The manual assembly was necessary because the four traverses used for map building were not collected continuously in time and space. The reference poses were then obtained by running AMCL [36] on the stitched map subscribing to the “ROS tf frames” [37] published by Cartographer.

  2. Mine B: The map of mine tunnel was successfully built using Hector-Mapping, shown as the blue occupancy grid in Fig. 4(b). The camera poses of the reference images were obtained by running AMCL on the map subscribing to the “ROS tf frames” published by Hector-Mapping. Unlike Mine A dataset, no coarse localisation results were available, so AMCL was used on the same map used to generate the locations associated with the images used in Semi-Supervised SLAM.

    It should be clarified that it is not necessary to obtain reference poses in this way. We obtained the reference poses using the laser scan data with occupancy grid map simply because this dataset was collected after the construction of the mine and we did not have surveying capabilities.

(a) SLAM Map of Mine A, built by Cartographer [35].
(b) SLAM Map of Mine B. Blue: Hector-Mapping [34]; Black: Cartographer [35].
Fig. 5: Maps built by the SLAM algorithms in [34, 35]. Note our system does not depend on these algorithms.

Iv-C Localisation benchmark

To calculate the localisation errors for evaluating different system settings, AMCL was run on the query traverses to produce the benchmark poses. During the AMCL runs, the poses of the vehicle and the laser scan results are visualised together with the maps (in Fig. 5

). Except for the beginning of each traverse, when AMCL is “initializing likelihood field model with probabilities”, and a few times in the tunnel sections in the maps (Fig. 

5) where there are no draw points or junctions, the laser scans align with the map pretty well. Since the reference poses are built with the same maps in the same way, although the maps and the AMCL poses may not be globally accurate, the AMCL poses can be reasoned as locally reliable enough to be used for the local refinements presented in this paper, which are essentially relative pose transformations indifferent to absolute global coordinates.

Additionally, the global accuracy of the whole system is cross-verified with an independent algorithm on the Mine B dataset. The state-of-the-art SLAM algorithm - Cartographer [35], not used for Mine B, was chosen to build a second set of map (black occupancy grid in Fig. 4(b)). The two SLAM algorithms work under different principles: AMCL uses particle filters and Cartographer uses iterative optimisations of pose graph. Proper loop closure was achieved by both algorithms, which is non-trivial for such datasets. As shown in Fig. 4(b), the difference between the two maps is within 5 metres, indicating the accuracy of the AMCL poses in a more global sense.

Iv-D Comparison of FCN with Regular Grid

The FCN was implemented with Tensorflow

[38]

in Python. It was trained with Stochastic Gradient Descent (SGD) with batch size of 8 (the maximum that can fit into an NVIDIA GeForce GTX 1080 GPU) and drop out rate of 50%. Adam Optimiser and Softmax activation were used to generate the sample quality heat map. LookUP iteratively selects the best sample point (the one with highest heat map value), and apply a fixed reduction ratio

to its -sized neighbourhood in the heat map. It continues to pick the next best sample point until the required sample point number is reached. The FCN-based sample point selector was evaluated on the Mine A dataset in comparison with a regular grid sampling method. The regular grid contains 24 sample points (at the cost of more computation), whereas only the top 12 from the FCN-based sample point classifier were processed. All other parameters were kept the same. Selected frames of query images from query traverse 0 were used to train the FCN. After that the FCN generated sample point quality for all reference images in the database, which does not include any image the FCN was trained on. The FCN for Mine B dataset was trained on sub-sampled query frames, and classified sample points for the reference images.

Iv-E Parameters

The parameters in Table I were used to obtain the results in the next section.

Parameter Value Unit Description
40 pixels Search range, Mine A
70 pixels Search range, Mine B
40 pixels Patch size, Mine A
60 pixels Patch size, Mine B
0.5 Factor multiplied to heat map value within neighbourhood of currently selected sample point
10 pixels Neighbourhood size, Mine A
20 pixels Neighbourhood size, Mine B
60% Min. inlier percentage
2 metre Max. displacement threshold
TABLE I: PARAMETER LIST

V Results

V-a Evaluation of the FCN

The performance of the FCN in generating high-quality sample points was evaluated on test sets of images different from the training sets. The classification accuracy of the best sample point selected by the FCN was compared with that of a random point generator (representing the percentage of good sampling points in the ground truth). The percentage of correct classifications for test sets of Mine A dataset was 74%, compared to 62% from a random sampler; for Mine B dataset it was 41% compared to 11%.

V-B Localisation Results of LookUP

As shown in Fig. 5(a), LookUP can successfully extract optical flow in various directions and its ability to refine the coarse localisation results is not limited to the travel direction (Figs. 5(b)-5(d)).

(a)
(b)
(c)
(d)
Fig. 6: (a) Optical flow between the query image (top row) and various reference images (bottom row) for the frame in (d). (b-d) Localisation results of three sample frames from Mine B dataset, showing refinements in different directions.

V-C Effectiveness of Sample Point Classifier

The mean localisation errors obtained for each traverse with: a) Semi-Supervised SLAM without refinement, b) LookUP with FCN and c) LookUP with regular grid sample point selector, are shown in Fig. 7. The localisation refinements computed by LookUP with regular grid leads to consistent but small error reductions, while LookUP with FCN sample point selector consistently leads to significant error reduction (as much as 27% for traverse 3). This is because the indiscriminately sampled points on a regular grid resulted in false positive matches and therefore inaccurate optical flows for the Pixel Correspondence Matcher. Note that the mean errors reported for the coarse localisation method in Fig. 7 are significantly lower than the 9.44 metres reported in [2]. There are two major reasons: Firstly, as mentioned previously, for all traverses and all methods, frames for which Semi-Supervised SLAM produced errors greater than 10 metres were excluded from the evaluation. Secondly, the map and set of benchmarks used in [2] were generated by an external Radio Telemetry System, which was less accurate than AMCL.

Fig. 7: Mean localisation error for each query traverse under different system settings.

V-D Computation Time

To study computation time performance of the LookUP unit, the coarse localisation result was obtained first by running Semi-Supervised SLAM unit with all the query images and the confusion matrix was saved before the timer was started. The following processes are all included in the computation time: for each query image, The LookUP reads the pre-computed confusion matrix, searches for the best-match coarse reference image from the forward-facing camera for that query, and “looks up” the corresponding ceiling images with the closest time stamp. The homography result is then calculated and saved as a file. Subsequent filtering, analyses and plotting are not timed. On an Intel i7-7700K 4.20GHz CPU, LookUP with FCN took 15 minutes to generate all results for Traverse 0 of Mine A, an averaged 5 frames per second (fps), which is acceptable for real-time operations in our application. Note there could be multiple reference images processed for each query input, the frame rate for processing each reference-query pair is 22 fps.

Vi Conclusion

In this paper, we designed and characterised a refinement unit “LookUP” to our localisation system for vehicles in underground mine tunnel environments. It works by finding homographies based on matched pixels between query and reference images of the mine ceiling. The accuracy of LookUP is enhanced by generating pixel correspondences only on high-quality sample points proposed by an FCN. Selectively processing high-quality sample points also significantly increased the frame rate to 5 fps. This result was obtained using code that is yet to be optimised and could potentially be even faster if a GPU is available in the system. The proposed system provides a viable framework for industrial applications in underground mines.

References

  • [1] J. M. M. Mur-Artal Raúl, Montiel and J. D. Tardós, “{ORB-SLAM}: a Versatile and Accurate Monocular {SLAM} System,” IEEE Trans. Robot., vol. 31, no. 5, pp. 1147–1163, 2015.
  • [2] A. Jacobson, F. Zeng, D. Smith, N. Boswell, T. Peynot, and M. Milford, “Semi-supervised slam: Leveraging low-cost sensors on underground autonomous vehicles for position tracking,” in IEEE Int. Conf. Intell. Robot. Syst. (IROS), 2018.
  • [3] F. Zeng, A. Jacobson, D. Smith, N. Boswell, T. Peynot, and M. Milford, “Enhancing Underground Visual Place Recognition with Shannon Entropy Saliency,” in Australas. Conf. Robot. Autom. (ACRA), Sydney, Australia, 2017.
  • [4] N. Sünderhauf, S. Shirazi, F. Dayoub, B. Upcroft, and M. Milford, “On the performance of convnet features for place recognition,” in Intell. Robot. Syst. (IROS), 2015.
  • [5] P. Newman, G. Sibley, M. Smith, M. Cummins, A. Harrison, C. Mei, I. Posner, R. Shade, D. Schroeter, L. Murphy, W. Churchill, D. Cole, and I. Reid, “Navigating, Recognizing and Describing Urban Spaces With Vision and Lasers,” Int. J. Rob. Res., 2009.
  • [6] D. M. Cole and P. M. Newman, “Using laser range data for 3D SLAM in outdoor environments,” in IEEE Int. Conf. Robot. Autom. (ICRA), 2006.
  • [7]

    M. Magnusson, H. Andreasson, A. Nüchter, and A. J. Lilienthal, “Appearance-Based place recognition from 3D laser data using the normal distributions transform,” in

    IEEE Int. Conf. Robot. Autom. (ICRA), 2009.
  • [8] C. Sprunk, G. D. Tipaldi, A. Cherubini, and W. Burgard, “Lidar-based teach-and-repeat of mobile robot trajectories,” in IEEE Int. Conf. Intell. Robot. Syst. (IROS), 2013.
  • [9] M. Bosse and J. Roberts, “Histogram Matching and Global Initialization for Laser-only SLAM in Large Unstructured Environments,” in IEEE Int. Conf. Robot. Autom. (ICRA), 2007.
  • [10] M. Cummins and P. Newman, “FAB-MAP: Probabilistic localization and mapping in the space of appearance,” Int. J. Rob. Res., vol. 27, no. 6, pp. 647–665, 2008.
  • [11] ——, “Highly scalable appearance-only SLAM - FAB-MAP 2.0,” in Robot. Sci. Syst., Seattle, United States, 2009.
  • [12] M. J. Milford and G. F. Wyeth, “SeqSLAM: Visual route-based navigation for sunny summer days and stormy winter nights,” in IEEE Int. Conf. Robot. Autom. (ICRA), 2012.
  • [13] M. J. Milford, G. Wyeth, and D. Prasser, “RatSLAM: A Hippocampal Model for Simultaneous Localization and Mapping,” in IEEE Int. Conf. Robot. Autom. (ICRA), 2004.
  • [14] W. Maddern, M. Milford, and G. Wyeth, “CAT-SLAM: probabilistic localisation and mapping using a continuous appearance-based trajectory,” Int. J. Rob. Res., vol. 31, no. 4, pp. 429–451, 2012.
  • [15] A. J. Glover, W. P. Maddern, M. J. Milford, and G. F. Wyeth, “FAB-MAP + RatSLAM: Appearance-based SLAM for Multiple Times of Day,” in IEEE Int. Conf. Robot. Autom. (ICRA), Anchorage, United States, 2010.
  • [16] B. D. Lucas, T. Kanade, and Others, “An iterative image registration technique with an application to stereo vision,” 7th Int. Jt. Conf. Artif. Intell., 1981.
  • [17] J. Shi and C. Tomasi, “Good features to track,” Cornell University, Tech. Rep., 1993.
  • [18] P. Fua and V. Lepetit, “Monocular model-based 3d tracking of rigid objects,” Comput. Graph. Vis, vol. 1, no. 1, pp. 1–89, 2005.
  • [19] F. Zeng, A. Jacobson, D. Smith, N. Boswell, T. Peynot, and M. Milford, “I2-S2: Intra-Image-SeqSLAM for more accurate vision-based localisation in underground mines,” in Australas. Conf. Robot. Autom. (ACRA), Canterbury, New Zealand, 2018.
  • [20] M. Milford, E. Vig, W. Scheirer, and D. Cox, “Vision-based simultaneous localization and mapping in changing outdoor environments,” J. F. Robot., vol. 31, no. 5, pp. 814–836, 2014.
  • [21]

    A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in

    Adv. Neural Inf. Process. Syst., 2012, pp. 1097–1105.
  • [22] R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic, “NetVLAD: CNN architecture for weakly supervised place recognition,” in

    IEEE Conf. Comput. Vis. Pattern Recognit.

    , 2016.
  • [23] R. Girshick, “Fast r-cnn,” in IEEE Int. Conf. Comput. Vis., 2015.
  • [24] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Adv. Neural Inf. Process. Syst., 2015.
  • [25] A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. van der Smagt, D. Cremers, and T. Brox, “Flownet: Learning optical flow with convolutional networks,” in IEEE Int. Conf. Comput. Vis., 2015.
  • [26] P. Weinzaepfel, J. Revaud, Z. Harchaoui, and C. Schmid, “DeepFlow: Large displacement optical flow with deep matching,” in IEEE Int. Conf. Comput. Vis., 2013.
  • [27] J. Wulff, L. Sevilla-Lara, and M. J. Black, “Optical Flow in Mostly Rigid Scenes,” in IEEE Conf. Comput. Vis. Pattern Recognit., 2017.
  • [28] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in IEEE Conf. Comput. Vis. Pattern Recognit., 2015.
  • [29] H. Noh, S. Hong, and B. Han, “Learning deconvolution network for semantic segmentation,” in IEEE Int. Conf. Comput. Vis., 2015.
  • [30] M. A. Fischler and R. C. Bolles, “Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography,” Commun. ACM, vol. 24, no. 6, pp. 381–395, 1981.
  • [31] J. A. K. Suykens and J. Vandewalle, “Least squares support vector machine classifiers,” Neural Process. Lett., vol. 9, no. 3, pp. 293–300, 1999.
  • [32] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
  • [33] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in IEEE Conf. Comput. Vis. Pattern Recognit., 2009.
  • [34] S. Kohlbrecher, J. Meyer, O. von Stryk, and U. Klingauf, “A Flexible and Scalable SLAM System with Full 3D Motion Estimation,” in IEEE Int. Symp. Safety, Secur. Rescue Robot., 2011.
  • [35] W. Hess, D. Kohler, H. Rapp, and D. Andor, “Real-Time Loop Closure in 2D LIDAR SLAM,” in 2016 IEEE Int. Conf. Robot. Autom., 2016, pp. 1271–1278.
  • [36] F. Dellaert, D. Fox, W. Burgard, and S. Thrun, “Monte carlo localization for mobile robots,” in IEEE Int. Conf. Robot. Autom. (ICRA), 1999.
  • [37] M. Quigley, K. Conley, B. Gerkey, J. Faust, T. Foote, J. Leibs, R. Wheeler, and A. Y. Ng, “ROS: an open-source Robot Operating System,” in ICRA Work. open source Softw., vol. 3, no. 3.2.   Kobe, Japan, 2009, p. 5.
  • [38] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al.

    , “Tensorflow: A system for large-scale machine learning,” in

    12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), 2016, pp. 265–283.