In this work, we attempt to address the problem of performing metric localization in a known environment under extreme changes in visual scale. Our localization approach is based on the identification of objects in the environment, and their use as landmarks. By “objects” we here mean physical entities which are distinct from their surroundings and have some consistent physical properties of structure and appearance.
Many robotic applications involve repeated traversals of a known environment over time. In such applications, it is usually beneficial to first construct a map of the environment, which can then be used by a robot to navigate the environment in subsequent missions. Surveying the environment from a very high altitude allows complete geographic coverage of the environment to be obtained by shorter, and thus more efficient, paths by the surveyor. At the same time, a robot that makes use of this high-altitude map to localize may have mission parameters requiring it to operate at a much lower altitude.
One such scenario is that of performing visual surveys of benthic environments, such as coral reefs, as in Johnson-Roberson et al. . A fast-moving surface vehicle may be used to rapidly map a large area of a reef. This map may then be used by a slower-moving, but more maneuverable, autonomous underwater vehicle (AUV) such as the Aqua robot , to navigate the reef while capturing imagery very close to the sea floor. Another relevant scenario is that of a robot performing loop closure over long distances as part of Simultaneous Localization And Mapping (SLAM). Loop closure, the recognition of a previously-viewed location when viewing it a second time, is key to accurate SLAM, and the overall accuracy of SLAM techniques could be considerably improved if loop closure could be conducted across major changes in scale and perspective.
In scenarios such as the above, a robot must deal with the change in visual scale between two perspectives, which may be or even greater. In some scenarios, such as in benthic environments, other factors may also intrude, such as colour-shifting due to the optical properties of water, and image noise due to particulate suspended in the water. Identifying scenes across such large changes in scale is very challenging for modern visual localization techniques. Even the most scale-robust techniques, such as Scale-Invariant Feature Transforms (SIFT), can only localize reliably under scale factors less than about .
We hypothesize that the hierarchical features computed by the intermediate layers of a Convolutional Neural Network (CNN)  may prove robust to changes in scale, due to their high degree of abstraction. We propose a technique for performing metric localization across significant changes in scale by identifying and describing non-semantic objects in a way that allows them to be associated between scenes. We show that these associations can be used to guide the matching of SIFT features between images in a way that improves the robustness of matching to scale changes, allowing accurate localization under visual scale factors of 3 and greater. The proposed system does not require any environment-specific training, and in principle can be deployed out-of-the-box in arbitrary environments. The objects used by our system are defined functionally, in terms of their utility as scale-invariant landmarks, and are not limited to semantically-meaningful object categories.
We specifically consider the problem of localizing between pairs of images known to contain the same scene at different visual scale. A solution to this problem is an essential component of a system that can perform full global localization across large scale changes, and in certain cases - such as the low-vs-high-altitude case described above - could suffice on its own for global localization. We demonstrate the approach both on standard localization benchmarks and on a novel dataset of image pairs from urban scenes exhibiting major scale changes.
Ii Related Work
Visual localization refers to the problem of determining a robot’s pose using images from one or more cameras, with reference to a map or set of previously-seen images. This may be done with some prior on the robot’s position, or with no such prior, called global localization . Visual odometry is a form of non-global localization, while global localization is closely related to loop closure; both of these are important components of SLAM, and there is a large body of literature exploring both problems. Prominent early work includes Leonard et al.  and Mackenzie et al. , and Fox et al. .
Many traditional visual approaches to these problems, and particularly global localization, have been based on the recognition of whole-image descriptors of particular scenes, such as GIST features . Successful instances include SeqSLAM , which uses a heavily downsampled version of the input image as a descriptor, and LSD-SLAM , which performs direct image alignment for loop closure, as well as Hansen et al. , Cadena et al., , Liu et al.  and Naseer et al. . Because whole-image descriptors encode the geometric relationships of features in the 2D image plane, images of the same scene from different perspectives can have very different descriptors, making such methods very sensitive to changes in perspective and scale.
Another common approach is to discretize point-feature descriptors and build bag-of-words histograms of the input images. FAB-MAP , ORB-SLAM , and the system of Ho et al.  perform variants of this for loop closure, starting from SURF , ORB , and SIFT  features, respectively. While suitable for place-recognition tasks, such approaches alone are not appropriate for global localization, because spatial information about the visual words is not contained in the histogram. Hence, state-of-the-art SLAM
systems such as ORB-SLAM and LSD-SLAM rely on visual odometry for pose estimation. Their visual odometry techniques are limited in robustness to changes in scale, perspective, and appearance, and so rely on successive estimations from closely-spaced frames.
Other global localization approaches attempt to recognize particular landmarks in an image, and use those to produce a metric estimate of the robot’s pose. SLAM++ of Salas-Moreno et al.  performs SLAM by recognizing landmarks from a database of 3D object models. Linegar et al.  and Li et al. 
both train a bank of support vector machines (SVMs) to detect specific landmarks in a known environment, one SVM per landmark. More recently, made use of a Deformable Parts Model (DPM)  to detect objects for use as loop-closure landmarks in their SLAM system. All of these approaches require a pre-existing database of either object types or specific objects to operate. These databases can be costly to construct, and these systems will fail in environments in which not enough landmarks belonging to the database are present.
Some work has explored the use of CNNs for localization. PoseNet  is a CNN that learns a mapping from images in an environment to metric camera poses, but it can only operate on the environment on which it was trained. In Sünderhauf et al. , the intermediate activations of a CNN trained for image classification were used as whole-image descriptors for place recognition, a non-metric form of global localization. In a similar fashion, Vysotska et al.  use whole-image descriptors from a CNN in a SeqSLAM-like framework. Subsequent work of Sünderhauf et al.  refined this approach by using the same descriptor for object proposals within an image instead of the whole image. Cascianelli et al.  and Panphattarasap et al.  both expand on this technique. These works consider only place recognition, however, and do not attempt to deal with the more challenging problem of full global localization (which necessitates returning a pose estimate). Schmidt et al.  and Simo-Serra et al.  have both explored the idea of learning point-feature descriptors with a CNN, which could replace classical point features in a bag-of-words model.
When exploring robustness to perspective change, all of these works only consider positional variations of at most a few meters, when the scenes exhibit within-image scale variations of tens or hundreds of meters, and when the reference or training datasets consisted of images taken over traversals of environments ranging from hundreds to thousands of meters. As a result, little significant change in scale exists between map images and query images in these experiments. To the best of our knowledge, ours is the first to attempt to combine deep object-like features and point features into a single, unified representation of landmarks. This synthesis provides superior metric localization to either technique in isolation, particularly under significant ( and greater) changes in scale.
Iii Proposed System
The first stage of our metric localization pipeline consists in detecting objects in a pair of images, computing convolutional descriptors for them, and matching these descriptors between images. Our approach here closely follows that used for image-retrieval by Sünderhauf et al. ; we differ in using Selective Search (SS), as proposed by Uijlings et al. , to propose object regions, and in our use of a more recent CNN architecture.
To extract objects from an image, Selective Search object proposals are first extracted from the image, and filtered to remove objects with bounding boxes less than 200 pixels in size and with aspect ratio greater than 3 or less than 1/3. The image regions defined by each surviving SS bounding box are then extracted from the image, rescaled to a fixed size via bilinear interpolation, and run through aCNN
. We use a ResNet-50 architecture trained on the ImageNet image-classification dataset, as described in He et al.. Experiments were run using six different layers of the network as feature descriptors, and with inputs to the network of four different resolutions. The network layers and resolutions are listed in Table I.
Having extracted objects and their descriptors from a pair of images, we perform brute-force matching of the objects between the images. Following , we take the match of each object descriptor in image to be the descriptor in image that has the smallest cosine distance from , defined as . Matches are validated by cross-checking; a match is only considered valid if is the most similar object to in image and is the most similar object to in image .
Once object matches are found, we extract SIFT features from both images, using 3 octave layers, an initial Gaussian with , an edge threshold of 10, and a contrast threshold of 0.04. For each pair of matched objects, we match SIFT features that lie inside the corresponding bounding boxes to one another. SIFT features are matched via their Euclidean distance, and cross-checking is again used to filter out bad matches. By limiting the space over which we search for SIFT matches to matched object regions, we hypothesize that the scope for error in SIFT matching will be significantly reduced, and thus the accuracy of the resulting metric pose estimates will be increased. As a baseline against which to compare our results, experiments were also run using SIFT alone, with no objects, and objects alone, without SIFT features - this last is essentially a naïve application of the place-recognition system of Sünderhauf  to metric localization. In these baseline experiments, SIFT matching was performed in the same way, but the search for matches was conducted over all SIFT features in both images. When object proposals alone were used, they were matched in the same manner described above, and their bounding box centers were used as match points.
The resulting set of match points are used to produce a metric pose estimate. Depending on the experiment, we compute either a homography or an essential matrix . In either case, the calculation of or from point correspondences is done via a RANSAC algorithm with an inlier threshold of 6, measured in pixel units.
Iv Kitti Experiments
Iv-a Experimental Setup
To evaluate the robustness of our proposed method to changes in scale, we conducted experiments on the KITTI Odometry benchmark dataset . This dataset consists of data sequences from a variety of sensors, including colour stereo imagery captured at a 15Hz frame rate, taken from a sensor rig mounted on a car as it drives along twenty-two distinct routes in the daytime. Eleven of these sequences contain precise ground truth poses for each camera frame taken on each trajectory. These trajectories were used to evaluate the proposed method.
Our evaluation consisted of first subsampling each sequence by taking every fifth frame, to make the size of the overall dataset more manageable and increase the level of scale change present between adjacent frames in the sequence. A set of image pairs was generated for each subsampled sequence by taking each frame in the sequence and pairing with the 10 subsequent frames, . Each successive value of gave an image pair with a greater degree of visual scale change, as shown in Fig. 7.
We finally filtered out any frame pairs whose gaze directions differed by more than in any axis, in order to consider only pairs that actually look at the same scene (in practice, only the yaw differs significantly in KITTI). In total, 40,748 image pairs were used in our evaluation. For each image pair, the images from the left colour camera (designated camera 2 in KITTI) were used for localization. An example set of images is shown in Fig. 7.
To estimate a transform between an image pair, a set of point matches was produced between the two images according to each of the three methods we compare, as described in section III. In each case, these point matches were used to estimate an essential matrix , from which was derived a pose estimate via a standard method of applying SVD and cheirality checking . describes the transform between the two frames. To assess the quality of the estimate, we used two error metrics. The first was the relative positional error, as defined in Eq. 1:
where is the ground-truth translation between the two frames and is the estimated translation. We normalize the vector from the estimated pose to the true pose to remove any correlation of that vector’s length with the magnitude of the true translation. Values of range from 0 to 1.
Where and are quaternions representing the ground-truth and estimated gaze directions, respectively. For some image pairs, no pose could be estimated, due to insufficient or inconsistent point matches. We refer to this as localization failure, and for both and we substitute a value of 1, the maximum possible error under each metric, in these failure cases.
A preliminary evaluation was carried out over the space of CNN input resolutions and output layers by running them on the first 1000 image pairs from the first subsampled sequence (sequence 00). We found that using an input resolution of and the res5c feature layer as output gave both the highest accuracy and lowest localization failure rate. This configuration was used for all object-landmark experiments on KITTI that we describe below.
All metrics were plotted against the ground-truth translational distance, , between the frames in the image pairs. To make these plots readable, we grouped image pairs by their frame-separation , and plotted the mean error of each group against its mean ground-truth distance, in Fig. 8 (for ) and Fig. 9 (for ). A logarithmic curve was fitted against each, as we expected that performance would initially worsen rapidly with distance, then level off. We also display the failure rate of each group versus the group’s mean distance in Fig. 10.
The overall performance of each method across all pairs is provided in Table II. This table shows that our proposed method improves on SIFT under each metric: a small improvement of 6% in , and more significant improvements of 43% in , and 58% in failure rate, overall. Meanwhile, the objects-only method performs significantly worse than both our method and SIFT on all metrics and at all pair distances.
Fig. 9 shows that on the improvement of our method over SIFT is negligible at , but grows significantly and consistently with the distance between frames. In Fig. 8 meanwhile, we see that on the improvement grows at first, and is greater than 0.05 for most of the intermediate gaps, but shrinks again at the largest gaps. Fig. 10 shows similar behaviour in the localization failure rate - it is lowest for all methods at the largest gaps.
From visual inspection of these extreme image pairs, this improvement at high appears to be caused by sections where the vehicle drives down a long, straight road for some distance. In these cases, the visual scale of objects visible near the end of the road will show little change over even a gap of , making localization relatively easy. Unlike more winding roads, such long, straight sections will not have any high- pairs removed due to the images being on either side of a bend in the road, meaning that the high- groups will contain disproportionately many pairs from these straight sections.
V Montreal Image Pair Experiments
V-a Experimental Setup
To test the effectiveness of the proposed system in a real-world scenario, a set of 31 image pairs were taken across eleven scenes surrounding the Montreal campus. Scenes were chosen to contain a roughly-centred object of approximately uniform depth in the scene, so that a uniform change in image scale could be achieved by taking images at various distances from the object. This ensures that successful matches must be made under one change in scale, and makes the relationship between the images amenable to description by a homography. The image pairs exhibit changes in scale ranging from factors of about 1.5 to about 7, with the exception of one image pair showing scale change of about 15 in a prominent foreground object. All images were taken using the rear-facing camera of a Samsung Galaxy S3 phone, and were downsampled to pixels via bilinear interpolation for all experiments. Each image pair was hand-annotated with a set of 10 point correspondences, distributed approximately evenly over the nearer image in each pair. We have made this dataset publicly available 111http://www.cim.mcgill.ca/~mrl/montreal_scale_pairs/.
The proposed system was used to compute point matches between each image pair, and from these point matches, a homography was computed as described in section III. was used to calculate the total symmetric transfer error (STE) for the image pair over the ground truth points:
Whenever no could be found for an image pair by some method, its error on that image pair was set to the maximum STE we observed for any attempted method, . The plain STE ranges over many orders of magnitude on this dataset, so we present the results using the logarithmic STE, making the results easier to interpret.
The same set of parameters were run over this dataset as in the KITTI experiments - our system at six network layers and four input resolutions, plus SIFT alone and objects alone, for comparison. However, the results from objects alone were substantially worse at all configurations than those of either SIFT or the proposed method, similar to what we observed in section IV. For the sake of brevity, we ignore the objects-only results in the discussion and figures below.
Table III shows the performance of each feature layer and each input resolution over the whole Montreal dataset, and shows the results from using SIFT features alone as well. As this table shows, the total error using just SIFT features is significantly greater than that of the best-performing input resolution for each feature layer. Also, the average error of the intermediate layers res2c, res3d, and res4f, are all very comparable. It is interesting to note that in this experiment, more intermediate layers are favoured, while the KITTI experiments favoured the highest resolution and the second-deepest layer of the network. This may arise from the difference in the native resolution of the images - KITTI’s image resolutions vary from sequence to sequence, but are all close to .
Fig. 11 show the error of each of the three best-performing configurations, as well as the SIFT-only approach, on each of the image pairs in the dataset, plotted versus the median scale change over all pairs of ground-truth matches in each image. The scale change between matches is defined as: . The lines of best fit for each method further emphasize the improvement of our system over SIFT
features at all scale factors up to 6. The best-fit lines for all of the top-three configurations of our system overlap almost perfectly, although there is a fair degree of variance in their performances on individual examples.
The use of homographies to relate the image pairs allows us to visually inspect the quality of the estimated , by using to map all pixels in the farther image to their estimated locations in the nearer image. Visual inspection of these mappings for the 31 image pairs confirm that those configurations with lower logarithmic STEs tend to have more correct-looking mappings, although all configurations of our system with mean logarithmic STE produce comparable mappings for most pairs, and on some pairs, higher-error configurations such as res4f with -pixel inputs produce a subjectively better mapping than the lowest-error configuration. Fig. 5 and Fig. 18 display some example homography mappings.
One strength of our proposed system is that it requires no domain-specific training, making use only of a pre-trained CNN. However, as future work we wish to explore the possibility of training a CNN with the specific objective of producing a scale- and perspective-invariant object descriptor, as doing so may result in more accurate matching of objects. We also wish to explore the possibility that including matches from multiple layers of the network in the localization process could improve the system’s accuracy.
The most natural extension of this work, however, is to extend it to the full global-localization problem, where the system must localize within a large map or database of images with no prior on the position, and must moreover do so across major scale changes. Depending on the scenario, this may require combining our localization method with a similarly scale-robust place-recognition system.
We have shown that by combining deep learning with classical methods, we can perform accurate localization across major changes in scale. Our system uses a pre-trained deep network to describe arbitrary objects and correctly match them between images for use as navigation landmarks. Restricting SIFT feature matching to matched object regions substantially improves the robustness of SIFT matching both to changes in image noise and to changes in scale. Despite much prior work on place recognition and localization using both classical methods and deep learning, our result sets a new benchmark for metric localization performance across significant scale changes.
- Convolutional Neural Network
- Simultaneous Localization And Mapping
- Scale-Invariant Feature Transforms
-  M. Johnson-Roberson, O. Pizarro, S. B. Williams, and I. Mahon, “Generation and visualization of large-scale three-dimensional reconstructions from underwater robotic surveys,” Journal of Field Robotics, vol. 27, no. 1, pp. 21–51, 2010.
-  J. Sattar, G. Dudek, O. Chiu, I. Rekleitis, P. Giguère, A. Mills, N. Plamondon, C. Prahacs, Y. Girdhar, M. Nahon, and J.-P. Lobos, “Enabling autonomous capabilities in underwater robotics,” in Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS, Nice, France, September 2008.
-  I. Goodfellow, Y. Bengio, and A. Courville, “Deep learning. book in preparation for mit press,” URL¡ http://www. deeplearningbook. org, 2016.
-  G. Dudek and M. Jenkin, Computational principles of mobile robotics. Cambridge university press, 2010.
-  J. J. Leonard and H. F. Durrant-Whyte, “Mobile robot localization by tracking geometric beacons,” IEEE Transactions on robotics and Automation, vol. 7, no. 3, pp. 376–382, 1991.
-  P. MacKenzie and G. Dudek, “Precise positioning using model-based maps,” in Robotics and Automation, 1994. Proceedings., 1994 IEEE International Conference on. IEEE, 1994, pp. 1615–1621.
D. Fox, W. Burgard, and S. Thrun, “Markov localization for mobile robots in
Journal of Artificial Intelligence Research, vol. 11, pp. 391–427, 1999.
A. Oliva and A. Torralba, “Modeling the shape of the scene: A holistic
representation of the spatial envelope,”
Int. J. Comput. Vision, vol. 42, no. 3, pp. 145–175, May 2001. [Online]. Available: http://dx.doi.org/10.1023/A:1011139631724
-  M. Milford and G. Wyeth, “Seqslam : visual route-based navigation for sunny summer days and stormy winter nights,” in IEEE International Conferece on Robotics and Automation (ICRA 2012), N. Papanikolopoulos, Ed. River Centre, Saint Paul, Minnesota: IEEE, 2012, pp. 1643–1649. [Online]. Available: http://eprints.qut.edu.au/51538/
-  J. Engel, T. Schöps, and D. Cremers, “Lsd-slam: Large-scale direct monocular slam,” in European Conference on Computer Vision. Springer, 2014, pp. 834–849.
-  P. Hansen and B. Browning, “Visual place recognition using hmm sequence matching,” in 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems, Sept 2014, pp. 4549–4555.
-  C. Cadena, D. Galvez-López, J. D. Tardos, and J. Neira, “Robust place recognition with stereo sequences,” IEEE Transactions on Robotics, vol. 28, no. 4, pp. 871–885, Aug 2012.
-  Y. Liu and H. Zhang, “Performance evaluation of whole-image descriptors in visual loop closure detection,” in 2013 IEEE International Conference on Information and Automation (ICIA), Aug 2013, pp. 716–722.
-  T. Naseer, L. Spinello, W. Burgard, and C. Stachniss, “Robust visual robot localization across seasons using network flows,” in AAAI Conference on Artificial Intelligence, 2014. [Online]. Available: http://www.aaai.org/ocs/index.php/AAAI/AAAI14/paper/view/8483
M. Cummins and P. Newman, “Invited Applications Paper FAB-MAP:
Appearance-Based Place Recognition and Mapping using a Learned Visual
Vocabulary Model,” in
27th Intl Conf. on Machine Learning (ICML2010), 2010.
-  R. Mur-Artal, J. M. M. Montiel, and J. D. Tardós, “Orb-slam: a versatile and accurate monocular slam system.” CoRR, vol. abs/1502.00956, 2015.
-  K. L. Ho and P. Newman, “Detecting loop closure with scene sequences,” Int. J. Comput. Vision, vol. 74, no. 3, pp. 261–286, Sept. 2007. [Online]. Available: http://dx.doi.org/10.1007/s11263-006-0020-1
-  H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool, “Speeded-up robust features (surf),” Comput. Vis. Image Underst., vol. 110, no. 3, pp. 346–359, June 2008. [Online]. Available: http://dx.doi.org/10.1016/j.cviu.2007.09.014
-  E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “Orb: An efficient alternative to sift or surf,” in Proceedings of the 2011 International Conference on Computer Vision, ser. ICCV ’11. Washington, DC, USA: IEEE Computer Society, 2011, pp. 2564–2571. [Online]. Available: http://dx.doi.org/10.1109/ICCV.2011.6126544
-  D. G. Lowe, “Object recognition from local scale-invariant features,” in Proceedings of the International Conference on Computer Vision-Volume 2 - Volume 2, ser. ICCV ’99. Washington, DC, USA: IEEE Computer Society, 1999, pp. 1150–. [Online]. Available: http://dl.acm.org/citation.cfm?id=850924.851523
R. F. Salas-Moreno, R. A. Newcombe, H. Strasdat, P. H. Kelly, and A. J.
Davison, “Slam++: Simultaneous localisation and mapping at the level of
Proceedings of the IEEE conference on computer vision and pattern recognition, 2013, pp. 1352–1359.
-  C. Linegar, W. Churchill, and P. Newman, “Made to measure: Bespoke landmarks for 24-hour, all-weather localisation with a camera,” in 2016 IEEE International Conference on Robotics and Automation (ICRA), May 2016, pp. 787–794.
-  R. M. E. Jie Li and M. Johnson-Roberson, “High-level visual features for underwater place recognition.”
-  S. L. Bowman, N. Atanasov, K. Daniilidis, and G. J. Pappas, “Probabilistic data association for semantic slam,” in 2017 IEEE International Conference on Robotics and Automation (ICRA), May 2017, pp. 1722–1729.
-  P. Felzenszwalb, D. McAllester, and D. Ramanan, “A discriminatively trained, multiscale, deformable part model,” in Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on. IEEE, 2008, pp. 1–8.
-  A. Kendall and R. Cipolla, “Modelling uncertainty in deep learning for camera relocalization,” Proceedings of the International Conference on Robotics and Automation (ICRA), 2016.
-  N. Sünderhauf, F. Dayoub, S. Shirazi, B. Upcroft, and M. Milford, “On the performance of convnet features for place recognition,” CoRR, vol. abs/1501.04158, 2015. [Online]. Available: http://arxiv.org/abs/1501.04158
-  O. Vysotska and C. Stachniss, “Lazy data association for image sequences matching under substantial appearance changes,” IEEE Robotics and Automation Letters, vol. 1, no. 1, pp. 213–220, 2016.
-  A. J. F. D. E. P. B. U. Niko S~̈underhauf, Sareh Shirazi and M. Milford, “Place recognition with convnet landmarks: Viewpoint-robust, condition-robust, training-free,” in Proceedings of Robotics: Science and Systems (RSS), 2015.
-  S. Cascianelli, G. Costante, E. Bellocchio, P. Valigi, M. L. Fravolini, and T. A. Ciarfuglia, “Robust visual semi-semantic loop closure detection by a covisibility graph and cnn features,” Robotics and Autonomous Systems, vol. 92, pp. 53–65, 2017.
-  P. Panphattarasap and A. Calway, “Visual place recognition using landmark distribution descriptors,” in Asian Conference on Computer Vision. Springer, 2016, pp. 487–502.
-  T. Schmidt, R. Newcombe, and D. Fox, “Self-supervised visual descriptor learning for dense correspondence,” IEEE Robotics and Automation Letters, vol. 2, no. 2, pp. 420–427, 2017.
-  E. Simo-Serra, E. Trulls, L. Ferraz, I. Kokkinos, P. Fua, and F. Moreno-Noguer, “Discriminative learning of deep convolutional feature point descriptors,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 118–126.
-  J. R. R. Uijlings, K. E. A. van de Sande, T. Gevers, and A. W. M. Smeulders, “Selective search for object recognition,” International Journal of Computer Vision, vol. 104, no. 2, pp. 154–171, 2013. [Online]. Available: https://ivi.fnwi.uva.nl/isis/publications/2013/UijlingsIJCV2013
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” arXiv preprint arXiv:1512.03385, 2015.
-  R. I. Hartley and A. Zisserman, Multiple View Geometry in Computer Vision, 2nd ed. Cambridge University Press, ISBN: 0521540518, 2004.
-  A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
-  D. Q. Huynh, “Metrics for 3d rotations: Comparison and analysis,” Journal of Mathematical Imaging and Vision, vol. 35, no. 2, pp. 155–164, 2009.