Robust Visual Teach and Repeat for UGVs Using 3D Semantic Maps

09/21/2021 ∙ by Mohammad Mahdavian, et al. ∙ Simon Fraser University 0

In this paper, we propose a Visual Teach and Repeat (VTR) algorithm using semantic landmarks extracted from environmental objects for ground robots with fixed mount monocular cameras. The proposed algorithm is robust to changes in the starting pose of the camera/robot, where a pose is defined as the planar position plus the orientation around the vertical axis. VTR consists of a teach phase in which a robot moves in a prescribed path, and a repeat phase in which the robot tries to repeat the same path starting from the same or a different pose. Most available VTR algorithms are pose dependent and cannot perform well in the repeat phase when starting from an initial pose far from that of the teach phase. To achieve more robust pose independency, during the teach phase, we collect the camera poses and the 3D point clouds of the environment using ORB-SLAM. We also detect objects in the environment using YOLOv3. We then combine the two outputs to build a 3D semantic map of the environment consisting of the 3D position of the objects and the robot path. In the repeat phase, we relocalize the robot based on the detected objects and the stored semantic map. The robot is then able to move toward the teach path, and repeat it in both forward and backward directions. The results show that our algorithm is highly robust with respect to pose variations as well as environmental alterations. Our code and data are available at the following Github page:



There are no comments yet.


page 1

page 2

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Visual Teach and Repeat (VTR) is one of the most important tasks in robotic navigation. VTR has practical applications for repetitive tasks such as surveillance and transportation, especially for robots that are not equipped with GPS sensors, or in indoor areas where GPS sensors have low accuracy. VTR is thus an alternative for navigating an Unmanned Ground or Aerial Vehicle (UGV or UAV), as it only requires a normal monocular camera.

VTR consists of a teach phase and a repeat phase. In the teach phase, the robot is driven by a user, or a path planner, and captures images along and around the path. During the repeat phase, the robot tries to repeat the same path starting from arbitrary locations and orientations, using only images captured by the camera. There are two key challenges for VTR algorithms. First, depending on the starting pose, the onboard camera may not be looking from the same location and angle as during the teach phase. In extreme cases such as transportation robots working in mines, the robot may need to move back to the starting point, so the camera position and orientation both change dramatically. Second, given that onboard memory is usually limited and robots cannot store all the raw images of the teach phase, VTR algorithms are desired to abstract the raw data to reduce storage as well as to relocalize in an online fashion.

There are two common ways to process and abstract raw images for VTR: creating 2D or 3D SLAM maps, or extracting local features such as SIFT[1], SURF[2], BRIEF[3] or ORB[4] for each frame. Creating accurate SLAM maps requires advanced sensors such as Lidar, RGBD, or stereo cameras, which limits the application of the VTR algorithms. We target UGVs that are only equipped with monocular cameras. The second method of using local image features is problematic when the camera pose changes dramatically between the teach and repeat phases. Previous work using local features alone reported failures when viewing angles differ more than 30 degrees [5, 6, 7].

Fig. 1: VTR paths starting from different locations and orientations.

Contributions: To the best of our knowledge, we are the first to use 3D semantic maps of the environmental objects to achieve more robust VTR algorithms. During the teach phase, we combine both SLAM and local image features to create 3D semantic maps of the environment. The maps contain both object locations and semantic labels, together with the 3D camera poses on the teach path. In the repeat phase, our relocalization algorithm uses recognized objects as reference landmarks to align the camera poses to those in the teach map. Therefore the relocalization algorithm is robust with respect to starting camera pose changes, including planar position differences and the viewing angle variations. In addition, object movements between teach and repeat phases can be tolerated to a certain extent. Our results show that the repeat phase can start from a large range of initial poses in the environment, far away from the teach phase initial position with large viewing angle differences. We also demonstrate the ability to repeat the teach path in backward directions, which is not possible by any previous VTR work due to significant pose changes.

Ii Related Works

VTR has been investigated in a few studies for UAVs[8, 9] and UGVs[10], most of which are based on local image features. As one of the earliest studies, Furgate and Barfoot[11] built a manifold map of overlapping submaps as the robot was piloted along a route. The map was then used for localization as the robot repeats the route autonomously. Barfoot later developed a method called Multi-Experience Localization (MEL) by using local image features[12, 13, 14, 15, 16, 17]. Such methods boost the performance by bridging between local image features found in multiple repeats. In one of his follow-up works [17]

, deep learning was utilized to improve the accuracy of the MEL method. All these methods are sensitive to the initial viewing angle. The robot is possible to lose track of the path as the locations of local image features change drastically according to the viewpoint.

Swedish and Raskar investigated VTR for routes that consist of discrete directions using deep learning without querying image frames or features [10]

. They utilized a classifier trained on each path to find the correct route toward the end. Their method targets discrete routes and cannot be applied to most VTR scenarios.

Most recently, Camara et al. trained deep learning models to solve VTR with high end-point accuracy [18]

. They used a CNN (Convolutional Neural Network) model as a feature extractor to build a feature database during the teach phase. At the repeat phase, the extracted features by the CNN model were compared with the feature database for visual place recognition. A horizontal offset estimator was also used to find the direction that the robot needs to move toward. However, the algorithm is still not robust to viewpoint changes, and may lose track of the path when starting from a location far away from the actual teach path.

In order to alleviate the pose-dependence problem, Ghasemi Toudeshki et al. proposed to utilize environmental objects [19], which can be robust semantic features of the environment that are viewpoint independent. Their algorithm memorized the semantic objects found during the teach phase for a UAV. Later a Seq-SLAM-like relocalization module used these objects to find the correct path toward the end-point. The work used objects as 2D features, which resulted in inferior repeat accuracy compared to ours that uses environmental objects as 3D features.

Our method shares the same motivation to use environmental objects as semantic features, but for the VTR problem of UGVs. Moreover, the 3D object locations are used to increase the accuracy of the algorithm, in addition to decreasing the sensitivity to initial robot poses. We were particularly inspired by [5], in which Li et al. relocalized camera poses from 6D poses of semantic objects found in the environment. They showed that semantic objects can make relocalization more robust to significant viewing angle changes, which local image features are sensitive to and thus tend to fail. Our work simplifies their relocalization algorithm by using 3D positions of the semantic objects instead. Our algorithm is fast and practical for VTR applications, and enables robust relocalization for even backward repeats not possible before.

Fig. 2: An example of semantic map created in teach phase containing semantic object labels and positions as well as camera key frames. Red dots represent the upper middle point of the corresponding objects.

Iii Preliminaries

Iii-a Overview of Proposed Method

We propose a VTR algorithm that is robust to large viewpoint changes, e.g., when the robot starts from a position far away from the teach path. For this purpose, we employ the well-known ORB-SLAM [6] to obtain camera poses from local image features. We also utilize a CNN-based model, You Only Look Once (YOLOv3) [20], to recognize objects in the environment. We combine their outputs to build a 3D semantic map, as shown in Fig. 2, that contains semantic object labels, 3D object positions, along with the robot trajectory. We now give a brief overview of ORB-SLAM and YOLOv3.

Iii-B Orb-Slam

ORB-SLAM [6] has been widely adopted to reconstruct the camera trajectory and generate a sparse 3D reconstruction of the environment for monocular, stereo, and RGB-D cameras in real-time. It is also able to perform map reuse, loop closure, and relocalization with small viewing angle changes. In this work, we use ORB-SLAM for a monocular camera. The SLAM module gets initialized after receiving multiple image frames when the camera starts moving. The bundle adjustment algorithm takes in the motion of extracted 2D local ORB features and produces a map containing camera poses and 3D positions of these ORB features. Fig. 3 shows an example of the 2D local ORB features detected during a robot motion. The origin of the generated map is the starting point where the ORB-SLAM module was initialized. However, the generated map is ambiguous in its scale. Even ORB-SLAM map for the same area may have different scales in different runs. We will discuss in Section V-A how to handle this scale ambiguity in our relocalization algorithm.

Iii-C YOLOv3

To detect objects in each image frame, we employ a CNN-based model, YOLOv3 [20]. YOLOv3 was chosen for its high accuracy as well as its fast inference speed, which is necessary since several modules need to run in parallel. Other CNN models such as Faster-RCNN [23]

perform at a lower speed in comparison. The YOLOv3 model applies a single neural network to the full image to extract image features first. Then it divides the image into several regions and predicts the location of the bounding box, the semantic label, and their confidence scores for each object. To achieve high accuracy, the network was trained on 24 most common objects in indoor areas, such as TV-Monitor, Sofa, Chair, Umbrella, Clock, Bottle, etc., from the COCO 2017 dataset. Also, the Darknet-ROS 

[21] package was utilized to publish objects detected by YOLOv3 to ROS. Fig. 3 shows the detected objects by YOLOv3 during a robot motion.

Fig. 3: Local features detected by ORB-SLAM and bounding boxes detected by YOLOv3. The three image features closest to the upper middle point of the bounding boxes, pointed by the orange arrows, are used for calculating the 3D positions of the corresponding objects.

Iv Teach Phase

In the teach phase of VTR algorithms, the robot moves either with manual control from a user, or by a path planning algorithm. We call the trajectory of this motion the teach path, which the robot needs to memorize in order to repeat it later. To this end, we wish to create a 3D semantic map, referred to as the teach map hereafter, that contains camera key frames, semantic object labels, and object positions in 3D. Camera key frames can be obtained from ORB-SLAM directly, as described in Section III-B. Simultaneously, a YOLOv3 model is utilized to detect and classify objects in the streaming images to provide us the semantic object labels.

In order to obtain 3D object locations, we combine the 3D positions of ORB features from ORB-SLAM and the estimated object bounding boxes from YOLOv3. More specifically, we estimate the 3D object position from the 3D positions of the ORB features inside the corresponding object’s bounding box. Empirically, the upper middle area of each object is the most visible region of the object in image frames. Therefore, we take the average 3D positions of the three closest ORB feature points to the upper-middle point of the object bounding box as the object location in the semantic map. In Fig. 3, we use orange arrows to point out the chosen ORB feature points for estimating 3D object locations.

Object locations estimated from different image frames are usually not identical numerically, even for the same object, due to noise and errors in camera movement and image feature detection. We therefore only add a semantic landmark and its object location into the semantic map when its estimated 3D position is beyond a threshold away from any previously added objects. We also ignore bounding boxes that are only partially observable inside the image frames to avoid large estimation errors.

By the end of the teach phase, we have obtained a semantic teach map containing 3D positions of environmental objects as well as their semantic labels, and a camera pose trajectory where is the keyframe index. Fig. 2 shows a sample semantic map together with the origin of the map for our testing scene used in the experiments.

V Repeat Phase

We aim to develop a repeat algorithm that is highly robust to large variations of the robot starting pose, such as faraway starting locations or viewing angles opposite to the ones of the teach phase. Tolerance to reasonable environment changes, such as relocation of a subset of objects, is desirable as well. The algorithm should also be able to repeat the teach path with reasonable accuracy, relative to the size of the robot.

The key to such a robust VTR algorithm is to accurately relocalize the robot pose with respect to the teach map coordinate frame. For this purpose, we developed an optimization-based relocalization algorithm that finds the best matching pair of objects between the teach and repeat maps to relocalize the robot robustly and accurately. After relocalization, the robot can simply move toward the closest point on the teach path and repeat it to the end or move back to the starting point.

V-a Relocalization

To relocalize the robot in the repeat phase, environmental objects are used as reference landmarks to transform the robot pose in repeat map to its counterpart in the teach map . We note that the estimated 3D positions of the objects contain noise and errors from various sources, due to estimation errors from ORB-SLAM, occasional object relocation in the environment, and noise caused by the movement of the robot. Therefore, we robustly estimate the relative transformation from to using the best matching pair of objects in the environment. Furthermore, we assume there are multiple unique objects in the environment. We only consider unique objects to reduce ambiguity caused by repetitive objects, and leave it as future work to utilize repetitive objects for relocalization.

In general, there are 6 Degrees of Freedom (DoFs) for a given rigid body. However, our camera is mounted with a fixed base on a UGV that can only move in the horizontal ground plane and rotate around the vertical axis. Therefore, there are only 3 DoFs left for the robot/camera. In addition, ORB-SLAM outputs are ambiguous in scale. Thus there are in total four unknowns (

) in the relative transformation between two semantic maps, where is the scalar, (,) are the planar translation, and is the rotation around the vertical axis. We thus require the positions of two matching environmental objects to solve for the four unknowns. The scalar can be easily calculated by dividing the relative distances of the pair of objects in the two maps. We now detail how to solve for the relative transformation between and from intermediate coordinate frames estimated from the two matching objects.

Fig. 4: Relocalization between the teach and repeat phase calculates and from two matching objects.

Fig. 4 illustrates various coordinate frames involved in relocalization, together with some environmental objects. and are the two 2D coordinate frames automatically extracted by ORB-SLAM from the robot starting pose in teach and repeat phase, respectively. They define the map origin and the starting orientation. We first ignore repetitive objects in both maps, such as the chairs in Fig. 4. Then two of the objects in both maps with the same labels are chosen randomly, such as the monitor and the clock. Hereafter we denote individual objects as and , and the corresponding object pair as . Next, a new 2D coordinate frame is defined from the chosen objects

for both maps. The Y-axis is the vector pointing from one object to the other in the horizontal plane, e.g., the vector from the monitor to the clock. The X-axis is the cross product of the Y-axis and the up direction. We denote a 3-DoF transformation matrix

with one rotational DoF and two translational DoFs as in (1), then all objects can be transformed from a map coordinate frame, either or , to its corresponding object coordinate frame, either or , as in (2) and (3).


where the subscript and denote the various quantities in either the teach or repeat map. We denote all objects in a map as or . Then using (4) and (5) we can transform an object in or in , in their original map coordinates to their corresponding object coordinates as follows:


Then the aggregated error between positions of all unique objects that appear in both maps can be calculated:


is a scale factor introduced by ORB-SLAM, which can be estimated by:


where and are the Euclidean distance between and in the teach and repeat maps, respectively.

The right hand side of (6) would be zero for perfectly accurate maps without any object relocation. But in real world applications it is nonzero, and we aim to find the best matching pair of objects, denoted as , that minimize this aggregated error . That is, we search for the object indicies that minimize (6) as follows:


In practice, to make our relocalization algorithm more robust to estimation errors and outliers caused by object movement, we only use the top matching objects in calculating

. Specifically, positional errors between two maps for all objects are first sorted in an acceding order. Only the top half of objects are used in (6) and minimizing (8).

After finding the best matching pair of objects between the two maps, the optimal scale ratio () is calculated by (7). The optimal 3-DoF transformation matrix between the teach and repeat map is calculated from the best matching pair of objects () as follows:


Therefore, three of the unknowns ( and ) are solved for and embedded in the above matrix . It is now straightforward to transform the robot pose in the repeat map to the teach map pose as follows:


V-B Forward Motion

After transforming the initial robot pose from the repeat map to the teach map, the next step is to move toward the teach path and repeat it. The forward motion involves moving toward the closest point on the teach path to the robot, and then following the list of keyframes captured in the teach phase toward the end. Therefore, the closest point on the teach path from the current pose of the robot is considered as the first goal point of the robot. Then the next goal points are chosen with an arbitrary distance to the current point within the range of one meter. The robot would first rotate toward the next goal point and then move from the current position toward the next one all the way to the endpoint .

Fig. 5: The teach map and the repeat map before and after relocalization. The chair and the suitcase were matched, and helped relocalize the robot pose in repeat onto the teach map.

V-C Backward Motion

The robustness of our system is manifested by its ability to repeat the teach path in a backward direction. Our VTR algorithm relocalizes the robot from the repeat map to the teach map using detected semantic landmarks, independent of the viewpoint. Then instead of choosing the next points toward the end-point on the teach path, the robot can choose points directed to the starting location on the teach map. Therefore, the backward motion is basically following the list of keyframes captured in the teach phase from the closest point on the teach path toward the starting keyframe .

Vi Experiments

To evaluate the performance and accuracy of our algorithm and system, several experiments were carried out in a lab at Simon Fraser University. The lab was equipped with a Vicon motion capture system, which was only used for evaluation purposes and not used in our VTR method. The test platform was a Turtlebot2[22] robot equipped with a ZED2 camera capturing only monocular data. The computer hardware was Intel CPU Core i9-9980HK and RTX 2080 GPU, running the ROS Kinetic on Ubuntu 16.04.

The robot was placed inside the experimental area in which a number of objects were randomly placed. Then the robot was controlled manually by a user along a teach path, while ORB-SLAM, YOLOv3, and the object position detection modules were running simultaneously. The VTR system also built the semantic map and memorized the teach path within in an online fashion. An example semantic map built in the teach phase can be seen in Fig. 2.

During the repeat phase, the robot was placed in various locations in the lab. Then we manually moved the robot for a short period while mapping modules were running to create a new semantic map for the repeat phase. Once a subset of at least three objects in the lab were seen, the relocalization module kicked in to transform the current robot pose to the teach map using the observed objects in both maps.

Vi-a Forward Repeat

Having relocalized the robot with respect to the 3D semantic map built in the teach phase, the VTR system then activated the motion planning module to move the robot toward the closest point on the teach path and followed its keyframes toward the end. In Fig. 5 we show the object locations and robot paths: on the teach map; on the repeat map before relocalization; and on the repeat map after relocalization. In this specific test, the chair and the suitcase were chosen to be the best matching object pair. The repeat path taken by the robot converged to the end-point with high accuracy, by traversing some of the camera keyframes of the teach path.

Fig. 6: The teach and fourteen repeat motions captured by our Vicon motion capture system for forward repeat tests. Triangle arrows indicate the initial poses of the robot, for which we chose to cover a large range of starting positions and orientations.
Fig. 7: The teach and seven repeat motions captured by our Vicon motion capture system for backward repeat tests. Triangle arrows indicate the initial poses of the robot, which can vary significantly such as a close to viewing angle difference.
VTR Variation of
Start-Point Distances-Forward Motion
End-Point Distances-Forward Motion
Start-Point Distances-Backward Motion
End-Point Distances-Backward Motion

Statistics of the start-point distances and end-point distances between the teach path and the repeat paths. The average and standard deviation in meters are reported for both forward and backward repeats.

We have performed in total 14 forward repeat tests, all of which were captured by the Vicon motion-capture system and are shown in Fig. 6. The thick blue lines indicate the teach path, and all other colored lines visualize the repeat paths. The starting points for repeat motions were strategically chosen to cover a large range of locations and viewing angles inside our lab. In most cases, the viewing angles were significantly different from the one in the teach phase. The locations of the environmental objects utilized in the VTR algorithm are also shown in the figure. In all tests, the robot was first moved slightly to initialize the ORB-SLAM and semantic map for relocalization. Then the robot was able to repeat the teach path with reasonable accuracy, independent of the initial robot location and viewing angle.

We report key statistics of the 14 repeat tests in Table I. The distances between the start point in the teach path and the ones in the repeat paths were calculated, and the average and standard deviation in meters are given in the table. The end-point distances and statistics were also computed. As we can see, for forward repeats, the end points were on average less than the robot diameter (0.34

) away from the teach path end point, despite the large range and variance of the start points in repeats.

We note that direct quantitative comparisons with previous works in the VTR literature are challenging, due to different focuses of different algorithms, and the large variance of testing environments. Figures in [18] and [19] did show that these algorithms failed to follow the teach path in cases with large viewing angle differences between the teach and repeat initial poses, or from a starting point far away from the one in the teach phase.

Vi-B Backward Repeat

We evaluate the VTR algorithm in the backward direction with the same experimental setup as in the forward repeats. The only difference is that the robot would move toward the starting point instead of the end point on the teach path. Fig. 7 shows the seven backward repeat motions tested in the lab. Again, we report the average and standard deviation of start-point and end-point distances with respect to the teach path in Table I.

The results indicate reasonable accuracy of our algorithm in backward repeats. Although compared with forward repeats, the errors are usually larger. This is partly due to the specific environment setup and the characteristics of ORB-SLAM. In some cases when the robot moved toward a large open area with objects far away in the field of view, ORB-SLAM had difficulties in sensing the robot motions properly. This caused the robot to move slightly further in a few cases as can be seen in Fig. 7. In most cases, however, the path was followed in reasonable accuracy with respect to the size of the robot, especially considering the significant viewing angle differences, sometimes close to , between the teach path and the backward repeats.

Fig. 8: The teach and three repeat motions captured by our Vicon motion capture system, after relocation of the chair. The old location of the chair can be seen in the previous figures Fig. 6 and Fig. 7.

Vi-C Robustness Towards Environmental Changes

We evaluate the robustness of the proposed algorithm to occasional environmental changes. We randomly moved one of the objects, i.e., the chair, that was a semantic landmark in the teach map, to a new place more than 2.5 meters away. Three more forward repeat tests were performed in the changed setting. Fig. 8 shows the teach and repeat paths as well as the new positions of all landmarks. The repeat motions were as accurate as before. Our VTR is robust to environmental changes as long as the number of relocated objects is less than half of all the semantic objects. This algorithm feature holds for backward repeats as well.

Vii Conclusions

We have proposed a novel VTR algorithm using 3D semantic maps. The algorithm is robust to large changes of the robot starting pose in the repeat phase. The key component is to build 3D semantic maps of the environment, containing semantic labels and positions of the objects in the environment as well as camera poses, during the teach phase. In the repeat phase, the algorithm relocalizes the robot in the new map based on the found objects in the environment. We tested our algorithm in both forward and backward modes starting from various locations inside the lab. Our algorithm demonstrated robustness toward significant starting pose variations, as well as tolerance to environmental changes.

There are multiple areas for future research. The accuracy of the semantic maps could be improved by a more sophisticated 3D position estimation algorithm. We would like to be able to utilize repetitive objects in the environment in the future. We also wish to boost the end-point accuracy for applications that require return trips and loops.


  • [1]

    D. G. Lowe, Object recognition from local scale-invariant features, Proceedings of the Seventh IEEE International Conference on Computer Vision, vol. 2, pp. 1150-1157, 1999.

  • [2] H. Bay, T. Tuytelaars and L. Van Gool, Surf: Speeded up robust features, European Conference on Computer Vision, Springer, 2006, pp. 440-417.
  • [3] M. Calonder, V. Lepetit, C. Strecha and P. Fua, Brief: Binary robust independent elementary features, European Conference on Computer Vision, Springer, 2010, pp. 778-792.
  • [4] E. Rublee, V. Rabaud, K. Konolige and G. Bradski, ORB: An efficient alternative to SIFT or SURF, 2011 International conference on computer vision, IEEE, 2011, pp. 2564-2571.
  • [5] J. Lee, D. Meger and G. Dudek, Semantic mapping for view-invariant relocalization, 2019 International Conference on Robotics and Automation (ICRA), IEEE, 2019, pp. 7108-7115.
  • [6]

    R. Mur-Artal and J. D. Tardós, Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras, IEEE transactions on robotics, IEEE, 2017, pp. 1255-1262.

  • [7] G. Yu and J. M. Morel, ASIFT: An algorithm for fully affine invariant comparison, Image Processing On Line, vol. 1, 2011, pp 11-38.
  • [8] A. Pfunder, A. P. Schoelling and T. D. Barfoot, A proof-of-concept demonstration of visual teach and repeat on a quadcopter using an altitude sensor and a monocular camera, 2014 Canadian Conference on Computer and Robot Vision, IEEE, 2014, pp. 238-245.
  • [9] T. Nguyen, G. K. Mann, R. G. Gosine and A. Vardy, Appearance-based visual-teach-and-repeat navigation technique for micro aerial vehicle, Journal of Intelligent & Robotic Systems, Springer, vol. 84, no. 1, pp. 217-240, 2016.
  • [10]

    T. Swedish and R. Raskar, Deep visual teach and repeat on path networks, Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 1533-1542, 2018.

  • [11] P. Furgale and T. D. Barfoot, Visual teach and repeat for long-range rover autonomy,Journal of Field Robotics, Wilet Online Library, vol. 27, no. 5, pp. 534-560, 2010.
  • [12] M. Paton, K. MacTavish, M. Warren and T. D. Barfoot, Bridging the appearance gap: Multi-experience localization for long-term visual teach and repeat, 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, pp. 1918-1925, 2016.
  • [13] C. J Ostafew, A. P. Schoelling and T. D. Barfoot, Visual teach and repeat, repeat, repeat: Iterative learning control to improve mobile robot path tracking in challenging outdoor environments, 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp. 176-181, 2013.
  • [14] L. Celement, J. Kelly and T. D. Barfoot, Robust monocular visual teach and repeat aided by local ground planarity and color-constant imagery,Journal of field robotics, Wiley Obline Library, vol. 34, no. 1, pp. 74-97, 2017.
  • [15] M. Warren, M. Greeff, B. Patel and J. Collier, A. Schoelling, T. D. Barfoot, There’s no place like home: Visual teach and repeat for emergency return of multirotor uavs during gps failure,IEEE Robotics and automation letters, IEEE, vol. 4, no. 1, pp. 161-168, 2018.
  • [16] M. Gridseth and T. D. Barfoot, Towards direct localization for visual teach and repeat,2019 16th Conference on Computer and Robot Vision (CRV), IEEE, pp. 97-104, 2019.
  • [17] M. Gridseth and T. D. Barfoot, DeepMEL: Compiling Visual Multi-Experience Localization into a Deep Neural Network, 2020 IEEE International Conference on Robotics and Automation (ICRA), IEEE, pp. 1674-1681, 2020.
  • [18] L. G. Camara, T. Pivoňka, M. Jílek, C. Gäbert, K. Košnar and L. Přeučil , Accurate and robust teach and repeat navigation by visual place recognition: A cnn approach, 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, pp. 6018-6024, 2020.
  • [19] A. G. Toudeshki, F. Shamshirdar and R. Vaughan, Robust uav visual teach and repeat using only sparse semantic object features, 2018 15th Conference on Computer and Robot Vision (CRV), IEEE, pp. 182-189, 1993.
  • [20] J. Redmon and A. Farhadi, Yolov3: An incremental improvement, arXiv preprint arXiv:1804.02767, 2018.
  • [21] M. Bjelonic, YOLO ROS: Real-Time Object Detection for ROS, 2016-2018.
  • [22] D. Singh, E. Trivedi, Y. Sharma and V. Niranjan, TurtleBot: Design and Hardware Component Selection, 2018 International Conference on Computing, Power and Communication Technologies (GUCON), IEEE, pp. 805-809, 2018.
  • [23] S. Ren, K. He, R. Girshick and J. Sun, Faster r-cnn: Towards real-time object detection with region proposal networks, Advances in neural information processing systems, vol. 28, pp. 91-99, 2015.