Legged robots have great potential as universal mobility platforms for many real-world applications, such as last-mile delivery or search-and-rescue. However, visual navigation of legged robots can be considerably more challenging than wheeled robots due to the limited availability of legged robot navigation data; compared to over 1,000 hours for self-driving cars , several kilometers  and 32 years in simulation 
for indoor navigation of a mobile robot. It is hard to collect such a large amount of data on legged robots because they are hard to control with high accuracy and cannot operate continuously for a long time due to the hardware limitations. This lack of data makes it difficult to deploy state-of-the-art deep learning algorithms, such as imitation learning or reinforcement learning.
We choose an imitation learning approach that obtains a navigation policy by mimicking expert demonstrations, due to its data efficiency and safety for data collection. However, for problems as complex as visual navigation, the amount of data required is outside of the scope of what is available for legged robots. In this work, our key insight is to mitigate the data collection issue by building a learning system that can learn navigation from heterogeneous experts–i.e., expert demonstrators that have different perspectives and potentially different dynamics. Our assumption is that these agents have better navigation capabilities than legged robots and are more readily available, thus alleviating the data collection bottleneck. Specifically, in this work we focus on learning visual navigation from human agents.
The idea of learning visual navigation from heterogeneous agents imposes new challenges. One of the main issues is the perspective shift between different agents’ vision, because a robot may have different camera positions and states from other robots or humans. Directly transferring policies learned on human perspective demonstrations can result in domain shift problems. Additionally, in some cases, the demonstrations only contain raw state sequence, and do not contain action labels. We thus need an effective planning module that finds the optimal actions solely based on raw images, without any additional information about the robot and surroundings.
In this work, we propose a novel imitation learning framework that trains a visual navigation policy for a legged robot from human demonstrations. A human expert provides navigation demonstrations as videos that are recorded by multiple body-mounted cameras. We extract relevant state features from the temporally-aligned multi-perspective videos by training a feature disentanglement network (FDN), which disentangles state related features from perspective related features. FDN achieve such disentanglement by training with our proposed cycle-loss, that composing disentangled features should be able to generate images with correspondence to the features. We consider two approaches for labeling demonstrations with robot-compatible actions, either via an efficient human labelling GUI or through a learned inverse dynamics model. We then take a model-based imitation learning approach for training a visual navigation policy in the learned latent feature space.
We demonstrate our proposed framework in both simulated and real environments. Our framework can train effective navigation policies to guide a robot from the current position to the goal position described by a target image. We also validate the feature disentanglement network by comparing the prediction of the perspective-changed images to the ground truth. In addition, we analyze the performance of the proposed framework by conducting an ablation study and comparing with baseline algorithms for feature learning and imitation learning.
Ii Related Work
Ii-a Robot Visual Navigation
Robot visual navigation is a fundamental task for mobile robots, such as legged robots. Traditional approaches towards this problem generally use simultaneous localization and mapping (SLAM) to first construct a map of the environment and then plan paths [2, 15, 13]. However, these approaches require a robot to navigate and gradually map the environment. Though these methods may work in normal navigation case, they may not work in our case where the robot has to learn from human demonstrations which have different perspectives and dynamics. Other recent approaches use learning to enable visual navigation through either imitation learning or reinforcement learning. Imitation learning learns a policy given labeled expert trajectories, such as imitating a target driven navigation policy , and conditional imitation learning . As mentioned previously, imitation learning requires large quantities of labeled data that are not practical for legged robots. Reinforcement learning based approaches learn to navigate an environment given a reward function, either learned from demonstration [37, 9] or defined manually by human expert . Most existing work on visual navigation with reinforcement learning is done in simulation [7, 36]; though a few are done on real robots [22, 10]. These approaches are limited in the legged robot regime by requiring actual trial-and-error on real robots, which can be both time intensive and dangerous as collisions on legged robots can easily damage both themselves and the environment.
Ii-B Learning from demonstrations
Learning from demonstrations or imitation learning [21, 18, 26] is a simple but effective approach to learn robot policies from labeled data. The data could come from either on-robot demonstration such as imitating autonomous driving policy  or from human observation such as third-person imitation learning , learning by translating context  and using time-contrastive network (TCN) to learn a reward function . Though learning with on-robot data is effective, it is very labor intensive to collect large scale datasets for many robots, and some of them may require special training to use. Learning from human demonstrations of different contexts (perspectives) is natural to mimic the way human learns to behave, as children learn to perform locomotion and many control tasks by watching others (experts) perform the task 
. However, the perspective shift between a human and robot is non-trivial. In this approach, we propose a novel feature extraction framework to solve this problem. On the other hand, our work is related with model-based reinforcement learning and model-based imitation learning . Our imitation learning framework is similar to that of the universal planning network  (UPN), but differs in that we perform the model learning and planning in our learned feature space, rather than in the raw pixel space. Imitation learning on visual navigation from human video has been explored in , where they propose to train an inverse dynamics model to learn an action mapping function from robot dynamics to human video. While their work is focused on learning subroutines for navigation, our work focuses on learning a perspective-invariant feature space that is suitable for path planning and model-based imitation learning on the demonstration data. Our work could be combined with their contributions to improve the performance of visual navigation.
In this work, we consider the problem of learning visual navigation policy on a legged robot from human demonstrations. In our scenario, a human expert mounts cameras on the body and walks in the training environment. Each demonstration yields a sequence of images with the perspective index (superscript) and time index (subscript). We assume that the images with the same time indices are captured at the same human state (3 dimensional state space including their position in 2D and orientation).
The robot’s observation space at time is defined by an image from the robot’s perspective . The action space consists of five discrete actions : going forward, going backward, turning left, turning right and staying in place. Each action only provides high-level control over the robot while low-level motor torques on legs are computed by a traditional Raibert controller . The goal navigation task is specified by a goal image from the human’s perspective. Therefore, the policy maps the robot’s observation and the specified goal to the action .
This section introduces a zero-shot imitation learning framework for visual navigation of a legged robot. We first introduce our feature disentanglement network (FDN) that extracts perspective-invariant features from a set of temporally-aligned video demonstrations. Then we present the imitation learning algorithm that trains a navigation policy in the learned feature space defined by FDN. Figure 2 gives an overview of the framework.
Iv-a Feature Disentanglement Network
We design a feature disentanglement network (FDN, Figure 3) to perform feature disentanglement from visual inputs. More specifically, the FDN tries to separate state information from perspective information, which is necessary for imitation learning between heterogeneous agents. The network is composed of two parts: the state feature extractor with parameters , which extracts state-only information from the visual inputs; and the perspective feature extractor with parameters , which extracts perspective-only information from the visual input. For simplicity, we drop the parameters of the functions unless necessary.
Denote the entire human demonstration dataset as where is the total length and is the total number of perspectives. For a given image input , both networks extract one part of information from the visual input:
where and are the corresponding state features and perspective features, respectively. For training FDN, we also learn an image reconstructor with parameters that takes and as inputs and reconstructs an image corresponding to the same state specified by and the same perspective specified by :
where the subscript denotes reconstructed image. For any two images , that correspond to different state feature and different perspective feature
, we define the cycle-loss function of training the feature extractor as:
Assuming access to temporally aligned images from multiple perspectives, the feature extractor will learn to extract state related information only in and learn to extract perspective information only in . The total loss function for training FDN can be summarized by the following equation:
We train FDN by randomly sampling two images from the multi-perspective data. We use the CycleGAN  encoder as the backbone of the feature extractor and convert the last layer output as a flattened
dimensional vector. The decoder or the image generator is inherited from CycleGAN decoder. More specifically, within the encoder we have four convolutional layers followed by four residual layers. We also use instance normalization after each convolutional layer 
. For the decoder, we use two deconvolutional layers followed by one convolutional layer and upsampling layer. The Swish activation function is used through the network when necessary.
Iv-B Imitation Learning from Human Demonstrations
Inspired by Universal Planning Network [UPN], we train the model-based imitation learning network (Figure 4) in the latent feature space . We process the given human demonstration data into a sequence of the features by applying the trained FDN to the data. We also label the robot-compatible actions by training an inverse dynamics model or using a developed GUI to manually label actions. The inverse dynamics model (IDM) takes in FDN state feature extractor processed images that is temporally consecutive and predicts the robot action that completes the transition. To get one episode of robot data for training IDM, we randomly start the robot and walk the robot in the environment until collision or the number of steps exceeds 30. We collect multiple episodes robot random walk data.
We define a model that takes as inputs the current observation’s feature encoding , and a randomly initialized action sequence , where is the prediction horizon of the model, and predicts future states’ feature representation . Then we update the action sequence by performing gradient descent on the following plan loss:
which minimizes the difference between the predicted final future state feature and the given goal state feature . Here we use the superscript to explicitly point out that is from the robot’s perspective while the superscript means is from the human’s perspective. We use the Huber loss  to measure the difference between the predicted feature and goal feature. Then given the human demonstrator’s expert action sequence , we optimize the model parameters so as to imitate the expert behavior:
the loss function above could be a cross entropy loss when the action space is discrete. Once we train the model , Eq. (5) implicitly defines the policy . At each time step, the policy replans the entire action sequence and only executes the first action, which is similar to the way model predictive control (MPC)  does. When training the imitation learning model, the prediction horizon can change, and it depends on the number of expert steps between the start and goal state, a mask is applied on Equation 6 to only imitate the corresponding action sequence. This is similar to the way UPN  trains the policy. More details can be found in .
We design our experiments to investigate the following questions. 1) Is the proposed feature disentanglement network able to disentangle features? 2) Can the proposed model-based imitation learning find an effective action plan in the learned feature space? 3) How good is the performance of the proposed framework compared to baselines?
V-a Environment Setup and Data Collection
We select Laikago from Unitree  as the real world robotic platform to evaluate the proposed framework. For simulation, we develop two simulation environments using PyBullet  (Figure 5), one is called Navworld, the other OfficeWorld. The latter one has more complex texture than the previous one, and the space is also larger. We also put a simulated Laikago in both simulation environments. The robot is cm tall and we mount the camera cm above the body (See Figure 1 left). The frames are down-sampled to pixels. The discrete actions are defined as running a walking controller with the constant target linear and angular velocity for second. As a result, a robot can move for about m, which is near one-third of its body length, and can turn left or right for about degrees in the NavWorld environment and degrees in the OfficeWorld environment. We choose a small room of size for training and testing purposes for real robots. For simplicity, we assume kinematic movements.
In our experiments, each human demonstration is a first-person view navigation video. In the real world case, we collect temporally-aligned multi-perspective data by mounting three Go-Pro cameras on the person (Figure 1 right) and let the person navigate certain paths in the environment. The videos are downsampled to pixels, to match the dimensions of the robot’s camera. This enables us to obtain multiple video clips with the same state sequence but different perspectives and extract perspective-invariant features. In total we collected 25 demonstration trajectories, each of length 20 steps in the real world environment for minutes. In simulation, we automatically generate demonstrations using a path-planning algorithm  on randomly sampled start and goal locations. We collect 500 demonstration trajectories, each of length 20 steps. To improve the data efficiency, we perform data augmentation by replaying the video and reversing the time order both for the simulated and the real data. We also add in augmented stay in place demonstration sequences by repeating randomly sampled observations for 20 steps.
In our framework, we need to obtain robot-compatible action labels of human demonstrations since they have different dynamics. In simulation, we trained an inverse dynamics model that takes in two consecutive images processed by FDN and predicts the robot action that completes the transition. Then we use the trained inverse dynamics model to label the expert demonstration. In the real world experiment, since the robot trajectory data especially the actions contain significant noise, we develop a GUI that allows us to label human actions manually within a short amount of time. Note that this work’s focus is not on action labeling. In addition, the manually labeled action is only a rough estimation of where the robot is going, and it may still contain noise: for example, when the robot is staying in place, it may still move around a little bit due to drifting.
We train the FDN and the imitation learning model both in the simulated environments and on the real robot. Additionally, we train the inverse dynamics model in the simulation for automatic human action labeling. For all experiments, we use the Adam optimizer  with a learning rate of 0.0035 and batch size of 32, and we set the feature dimension . We evaluate the success (reaching the goal) rate of our experiments in simulation by comparing the robot’s state (location and orientation) to the goal state, and the success rate on real robot by human visual evaluation.
V-C Validation of Feature Disentanglement Network
We present in Figure 6 the results of image generation by composing state and perspective features using FDN. As illustrated in the generated image, the feature extraction network can compose state and perspective features to generate an image that has the same correspondence as the input state and perspective feature. In particular, the difference in the perspectives lies in the camera vertical position in simulation and camera vertical and horizontal location in real robot data. The results show that the network is able to learn such perspective information from training FDN.
V-D Simulation Results
|Env/ of Human Demos||100||300||500|
First, we validate our framework in the simulated environment. Our framework shows a zero-shot robot visual navigation from human demonstrations with a success rate ranging from % to % depending on the task difficulty (see Table I for more details). We observe that a robot is able to find the goal position, specified from human’s perspective, even when it is out of sight because the robot builds a model of environment through model-based imitation learning from human demonstrations.
We also evaluate the effect of task difficulty (by varying the number of minimum steps between the start and goal location) and the number of human demonstrations on the success rate. For the latter, we fix the task difficulty to be 10 steps. Table I shows that with more distance between the start and goal location, finding the correct path towards the goal becomes harder. In a larger OfficeWorld environment, we observe such decreasing in success rate with increasing task difficulty. Table II shows that with more demonstration from the human side, the success rate indeed increases.
V-E Hardware Results
In addition, we conduct the experiments on the real legged robot, Laikago. We test the robot on three sets of target-driven navigation tasks. We consider three targets and start locations to evaluate the robustness and consistency of the policy. The distance from the target location to the start location of the robot is around two meters. For each testing start and goal location pair, we test for three times and evaluate the success rate of the three trials. On these testing tasks, we obtain a success rate of around 60%. We show some successful target-driven visual navigation trajectories in Figure 7. In our experience, a robot shows a better accuracy when the goal image is visually salient, such as a brown chair. On the other hand, it struggles to find an object that has similar color to the background, e.g. a white desk in front of a white wall.
V-F1 Comparison with Different Loss Functions
|NavWorld, 4 Steps||54.55%||52.63%||8.77%|
|NavWorld, 10 Steps||21.81%||20.72%||2.70%|
|OfficeWorld, 10 Steps||26.67 %||27.30 %||25.08%|
To investigate whether our proposed cycle-loss is suitable for training feature disentanglement, we compare with other baseline loss functions. Specifically, we experiment with several combinations of the proposed cycle-loss and triplet loss . In the first scenario, we train the feature extractor with our cycle-loss. In the second case, we combine cycle-loss with triplet loss. Given three images , where and share their states and share their perspective, the triplet loss can be defined as, which minimizes state feature difference for the same state but from different perspectives, and maximize state feature difference for different states but from the same perspective. Here is the enforced margin typically used in triplet-loss and usually the loss is cut to zero when it is negative. In the third case, we use triplet loss only to learn the state feature representation. The results are presented in Table III. It is clear that the Triplet loss alone has consistently worse results than our proposed Cycle loss. By combining Cycle-loss with Triplet loss the performance improved a bit. The triplet loss’s poor performance may be a result of sensitivity to the data sampling process. Our proposed Cycle-loss training is more stable and is not sensitive to data sampling. Besides, triplet loss only learns the state feature and our network learns both state and perspective feature and our decoder helps to verify the learned feature has correct correspondence.
V-F2 Comparison with baselines
We also compare the proposed framework with a baseline algorithm, Universal Planning Networks (UPN) . In particular, we test UPN in two scenarios: without and with perspective changes between a learner and a demonstrator. The former serves as the upper bound of our method’s performance. We will call the first method UPN and the second method UPN-PerspChange. In the first method UPN, we train and test UPN under the same perspective. In the second case, UPN is trained with multiple perspective data while the training perspective does not include the testing perspective.
We perform a comparative study on NavWorld with 4 steps of start-goal distance and on OfficeWorld with 10 steps of start-goal distance. The results are presented in Table IV. The results show that the success rate of our method is approaching UPN, the theoretical upper bound. This proves that our FDN effectively handles perspective shift between robot and human data. When there is perspective change, UPN-PerspChange trained with some perspective data can’t generalize to another unseen perspective, and the result is worse than our method. This indicates that the perspective shift is nontrivial and direct transfer does not work.
We also observe that the UPN-PerspChange works better in the OfficeWorld environment than in NavWorld, this is because that in the OfficeWorld environment, the turning angle is 90 degrees while in NavWorld the turning angle is 30 degrees. Therefore, to reach a state in the OfficeWorld, most of the actions are either moving forward or back. Even though the texture in the OfficeWorld is more complicated, the task difficulty with the same number of step between start and goal location is smaller.
Vi Conclusion and Remarks
We propose a novel imitation learning framework for learning a visual navigation policy on a legged robot from human demonstrations. The main challenge is to interpret the expert demonstrations from different perspectives than the robot’s. To this end, we develop a feature disentanglement network (FDN) that extracts perspective-invariant features and a model-based imitation learning algorithm that trains a policy in the learned feature space. We demonstrate that the proposed framework can find effective navigation policies in two simulated worlds and one real environment. We further validate the framework by conducting ablation and comparative studies.
The bottleneck for deploying the current framework to real-world scenarios is the manual action labeling process of human demonstrations. However, automated action labeling is not straightforward at the required high accuracy (%). One possible approach is to collect a small amount of the robot’s navigation data to build an inverse dynamics model that takes in two consecutive images and predicts the robot’s action. In our experience, this approach works in simulation but not on the real robot because a legged robot’s gait blurs the camera images. In addition, the robot’s discrete actions are often not well-matched with real human demonstrations. In the future, we want to investigate more stable gaits with continuous control commands.
Although we tested the framework for learning a legged robot policy from human demonstrations, the framework is designed to support general imitation learning between any heterogeneous agents. In the future, we hope to build a general system that can learn navigation policies for data-expensive robots, such as legged robots or aerial vehicles, from easy-to-operate robots, such as mobile robots or autonomous cars. If we can fully exploit a large navigation data sets, such as Google Streetview , there is great potential to significantly improve the performance on real robots.
-  (2010) Google street view: capturing the world at street level. Computer 43 (6), pp. 32–38. Cited by: §VI.
-  (2008) Visual navigation for mobile robots: a survey. Journal of intelligent and robotic systems 53 (3), pp. 263–296. Cited by: §II-A.
-  (2019) Learning navigation behaviors end-to-end with autorl. IEEE Robotics and Automation Letters 4 (2), pp. 2007–2014. Cited by: §I.
-  (2018) End-to-end driving via conditional imitation learning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1–9. Cited by: §II-A.
Pybullet, a python module for physics simulation for games, robotics and machine learning. GitHub repository. Cited by: §V-A.
-  (1959) A note on two problems in connexion with graphs. Numerische mathematik 1 (1), pp. 269–271. Cited by: §V-A.
-  (2019) Scene memory transformer for embodied agents in long-horizon tasks. In , pp. 538–547. Cited by: §II-A.
-  (2019) Long-range indoor navigation with PRM-RL. CoRR abs/1902.09458. Cited by: §I.
-  (2019) From language to goals: inverse reinforcement learning for vision-based instruction following. arXiv preprint arXiv:1902.07742. Cited by: §II-A.
-  (2017) Cognitive mapping and planning for visual navigation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2616–2625. Cited by: §II-A.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §IV-A.
-  (1992) Robust estimation of a location parameter. In Breakthroughs in statistics, pp. 492–518. Cited by: §IV-B.
-  (2013) Dense visual slam for rgb-d cameras. In 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 2100–2106. Cited by: §II-A.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §V-B.
-  (2010) View-based maps. The International Journal of Robotics Research 29 (8), pp. 941–957. Cited by: §II-A.
-  (2019) Learning navigation subroutines by watching videos. arXiv preprint arXiv:1905.12612. Cited by: §II-B.
-  (2015) DeepMPC: learning deep latent features for model predictive control.. In Robotics: Science and Systems, Cited by: §IV-B.
-  (2018) Imitation from observation: learning to imitate behaviors from raw video via context translation. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1118–1125. Cited by: §II-B.
-  (1999) Born to learn: what infants learn from watching us. The role of early experience in infant development, pp. 145–164. Cited by: §II-B.
-  (2019) Semantic predictive control for explainable and efficient policy learning. In 2019 International Conference on Robotics and Automation (ICRA), pp. 3203–3209. Cited by: §II-B.
-  (2018) Agile autonomous driving using end-to-end deep imitation learning. In Robotics: science and systems, Cited by: §II-B.
-  (2018) Zero-shot visual imitation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 2050–2053. Cited by: §II-A.
-  (1986) Legged robots that balance. MIT press. Cited by: §III.
-  (2017) Swish: a self-gated activation function. arXiv preprint arXiv:1710.05941 7. Cited by: §IV-A.
-  (2018) Deep imitative models for flexible inference, planning, and control. arXiv preprint arXiv:1810.06544. Cited by: §II-B.
A reduction of imitation learning and structured prediction to no-regret online learning.
Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 627–635. Cited by: §II-B.
Time-contrastive networks: self-supervised learning from video. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1134–1141. Cited by: §II-B, §V-F1.
-  (2018) Universal planning networks. arXiv preprint arXiv:1804.00645. Cited by: §II-B, §IV-B, §V-F2.
-  (2017) Third-person imitation learning. arXiv preprint arXiv:1703.01703. Cited by: §II-B.
-  (2016) Instance normalization: the missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022. Cited by: §IV-A.
-  (2019)(Website) External Links: Cited by: §V-A.
-  (2019) Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6629–6638. Cited by: §II-A.
-  (2018) Bdd100k: a diverse driving video database with scalable annotation tooling. arXiv preprint arXiv:1805.04687. Cited by: §I.
Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 2223–2232. Cited by: §IV-A.
-  (2017) Visual semantic planning using deep successor representations. In Proceedings of the IEEE International Conference on Computer Vision, pp. 483–492. Cited by: §II-A.
-  (2017) Target-driven visual navigation in indoor scenes using deep reinforcement learning. In 2017 IEEE international conference on robotics and automation (ICRA), pp. 3357–3364. Cited by: §II-A.
-  (2008) Maximum entropy inverse reinforcement learning. Cited by: §II-A.