Classical mobile robots are designed to be adaptive to different navigation environments by in-situ adjustment of the underlying navigation system, such as by sensor calibration [xiao2017uav] or by parameter tuning [xiao2020appld]. However, without adjustment from expert knowledge, the untuned system may repeat the same mistakes even though it has navigated in the same environment multiple times.
Recent success in using machine learning for mobile robot navigation indicates the potential of improvement of navigational performance based on past experience navigating in the same environment[kahn2018self]. When facing different navigation environments, however, learning methods cannot generalize well to unseen scenarios; they must re-learn to navigate in the new environments. More importantly, the learned system is prone to the so called catastrophic forgetting phenomenon, which causes the robot to forget what was learned in previous environments [french1999catastrophic].
This paper introduces a lifelong navigation framework that addresses the aforementioned challenges: Instead of learning from scratch, the navigation policy is initialized through an untuned static classical navigation algorithm, whose navigation performance does not improve with increasing navigation experience. The robot is able to identify its suboptimal actions and learn from them. The navigational performance then improves in a self-supervised manner. When facing different navigation environments, the navigation policy is able to learn to adapt to new environments, while not forgetting how to navigate in previous ones. The lifelong learning framework is implemented entirely onboard a physical robot with limited memory and computation, and allows the robot to navigate in three different environments (Figure 1). The three main contributions of the paper are:
A self-improvement strategy that complements an initial static planner to dynamically increase navigation performance with more experience, deployed in conjunction with the initial planner to minimize learning overhead;
A continual learning scheme that allows learning to navigate in new environments while not forgetting previous ones; and
An implementation of the lifelong navigation framework entirely onboard a physical robot platform with limited memory and computation.
2 Related Work
This section first reviews how classical and learning-based methods improve navigation performance and adapt to different navigation environments, and then briefly discusses recent successes in the continual learning community.
Classical navigation systems [quinlan1993elastic, fox1997dynamic, kavraki1996probabilistic, kuffner2000rrt] are designed to be applicable to a wide variety of environments, but they are usually static, operating under a fixed set of pre-specified hyper-parameters and therefore lacking the ability to improve with experience and adapt to a specific environment. Parameter Tuning is the current practice to address the aforementioned problems [zheng2017ros], which requires human experts’ intuition, experience, and trial-and-error. To reduce the reliance on expert knowledge, xiao2020appld proposed to improve navigation for a given environment (defined as “context”) by tuning parameters through Behavior Cloning (BC) from teleoperated demonstration. In most cases, tuning requires human knowledge, either in the form of direct tuning or of navigation demonstrations. Furthermore, once tuned, the static navigation system lacks the ability to further improve with more experience or adapt to new environments. In contrast, the proposed lifelong navigation framework does not require human knowledge and can dynamically improve with more navigation experience when facing new environments.
Data-driven machine learning techniques have also been widely applied to navigation problems [pfeiffer2018reinforced, tai2017virtual, zeng2019navigation, wang2018learning, xie2018learning, chiang2019learning, gao2017intention, zhu2017target, wang2019dual]. As for physical robot navigation, learning approaches typically either imitate a conventional planner [pfeiffer2017perception]
or learn from trial-and-error using reinforcement learning[kahn2018self]. While these learning methods enable improvement in a specific environment with increased navigation experience, if the agent were to be placed in multiple environments in a sequential fashion, which is common in real-world navigation, learning methods may not generalize well and can easily forget the past knowledge. From a lifelong learning perspective, wyeth2011towards studied continuous mapping for navigation and wang2019reinforced improved generalization in vision-language navigation. Both methods focus on the problem of where to navigate instead of how to navigate. To the best of the authors’ knowledge, no existing work has tackled lifelong/continual learning of navigation behaviors across different navigation environments. We conjecture that unconstrained offboard computation resources, e.g., memory, power, and time, allow learning from an extensive body of training data pre-collected in different environments. However, for onboard resource-constrained robot platforms without pre-collected supervised data, learning how to navigate in new environments while not forgetting previous ones requires further research, which is the focus of this work.
Lifelong/Continual Supervised Learning
Lifelong or continual learning studies the problem of learning in an ongoing fashion. One of the earliest attempts at lifelong learning originates from the robotics community [thrun1995lifelong]. ring1994continual
provides the earliest introduction to continual learning in reinforcement learning problems. Recently, much progress has been made for continual learning with neural networks. There are mainly three categories of approaches: 1) use regularization to prevent the learned weights from deviating too much from the old weights[kirkpatrick2017overcoming, zenke2017continual]; 2) train a generative model to recover old data for joint optimization [shin2017continual]; and 3) adopt a dynamic network architecture for learning more tasks [rusu2016progressive, yoon2017lifelong]. Among the above approaches, the first approach applies to a fixed capacity network, which is often much more computationally efficient than training a generative model or adopting a dynamic network architecture. This computational efficiency is essential for learning onboard resource-constrained mobile robot platforms. Specifically, when a few past data points can be saved, Gradient Episodic Memory (GEM) [lopez2017gradient] can be very efficient and powerful (See Section 3). All these methods demonstrate success on continual image classification problems, but there remains much room for studying continual learning in other applications like in robotics, especially when supervised labels from human experts are not available a priori.
In this section, we first present the notation we use in this work and the problem setup. We then introduce an underlying lifelong learning algorithm we use for our lifelong navigation framework. We define “lifelong navigation” in the sense of learning navigation with increasing experience, across environments, rather than in the literal sense, i.e. navigation over extended periods of time.
Notation and Problem Setup
The high-level objective of lifelong navigation can be summarized as learning to navigate in a sequence of environments .111Informally, we consider an environment to be a contiguous space in which a fixed navigation policy without any adaptation (e.g. parameter-tuning or re-learning) will achieve similar performance. It is generally characterized by the width of its passageways and the density of its obstacles. In each of those environments, the robot aims at navigating from one fixed start point of the environment to another fixed goal point. We assume a fixed global planner (e.g. Dijkstra’s algorithm [dijkstra1959note], A* [Hart1968] or D* [ferguson2006using]) generates a path connecting start and goal, and while navigating, the robot needs to produce motion commands which follow this global path, observe its kinodynamic constraints, and avoid obstacles. Whenever the agent advances to , it no longer has access to . Within the environment , the agent, at each time step , computes a motion command , where is the agent’s state, is a policy parameterized by . After executing , the agent advances to and the process continues. During the learning phase, the agent will have a limited onboard memory size , i.e. maximally pairs of
can be stored at any moment. Once the agent has seen allenvironments, its navigation performance will be evaluated on the same environments.
Gradient Episodic Memory
The key challenge of continual learning is so-called catastrophic forgetting, which means the agent forgets what it learned previously when adapting to a new environment. The phenomenon is especially prominent when feature-rich parametric models, i.e. neural networks, are used as the underlying learning module. To address catastrophic forgetting, we use Gradient Episodic Memory (GEM)[lopez2017gradient] within our lifelong navigation paradigm described in Section 3. GEM is designed to solve the catastrophic forgetting problem by constraining the gradient of new updates to be only along directions that improve performance on previous tasks. More specifically, if the agent has already seen environments up to , GEM assumes the agent keeps a small memory buffer that stores a few data points from each of the previous environment , and optimizes the following objective
is the loss function, i.e. in the regression case,. The above optimization problem is in general hard to solve. The key observation of GEM is that it is not necessary to store as long as at each optimization step the loss on previous tasks is not increasing. To approximately evaluate the loss increase on previous tasks, GEM further assumes the policy is locally linear, as is the case with small optimization steps, so that whether the loss on previous tasks will increase can be determined by the angle between the gradient on the current environment and the gradient from previous tasks. Specifically, the optimization problem becomes
In practice, to solve Equation (2
), GEM uses stochastic gradient descent with a modified gradient. In particular, denoteand . Then, GEM attempts to find the update direction :
The above optimization is already in a nice quadratic form, but the decision variable has the same dimension as , which can be millions for deep architectures. Fortunately, its dual problem is only associated with variables and can be efficiently solved by standard quadratic programming solvers. Formally, the dual problem of (3) is
where is a matrix having as its columns. As a result, the final and the update rule is , where is the learning rate. GEM has no requirement for dynamically expanding the parameter size and often requires very few exemplar data points from past experience to maintain the learned behavior. Therefore, GEM is particularly suitable for robot navigation tasks since mobile robots often have very limited onboard memory resource.
4 Lifelong Navigation
Intuitively, catastrophic forgetting is caused by the dilemma of learning new things and preserving old knowledge. In real-world navigation, it is unlikely that the agent needs to keep learning new things every second. The agent can often easily navigate in many places with classical approaches and only find it hard to navigate in particular scenarios (e.g. the green and red path segments in Figure 1). As a result, we focus the agent on learning from those difficult instances to reduce computation overhead and minimally influence its past knowledge. We would like the agent to be able to identify those “difficult” scenarios from its experience, i.e. in what particular state does the agent generate sub-optimal motion commands. In addition, since we do not assume access to an expert to provide the ground truth motion commands for those difficult scenarios, the agent must keep sampling motion commands until it finds alternative better actions.
When it eventually passes a difficult scenario, it should identify the cause of its success, i.e. a particular motion command that helped it navigate through the difficulty around . Note that the motion command is still generated by the agent itself, but might originate from a different state near (see Figure 2). Hence, by identifying the right and learning from it, the agent can improve its navigational behavior around . Finally, the learning from should minimally affect what the agent has learned in the past. To summarize, we list the key components of our lifelong navigation:
An initial sampling-based navigation planner and a learnable policy , parameterized by . The two policies will be used in conjunction for lifelong navigation.
A discriminator that given a state , evaluates how good an action is, to navigate the environment, i.e. larger indicates is a better action at .
A short-term memory buffer , acting as a streaming buffer, that stores the past -step trajectory the agent has seen, i.e. .
A per-environment buffer that stores the selected training data from environment . is updated using data from .
A long-term memory buffer that stores the self-collected demonstrations for difficult scenarios in previous environments.
An algorithm that given the past experience and a sub-optimal behavior ,222Usually we pick the midpoint of the past -step trajectory, i.e. finds that causes the agent to successfully pass .
A continual learning algorithm that updates the agent’s parameterized policy given the current environment buffer and the long-term memory .
Methods and that update the training buffer and long-term memory buffer to ensure the total memory resource is constrained.
With the above components, the agent will be manually333It may be possible for the agent to automatically detect environment shift, but we leave that for future work. placed at the fixed start locations of a sequence of different environments, in each of which it navigates to the fixed goal, keeps identifying sub-optimal behaviors, and improves upon them, while preserving its past knowledge. The pipeline of lifelong navigation is summarized in Algorithm 1.
In Algorithm 1, the long-term and short-term memory buffer and , and initial learnable policy are initialized in line 2. For each environment (line 3), the buffer that is used to store the selected training data from is initialized in line 4. While navigating (line 5), when the agent progresses to a state (line 6), it computes motion predictions from both and (line 7). In line 8, it executes the action with higher score computed from the discriminator
. In practice, we use the recovery behavior heuristics extracted from the DWA planner[zheng2017ros] as as it naturally provides a good evaluation of navigational behavior (see Section 5 for details). In line 9, the agent saves the current step to the short-term memory . Note that the size-constrained will automatically remove the earliest step when its size goes beyond . In line 10-14, the agent looks at the midpoint of the short-term memory. If , i.e. is sub-optimal at , it then searches within the nearby experience to find the state-action pair that most likely causes the agent to pass using . Intuitively, if we search within state-action pairs around that have action scores above the threshold , and find the state-action pair that has the most similar state to , then learning from could potentially help the agent navigate from . Formally, selects according to
where measures how similar is to . The underlying assumption is that taking the same action in similar states should result in similar action scores. In other words, that similarity score indicates how confident we are that can actually help the agent pass , provided that . In our implementation, the state consists of raw sensor readings, such as lidar, and a local goal position, so the negative Euclidean distance is a reasonable measure of similarity. Note that although it is possible that within , no state-action pair has sufficiently similar state to , learning from is not detrimental to . In line 13, attempts to add to the training buffer . If is full, will compare all the similarity scores and remove the least similar one. In line 16, the agent updates its knowledge with the current , using the continual learning algorithm , while preserving what it has learned before by considering . In particular, to continually update , we adopt the Gradient Episodic Memory (GEM) as . Specifically, we would like to solve
Here , is the standard behavior cloning objective. The optimization in Equation (5) will then be solved with stochastic gradient descent using the same method shown in Equation (3). Finally, to accommodate the online memory budget, shrinks by removing entries with the lowest similarity score (line 17), appends the training experience from the current environment (line 18), and moves on to the next navigation environment.
The proposed lifelong navigation framework is tested in simulated and physical experiments. We hypothesize that through lifelong navigation (1) navigation performance can improve as the robot gathers more experience in an environment, and (2) navigation in new environments can be learned while not forgetting previous ones.
Clearpath Jackal, a four-wheeled differential-drive unmanned ground vehicle, is used for both simulated and physical experiments. The robot is equipped with a laser scanner to perceive surrounding obstacles and runs the basic Robot Operating System (ROS) move_base navigation stack. While the global planner is based on Dijkstra’s algorithm [dijkstra1959note], the local planner uses DWA [fox1997dynamic]. As mentioned in Section 3, we fix the global planner and improve the local DWA planner using the lifelong navigation framework. We include the local goal provided by the global planner as part of the state, along with LiDAR input. Based on the local DWA planner as an initial policy (), lifelong navigation framework learns to complement this and improve navigation performance in multiple environments. As a sampling-based local planner, when DWA cannot find feasible motion, it starts recovery behavior, including rotation in place or backing up [zheng2017ros]. The lifelong navigation framework first aims at eliminating those sub-optimal behaviors by proposing alternative motions for a given navigation environment. Second, it allows the robot to adapt to new environments while still remembering previous ones.
The learning problem is formulated as finding a policy that maps from the current state , which includes LiDAR input (720- and 2095-dimensional for simulated and physical experiments, respectively) and local goal (, 1m away on the global path), to the action , linear velocity and angular velocity . In our implementation, the discriminator prioritizes actions proposed by DWA, if they result in a steady forward motion (). This choice of implementation is due to the fact that DWA with default parameters is a relatively conservative algorithm, which only produces steady forward motion when it thinks the motion is absolutely safe. The discriminator can be implemented differently given a different initial policy, e.g. a random exploration policy. If DWA fails to find feasible actions, i.e. the robot starts to execute recovery behaviors, the learned policy takes over. To assure safety, if the action from the learned policy is likely to cause a collision based on a forward simulation model, we revert back to DWA’s sub-optimal action.
In consideration of limited onboard resources, we implement a streaming buffer as a regular queue to store 300 online streaming data points sequentially, and a separate training buffer as a priority queue to save valuable training data in the current environment. The size of is also constrained to 300. For every sub-optimal action generated by DWA at a certain state, i.e. the robot performs recovery behavior, we compute a similarity score between this state and all states in the streaming buffer with a successful action (L2-norm of the LiDAR reading difference). We update all data points of lower similarity score in the training buffer with those from the streaming buffer (). The onboard data memory overhead to implement the lifelong navigation framework is less than a total of 600 data points.
5.1 Simulated Experiments
The effect of lifelong navigation is first studied with extensive simulated trials. The simulated Jackal has a SICK LMS111 laser scanner onboard providing 270 720-dimensional laser scan. The three simulated navigation environments are shown in Figure 3, where the robot navigates from a fixed start from one side of the environment to a fixed goal on the other side. In the simulated experiments, the robot is placed at its start location in each environment, and the corresponding learned policy to be tested is selected manually.
We use the initial policy to execute three trials in each of the three navigation environments to collect self-supervised training data. (DWA) can eventually navigate through, given sufficient time to recover, re-sample, and re-plan. To test in-environment learning improvement, training data of the given environment is divided into five segments and incrementally presented to the learner. For example, the first training buffer for learning is constructed from the first one-fifth of the robot’s experience, the second from the first two-fifths, etc. The last training buffer is from the entire experience. Note all these training buffers contain only 300 data points, who have the highest similarities to those states where the initial policy performs sub-optimally, given the presented navigation experience. Therefore, six incrementally learned navigation policies are produced, starting from the initial policy to the policy learned with a training buffer that has seen all three training trials. Then the robot starts to learn the next navigation environment. For cross-environment learning, we use Sequential Training and the proposed Lifelong Navigation. Sequential Training first uses the training buffer of environment 1. Starting with the final policy trained from environment 1, it sequentially trains on the training buffer of environment 2, before moving on to the training buffer of environment 3. In Lifelong Navigation, while training environment 2, it only uses 150 data points from training buffer 2 with the highest similarity score, and still keeps a memory of 150 data points with the highest similarity score from training buffer 1 to assure new gradient updates won’t increase the loss of environment 1. While training on environment 3, 100 data points with the highest similarity score for each environment are used, to avoid forgetting environment 1 and 2.
We implement a small neural network of three hidden layers with 64 hidden neurons each, to compute linear and angular velocity based on the LiDAR input and local goal. After training, we evaluate the navigation performance in terms of traversal time in the current and previous environments. Each policy is executed three times, resulting in a total of 198 evaluation trials. We report the mean and standard deviation of the performance in Figure4. If the robot is not able to reach the goal, e.g. getting stuck, a penalty time of 100s is given. Within each environment, Lifelong Navigation is able to decrease traversal time with increasing learning experience. Across environments, Lifelong Navigation learns new environments with increasing data while avoids catastrophic forgetting of previous ones. Sequential Training can improve navigation performance of a given environment when being presented the data from that particular environment, in a faster way than Lifelong Navigation. However, navigation performance in previous environments deteriorates with increasing experience of the current environment. The catastrophic forgetting is apparent in the diverging red line (Sequential Training) from the green line (Lifelong Navigation). Note that the learned policies are used in conjunction with the initial policy.
5.2 Physical Experiments
Our lifelong navigation method is also implemented on a physical Jackal robot. In the physical experiments (Figure 1), we use the same setup, while all computation is done onboard the robot using an Intel Core i5-4570TE CPU. The physical Jackal has a Velodyne LiDAR, whose 3D point cloud data is converted to 360 2095-dimensional laser scan. We design the physical test environments such that the goal of one environment smoothly transitions to the start of the next. Therefore the robot can traverse through all three environments in one shot. For training, we manually label the current environment for the learner. During deployment, the robot uses only one for each traverse.
In the physical experiments, we compare our lifelong navigation framework with three baselines: DWA, Sequential Training, and Individual Models. Sequential Training and Lifelong Navigation are conducted in the same way as in the simulated experiments. In addition, Individual Models start with the final model of the last environment and then train a separate model using data from the current environment. These three individual models are applied to their corresponding environments separately. The Individual Models baseline does not use lifelong navigation and requires different models for each environment. It is expected to represent an upper-bound of navigation performance with additional resources (900 data points and three separate models). The individual model for environment 3 is also the final Sequential Training model. It is then applied to the first two environments after being sequentially trained on all three environments to see if catastrophic forgetting happens through Sequential Training on the physical robot. Training every model takes less than two minutes on the robot’s onboard CPU.
|Env. 2||Mean||28.053.48||49.41 13.94||19.981.07||23.410.66|
We deploy the trained models to navigate in the three environments. For each method, the robot navigates each environment five times, resulting in a total of 60 physical trials. Table 1 reports the mean execution time for each environment with standard deviation and average number of recovery behaviors (Rec.) and collisions (Col.). DWA exhibits the most recovery behaviors, because whenever the robot fails to sample a feasible motion, it starts recovery behavior. One collision happens in one of the environment 2 and 3 trials. Applying the model sequentially learned on environment 1, 2, and 3 causes catastrophic forgetting of the first two environments. It leads to longer execution time and higher standard deviation, with frequent recovery behaviors and collisions. Note that the Sequential Training model used for environment 3 is identical to the one used for Individual Models. Lifelong Navigation can successfully avoid catastrophic forgetting: it achieves a similar time to the Individual Models approach in environment 3, while, surprisingly, it outperforms Individual Models in environment 1 and 2 in terms of average time. One possible explanation is that the Lifelong Navigation model has good backward transfer ability after gathering more diverse experience. Utilizing extra training data and models specifically trained for each environment, the Individual Models approach is more stable (lowest standard deviation). (Anonymized video of representative trials of the four methods: https://tinyurl.com/corl20lifelongnav.)
In this paper, we propose and implement the first self-supervised lifelong learning framework for mobile robot navigation. Building upon an initial static sampling-based model predictive control policy, which does not improve with increasing navigation experience, the robot is able to self-identify sub-optimal actions, search for similar scenarios where good actions are performed, learn from those data, and improve navigation in a continual manner. Furthermore, in a multi-environment setting, the lifelong navigation framework is able to adapt to new environments, while not forgetting previous ones. Extensive simulated trials are performed to test the lifelong navigation’s in-environment and cross-environment learning capability. The entire lifelong navigation framework is also implemented and tested on limited computational resources onboard a physical robot and operates in real time without requiring any off board computation. One interesting future direction is to extend the current simple heuristic-based discriminator to a more general formulation, e.g. in terms of a value function. Then the discriminator will not depend on heuristics specific to the initial policy (DWA in our case). Another interesting direction is to investigate better methods to prioritize experience to update the online buffers with limited budget.