The general-purpose robot assistant of the future assists humans with daily tasks to reduce labor overhead, e.g. as a housekeeping or indoor service robot. An essential part of the human-robot interaction involves language guided navigation, i.e. enabling the robot to execute instructions given by a human to reach a target location. This requires the robot to interpret the instructions expressed in natural language, to ground them in (usually visual) observations made by the robot and to move accordingly.
In contrast, robot navigation in general has been an active area of research for decades, with well established methods for path planning based on occupancy maps, that have proven their value in a myriad of real world applications. The question then naturally arises how this can be leveraged in the context of language guided navigation.
Recent works for language guided navigation often include a pre-exploration phase (e.g.[reinforced]), during which the robot explores a new environment. This is introduced to tackle the domain gap and finetune the visual encoder in a self-supervised setting, and has been shown to have a significant impact on the performance. Such pre-exploration phase seems reasonable for most practical applications. Here we argue that, when pre-exploration is indeed possible, the robot should take full advantage of it. Instead of only finetuning its encoders, it can extract a full map of its environment. Having a map of its environment at its disposal, enables the robot to globally plan and to deploy traditional path planning algorithms rather than relying purely on learning-based methods.
In most application scenarios, the layouts of the environment are mostly stationary. We usually need to carry out multiple navigation tasks in the same scene. For example, we may give orders to a housekeeping robot who always works in an indoor space, like at your home; Or, the service robot of a hotel needs to deliver things for customers inside the building. In such scenarios, we are able to prescan the environments, obtain detailed semantic navigation maps, and rely on those for the actual navigation – in combination with obstacle avoidance to cope with dynamic objects. These maps may include not only the geometry of the environment, as in a traditional occupancy map, but also its semantics, such as the location of the objects and classification of the room types, which are useful to ground the language instructions.
At runtime, instead of taking RGBD sensor information as input, as in traditional visual language navigation (VLN) approaches, we use the current robot position in the semantic navigation map as main observation of our navigator.
We see two main benefits in using the semantic navigation map. First, the map provides global information about the environment. This prevents the agent from being stuck in local decisions. Second, we can easily apply an explicit path planning strategy to this modality. Some previous works [levit2007interpretation, vogel2010learning] also introduced maps for language guided navigation, but they simply treated the maps as additional image inputs, not leveraging their full potential.
In this paper, we propose a novel Map-Language Navigator (MLN), which carries out navigation tasks based on the semantic layouts of environments and natural language instructions. MLN uses a modular design that consists of three main components. In the First module, we adopt deterministic algorithms to analyze and discretize the semantic navigation map of a target environment so as to propose path candidates accordingly. The proposed path candidates greatly reduce the dimension of the solution space, making our method more efficient. Second
, we introduce a semantic feature extraction module to encode the environments along a path into a feature sequence in order to reflect what the robot has observed and how the robot moves along the path. MLN perceives the environments in an egocentric perspective for a set of key points along a path candidate. A novel feature extraction scheme is introduced to obtain permutation invariant features at the object and room level. We also take the low-level features of the path, e.g., moving directions and locations, into account as well.Third, we leverage attention mechanisms to fuse path representations and instructions and further score paths. The final prediction is determined according to these scores.
We evaluate and compare our method with other VLN methods. Comprehensive experiments show that the proposed method can leverage the maps well and can produce effective results. Our method outperforms the state-of-the-arts in most aspects, especially in cases that have long navigation distances. Compared with other methods, Our approach has higher training efficiency.
Ii Related Work
In this section, we briefly review the related prior work on vision language navigation and map navigation.
Vision-and-Language Navigation. In the VLN [anderson2018vision] task, an agent needs to navigate to a goal location in a photo-realistic virtual environment following a given natural language instruction. In this process, the agent takes an egocentric RGB-D image as observation and, together with the language instruction, it has to decide its action at each time step. Research on the VLN task has made significant progress in the past few years. Attention mechanisms across different modalities are widely used to learn an alignment between vision and language, to boost the performance of this task [vlnce, reinforced, huang2019multi, landi2021multimodal, ma2019self, ma2019regretful, landi2019embodied, hong2020sub]. Except for the new modelling architectures, the improvements also come from new learning approaches. For instance, [anderson2018vision] first apply imitation learning (IL) in VLN to force the agent to mimic the expert’s behaviour during navigation; [wang2018look] combines model-free and model-based reinforcement learning (RL), which allows the agent to do both mapping and future planning at the same time; [reinforced] further use the ensemble of IL and RL to learn better cross-modal grounding and generalizability; [krantz2021waypoint] applies the DDPPO [ddppo] algorithm to directly predict a waypoint that the agent needs to navigate to.
Different from the above learning-based approaches, with the help of the access to the global semantic map of the environment, our proposed planning algorithm directly plans the full path to be followed given the natural language instruction, making it possible to consider multiple paths, which is vital in real-world applications.
Map Language Navigation. Navigation based on a map has been well studied in robotics in the past few years [meyer2003map], while using language as guidance is relative less explored. [levit2007interpretation] assume the navigation instruction is decomposed into navigational information units (NIU) and combine the representation of these units to form a path through the map. [vogel2010learning] reduce the requirement of access to NIU, and apply reinforcement learning to learn the whole pair of text and path in the map. Both of these works focus on the HCRC Map Task Corpus [HCRC], which only provides simple generated comic-like maps. In contrast, in our dataset we construct maps based on photorealistic environments allowing it to directly apply the trained model on real-world data.
Maps in Visual Navigation. It has been shown earlier that maps can provide semantic cues of the environment for navigation [chen2019learning]. Many works in visual navigation have included map information in their models [gupta2017unifying, seymour2021maast, chaplot2020learning, chen2019learning]
. However, most of the works mainly concentrate on building the map. They simply use a convolution neural network or attention mechanism to extract the relevant information from the map and further fuse it with information from other modalities, such as a depth map or RGB image. We argue that such methods might not be able to fully utilize the useful information in the map. Instead we propose a planning algorithm which is proved to be a better and more efficient way to use the map in a navigation task.
Given a start point in an environment, our novel Map-Language Navigator (MLN) aims at providing a route path reaching to a target location according to a natural language instruction and the semantic map of the environment. This problem is formulated as:
where is the proposed navigator.
To this end, we divide the proposed pipeline into three parts, as demonstrated in Figure 2. Specifically, in the first part (§ IV-A), we propose candidate paths according to the environment. Assuming that paths guided by the instructions do not take detours, we apply a deterministic shortest path algorithm to find a set of candidates, from which the final answer is selected. In the second part (§ IV-B), MLN extracts feature representations for each candidate path. We discretize each path and hence embed local observations to a feature sequence. Unlike traditional methods which encode RGB inputs, MLN extracts features from egocentric views of local 2D semantic maps along the path. Low-level features, such as moving directions and locations, are also included in the feature embedding. Based on these path feature representations, a language driven discriminator evaluates each candidate path and scores it in the last part (§ IV-C) of our model. We select the path that best matches the instruction as the final answer according to the score of the path.
Semantic Map Generation. We recycle the VLN-CE [vlnce] dataset to generate semantic maps. The VLN-CE dataset ports the Room-to-Room (R2R) dataset [anderson2018vision] to continuous environments. VLN-CE [vlnce] are derived from the Matterport [Matterport3D] dataset which contains annotated point clouds that indicate the room type and the category of objects in the environment. There are forty object types and thirty room types in the dataset. Using these annotated data, we render top-view layouts of each scene and obtain its and , which represent the semantic maps of objects and rooms, respectively. Each kind of map uses a separate channel for each object category or room type. These semantic maps are noisy, due to the sparsity of the point clouds. Hence, we apply a median-filter with a kernel size of to denoise the maps. We also build an obstacle map that indicates the navigable space for robots to move at each scene. Figure 2 (a) shows an example of . takes the size of the robot, the walls and all obstacles in a scene into account.
Note that this map generation method can only deal with single-floor scenes. Thus we create a split of the original dataset, which contains scenes that meet the condition to train and evaluate our method.
Exploiting a semantic map in a navigation task is nontrivial. Naively integrating maps into existing approaches does not result in a substantial gain in performance and even leads to failures in some cases, as demonstrated in our experiments(§ V-C). Therefore, we propose a novel way of map-language navigation as shown in the following sections.
Iv-a Deterministic Path Candidate Proposal
We first sample a candidate point set from a navigable area. We regard these points as possible end points of the target path according to the instruction . Specifically, end points are sampled from a grid with a spacing of meters. We discard end points outside the navigable space on the maps. Given a start point , we run Dijkstra’ algorithm [dijkstra1959note] on a graph where each node is a pixel on the obstacle map . There exist a edge between two adjacent pixel nodes when and only when they are reachable to each other in any of the eight directions. The algorithm provides shortest paths between and each point . These paths are the candidates for the final results.
The distance between adjacent points controls the number of sample points. Too many points will cause redundant calculations, while too few points will reduce accuracy. We found that has a good balance between computational complexity and performance.
Iv-B Path Feature Encoding
Next, we propose a path feature embedding scheme that is applied to each path candidate in the semantic map . Ideally, the path feature represents the environment context along the path and will be aligned with the language representation to further verify if this path matches a target instruction. MLN discretizes each path and encodes local context into each keypoint along the path. So features of all keypoints form a feature sequence in temporal order.
Specifically, for each keypoint’s feature encoding, we introduce a room compass and an object compass to perceive the local environment as illustrated in Figure 3. Let () denote a room compass, where is the number of discretized angles. We have a 360 degree circle divided into 12 sectors. Each of these elements in corresponds to one of the sectors. A chunk records the type of the closest room that is of a different room type with the keypoint in that direction. We align the current direction of the agent with the first chunk and arrange these chunks in clockwise order to obtain . We deploy a single-layer fully-connected network
activation to project the room compass of a keypoint into a feature vectorfor a keypoint:
The object compass contains objects near the agent within meters. Each object is represented by a relative coordinate system with a normalized coordinate of inside a local region centered at the agent. The origin of the relative coordinate system is aligned with the agent, and its X-axis is aligned with agent’s current direction. The object compass can be treated as a 2D point cloud in a 2D egocentric view. For each point in the cloud, we embed the object’s class with a 50-dimension GLOVE word embedding [pennington2014glove] as additional point feature . Then a Pointnet [qi2017pointnet] network is applied to encode the object compass into a permutation invariant feature representation, which makes that the order of the objects in does not matter. This process is formulated as:
where is the feature of local object context in a keypoint. Since some object classes, such as “ceiling” and “other”, cannot provide much information in the navigation task, we exempt them from the object compass.
We also incorporate low-level information, e.g. agent pose, into the feature sequence. An agent pose feature consists of location and agent direction at a keypoint:
where is the concatenate operator, is the positional encoding function [mildenhall2020nerf], and is a single-layer fully-connected network with ReLU activation.
By gathering features of all keypoints, we can obtain the path feature , where is the length of the discretized path.
In terms of the path discretization, we sample keypoints along the path at equal distances. Each keypoint maintains a distance of 2 meters from adjacent ones. The step size of 2 meter is evaluated to be effective in our experiments, which is neither too far to miss turning points nor too short to cause redundancy in the computations.
Iv-C Language Driven Discriminator
We deploy a transformer-based model [vaswani2017attention] as Language Driven Discriminator for scoring the paths. The natural language instruction is first tokenized and embedded by GLOVE [pennington2014glove] embeddings . We project the embeddings of words to feature vectors using a single-layer FC with ReLU activation.
where is the language representation sequence.
has encoder layers as shown in Figure 4. It takes a concatenated sequence of and as inputs and predicts a score for the corresponding path. A learnable special token embedding is put at the front of the input sequence to aggregate context information. The Language Driven Discriminator is formulated as
where is the prediction score projected from the token embedding. We pick the path that has the highest score as the predicted path during inference.
During training, we set the ground truth score of each candidate path with a combination of two metrics, which are normalized Dynamic-Time Warping (nDTW) [ndtw] and the Euclidean distance. Specifically, nDTW, as a path shape similarity metric, indicates how well the candidate path and the ground truth path match. The Euclidean distance is calculated between the end point candidate and the ground truth end point . The final GT score is the linear combination of them, formulated as:
where is the Euclidean distance between two end points, and is a coefficient to balance these two components. We apply a Masked Language Model (MLM) [devlin2018bert] to the input embedding of the natural language navigation instruction to enhance the model’s context-awareness during training. The total training loss is formulated as:
where is the MLM loss described in [devlin2018bert], and is the mean square error between and .
|RGBD + Map||Seq2Seq-DA-mln*||9.11||8.31||0.50||0.39||28.8||0.28||6.67||7.81||0.48||0.23||17.6||0.17|
* stands for models trained only on MLN splits (single floor large environment), others are validation results of official checkpoints on the MLN validation split.
V-a Implementation Details
Our method contains two modules, the path planner and the path scorer. The path planner is a fully CPU intensive module that in parallel extracts candidate paths and context compasses (room and object). The results are batched and sent to a 12-head 6-layer transformer model for scoring. Path features are extracted by additional projection layers and PointNet as aforementioned. These feature extractors and transformer are combined and jointly optimized by the AdamW [loshchilov2018adamw] optimizer with a learning rate of 1e-4. The model is trained on a single NVIDIA-P100 GPU for 10 hours.
V-B Evaluation Metrics
We evaluate the predicted paths in the Habitat simulator [savva2019habitat]
using the standard evaluation metrics employed in vision-language navigation (VLN) tasks[anderson2018evaluation, anderson2018vision, magalhaes2019effective]: trajectory length (TL), navigation error - average distance to goal in meters (NE), normalized dynamic-time warping (nDTW), oracle success (OS), success rate (SR) and success weighted by the normalized inverse of the path length (SPL). We choose nDTW and SR as our primary metrics when discussing the results. These two metrics cover two important aspects of the navigation task: (a) shape similarity of the predicted paths to the ground truth trajectories and (b) accuracy of stopping at the correct region.
V-C Main Results
Besides models fully trained under MLN settings, i.e. on a subset of the data consisting of single floor environments, we also include models trained in the traditional VLN settings but evaluated on the same split as our MLN. These experiments aim to reveal the performance of models with different technical routes under the general language-guided navigation task.
We compare our proposed planning model with two types of released pretrained methods: (a) Imitation learning: sequence to sequence model (Seq2Seq-DA-full) [vlnce]
and the cross attention model (CMA-DA-PM-Aug-full) [vlnce] leverage the powerful DAgger [dagger] strategy; and (b) Reinforcement learning: the waypoint prediction model (Waypoint-full) [krantz2021waypoint] is the state-of-the-art method, which shifts the training objective from action prediction to local waypoint selection and is trained using resource-consuming DDPPO [ddppo]. For fair comparison, we also include light-weighted models trained only on the MLN split named Seq2Seq-DA-mln and CMA-DA-mln
. Egocentric map embeddings of our modified models are extracted from a convolution neural network (CNN) following conventional solutions[chaplot2020object]. Since including semantic maps together with a RGBD sensor in a CMA model has too many possible variations in design and conventional CMA’s performance does not surpass Seq2Seq by a significant gap, we ignore this case in the experiment of RGBD+MAP in Table I.
As illustrated in Table I, our proposed method outperforms others by a large margin on the val_seen split. For the two primary evaluation metrics nDTW and Success Rate (SR), our proposed method surpasses the RGBD based SOTA model by and , respectively. Compared to imitation learning that is strictly trained with the MLN setting, the proposed planning based method improves the success rate by on val_seen split and on val_unseen. As we can observe from line 4 in Table I, without a well-pretrained feature extractor, it is very difficult to learn meaningful information by fully relying on imitation learning, evidenced by the limited improvement after including the semantic maps.
We find that our model is less successful on unseen data compared to seen data when evaluated with the metric of success rate. There might be two reasons for this. First, MLN is trained only on a subset of VLN-CE. Lack of training data may hurt its generalization when finding accurate end points on unseen environments, although we observe that the routes are largely situated in the right directions. This can be demonstrated from the other primary metric nDTW, which has comparable performance to SOTA methods. Second, our methods are trained from scratch without using pretrained networks. As we can observe from Table I, models that use a RGBD sensor with feature extractor pretrained on large scale vision data show a good visual adaptation in unseen environments. In the MLN setting, because the map has accurate semantics and is sensitive to blurring and rotation operations, large scale pretraining on maps is challenging and requires further study.
V-D Ablation Study
To verify the effectiveness of each component of our proposed algorithm, we conduct an ablation study evaluated on the MLN dataset and present the results in Table II.
We first investigate if the natural language instruction is really needed in this map language navigation task. In this experiment, we zero out language tokens from all instructions except for the [CLS] token while training. The model is evaluated under the same setting. As is shown in the first and second row of Table II, removing the natural language instruction causes a significant drop in almost all evaluation metrics for both seen and unseen environments, which indicates the importance of language instructions as a guidance signal in this task.
Agent pose information regards the agent’s state at some point in the environment. This information is crucial when dealing with a large environment, as we can see from the third line of Table II: val_seen environment models trained without agent pose information provoke a drop of in success rate (SR), while for smaller unseen environments, the performance remains stable.
To encode the environment context information, we have proposed two compasses: room compass and object compass. The results in the last three rows of Table II show the importance of each compass. Removing either or both the compasses from the model results in a significant decrease in almost all evaluation metrics.
|- agent pose||0.53||31.0||0.58||24.8|
|- room compass||0.55||35.8||0.57||21.2|
|- object compasses||0.52||34.6||0.54||19.0|
|- full compass||0.52||27.6||0.52||20.0|
V-E Long-distance path planning
In a language-guided navigation task, error accumulation is an inherent issue of sequence prediction models [ma2019regretful]. To mitigate this issue, the SOTA method WPN [krantz2021waypoint] tries to shift the step-wise action prediction to a local path planning problem. This modification reduces the required sequence length and smooths the prediction action space. But still, WPN [krantz2021waypoint] fails to include a global path planning. Our solution shifts the sequence prediction problem to a sequence scoring problem, which removes the influence of error accumulation by design. In this experiment, we compare the success rate (SR) of WPN with our method. We use the officially released WPN pretrained on the full VLN-CE dataset. We grouped predicted results by their ground truth geodesic distance from start point to goal. Each group has a range of 2 meters. As shown in Fig. 5 on the seen split, the action sequence prediction method WPN fails to predict long-distance paths especially for data with a distance longer than 14 meters. Our approach excels on long-distance planning. In the unseen environment (Fig. 5), although our model is trained with much less data, it still has a comparable performance for long distance cases.
V-F Training Efficiency
Previous sections emphasize the comparable or even superior performance on the task. Apart from that, our method also has good training efficiency.
Due to the nature of reinforcement learning, previous RL based methods [krantz2021waypoint] require massive computational resources. It takes 5 days with tens of GPUs for convergence [krantz2021waypoint]. As for imitation learning, if trained from scratch without using pretrained models, besides a significant drop in performance, data collection and storing takes a lot of time in practice. These usually increase the training time by a fact of 5 or longer. Sequence generation is naturally more time-consuming than a simple regression problem. We modularize the sequence generation problem and make it fit to a regression model. Our overall training time is less than 9 hours on a single 16GB GPU.
Vi limitation and future work
So far, we can only deal with stationary environments. Dynamic objects need to be dealt with separately to avoid collision with them. Some obstacle avoidance mechanism may need to be introduced down the road. In our setting, we assume the instruction corresponds to a shortest path between two points. If an instruction requires a detour, the path candidates may not be the best fit.
To further improve our methods, we consider the following future works: (a) Large-scale pretraining has gained a large success in the computer vision[dosovitskiy2020image]
and natural language processing communities[devlin2018bert] by showing powerful generalization. Developing pretraining strategies for semantic navigation maps might increase the adaptation ability of the map-language navigator in more diverse environments. (b) Our current auto-generated dataset only contains complete single-floor environments, we propose to design and construct a larger and semantic meaningful dataset on multi-floor environments. (c) Designing path pruning strategies conditioned on natural language instructions is a promising direction to move forward. It has the potential to be combined with neural-symbolic models [mao2019neuro].
In addition, our algorithm is able to provide multiple candidate paths based on a ranking score, which is a good feature for human-robot interaction. When the agent feels confused about vague instructions, our method provides opportunities to include additional constraints from a human. In other words, it would be interesting to study how the agent could ask the human for further explanation and clarification. Such an approach would benefit research topics such as embodied question answering and embodied dialog.
We propose a novel approach to deal with a map-language guided navigation task. The strategy of dividing the navigation problem into three subproblems provides a new and successful perspective for solving the problem. The deterministic path proposal scheme can leverage the global information of the maps well and reformulates the original task into a new one whose solution space is reduced, which greatly avoids accumulating errors and falling into local solutions. Our path feature embedding enables the navigator to perceive the environments from semantic navigation maps, and it can make great use of the map information compared to methods that naively use convolutional layers for feature extraction. With the designed score metric, Language Driven Discriminator can learn the alignment between modalities from the supervision. The discriminator has the capability to distinguish paths that match natural language instructions.
Extensive experiments demonstrate the superiority of our method, especially in long distance navigation and training efficiency. Even though having some points for improvement, this propose-and-discriminate scheme has the potential to be a promising direction.
This project is supported by KULeuven C1 project Macchina.