PyTorch code for the ACL 2020 paper: "BabyWalk: Going Farther in Vision-and-Language Navigationby Taking Baby Steps"
Learning to follow instructions is of fundamental importance to autonomous agents for vision-and-language navigation (VLN). In this paper, we study how an agent can navigate long paths when learning from a corpus that consists of shorter ones. We show that existing state-of-the-art agents do not generalize well. To this end, we propose BabyWalk, a new VLN agent that is learned to navigate by decomposing long instructions into shorter ones (BabySteps) and completing them sequentially. A special design memory buffer is used by the agent to turn its past experiences into contexts for future steps. The learning process is composed of two phases. In the first phase, the agent uses imitation learning from demonstration to accomplish BabySteps. In the second phase, the agent uses curriculum-based reinforcement learning to maximize rewards on navigation tasks with increasingly longer instructions. We create two new benchmark datasets (of long navigation tasks) and use them in conjunction with existing ones to examine BabyWalk's generalization ability. Empirical results show that BabyWalk achieves state-of-the-art results on several metrics, in particular, is able to follow long instructions better. The codes and the datasets are released on our project page https://github.com/Sha-Lab/babywalk.READ FULL TEXT VIEW PDF
Mobile agents that can leverage help from humans can potentially accompl...
The BabyAI platform is designed to measure the sample efficiency of trai...
Can we enable NLP models to appropriately respond to instructional promp...
Recently, numerous algorithms have been developed to tackle the problem ...
An agent that can understand natural-language instruction and carry out
Language instruction plays an essential role in the natural language gro...
To cooperate with humans effectively, virtual agents need to be able to
PyTorch code for the ACL 2020 paper: "BabyWalk: Going Farther in Vision-and-Language Navigationby Taking Baby Steps"
Autonomous agents such as household robots need to interact with the physical world in multiple modalities. As an example, in vision-and-language navigation (VLN) (anderson2018vision), the agent moves around in a photo-realistic simulated environment (Matterport3D) by following a sequence of natural language instructions. To infer its whereabouts so as to decide its moves, the agent infuses its visual perception, its trajectory and the instructions (fried2018speaker; anderson2018vision; wang2019reinforced; ma2019self; ma2019regretful).
Arguably, the ability to understand and follow the instructions is one of the most crucial skills to acquire by VLN agents. jain2019stay shows that the VLN agents trained on the originally proposed dataset Room2Room (i.e. r2r thereafter) do not follow the instructions, despite having achieved high success rates of reaching the navigation goals. They proposed two remedies: a new dataset Room4Room (or r4r) that doubles the path lengths in the r2r
, and a new evaluation metric Coverage weighted by Length Score (CLS) that measures more closely whether the ground-truth paths are followed. They showed optimizing the fidelity of following instructions leads to agents with desirable behavior. Moreover, the long lengths inr4r are informative in identifying agents who score higher in such fidelity measure.
In this paper, we investigate another crucial aspect of following the instructions: can a VLN agent generalize to following longer instructions by learning from shorter ones? This aspect has important implication to real-world applications as collecting annotated long sequences of instructions and training on them can be costly. Thus, it is highly desirable to have this generalization ability. After all, it seems that humans can achieve this effortlessly111Anecdotally, we do not have to learn from long navigation experiences. Instead, we extrapolate from our experiences of learning to navigate in shorter distances or smaller spaces (perhaps a skill we learn when we were babies or kids)..
To this end, we have created several datasets of longer navigation tasks, inspired by r4r (jain2019stay). We trained VLN agents on r4r and use the agents to navigate in Room6Room (i.e., r6r) and Room8Room (i.e., r8r). We contrast to the performance of the agents which are trained on those datasets directly (“in-domain”). The results are shown in Fig. 1.
Our findings are that the agents trained on r4r (denoted by the purple and the pink solid lines) perform significantly worse than the in-domain agents (denoted the light blue dashed line). Also interestingly, when such out-of-domain agents are applied to the dataset r2r with shorter navigation tasks, they also perform significantly worse than the corresponding in-domain agent despite r4r containing many navigation paths from r2r. Note that the agent trained to optimize the aforementioned fidelity measure (rcm(fidelity)) performs better than the agent trained to reach the goal only (rcm(goal)), supporting the claim by jain2019stay that following instructions is a more meaningful objective than merely goal-reaching. Yet, the fidelity measure itself is not enough to enable the agent to transfer well to longer navigation tasks.
To address these deficiencies, we propose a new approach for VLN. The agent follows a long navigation instruction by decomposing the instruction into shorter ones (“micro-instructions”, i.e., Baby-Steps), each of which corresponds to an intermediate goal/task to be executed sequentially. To this end, the agent has three components: (a) a memory buffer that summarizes the agent’s experiences so that the agent can use them to provide the context for executing the next Baby-Step. (b) the agent first learns from human experts in “bite-size”. Instead of trying to imitate to achieve the ground-truth paths as a whole, the agent is given the pairs of a Baby-Step and the corresponding human expert path so that it can learn policies of actions from shorter instructions. (c) In the second stage of learning, the agent refines the policies by curriculum-based reinforcement learning, where the agent is given increasingly longer navigation tasks to achieve. In particular, this curriculum design reflects our desiderata that the agent optimized on shorter tasks should generalize well to slightly longer tasks and then much longer ones.
While we do not claim that our approach faithfully simulates human learning of navigation, the design is loosely inspired by it. We name our approach BabyWalk and refer to the intermediate navigation goals in (b) as Baby-Steps. Fig. 1 shows that BabyWalk (the red solid line) significantly outperforms other approaches and despite being out-of-domain, it even exceeds the performance of in-domain agents on r6r and r8r.
The effectiveness of BabyWalk also leads to an interesting twist. As mentioned before, one of the most important observations by jain2019stay is that the original VLN dataset r2r fails to reveal the difference between optimizing goal-reaching (thus ignoring the instructions) and optimizing the fidelity (thus adhering to the instructions). Yet, leaving details to section 5, we have also shown that applying BabyWalk to r2r can lead to equally strong performance on generalizing from shorter instructions (i.e., r2r) to longer ones.
In summary, in this paper, we have demonstrated empirically that the current VLN agents are ineffective in generalizing from learning on shorter navigation tasks to longer ones. We propose a new approach in addressing this important problem. We validate the approach with extensive benchmarks, including ablation studies to identify the effectiveness of various components in our approach.
Recent works (anderson2018vision; thomason2019vision; jain2019stay; chen2019touchdown; nguyen2019help) extend the early works of instruction based navigation (chen2011learning; kim2013adapting; mei2016listen) to photo-realistic simulated environments. For instance, anderson2018vision proposed to learn a multi-modal Sequence-to-Sequence agent (Seq2Seq) by imitating expert demonstration. fried2018speaker developed a method that augments the paired instruction and demonstration data using a learned speaker model, to teach the navigation agent to better understand instructions. wang2019reinforced further applies reinforcement learning (RL) and self-imitation learning to improve navigation agents. ma2019self; ma2019regretful designed models that track the execution progress for a sequence of instructions using soft-attention.
Different from them, we focus on transferring an agent’s performances on shorter tasks to longer ones. This leads to designs and learning schemes that improve generalization across datasets. We use a memory buffer to prevent mistakes in the distant past from exerting strong influence on the present. In imitation learning stage, we solve fine-grained subtasks (Baby-Steps) instead of asking the agent to learn the navigation trajectory as a whole. We then use curriculum-based reinforcement learning by asking the agent to follow increasingly longer instructions.
Since proposed in (bengio2009curriculum), curriculum learning was successfully used in a range of tasks: training robots for goal reaching (florensa2017reverse), visual question answering (mao2019neuro), image generation (karras2017progressive). To our best knowledge, this work is the first to apply the idea to learning in VLN.
In the VLN task, the agent receives a natural language instruction composed of a sequence of sentences. We model the agent with an Markov Decision Process (MDP) which is defined as a tuple of a state space , an action space , an initial state , a stationary transition dynamics , a reward function , and the discount factor for weighting future rewards. The agent acts according to a policy . The state and action spaces are defined the same as in fried2018speaker (cf. § 4.4 for details).
For each , the sequence of the pairs is called a trajectory where denotes the length of the sequence or the size of a set. We use to denote an action taken by the agent according to its policy. Hence, denotes the agent’s trajectory, while (or ) denotes the human expert’s trajectory (or action). The agent is given training examples of to optimize its policy to maximize its expected rewards.
In our work, we introduce additional notations in the following. We will segment a (long) instruction into multiple shorter sequences of sentences , to which we refer as Baby-Steps. Each is interpreted as a micro-instruction that corresponds to a trajectory by the agent and is aligned with a part of the human expert’s trajectory, denoted as . While the alignment is not available in existing datasets for VLN, we will describe how to obtain them in a later section (§ 4.3). Throughout the paper, we also freely interexchange the term “following the th micro-instruction”, “executing the Baby-Step ”, or “complete the th subtask”.
We use to denote the (discrete) time steps the agent takes actions. Additionally, when the agent follows , for convenience, we sometimes use to index the time steps, instead of the “global time” .
We describe in detail the 3 key elements in the design of our navigation agent: (i) a memory buffer for storing and recalling past experiences to provide contexts for the current navigation instruction (§ 4.1); (ii) an imitation-learning stage of navigating with short instructions to accomplish a single Baby-Step (§ 4.2.1); (iii) a curriculum-based reinforcement learning phase where the agent learns with increasingly longer instructions (i.e. multiple Baby-Steps) (§ 4.2.2). We describe new benchmarks created for learning and evaluation and key implementation details in § 4.3 and § 4.4 (with more details in the Suppl. Material).
The basic operating model of our navigation agent BabyWalk is to follow a “micro instruction” (i.e., a short sequence of instructions, to which we also refer as Baby-Step), conditioning on the context and to output a trajectory . A schematic diagram is shown in Fig. 2. Of particularly different from previous approaches is the introduction of a novel memory module. We assume the Baby-Steps are given in the training and inference time – § 4.3 explains how to obtain them if not given a prior (Readers can directly move to that section and return to this part afterwards). The left of the Fig. 3 gives an example of those micro-instructions.
The context is a summary of the past experiences of the agent, namely the previous mini-instructions and trajectories:
where the function
is implemented with a multi-layer perceptron. The summary functionis explained in below.
To map variable-length sequences (such as the trajectory and the instructions) to a single vector, we can use various mechanisms such as LSTM. We reported an ablation study on this in §5.3. In the following, we describe the “forgetting” one that weighs more heavily towards the most recent experiences and performs the best empirically.
where the weights are normalized to 1 and inverse proportional to how far is from ,
is a hyper-parameter (we set to ) and is a monotonically nondecreasing function and we simply choose the identity function.
Note that, we summarize over representations of “micro-instructions” () and experiences of executing those micro-instructions . The two encoders and are described in § 4.4. They are essentially the summaries of “low-level” details, i.e., representations of a sequence of words, or a sequence of states and actions. While existing work often directly summarizes all the low-level details, we have found that the current form of “hierarchical” summarizing (i.e., first summarizing each Baby-Step, then summarizing all previous Baby-Steps) performs better.
The agent takes actions, conditioning on the context , and the current instruction :
where the policy is implemented with a LSTM with the same cross-modal attention between visual states and languages as in (fried2018speaker).
The agent learns in two phases. In the first one, imitation learning is used where the agent learns to execute Baby-Steps accurately. In the second one, the agent learns to execute successively longer tasks from a designed curriculum.
Baby-Steps are shorter navigation tasks. With the th instruction , the agent is asked to follow the instruction so that its trajectory matches the human expert’s . To assist the learning, the context is computed from the human expert trajectory up to the th Baby-Step (i.e., in eq. (1), s are replaced with s). We maximize the objective
We emphasize here each Baby-Step is treated independently of the others in this learning regime. Each time a Baby-Step is to be executed, we “preset” the agent in the human expert’s context and the last visited state. We follow existing literature (anderson2018vision; fried2018speaker) and use student-forcing based imitation learning, which uses agent’s predicted action instead of the expert action for the trajectory rollout.
We want the agent to be able to execute multiple consecutive Baby-Steps and optimize its performance on following longer navigation instructions (instead of the cross-entropy losses from the imitation learning). However, there is a discrepancy between our goal of training the agent to cope with the uncertainty in a long instruction and the imitation learning agent’s ability in accomplishing shorter tasks given the human annotated history. Thus it is challenging to directly optimize the agent with a typical RL learning procedure, even the imitation learning might have provided a good initialization for the policy, see our ablation study in § 5.3.
Inspired by the curriculum learning strategy (bengio2009curriculum), we design an incremental learning process that the agent is presented with a curriculum of increasingly longer navigation tasks. Fig. 3 illustrates this idea with two “lectures”. Given a long navigation instruction with Baby-Steps, for the th lecture, the agent is given all the human expert’s trajectory up to but not including the th Baby-Step, as well as the history context . The agent is then asked to execute the th micro-instructions from to using reinforcement learning to produce its trajectory that optimizes a task related metric, for instance the fidelity metric measuring how faithful the agent follows the instructions.
As we increase from 1 to , the agent faces the challenge of navigating longer and longer tasks with reinforcement learning. However, the agent only needs to improve its skills from its prior exposure to shorter ones. Our ablation studies show this is indeed a highly effective strategy.
|Train seen instr.||14,039||233,532||89,632||94,731|
|Val unseen instr.||2,349||45,234||35,777||43,273|
|Avg instr. length||29.4||58.4||91.2||121.6|
|Avg # Baby-Steps||1.8||3.6||5.6||7.4|
To our best knowledge, this is the first work studying how well VLN agents generalize to long navigation tasks. To this end, we create the following datasets in the same style as in (jain2019stay).
We concatenate the trajectories in the training as well as the validation unseen split of the Room2Room dataset for 3 times and 4 times respectively, thus extending the lengths of navigation tasks to 6 rooms and 8 rooms. To join, the end of the former trajectory must be within 0.5 meter with the beginning of the later trajectory. Table 1 and Fig. 4 contrast the different datasets in the # of instructions, the average length (in words) of instructions and how the distributions vary.
In the following, we describe key information for research reproducibility, while the complete details are in the Suppl. Material.
We follow fried2018speaker to set up the states as the visual features (i.e. ResNet-152 features he2016deep) from the agent-centric panoramic views in 12 headings 3 elevations with 30 degree intervals. Likewise, we use the same panoramic action space.
Our learning approach requires an agent to follow micro-instructions (i.e., the Baby-Steps). Existing datasets (anderson2018vision; jain2019stay; chen2019touchdown) do not provide fine-grained segmentations of long instructions. Therefore, we use a template matching approach to aggregate consecutive sentences into Baby-Step
s. First, we extract the noun phrase using POS tagging. Then, we employs heuristic rules to chunk a long instruction into shorter segments according to punctuation and landmark phrase (i.e., words for concrete objects). We document the details in the Suppl. Material.
Without extra annotation, we propose a method to approximately chunk original expert trajectories into sub-trajectories that align with the Baby-Steps. This is important for imitation learning at the micro-instruction level (§ 4.2.1
). Specifically, we learn a multi-label visual landmark classifier to identify concrete objects from the states along expert trajectories by using the landmark phrases extracted from the their instructions as weak supervision. For each trajectory-instruction pair, we then extract the visual landmarks of every state as well as the landmark phrases inBaby-Step instructions. Next, we perform a dynamic programming procedure to segment the expert trajectories by aligning the visual landmarks and landmark phrases, using the confidence scores of the multi-label visual landmark classifier to form the function.
The encoder for the (micro)instructions is a LSTM. The encoder for the trajectory contains two separate Bi-LSTMs, one for the state and the other for the action . The outputs of the two Bi-LSTMs are then concatenated to form the embedding function
. The details of the neural network architectures (i.e. configurations as well as an illustrative figure), optimization hyper-parameters, etc. are included in the Suppl. Material.
In the second phase of learning, BabyWalk uses RL to learn a policy that maximizes the fidelity-oriented rewards (CLS) proposed by jain2019stay. We use policy gradient as the optimizer (sutton2000policy). Meanwhile, we set the maximum number of lectures in curriculum RL to be 4, which is studied in Section 5.3.
|In-domain||Generalization to other datasets|
|Setting||r4r r4r||r4r r2r||r4r r6r||r4r r8r||Average|
We describe the experimental setup (§ 5.1),followed by the main results in § 5.2 where we show the proposed BabyWalk agent attains competitive results on both the in-domain dataset but also generalizing to out-of-the-domain datasets with varying lengths of navigation tasks. We report results from various ablation studies in § 5.3. While we primarily focus on the Room4Room dataset, we re-analyze the original Room2Room dataset in § 5.4 and were surprised to find out the agents trained on it can generalize.
We adopt the following metrics: Success Rate (sr) that measures the average rate of the agent stopping within a specified distance near the goal location (anderson2018vision), Coverage weighted by Length Score (cls) (jain2019stay) that measures the fidelity of the agent’s path to the reference, weighted by the length score, and the newly proposed Success rate weighted normalized Dynamic Time Warping (sdtw) that measures in more fine-grained details, the spatio-temporal similarity of the paths by the agent and the human expert, weighted by the success rate (magalhaes2019effective). Both cls and sdtw measure explicitly the agent’s ability to follow instructions and in particular, it was shown that sdtw corresponds to human preferences the most. We report results in other metrics in the Suppl. Material.
Whenever possible, for all agents we compare to, we either re-run, reimplement or adapt publicly available codes from their corresponding authors with their provided instructions to ensure a fair comparison. We also “sanity check” by ensuring the results from our implementation and adaptation replicate and are comparable to the reported ones in the literature.
We compare our BabyWalk to the following: (1) the seq2seq agent (anderson2018vision), being adapted to the panoramic state and action space used in this work; (2) the Speaker Follower (sf) agent (fried2018speaker); (3) the Reinforced Cross-Modal Agent (rcm) (wang2019reinforced) that refines the sf agent using reinforcement learning with either goal-oriented reward (rcm(goal)) or fidelity-oriented reward (rcm(fidelity)); (4) the Regretful Agent (regretful) (ma2019regretful) that uses a progress monitor that records visited path and a regret module that performs backtracking; (5) the Frontier Aware Search with Backtracking agent (fast) (ke2019tactical) that incorporates global and local knowledge to compare partial trajectories in different lengths.
The last 3 agents are reported having state-of-the art results on the benchmark datasets. Except the seq2seq agent, all other agents depend on an additional pre-training stage with data augmentation (fried2018speaker), which improves cross-board. Thus, we train two BabyWalk agents: one with and the other without the data augmentation.
This is the standard evaluation scenario where a trained agent is assessed on the unseen split from the same dataset as the training data. The leftmost columns in Table 2 reports the results where the training data is from r4r. The BabyWalk agents outperform all other agents when evaluated on cls and sdtw.
When evaluated on sr, fast performs the best and the BabyWalk agents do not stand out. This is expected: agents which are trained to reach goal do not necessarily lead to better instruction-following. Note that rcm(fidelity) performs well in path-following.
While our primary goal is to train agents to generalize well to longer navigation tasks, we are also curious how the agents perform on shorter navigation tasks too. The right columns in Table 2 report the comparison. The BabyWalk agents outperform all other agents in all metrics except sr. In particular, on sdtw, the generalization to r6r and r8r is especially encouraging, resulting almost twice those of the second-best agent fast. Moreover, recalling from Fig. 1, BabyWalk’s generalization to r6r and r8r attain even better performance than the rcm agents that are trained in-domain.
Fig. 5 provides additional evidence on the success of BabyWalk, where we have contrasted to its performance to other agents’ on following instructions in different lengths across all datasets. Clearly, the BabyWalk agent is able to improve very noticeably on longer instructions.
Fig. 6 contrasts visually several agents in executing two (long) navigation tasks. BabyWalk’s trajectories are similar to what human experts provide, while other agents’ are not.
|Setting||r4r r4r||r4r others|
|, i.e., eqs. (2,3)|
Table 3 illustrates the importance of having a memory buffer to summarize the agent’s past experiences. Without the memory (null), generalization to longer tasks is significantly worse. Using LSTM to summarize is worse than using forgetting to summarize (eqs. (2,3)). Meanwhile, ablating of the forgetting mechanism concludes that
is the optimal to our hyperparameter search. Note that when, this mechanism degenerates to taking average of the memory buffer, and leads to inferior results.
Table 4 establishes the value of CRL. While imitation learning (il) provides a good warm-up for sr, significant improvement on other two metrics come from the subsequent RL (il+rl). Furthermore, CRL (with 4 “lectures”) provides clear improvements over direct RL on the entire instruction (i.e., learning to execute all Baby-Steps at once). Each lecture improves over the previous one, especially in terms of the sdtw metric.
|Setting||r4r r4r||r4r others|
|il+ crl w/ lecture #|
Our experimental study has been focusing on using r4r as the training dataset as it was established that as opposed to r2r, r4r distinguishes well an agent who just learns to reach the goal from an agent who learns to follow instructions.
Given the encouraging results of generalizing to longer tasks, a natural question to ask, how well can an agent trained on r2r generalize?
Results in Table 5 are interesting. Shown in the top panel, the difference in the averaged performance of generalizing to r6r and r8r is not significant. The agent trained on r4r has a small win on r6r presumably because r4r is closer to r6r than r2r does. But for even longer tasks in r8r, the win is similar.
In the bottom panel, however, it seems that r2r r4r is stronger (incurring less loss in performance when compared to the in-domain setting r4r r4r) than the reverse direction (i.e., comparing r4r r2r to the in-domain r2r r2r). This might have been caused by the noisier segmentation of long instructions into Baby-Steps in r4r. (While r4r is composed of two navigation paths in r2r, the segmentation algorithm is not aware of the “natural” boundaries between the two paths.)
There are a few future directions to pursue. First, despite the significant improvement, the gap between short and long tasks is still large and needs to be further reduced. Secondly, richer and more complicated variations between the learning setting and the real physical world need to be tackled. For instance, developing agents that are robust to variations in both visual appearance and instruction descriptions is an important next step.
We appreciate the feedback from the reviewers. This work is partially supported by NSF Awards IIS-1513966/1632803/1833137, CCF-1139148, DARPA Award#: FA8750-18-2-0117, DARPA-D3M - Award UCB-00009528, Google Research Awards, gifts from Facebook and Netflix, and ARO# W911NF-12-1-0241 and W911NF-15-1-0484.
In this section, we describe the details of how Baby-Steps are identified in the annotated natural language instructions and how expert trajectory data are segmented to align with Baby-Step instructions.
We identify the navigable Baby-Steps from the natural language instructions of r2r, r4r, r6r and r8r, based on the following 6 steps:
Split sentence and chunk phrases. We split the instructions by periods. For each sentence, we perform POS tagging using the SpaCy spacy2 package to locate and chunk all plausible noun phrases and verb phrases.
Curate noun phrases. We curate noun phrases by removing the stop words (i.e., the, for, from etc.) and isolated punctuations among them and lemmatizing each word of them. The purpose is to collect a concentrated set of semantic noun phrases that contain potential visual objects.
Identify “landmark words”. Next, given the set of candidate visual object words, we filter out a blacklist of words that either do not correspond to any visual counterpart or are mis-classified by the SpaCy package. The word blacklist includes:
end, 18 inch, head, inside, forward, position, ground, home, face, walk, feet, way, walking, bit, veer, ’ve, next, stop, towards, right, direction, thing, facing, side, turn, middle, one, out, piece, left, destination, straight, enter, wait, don’t, stand, back, round
We use the remaining noun phrases as the “landmark words” of the sentences. Note that this step identifies the “landmark words” for the later procedure which aligns Baby-Steps and expert trajectories.
Identifying verb phrases. Similarly, we use a verb blacklist to filter out verbs that require no navigational actions of the agent. The blacklist includes: make, turn, face, facing, veer.
Merge non-actionable sentences. We merge the sentence without landmarks and verbs into the next sentence, as it is likely not actionable.
Merge stop sentences. There are sentences that only describe the stop condition of a navigation action, which include verb-noun compositions indicating the stop condition. We detect the sentences starting with wait, stop, there, remain, you will see as the sentences that only describe the stop condition and merge them to the previous sentence. Similarly, we detect sentences starting with with, facing and merge them to the next sentence.
After applying the above 6 heuristic rules to the language instruction, we obtain chunks of sentences that describes the navigable Baby-Steps of the whole task (i.e., a sequence of navigational sub-goals.).
In the previous section, we describe the algorithm for identifying Baby-Step instructions from the original natural language instructions of the dataset. Now we are going to describe the procedure of aligning Baby-Steps with the expert trajectories, which segments the expert trajectories according to the Baby-Steps to create the training data for the learning pipeline of our BabyWalk agent. Note that during the training, our BabyWalk does not rely on the existence of ground-truth alignments between the (micro)instructions and Baby-Steps trajectories.
The main idea here is to: 1) perform visual landmark classification to produce confidence scores of landmarks for each visual state along expert trajectories; 2) use the predicted landmark scores and the “landmark words” in Baby-Steps to guide the alignment between the expert trajectory and Baby-Steps. To achieve this, we train a visual landmark classifier with weak supervision — trajectory-wise existence of landmark objects. Next, based on the predicted landmark confidence scores, we use dynamic programming (DP) to chunk the expert trajectory into segments and assign the segments to the Baby-Steps.
Given the pairs of aligned instruction and trajectories from the original dataset, we train a landmark classifier to detect landmarks mentioned in the instructions. We formulate it as a multi-label classification problem that asks a classifier to predict all the landmarks of the instruction given the corresponding trajectory . Here, we denotes all possible landmarks from the entire dataset to be , and the landmarks of a specific instruction to be
. Concretely, we first train a convolutional neural network (CNN) based on the visual state features
to independently predict the existence of landmarks at every time step, then we aggregate the predictions across all time steps to get trajectory-wise logits
via max-pooling over all states of the trajectory.
Here denotes the independent state-wise landmark classifier, and
is the logits before normalization for computing the landmark probability. For the specific details of, we input the panorama visual feature (i.e
. ResNet-152 feature) into a two-layer CNN (with kernel size of 3, hidden dimension of 128 and ReLU as non-linearity layer) to produce feature activation with spatial extents, followed by a global averaging operator over spatial dimensions and a multi-layer perceptron (2-layer with hidden dimension of 512 and ReLU as non-linearity layer) that outputs the state-wise logits for all visual landmarks. We then max pool all the state-wise logits along the trajectory and compute the loss using a trajectory-wise binary cross-entropy between the ground-truth landmark label (of existence) and the prediction.
Now, sppose we have a sequence of Baby-Step instructions , and its expert trajectory , we can compute the averaged landmark score for the landmarks that exists in this sub-task instruction on a single state :
represents the one-hot encoding of the landmarks that exists in theBaby-Step , and is the total number of existed landmarks. We then apply dynamic programming (DP) to solve the trajectory segmentation specified by the following Bellman equation (in a recursive form).
Here, represents the maximum potential of choosing the state as the end point of the Baby-Step instruction . Solving this DP leads to a set of correspondingly segmented trajectories , with being the -th Baby-Step sub-trajectory.
Figure 7 gives an overview of the unrolled version of our full navigation agent.
We set up the states as the stacked visual feature of agent-centric panoramic views in 12 headings 3 elevations with 30 degree intervals. The visual feature of each view is a concatenation of the ResNet-152 feature vector of size 2048 and the orientation feature vector of size 128 (The 4-dimensional orientation feature are tiled 32 times). We use similar single-view visual feature of size 2176 as our action embeddings.
Instruction encoder for the instructions is a single directional LSTM with hidden size 512 and a word embedding layer of size 300 (initialized with GloVE embedding pennington2014glove). We use the same encoder for encoding the past experienced and the current executing instruction. Trajectory encoder contains two separate bidirectional LSTMs (Bi-LSTM), both with hidden size 512. The first Bi-LSTM encodes and outputs a hidden state for each time step . Then we attends the hidden state to the panoramic view to get a state feature of size 2176 for each time step. The second Bi-LSTM encoders the state feature. We use the trajectory encoder just for encoding the past experienced trajectories.
The BabyWalk policy network consists of one LSTM with two attention layers and an action predictor. First we attend the hidden state to the panoramic view to get state feature of size 2176. The state feature is concatenated with the previous action embedding as a variable to update the hidden state using a LSTM with hidden size 512. The updated hidden state is then attended to the context variables (output of ). For the action predictor module, we concatenate the output of text attention layer with the summarized past context in order to get an action prediction variable. We then get the action prediction variable through a 2-layer MLP and make a dot product with the navigable action embeddings to retrieve the probability of the next action.
During the inference time, the BabyWalk policy only requires running the heuristic Baby-Step identification on the test-time instruction. No need for oracle Baby-Step trajectory during this time as the BabyWalk agent is going to roll out for each Baby-Step by itself.
As mentioned in the main text, we learn policy via optimizing the Fidelity-oriented reward jain2019stay. Now we give the complete details of this reward function. Suppose the total number of roll out steps is , we would have the following form of reward function:
Here, represents the concatenation of Baby-Step trajectories produced by the navigation agent (and we note as the concatenation operation).
For each Baby-Step task, we set the maximal number of steps to be 10, and truncate the corresponding Baby-Step instruction length to be 100. During both the imitation learning and the curriculum reinforcement learning procedures, we fix the learning rate to be 1e-4. In the imitation learning, the mini-batch size is set to be 100. In the curriculum learning, we reduce the mini-batch size as curriculum increases to save memory consumption. For the 1st, 2nd, 3rd and 4th curriculum, the mini-batch size is set to be 50, 32, 20, and 20 respectively. During the learning, we pre-train our BabyWalk model for 50000 iterations using the imitation learning as a warm-up stage. Next, in each lecture (up to 4) of the reinforcement learning (RL), we train the BabyWalk agent for an additional 10000 iterations, and select the best performing model in terms of sdtw to resume the next lecture. For executing each instruction during the RL, we sample 8 navigation episodes before performing any back-propagation. For each learning stage, we use separate Adam optimizers to optimize for all the parameters. Meanwhile, we use the L2 weight decay as the regularizer with its coefficient set to be 0.0005. In the reinforcement learning, the discounted factor is set to be 0.95.
In this section, we describe a comprehensive set of evaluation metrics and then show transfer results of models trained on each dataset, with all metrics. We provide additional analysis studying the effectiveness of template based Baby-Step identification. Finally we present additional qualitative results.
We adopt the following set of metrics:
Path Length (pl) is the length of the agent’s navigation path.
Navigation Error (ne) measures the distance between the goal location and final location of the agent’s path.
Success Rate (sr) that measures the average rate of the agent stopping within a specified distance near the goal location anderson2018vision
Success weighted by Path Length (spl) anderson2018vision measures the success rate weighted by the inverse trajectory length, to penalize very long successful trajectory.
Coverage weighted by Length Score (cls) jain2019stay that measures the fidelity of the agent’s path to the reference, weighted by the length score, and the newly proposed
Normalized Dynamic Time Warping (ndtw) that measures in more fine-grained details, the spatio-temporal similarity of the paths by the agent and the human expert magalhaes2019effective.
Success rate weighted normalized Dynamic Time Warping (sdtw) that further measures the spatio-temporal similarity of the paths weighted by the success rate magalhaes2019effective. cls, ndtw and sdtw measure explicitly the agent’s ability to follow instructions and in particular, it was shown that sdtw corresponds to human preferences the most.
|Data Splits||r2r Validation Unseen|
As mentioned in the main text, we compare our re-implementation and originally reported results of baseline methods on the r2r datasets, as Table 6. We found that the results are mostly very similar, indicating that our re-implementation are reliable.
We present the curriculum learning results with all evaluation metrics in Table 7.
|il+ crl w/ lecture #|
We present an additional analysis comparing different Baby-Step identification methods. We compare our template-based Baby-Step identification with a simple method that treat each sentence as an Baby-Step (referred as sentence-wise), both using the complete BabyWalk model with the same training routine. The results are shown in the Table 8. Generally speaking, the template based Baby-Step identification provides a better performance.
As mentioned in the main text, we display all the in-domain results of navigation agents trained on r2r, r4r, r6r, r8r, respectively. The complete results of all different metrics are included in the Table 9. We note that our BabyWalk agent consistently outperforms baseline methods on each dataset. It is worth noting that on r4r, r6r and r8r datasets, rcm(goal) achieves better results in spl. This is due to the aforementioned fact that they often take short-cuts to directly reach the goal, with a significantly short trajectory. As a consequence, the success rate weighted by inverse path length is high.
|Datasets Metrics seq2seq sf rcm(goal) rcm(fidelity) regretful fast BabyWalk BabyWalk pl 28.6 28.9 13.2 14.1 15.5 29.7 19.5 17.9 ne 9.1 9.0 9.2 9.3 8.4 9.1 8.9 8.9 sr 18.3 16.7 14.7 15.2 19.2 13.3 22.5 21.4 spl 7.9 7.4 8.9 8.9 10.1 7.7 12.6 11.9 cls 29.8 30.0 42.5 41.2 46.4 41.8 50.3 51.0 ndtw 25.1 25.3 33.3 32.4 31.6 33.5 38.9 40.3 sdtw 7.1 6.7 7.3 7.2 9.8 7.2 14.5 13.8 pl 39.4 41.4 14.2 15.7 15.9 32.0 29.1 25.9 ne 9.6 9.8 9.7 9.8 8.8 9.0 10.1 9.8 sr 20.7 17.9 22.4 22.7 24.2 26.0 21.4 21.7 spl 11.0 9.1 17.7 18.3 16.6 16.5 7.9 8.8 cls 25.9 26.2 37.1 36.4 40.9 37.7 48.4 49.0 ndtw 20.5 20.8 26.6 26.1 16.2 21.9 30.8 32.6 sdtw 7.7 7.2 8.2 8.4 6.8 8.5 11.2 11.2 pl 52.3 52.2 15.3 16.9 16.6 34.9 38.3 34.0 ne 10.5 10.5 11.0 11.1 10.0 10.6 11.1 10.5 sr 16.9 13.8 12.4 12.6 16.3 11.1 19.6 20.7 spl 6.1 5.6 7.4 7.5 7.7 6.2 6.9 7.8 cls 22.5 24.1 32.4 30.9 35.3 33.7 48.1 48.7 ndtw 17.1 18.2 23.9 23.3 8.1 14.5 26.7 29.1 sdtw 4.1 3.8 4.3 4.3 2.4 2.4 9.4 9.8 Average pl 40.1 40.8 14.2 15.6 16.0 32.2 29.0 25.9 ne 9.7 9.8 10.0 10.1 9.1 9.6 10.0 9.7 sr 18.6 16.1 16.5 16.8 19.9 16.8 21.2 21.3 spl 8.3 7.4 11.3 11.6 11.5 10.1 9.1 9.5 cls 26.1 26.8 37.3 36.2 40.9 37.7 48.9 49.6 ndtw 20.9 21.4 27.9 27.3 18.6 23.3 32.1 34.0 sdtw 6.3 5.9 6.6 6.6 6.3 6.0 11.7 11.6||Datasets Metrics seq2seq sf rcm(goal) rcm(fidelity) regretful fast BabyWalk BabyWalk pl 16.2 17.4 10.2 17.7 20.0 26.5 12.1 9.6 ne 7.8 7.3 7.1 6.7 7.5 7.2 6.6 6.6 sr 16.3 22.5 25.9 29.1 22.8 25.1 35.2 34.1 spl 9.9 14.1 22.5 18.2 14.0 16.3 28.3 30.2 cls 27.1 29.5 44.2 34.3 32.6 33.9 48.5 50.4 ndtw 29.3 31.8 41.1 33.5 28.5 27.9 46.5 50.0 sdtw 10.6 14.8 20.2 18.3 13.4 14.2 27.2 27.8 pl 40.8 38.5 12.8 33.0 19.9 26.6 37.0 28.7 ne 9.9 9.5 9.2 9.3 9.5 8.9 8.8 9.2 sr 14.4 15.5 19.3 20.5 18.0 22.1 26.4 25.5 spl 6.8 8.4 15.2 8.5 10.6 13.7 8.1 9.2 cls 17.7 20.4 31.8 38.3 31.7 31.5 44.9 47.2 ndtw 16.4 18.3 23.5 23.7 23.5 23.0 30.1 32.7 sdtw 4.6 5.2 7.3 7.9 7.5 7.7 13.1 13.6 pl 56.4 50.8 13.9 38.7 20.7 28.2 50.0 39.9 ne 10.1 9.5 9.5 9.9 9.5 9.1 9.3 10.1 sr 20.7 21.6 22.8 20.9 18.7 27.7 26.3 23.1 spl 10.4 11.8 16.9 9.0 9.2 13.7 7.2 7.4 cls 15.0 17.2 27.6 34.6 29.3 29.6 44.7 46.0 ndtw 13.4 15.1 19.5 21.7 19.0 17.7 27.1 28.2 sdtw 4.7 5.0 5.1 6.1 5.6 6.9 11.5 11.1 Average pl 37.8 35.6 12.3 29.8 20.2 27.1 33.0 26.1 ne 9.3 8.8 8.6 8.6 8.8 8.4 8.2 8.6 sr 17.1 19.9 22.7 23.5 19.8 25.0 29.3 27.6 spl 9.0 11.4 18.2 11.9 11.3 14.6 14.5 15.6 cls 19.9 22.4 34.5 35.7 31.2 31.7 46.0 47.9 ndtw 19.7 21.7 28.0 26.3 23.7 22.9 34.6 37.0 sdtw 6.6 8.3 10.9 10.8 8.8 9.6 17.3 17.5|
|(a) r2r trained model||(b) r4r trained model|
|Datasets Metrics seq2seq sf rcm(goal) rcm(fidelity) BabyWalk BabyWalk pl 14.5 19.4 8.1 15.5 9.4 9.2 ne 7.7 7.1 7.6 7.5 6.8 6.8 sr 19.3 21.9 19.6 22.6 31.3 30.6 spl 13.3 11.6 17.2 14.1 28.3 27.8 cls 32.1 26.2 43.2 34.3 49.9 50.0 ndtw 31.9 30.8 39.7 32.4 49.5 49.4 sdtw 13.1 13.3 15.3 14.3 25.9 25.4 pl 25.2 33.0 11.6 25.7 18.1 17.7 ne 8.7 8.6 8.5 8.4 8.4 8.2 sr 24.2 22.4 23.6 25.4 24.3 24.3 spl 13.7 9.3 17.5 10.6 12.8 12.9 cls 25.8 21.4 35.8 34.8 48.6 48.6 ndtw 22.9 20.6 29.8 26.5 39.0 39.4 sdtw 9.3 7.5 10.8 11.1 15.1 15.1 pl 43.0 52.8 14.2 29.9 38.3 36.8 ne 9.9 9.9 9.6 9.7 10.2 10.0 sr 20.1 20.3 20.3 22.4 20.8 21.0 spl 11.2 9.4 14.9 8.1 6.6 6.8 cls 20.6 18.3 27.7 38.9 45.9 46.3 ndtw 16.3 15.2 21.9 22.2 28.4 29.3 sdtw 5.6 5.0 6.4 6.8 9.6 9.9 Average pl 27.6 35.1 11.3 23.7 21.9 21.2 ne 8.8 8.5 8.6 8.5 8.5 8.3 sr 21.2 21.5 21.2 23.5 25.5 25.3 spl 12.7 10.1 16.5 10.9 15.9 15.8 cls 26.2 22.0 35.6 36.0 48.1 48.3 ndtw 23.7 22.2 30.5 27.0 39.0 39.4 sdtw 9.3 8.6 10.8 10.7 16.9 16.8||Datasets Metrics seq2seq sf rcm(goal) rcm(fidelity) BabyWalk BabyWalk pl 13.7 19.3 7.8 17.8 9.1 9.8 ne 7.6 7.3 8.0 8.2 6.8 6.7 sr 18.7 23.4 14.8 19.2 30.0 32.1 spl 13.3 12.9 12.9 10.6 27.0 28.2 cls 32.7 26.6 37.9 28.9 49.5 49.3 ndtw 32.4 29.9 34.9 25.9 48.9 48.9 sdtw 12.7 14.5 11.1 10.5 24.6 26.2 pl 23.1 31.7 11.1 32.5 17.4 19.0 ne 8.7 8.8 8.7 9.2 8.2 8.5 sr 23.6 21.8 23.2 21.7 24.4 24.4 spl 15.1 10.5 18.2 7.4 12.6 12.5 cls 24.9 20.8 32.3 29.4 48.1 48.5 ndtw 22.3 19.7 26.4 20.6 39.1 38.5 sdtw 8.8 7.7 9.3 8.4 14.9 15.2 pl 30.9 42.2 11.9 39.9 26.6 29.2 ne 9.7 9.9 9.9 10.1 9.0 9.3 sr 15.4 14.7 14.8 20.0 22.9 22.9 spl 8.6 6.7 11.6 5.3 8.4 7.9 cls 22.2 18.5 29.1 33.5 46.9 46.6 ndtw 18.5 15.9 22.5 20.1 33.3 31.8 sdtw 5.5 4.7 6.0 7.8 12.1 11.8 Average pl 22.6 31.1 10.3 30.1 17.7 19.3 ne 8.7 8.7 8.9 9.2 8.0 8.2 sr 19.2 20.0 17.6 20.3 25.8 26.5 spl 12.3 10.0 14.2 7.8 16.0 16.2 cls 26.6 22.0 33.1 30.6 48.2 48.1 ndtw 24.4 21.8 27.9 22.2 40.4 39.7 sdtw 9.0 9.0 8.8 8.9 17.2 17.7|
|(c) r6r trained model||(d) r8r trained model|
For completeness, we also include all the transfer results of navigation agents trained on r2r, r4r, r6r, r8r, respectfully. The complete results of all different metrics are included in the Table 10. According to this table, we note that models trained on r8r can achieve the best overall transfer learning performances. This could because of the fact that r8r
trained model only needs to deal with interpolating to shorter ones, rather than extrapolating to longer instructions, which is intuitively an easier direction.
We present more qualitative result of various VLN agents as Fig 8. It seems that BabyWalk can produce trajectories that align better with the human expert trajectories.