Imagine finding yourself in a large conference hall, with an assistant giving you instructions on how to reach the room for your talk. You are likely to hear something like: turn right at the end of the corridor, head upstairs and reach the third floor: your room is immediately on the left. Succeeding in the task of finding your target location is rather nontrivial because of the length of the instruction and its sequential nature: the flow of actions must be coordinated with a series of visual examinations – like recognizing the end of the corridor or the floor number. Furthermore, navigation complexity dramatically increases if the environment is unknown, and no prior knowledge, such as a map, is available.
Vision-and-Language Navigation (VLN) [anderson2018vision] is a challenging task that demands an embodied agent to reach a target location by navigating unseen environments, with a natural language instruction as its only clue. Similarly to the previous example, the agent must assess different sub-tasks to succeed. First, a fine-grained comprehension of the given instruction is needed. Then, the agent must be able to map parts of the description into the visual perception. For example, walking past the piano requires to find and focus on the piano, rather than considering other objects in the scene. Finally, the agent needs to understand when the navigation has been completed and send a stop signal.
VLN has been first proposed by Anderson et al [anderson2018vision], with the aim of connecting the research efforts on vision-and-language understanding [vinyals2015show, xu2015show, Anderson_2018, VQA, balanced_vqa_v2, visdial, visdial_rl] with the raising area of embodied AI [das2018embodied, das2018neural, xia2018gibson, anderson2018evaluation]. This is particularly challenging, as embodied agents must deal with a series of issues that do not belong to traditional vision and language tasks [anderson2018evaluation], like contextual decision-making and planning. Recent works on VLN [fried2018speaker, ma2019self, ma2019regretful, tan2019learning] integrate the agent with a simplified action space in which it “only needs to make high-level decisions as to which navigable direction to go next” [fried2018speaker]. In this scenario, the agent does not need to infer the sequence of actions to progress in the environment (e.g., turn right 30 degrees, then move forward) but it exploits a navigation graph to teleport itself to an adjacent location. The adoption of this high-level action space allowed for a significant boost in success rates, while partly depriving the task of its embodied nature, and leaving space for little more than pure visual and language understanding. We claim that this type of approach is inconvenient, as it strongly relies on prior knowledge on the environment. Depending on information such as the position and the availability of navigable directions, it reduces the task to a pure graph navigation. Moreover, it ignores the active role of the agent, as it only perceives the surrounding scene and selects the next target viewpoint from a limited set. We claim instead that the agent should be the principal component of embodied VLN [anderson2018evaluation]. Consequently, the output space should match with the low-level set of movements that the agent can perform.
In this paper, we propose a novel architecture for embodied VLN which employs dynamic convolutional filters [li2017tracking] to identify the next target direction, without getting any information about the navigable viewpoints from the simulator. Convolutional filters are produced via an attention mechanism which follows the given instruction, and are in turn used to attend relevant directions of the scene towards which the agent should move. We then rely on a policy network to predict the sequence of low-level actions.
Dynamic convolutional filters, proposed by Li et al [li2017tracking]
, were first conceived to identify and track people by a natural language specification. They were then successfully employed in other computer vision tasks, such as actor and action video segmentation from a sentence[gavrilyuk2018actor]. Nonetheless, these works considered mainly short descriptions, while we deal with complex sentences and long-term dependencies. We generate dynamic filters according to the given instruction, to extract only the relevant information from the visual context. In this way, the same observation can lead to different feature maps, depending on the part of the instruction that the agents must complete (Fig. 1).
The proposed method is competitive with prior work that performs high-level navigation exploiting information about the reachable viewpoints (i.ethe navigation graph). Additionally, our approach is fully compliant with recent recommendations for embodied navigation [anderson2018evaluation]. When compared with models that are compliant with the VLN setup, we overcome the current state of the art by a significant margin.
To sum up, our contributions are as follows:
We propose a new encoder-decoder architecture for embodied VLN, which for the first time employs dynamic convolutional filters to attend relevant regions of the visual scene and control the actions of the agent.
We show, through extensive experimental evaluations, that in a mutable environment with shifting goals dynamic convolutional filters can provide better performance than traditional convolutional filters. Results show that our proposed architecture overcomes the state of the art on the embodied VLN task.
As a complementary contribution, we categorize previous work on VLN basing on their level of abstraction and generalizability. We distinguish a group of works that strongly relies on the simulating platform and on the navigation graph, we call them high-level actions models. A second group, named low-level actions models, includes methods that are more agnostic on the underlying implementation and that directly predicts agent actions. With this categorization, we hope to encourage further research to consider low-level and high-level action spaces as distinct fields of application when dealing with VLN.
We propose an encoder-decoder architecture for vision-and-language navigation. Our work employs dynamic convolutional filters conditioned on the current instruction to extract the relevant piece of information from the visual perception, which is in turn used to feed a policy network which controls the actions performed by the agent. The output of our model is a probability distribution over a low-level action space, which comprises the following actions: turn 30° left, turn 30° right, raise elevation, lower elevation, go ahead, <end episode>. The output probability distribution at a given step, , depends on a natural language instruction , the current visual observation , and on the policy hidden state at time step . Our architecture is depicted in Fig. 2 and detailed next.
To represent the two inputs of the architecture, i.ethe instruction and the visual input at time , we devise an instruction and a visual encoder. The instruction encoder provides a representation of the navigation instructions that is employed to guide the whole navigation episode. On the other hand, the visual encoding module operates at every single step, building a representation of the current observation which depends on the agent position.
Instruction Encoding. The given natural language instruction is split into single words via tokenization, and stop words are filtered out to obtain a shorter description. Differently from previous works that train word embeddings from scratch, we rely on word embeddings obtained from a large corpus of documents. Beside providing semantic information which could not be learned purely from VLN instructions, this also let us handle words that are not present in the training set (see Sec. 3.2 for a discussion). Given an instruction with length , we denote its embeddings sequence as , where indicates the embedding for the
-th word. Then, we adopt a Long Short-Term Memory (LSTM) network to provide a timewise contextual representation of the instruction:
where each denotes the hidden state of the LSTM at time , thus leading to a final representation with shape , where is the size of the LSTM hidden state.
Visual Features Encoding. As visual input, we employ the panoramic 360° view of the agent, and discretize the resulting equirectangular image in a
grid, consisting of three different elevation levels and 30° heading shift from each other. Each location of the grid is then encoded via the 2048-dimensional features extracted from a ResNet-152[he2016deep]
pre-trained on ImageNet[deng2009imagenet]
. We also append to each cell vector a set of coordinates relative to the current agent heading and elevation:
where and are the heading and elevation angles w.r.t. the agent position. By adding to the image feature map, we encode information related to concepts such as right, left, above, below into the agent observation.
Given the instruction embedding for the whole episode, we use an attention mechanism to select the next part of the sentence that the agent has to fulfill. We denote this encoded piece of instruction as . We detail our attentive module in the next section.
Dynamic Convolutional Filters. Dynamic filters are different from traditional, fixed filters typically used in CNN, as they depend on an input rather than being purely learnable parameters. In our case, we can think about them as specialized feature extractors reflecting the semantics of the natural language specification. For example, starting from an instruction like “head towards the red chair” our model can learn specific filters to focus on concepts such as red and chair. In this way, our model can rely on a large ensemble of specialized kernels and apply only the most suitable ones, depending on the current goal. Naturally, this approach is more efficient and flexible than learning a fixed set of filters for all the navigation steps. We use the representation of the current piece of instruction to generate multiple dynamic convolutional kernels, according to the following equation:
where indicates L2 normalization, and
is a tensor of filters reshaped to have the same number of channels as the image feature map. We then perform the dynamic convolution over the image features, thus obtaining a response map for the current timestep as follows:
As the aforementioned operation is equivalent to a dot product, we can conceive the dynamic convolution as a specialized form of dot-product attention, in which acts as key and the filters in act as time-varying queries. Following this interpretation, we rescale by , where is the dynamic filter size [vaswani2017attention] to maintain dot products smaller in magnitude.
Action Selection. We use the response maps dynamically generated as input for the policy network. We implement it with an LSTM whose hidden state at time step is employed to obtain the action scores. Formally,
where indicates concatenation,
is the one-hot encoding of the action performed at the previous timestep, andis the flattened tensor obtained from . To select the next action , we sample from a multinomial distribution parametrized with the output probability distribution during training, and select during the test. In line with previous work, we find out that sampling during the training phase encourages exploration and improves overall performances.
Note that, as previously stated, we do not employ a high-level action space, where the agent selects the next viewpoint in the image feature map, but instead make the agent responsible of learning the sequence of low-level actions needed to perform the navigation. The agent can additionally send a specific stop signal when it considers the goal reached, as suggested by recent standardization attempts [anderson2018evaluation].
2.3 Encoder-Decoder Attention
The navigation instructions are very complex, as they involve not only different actions but also temporal dependencies between them. Moreover, their high average length represents an additional challenge for traditional embedding methods. For these reasons, we enrich our architecture with a mechanism to attend different locations of the sentence representation, as the navigation moves towards the goal. In line with previous work on VLN [anderson2018vision, fried2018speaker], we employ an attention mechanism to identify the most relevant parts of the navigation instruction. We employ the hidden state of our policy LSTM to get the information about our progress in the navigation episode and extract a time-varying query . We then project our sentence embedding into a lower dimensional space to obtain key vectors, and perform a scaled dot-product attention [vaswani2017attention] among them.
After a softmax layer, we obtain the current instruction embeddingby matrix multiplication between the initial sentence embedding and the softmax scores.
At each timestep of the navigation process is obtained by attending the instruction embedding at different locations. The same vector is in turn used to obtain a time-varying query for attending spatial locations in the visual input.
Our training sample consists of a batch of navigation instructions and the corresponding ground truth paths coming from the R2R (Room-to-Room) dataset [anderson2018vision] (described in section 3). The path denotes a list of discretized viewpoints that the agent has to traverse to progress towards the goal. The agent spawns in the first viewpoint, and its goal is to reach the last viewpoint in the ground truth list. At each step, the simulator is responsible for providing the next ground truth action in the low-level action space that enables the agent to progress. Specifically, the ground truth action is computed by comparing the coordinates of the next target node in the navigation graph with the agent position and orientation. At each time step , we minimize the following objective function:
where is the output of our network, and is the ground truth low-level action provided by the simulator at time step . We train our network with a batch size of 128 and use Adam optimizer [kingma2015adam] with a learning rate of
. We adopt early stopping to terminate the training if the mean success rate does not improve for 10 epochs.
3.1 Experimental Settings
For our experiments, we employ the R2R (Room-to-Room) dataset [anderson2018vision]. This challenging benchmark builds upon Matterport3D dataset of spaces [Matterport3D] and contains different navigation paths in different scenes. For each route, the dataset provides 3 natural language instructions, for a total of 21,567 instructions with an average length of 29 words. The R2R dataset is split into 4 partitions: training, validation on seen environments, validation on unseen scenes, and test on unseen environments.
We adopt the same evaluation metrics employed by previous work on the R2R dataset: navigation error (NE), oracle success rate (OSR), success rate (SR), and success rate weighted by path length (SPL). NE is the mean distance in meters between the final position and the goal. SR is fraction of episodes terminated within no more than 3 meters from the goal position. OSR is the success rate that the agent would have achieved if it received an oracle stop signal in the closest point to the goal along its navigation. SPL is the success rate weighted by normalized inverse path length and penalizes overlong navigations.
Implementation Details. For each LSTM, we set the hidden size to 512. Word embeddings are obtained with GloVe [pennington2014glove]. In our visual encoder, we apply a bottleneck layer to reduce the dimension of the image feature map to 512. We generate dynamic filters with 512 channels using a linear layer with dropout [srivastava2014dropout] (). In our attention module, andto the policy hidden state before feeding it to the linear layer.
3.2 Ablation Study
|Baseline w/ traditional convolution [anderson2018vision]||6.01||38.6||52.9||-||7.81||21.8||28.4||-|
|Ours w/o encoder-decoder attention||5.86||41.3||51.2||36.3||7.72||22.0||29.3||19.3|
|Ours w/o pre-trained embedding||5.62||42.0||54.0||36.3||7.32||25.8||33.3||22.1|
|Ours w/ dynamic filters||4.68||53.1||66.1||46.0||6.65||31.6||43.6||26.8|
In our ablation study, we test the influence of our implementation choices on VLN. As a first step, we discuss the impact of dynamic convolution by comparing our model with a similar seq2seq architecture that employs fixed convolutional filters. We then detail the importance of using an attention mechanism to extract the current piece of instruction to be fulfilled. Finally, we compare the results obtained using a pre-trained word embedding instead of learning the word representation from scratch. Results are reported in Table 1.
Static Filters Vs. Dynamic Convolution. As results show, dynamic convolutional filters surpass traditional fixed filters for VLN. This because they can easily adapt to new instructions and reflect the variability of the task. When compared to a baseline model that employs traditional convolution [anderson2018vision], our method performs and better, in terms of success rate, on the val-seen and val-unseen splits respectively.
Fixed Instruction Representation Vs. Attention. The navigation instructions are very complex and rich. When removing the attention module from our architecture, we keep the last hidden state as instruction representation for the whole episode. Even with this limitation, dynamic filters achieve better results than static convolution, as the success rate is higher for both of the validation splits. However, our attention module further increases the success rate by and .
Word Embedding from Scratch Vs. Pre-trained Embedding. Learning a meaningful word embedding is nontrivial and requires a large corpus of natural language descriptions. For this reason, we adopt a pre-trained word embedding to encode single words in our instructions. We then run the same model while trying to learn the word embedding from scratch. We discover that a pre-trained word embedding significantly eases VLN. Our model with GloVe [pennington2014glove] obtains and more on the val-seen and val-unseen splits respectively, in terms of success rate.
3.3 Multi-headed Dynamic Convolution
In this experiment, we test the impact of using a different number of dynamically-generated filters. We test our architecture when using 1, 2, 4, 8, and 16 dynamic filters. We find out that the best setup corresponds to the use of 4 different convolutional filters. Results in Fig. 3 show that the success rate and the SPL increase linearly with the number of dynamic kernels for a small number of filters, reaching a maximum at 4. The metrics then decrease when adding new parameters to the network. This suggests that a low number of dynamic filters can represent a wide variety of natural language specifications. However, as the number of dynamic filters increase, the representation provided by the convolution becomes less efficient.
3.4 Comparison with the State-of-the-art
|Low-level Actions Methods||NE||SR||OSR||SPL||NE||SR||OSR||SPL||NE||SR||OSR||SPL|
|Ours w/ data augmentation||3.96||0.58||0.73||0.51||6.52||0.34||0.43||0.29||6.55||0.35||0.45||0.31|
|High-level Actions Methods||NE||SR||OSR||SPL||NE||SR||OSR||SPL||NE||SR||OSR||SPL|
Finally, we compare our architecture with the state-of-the-art methods for VLN. Results are reported in Table 2. We distinguish two main categories of models, depending on their output space: the first, to which our approach belongs, predicts the next atomic action (e.gturn right, go ahead). We call architectures in this category low-level actions methods. The second, instead, searches in the visual space to match the current instruction with the most suitable navigable viewpoint. In these models, atomic actions are not considered, as the agent displacements are done with a teleport system, using the next viewpoint identifier as target destination. Hence, we refer to these works as high-level actions methods. While the latter achieve better results, they make strong assumptions on the underlying simulating platform and on the navigation graph. Our method, exploiting dynamic convolutional filters and predicting atomic actions, outperforms comparable architectures and achieves state of the art results for low-level actions VLN. Our final implementation takes advantage of the synthetic data provided by Fried et al [fried2018speaker] and overcomes comparable methods [anderson2018vision, wang2018look] by and success rate points on the R2R test set. Additionally, we note that our method is competitive with some high-level actions models, especially in terms of SPL. When considering the test set, we notice in fact that our model outperforms Speaker-Follower [fried2018speaker] by , while performing only worse than [ma2019self].
Low-level Action Space or High-level Navigation Space? While previous work on VLN never considered this important difference, we claim that it is imperative to categorize navigation architectures depending on their output space. In our opinion, ignoring this aspect would lead to inappropriate comparisons and wrong conclusions. Considering the results in Table 2, we separate the two classes of work and highlight the best results for each category. Please note that the random baseline was initially provided by [anderson2018vision] and belongs to low-level actions architectures (a random high-level actions
agent was never provided by previous work). We immediately notice that, with this new categorization, intra-class results have less variance and are much more aligned to each other. We believe that future work on VLN should consider this new taxonomy in order to provide meaningful and fair comparisons.
3.5 Qualitative Results
Fig. 4 shows two navigation episodes from the R2R validation set. We display the predicted action in a green box on the bottom-right corner of each image. Both examples are successful.
|Instruction: From bathroom, enter bedroom and walk straight|
|across down two steps, wait at loungers.|
|Instruction: Walk past the fireplace and to the left.|
|Stop in the entryway of the kitchen.|
In this paper, we propose dynamic convolution for embodied Vision-and-Language Navigation. Instead of relying on a high-level action space, where the agent is teleported from one viewpoint to the other, we predict a series of action in an agent friendly action space. Basing on this substantial difference, we propose a new categorization based on the model output space. We then separate previous VLN architectures into low-level actions and high-level actions methods. We claim that comparisons made considering this new taxonomy are more fair and reasonable than previous analysis. Our method with dynamic convolutional filters achieves state-of-the-art results for the low-level actions category, and it is competitive with high-level actions architectures that rely on much more information and have a higher level of abstraction during the navigation episode. We hope this work encourages further research on low-level VLN, and in general we consider this a step towards the use of more realistic action spaces for this task. While our experiments show promising results in this setting, much work remains to inspect the possible connections between low-level and high-level Vision-and-Language Navigation.
Acknowledgements: This work was partially supported by the Fondazione Cassa di Risparmio di Modena project “AI for Digital Humanities” (Prot. n. 505.18.8b del 18/10/2018 - Pratica Sime n. 2018.0390). We also want to thank the anonymous reviewers for their insightful remarks and their constructive criticism.