Vision-and-Language Navigation (VLN) tasks  define a comprehensive problem: an embodied agent is placed at a spot in a photo-realistic house and the agent is called to navigate to a specific spot based on given natural language instructions. Rising research interests have been put into the VLN since multi-modal data are involved. One of the biggest challenges for this task is to ask an agent to perform appropriate actions in an unseen environment. This in turn requires the agent to learn human behaviours to understand and explore the scene, instead of memorising it.
Current VLN models [2, 4, 7, 9, 10] rely much on behavioural cloning (BC) that treats expert behaviours as strong supervision signals. By doing this, it enables the agents to gain better performance on seen scenarios, however the agents meet trouble on unseen environments due to the error accumulation. As stated in , teacher forcing models suffer from distribution shift issues because of the greediness of imitating demonstrated expert actions.
Some other works [16, 19], instead, adopt reinforcement learning (RL) along with supervised learning methods intending to overcome the error accumulation issue caused by hard behavioural cloning. However, the reward engineering in RL suffers issues: the reward functions designed at one environment/task may not generalise well to other scenarios; in many practical and complicated tasks, it is hard to define concrete reward functions as game scores. What is more, a hand-crafted reward is defined to target at a certain functionality, it thus inevitably incurs lacking comprehensive considering of the system dynamics. The designing of a reward function requires careful manual tuning and it also suffers generalisation problem due to environment-oriented reward designing, which may affect the model performance while inference.
In this paper, we propose a Soft Expert Reward Learning (SERL) model to address above issues. Our proposed method consists of two orthogonal parts: the Soft Expert Distillation (SED) module that portrays the expert data distribution by distilling knowledge from a random projection space and a Self Perceiving (SP) module that encourages agents to reach the goal as soon as possible. For the SED module, intuitively, a higher reward should be assigned to an agent who takes an action “close” to its expert. To measure the similarity continuously, a density function was adopted to reflect this process in a soft manner rather than leveraging behaviour cloning directly. This density function is implemented to calculate the similarity between observation-action pairs of the expert and the agent in a randomly projected space, by doing which it transforms the expert behaviour into a soft reward signal for the reinforcement learning branch. For the Self Perceiving (SP) module, our model first predicts the schedule to the target location and then utilises the predicted schedule information as an additional reward. As a result, the agent can perceive its current schedule and use it to further pushing itself forward to the goal.
The two newly designed reward modules work complementarily: the Soft Expert Distillation (SED) reward encourages agents to behave as an expert, but the soften behaviour-imitation process makes it more robust; Self Perceiving (SP) module targets at pushing the agents towards the final destination by introducing the current schedule information as another intrinsic reward signal. In summary, this paper makes the following three main contributions.
We propose a Soft Expert Distillation (SED) formulation, which is very simple yet offers a highly effective reward signal for obtaining expressive navigational ability. The SED reward encourages the agent to have a better alignment with its expert in a soft manner.
We introduce another complementary reward signal with aforementioned SED reward termed as Self Perceiving reward that can help the agent use the current schedule information to push itself to reach the destination as soon as possible.
As a result, we show our instantiated model termed as SERL that enables better performance than current state-of-the-art competing methods in both validation unseen and test unseen set of VLN Room-to-Room dataset .
2 Related Work
2.1 Vision-and-Language Navigation
In order to gain promising performance on Vision-and-Language (VLN)  task, numerous methods have been proposed, as listed in Table 1. Many existing works adopt supervised learning and behaviour cloning based methods. Seq2seq  model is the most naive baseline that utilises an LSTM-based sequence-to-sequence architecture with attention mechanism to predict the next action. Speaker-Follower  model designs a language model (“speaker”) to learn the relationship between visual and language information, as well as a policy network (“follower”) to take actions based on multi-modal inputs. It uses “speaker” to synthesise new instructions for data augmentation and help the policy network to select routes.  claims its proposed FAST model is able to balance local and global signals while exploring an unobserved environment. It enables the agent act greedily but allows the agent backtrack if necessary according to global signals.  proposes a visual-language co-grounding framework named as self-monitoring model to better fuse the instructions and visual inputs. Building upon self-monitoring model,  provides a strategy for the agent to retrieve and re-choose paths based on monitored progress.
propose a novel Reinforced Cross-modal Matching (RCM) via reinforcement learning to enforce cross-modal matching locally and globally along with imitation learning. In RCM model, an extrinsic reward measuring the reduced distance toward the target location after taking actions, as well as an intrinsic cross-modal matching reward between trajectories and instructions, are proposed. Most recently, introduces a novel environment dropout to drop features channel-wisely targeting at feature maps inconsistency issue through combining behaviour cloning and reinforcement learning.
|Reinforced Cross-Modal ||✓||✓||✓|
|Regretful Agent ||✓|
However, these approaches require either exact imitation of the expert demonstrations or careful reward designing. Behaviour cloning techniques unfortunately lead to error accumulation and further result in catastrophic failure while the agent is exploring unknown environments. Moreover, reward engineering requires careful manual tuning, which motivates us to propose SERL model to learn reward functions from the expert distribution directly.
2.2 Reward Learning
Reward engineering is commonly used to design reward functions for reinforcement learning algorithms. In conventional reinforcement learning tasks, such as playing Atari games , rewards are individually shaped by each game simulators. However, reward engineering has obvious drawbacks — the reward functions are designed targeting at different environments which is not generic. There are some methods have been proposed to solve this problem. Recently, Inverse reinforcement learning (IRL)  framework is proposed to extract reward functions from expert behaviours by updating both of the reward functions and the policy networks. Random Expert Distillation (RED) 
proposed an expert policy support estimation method to distil rewards from given expert trajectories. Generative Adversarial Imitation Learning (GAIL)
is also a recently proposed model which tries to bypass the reward function and learn experts behaviour directly with generative adversarial networks.
Comparing with the IRL and GAIL models, our proposed Soft Expert Distillation module learns expert demonstration data distribution directly by comparing the output similarity between a randomised network and a distillation network, rather than utilising iterative model updating and generative adversarial networks. The RED model designs state and action in relatively small spaces for the Mujoco environment  and its driving task; while we design our SED module in fundamentally different state and action spaces for navigation in photo-realistic Matterport3D environments. We are the first to introduce soft expert reward learning framework into Vision-and-Language task.
3 Soft Expert Reward Learning Model
3.1 Overview and Problem Definition
Vision-and-Language Navigation task requires an agent placed at a unknown photo-realistic house to understand multi-modal data comprehensively, so that the agent can navigate to the specified location. The multi-modal data includes natural image data and natural language instructions. More specifically, after an agent is spawn, at each time step the observation of the agent consists of 36 images of panoramic views, denoted as . The navigable views are given as well, where denotes the maximum number of navigable viewpoints and represents “stay” action. A words length instruction is given which is denoted as . Based on the visual and language information, actions at each time will be selected and eventually a trajectory is formed. The objective of VLN task is to find the optimal action at each step to quickly reach the target location, while keep the trajectory
as short as possible. Since Vision-and-Language Navigation task is a sequential decision problem, it can be modelled as a Markov Decision Process (MDP), which is noted as a four-element-tuple (). and represent state and action sets relatively. is the environment dynamics and it can be presented in the form . is the reward function.
In this paper we introduce a Soft Expert Reward Learning model to distil reward function directly from expert demonstrations and soften the process of behaviour cloning to alleviate the drawbacks from error accumulation. The structure of our model is illustrated through Figure 1
. We follow a standard Encoder-Decoder paradigm. The encoder plays the role as a multi-modal data feature extractor to fetch the features from both visual images and language instructions. The decoder is a LSTM (long short-term memory) network with attention mechanism to predict actions according to the abovementioned two branches: the supervised learning branch helps the agent imitate the expert demonstration and perceive the current schedule to the target location; the reinforcement learning branch optimises the outputted action probability distribution from reinforcement learning aspects. The key difference of our proposed SERL model with previous models is that we proposed two novel intrinsic reward signals: Soft Expert Distillation rewardencourages the agent to align with expert actions but in a soft fashion and Self Perceiving reward motivates the agent to reach the goal as fast as possible with predicted schedule information. In the following sections, we will first introduce the Encoder-Decoder structure and then introduce the two reward functions.
3.2 Encoder-Decoder Structure
Encoder-Decoder structure (as shown in 2) is adopted as the main structure of our method. Natural image data and natural language instructions are inputted to an encoder to extract corresponding features maps. Following the paper [9, 16], we extract ResNet  features of the navigable views concatenated with the orientation as the visual features . We then use a Bi-Directional Long Short-Term Memory (Bi-LSTM) to pull out language features
. The multi-modal features are fed into a decoder to output the next action probability vectors later on.
On the encoder side, after pre-extracting ResNet features of different views, the feature maps of each navigable view is attached with an orientation tag to form the visual feature :
where is a concatenation function.
For the language perspective, after each word of the instruction is tokenised into a vector, the token vectors are fed into a Bi-LSTM network to extract the language features . As Eqn. 2, formally we have
where is the corresponding i-th encoded word tokenised by Bi-LSTM.
On the decoder side, after the visual feature and language features are formed, along with the last cross-modal hidden state , they are fed into soft attention layers to fetch the attentive visual and language features. Following the work , the environment dropout is used on before feeding into soft attention layer to obtain feature-wise dropout for consistency in different views. Formally,
Together with previous navigated view , last cross-modal hidden state , cell state , attentive visual and language features are fed into a LSTM layer to form the cross-modal hidden state and cell state at step . This step is critical for the model to fuse the visual and language multi-modal signals to choose the action.
The action probability distribution for the next step is calculated as:
where represents a dropout function. The dot product is used hereafter for matrix multiplication operation.
The decoder is connected to two branches: supervised learning branch and reinforcement learning branch. These two branches optimise the outputted action probability distribution from two different learning paradigms. In this case, the total loss function is:
In the supervised learning branch, the cross-entropy loss between the predicted action logits and expert actions one-hot vector is calculated to force the agent to mimic its teacher’s behaviours. This loss is termed as behaviour cloning loss. Following the work , besides the behaviour cloning loss, another loss to predict current schedule towards the goal is adopted. This loss is named as schedule loss working as an additional supervisory signal. Formally, the loss function for the supervised learning branch is:
where the behaviour cloning loss can be presented detailedly:
where and are predicted action logits and expert actions one-hot vector at step respectively.
To calculate the , the model ought to predict distance improvement ratio in advance at each step as its current schedule information. Then, L2 distance between predicted schedule and the genuine schedule is chosen as the loss function. Formally,
where represents the predicted schedule which will be described in detail in the subsequent section and is the corresponding true schedule value.
As the reinforcement learning branch shown in Figure 1, we adopt actor-critic algorithm  as our reinforcement learning method. For the reinforcement learning branch, the training loss can be formally represented as:
where is the value function of critic. represents the discounted reward for time step and it can be formulated as:
in which the is the discount factor. The reward is made up of three parts: an extrinsic reward and another two complementary and newly proposed reward functions — Soft Expert Distillation (SED) reward and Self Perceiving reward . The total reward function thus can be formalised as:
where (1) SED reward , an automatically learnt reward function through aligning agent’s behaviours to the provided expert demonstrations. (2) SP reward , a reward function comes from predicted schedule to encourage the agent to reach the goal as soon as possible. (3) The extrinsic reward assigns the agent a positive reward, if the agent stops within three-meter from target or the agent reduces the distance to the goal; otherwise, a negative reward will be returned. , are the trade-off factors of SED reward and SP reward respectively. The details of individual proposed reward function will be revealed in the following sections.
3.3 Soft Expert Distillation
Inspired by the work , we propose to learn the reward function from inputted expert demonstration in Vision-and-Language Navigation task. We train a neural network to predict the output of a random-initialised but frozen network to distil the expert knowledge. The Soft Expert Distillation networks structure is shown in Figure 3. The key intuition behind this is: given a certain amount of random projection information, the representation learner is required to fit the structure of these given data points in the random projection space to achieve a similar projected distribution. The learning function is expected to predict relatively better where more expert data lays. In this case, a strong density function is formed. It models the likelihood of the agent performing a similar action with its expert in a situation through distillation. A higher prediction distance, which results in a low SED reward in turn, will be assigned to unexpected observation-action pairs that differs from given expert demonstrations. Thus, a higher reward will be assigned to an agent who takes an action similar with its expert. This encapsulation of density function gives us another view of learning expert demonstrations directly other than  and .
Precisely, for a given expert demonstrated data point , we ﬁrst feed it into a weight-fixed and random-initialised neural network ; at the same time the data point is inputted into a distillation network with different structure but same output dimensions. The data is projected to a -dimensional new space by a representation learner with the parameters . We emphasise here, the function capacity of network is less than network , by doing which can prevent overfitting. As we adopt L2 distance as our loss function, then we formulate the subsequent step as a prediction task and define a loss function as:
Empirically, both of and
are implemented by multi-layer perceptrons.plays the role of a random data mapping function to project points into a randomly projected space. By doing so, this loss offers a simple yet powerful supervisory signal for the distillation network to learn semantic-rich feature representations from given expert data processed by the random projection function .
In order to distil the expert behaviour distribution, the data points are consist of expert’s visual observation, language instructions and actions. The equation is formally shown as:
The SED module preserves semantic-rich information w.r.t. distribution of expert demonstration for the representation learner. So the module is an ideal density function to measure the similarity of an agent’s behaviour with the expert demonstration. Differ from the behaviour cloning process, it is formed in a soft manner. The SED intrinsic reward function is formally presented as:
The L2 distance between and is denoted as . Intuitively, if is less than the threshold, it represents the current behaviour of the agent is similar with the expert distribution where a positive reward should be awarded; otherwise, a negative reward will be returned. In contrast, behaviour cloning based models encourage the agent to copy expert demonstrations exactly; while our proposed soft expert distillation module learns the demonstrated behaviour in a soft manner by depicting the distribution of expert behaviours. In the case, the agent can retain the expert knowledge but will not suffer from the error accumulation problem. Thus, it increases the robustness of the model across various VLN environments.
3.4 Self Perceiving Reward
To perceive the schedule information towards the goal is crucial for the agent to complete the VLN task. A self perceiving module is designed to predict distance improvement ratio at each step as current schedule information of the agent. In order to utilise the information more adequately, we take one more step ahead by making use of this schedule information as another intrinsic reward—self perceiving reward. Formally, the self perceiving reward is calculated from:
where represents the language attention over different vocabularies within the instruction sentence. is the element-wise Hadamard product. Intuitively, the Self Perceiving reward indicates the predicted schedule information toward the destination. The more distance improvement ratio of the current action archived, the higher reward ought to be assigned. Moreover, this reward offers more information of distance change than raw distances. The more self perceiving reward the agent collected, the closer the agent believes to reach the target location.
Following previous works [2, 4, 7, 9, 10, 16, 19], we evaluate our model on the Room-to-Room (R2R) dataset  for VLN task. Furthermore, we test our method on the VLN test server111The VLN leaderboard address is https://evalai.cloudcv.org/web/challenges/challenge-page/97/leaderboard/270.  to validate the proposed Soft Expert Reward Learning Model. Ablation study is further conveyed to examine the contribution of each individual component of the model. The experimental results show the effectiveness of the proposed model.
4.1 Experimental Setup
Evaluation Metrics. Currently, a variety of metrics are used to evaluate VLN models. We adopt the following metrics: Navigation Error (NE) is to measure the shortest path distance between the stopping position and the goal; Success Rate (SR) quantifies the rate of success if the agent can stop within three meters from the target; Oracle Success Rate (OSR) is the success percentage if the agent can stop at the closest point along its trajectory; the Success rate weighted by Path Length (SPL)  is also adopted to indicate the weighted SR.
, we utilise the ResNet-152 model pre-trained on ImageNet to extract CNN features as visual inputs. Empirically, we set theequal to 128, and set both of the reward trade-off factors and
to 0.1. In Soft Expert Distillation networks, the randomised network is made up of two hidden linear layers with 512 and 256 neurons respectively; the distillation network has one hidden linear layers with 256 neurons. Between every two linear layers, both of the randomised network and the distillation network adopt leaky-relu as their activation function. To prevent overfitting, we early-stopped the training process of models according to the performance on the validation set. The Soft Expert Distillation module is not jointly trained with the rest of the model. This decoupling prevents performance unstableness during training and increase the robustness of the model.
4.2 Overall Performance
|Val Seen||Val Unseen||Test Unseen|
In this section, we convey the evaluation experiments on three individual sets, validation seen, validation unseen and test set, shown in table 2, to compare the effectiveness of our proposed soft expert reward learning model with other models. The comparison is split into two groups: models trained on non-augmented data and augmented data. Within twelve indicators of validation set and test set, we achieve ten best results on the non-augmented group and nine best results on the augmented group, which reveals the effectiveness of SERL model. More specifically, for the non-augmented group, on validation unseen set, our SERL model reduces the navigation error by 7%, increase the success rate by 4% and SPL by 2%. Our method also receives remarkable results on test unseen set. Similarly for the augmented group, on validation unseen set, it is clear that our model is the best performer. SERL model reduces the navigation error by 5% and gets 0.56 successful rate. Our model also increases 10% for the oracle successful rate and gets 0.48 SPL respectively compared to the second-best model. On the test unseen set, our SERL model can achieve performance better than, or comparably well to, the other competing methods in Table 2. When compared to the second-best model, the model increases 3% for the oracle successful rate and 4% SPL respectively. The FAST  model applies a beam-search style strategy, thus it is expected to produce better successful rate (SR) but it leads to a relatively worse SPL.
4.3 Ablation Study
4.3.1 Ablation Study of Different Components Performance
This section examines the contribution of each component of SERL model. Different components are added to the baseline model. The ablation results are represented as Table 3. The results are shown on validation seen and unseen sets and the models are trained with the same data augmentation strategy. In the first column, SED represents our proposed soft expert distillation module, while SP is the self perceiving module. BS represents beam search setting. We check different components in the second column to examine each variant. Row model #1 shows the performance of the environment dropout methods that we implemented. From the table we can clearly find that when comparing to row #1, excluding the beam search setting on the validation unseen set, the model with SED module alone (method #2) achieves higher SR by 6% and increases SPL score from 0.45 to 0.48; the model with SP module alone (#3) receives better success rate as 0.53 from 0.49 and better SPL score as 0.46 from 0.45. This is because the SED module encourages the agent to have better alignment with expert trajectories, but in a soft way; the SP module pushes the agent to find the target location as fast as possible. The full SERL model (method #4) combines the advantages of individual module and it achieves 0.56 of successful rate and 0.48 of SPL, which outperforms other variants.
|Val Seen||Val Unseen|
Additionally, beam search is another popular Vision-and-Language Navigation setting. In the beam search setting, the agents are given the chance to choose the trajectories with the highest success rate. In this case, it can further boost the success rate of our SERL model (method #5) to 0.77 on validation seen set and 0.71 on validation seen set. Moreover, SERL model receives 0.70 in successful rate on the test unseen set with beam search.
4.3.2 Sensitivity Test
This section presents the performances of SERL model with different and weights to trade-off the proposed individual intrinsic reward. Figure 4 shows the sensitivity test results, which is evaluated in SR and SPL on validation unseen set. It is clear that SERL generally performs stably w.r.t. the use of different and weights. This demonstrates the general stability of our SERL method by setting different hyper-parameters. In general, is recommended for SERL to achieve effective visual and language navigation performance.
Figure 5 shows the actions taken by our baseline agents and proposed SERL agent, respectively. The attention maps over the instruction at each step are also illustrated in the figure. On the left column of the figure, the agent is trained by behaviour cloning solely and it performs correctly at the first three steps. But the agent takes a wrong action at the fourth step and it results in failure navigation in the next three steps. This is because subtle errors will be accumulated at each step by just copy expert demonstrations in the training phase. However, our SERL model can attend over the instruction in a better way and it does not encounter the error accumulation problem in the case.
In this paper, we propose a Soft Expert Reward Learning (SERL) model to address the behaviour cloning error accumulation and the reinforcement learning reward engineering issues for VLN task. From the experimental results, we show that our SERL model gains better performance generally than current state-of-the-art methods in both validation unseen and test unseen set on VLN Room-to-Room dataset. The ablation study shows that our proposed the Soft Expert Distillation (SED) module and the Self Perceiving (SP) module are complementary to each other. Moreover, the visualisation experiments further verify the SERL model can overcome the error accumulation problem. In the future, we will further investigate more reward learning methods on VLN task.
-  (2018) On evaluation of embodied navigation agents. arXiv preprint arXiv:1807.06757. Cited by: §4.1.
-  (2018) Vision-and-language navigation: interpreting visually-grounded navigation instructions in real environments. In , pp. 3674–3683. Cited by: 3rd item, §1, §1, §2.1, Table 1, Table 2, §4.
-  (2016) Openai gym. arXiv preprint arXiv:1606.01540. Cited by: §2.2.
-  (2018) Speaker-follower models for vision-and-language navigation. In Advances in Neural Information Processing Systems, pp. 3314–3325. Cited by: §1, §2.1, Table 1, §4.1, Table 2, §4.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §3.2.
-  (2016) Generative adversarial imitation learning. In Advances in neural information processing systems, pp. 4565–4573. Cited by: §2.2, §3.3.
-  (2019) Tactical rewind: self-correction via backtracking in vision-and-language navigation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6741–6749. Cited by: §1, §2.1, Table 1, §4.2, Table 2, §4.
-  (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §2.1.
-  (2019) Self-monitoring navigation agent via auxiliary progress estimation. arXiv preprint arXiv:1901.03035. Cited by: §1, §2.1, Table 1, §3.2, §3.2, Table 2, §4.
The regretful agent: heuristic-aided navigation through progress estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6732–6740. Cited by: §1, §2.1, Table 1, Table 2, §4.
Asynchronous methods for deep reinforcement learning.
International conference on machine learning, pp. 1928–1937. Cited by: §3.2.
-  (2013) Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602. Cited by: §2.1.
-  (2000) Algorithms for inverse reinforcement learning.. In Icml, Vol. 1, pp. 663–670. Cited by: §2.2, §3.3.
-  (2019) SQIL: imitation learning via regularized behavioral cloning. arXiv preprint arXiv:1905.11108. Cited by: §1.
-  (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §2.1.
-  (2019) Learning to navigate unseen environments: back translation with environmental dropout. arXiv preprint arXiv:1904.04195. Cited by: §1, §2.1, Table 1, §3.2, §3.2, §4.1, Table 2, §4.
-  (2012) Mujoco: a physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. Cited by: §2.2.
-  (2019) Random expert distillation: imitation learning via expert policy support estimation. arXiv preprint arXiv:1905.06750. Cited by: §2.2, §3.3.
-  (2019) Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6629–6638. Cited by: §1, §2.1, Table 1, Table 2, §4.
-  (2019) EvalAI: towards better evaluation systems for ai agents. arxiv:1902.03570. Cited by: §4.