Aerial Vision-and-Dialog Navigation

by   Yue Fan, et al.
University of California Santa Cruz

The ability to converse with humans and follow commands in natural language is crucial for intelligent unmanned aerial vehicles (a.k.a. drones). It can relieve people's burden of holding a controller all the time, allow multitasking, and make drone control more accessible for people with disabilities or with their hands occupied. To this end, we introduce Aerial Vision-and-Dialog Navigation (AVDN), to navigate a drone via natural language conversation. We build a drone simulator with a continuous photorealistic environment and collect a new AVDN dataset of over 3k recorded navigation trajectories with asynchronous human-human dialogs between commanders and followers. The commander provides initial navigation instruction and further guidance by request, while the follower navigates the drone in the simulator and asks questions when needed. During data collection, followers' attention on the drone's visual observation is also recorded. Based on the AVDN dataset, we study the tasks of aerial navigation from (full) dialog history and propose an effective Human Attention Aided (HAA) baseline model, which learns to predict both navigation waypoints and human attention. Dataset and code will be released.


page 3

page 4

page 5

page 13

page 14

page 16

page 17

page 18


The RobotSlang Benchmark: Dialog-guided Robot Localization and Navigation

Autonomous robot systems for applications from search and rescue to assi...

Vision-Dialog Navigation by Exploring Cross-modal Memory

Vision-dialog navigation posed as a new holy-grail task in vision-langua...

Vision-and-Dialog Navigation

Robots navigating in human environments should use language to ask for a...

RMM: A Recursive Mental Model for Dialog Navigation

Fluent communication requires understanding your audience. In the new co...

MOBDrone: a Drone Video Dataset for Man OverBoard Rescue

Modern Unmanned Aerial Vehicles (UAV) equipped with cameras can play an ...

Some Problems of Deployment and Navigation of Civilian Aerial Drones

One of the biggest challenges is to determine the deployment and navigat...

Towards Affective Drone Swarms: A Preliminary Crowd-Sourced Study

Drone swarms are teams of autonomous un-manned aerial vehicles that act ...

1 Introduction

Figure 1: An example of Aerial Vision-and-Dialog Navigation (AVDN). The user instructs the agent to fly to a destination. During the navigation, the agent can ask questions while showing the images of past visual observations and relative trajectory. The user will talk back at a convenient time to provide further guidance to the agent without having to monitor the agent all the time.

Drones have been widely adopted for many applications in our daily life, from personal entertainment to professional use. It has the advantage of mobility and observing large areas over ground robots. However, compared with ground robots, the control of the aerial robot is more complex because an extra degree of freedom, altitude, is involved. People often need to hold a controller all the time to fly a drone. So it is essential to create a hands-free control experience for drone users and develop an intelligent drone that can complete tasks simply by talking to humans. It can lower the barrier of drone control for users with some disabilities and who have their hands occupied by activities such as taking photos, writing, etc.

Therefore, this work introduces Aerial Vision-and-Dialog Navigation (AVDN), aiming at developing an intelligent drone that can converse with its user to fly to the expected destination. As shown in Figure 1, the user (commander) provides instructions, and the aerial agent (follower) follows the instruction and asks questions when needed. Through free-form dialog, potential ambiguities in the instruction can be gradually resolved when the commander provides further instructions by request. The past visual trajectories are also provided along with the question, which frees the commander from monitoring the drone all the time and minimizes the burden of drone control.

To implement and evaluate the AVDN task, we build a photorealistic simulator with continuous state and action space to simulate a drone flying with its onboard camera pointing straight downward. Then we collect an AVDN dataset of 3,064 aerial navigation trajectories with human-human dialogs, where crowd-sourcing workers play the commander role and drone experts play the follower role, as illustrated in Figure 1. Moreover, we also collect the attention of human followers over the aerial views for a better understanding of where humans ground navigation instructions.

Based on our AVDN dataset, we introduce two challenging navigation tasks, Aerial Navigation from Dialog History (ANDH) and Aerial Navigation from Full Dialog History (ANDH-Full). Both tasks focus on predicting navigation actions that can lead the agent to the destination area, whereas the difference is that ANDH-Full presents the agent with full dialog and requires it to reach the final destination Kim et al. (2021), while ANDH evaluates the agent’s completion of the sub-trajectory within a dialog round given the previous dialog information Thomason et al. (2020).

The proposed tasks open new challenges of sequential action prediction in a large continuous space and natural language grounding on photorealistic aerial scenes. We propose a sequence-to-sequence Human Attention Aided (HAA) model for both tasks. The HAA model predicts waypoints to reduce the complexity of the search space and learns to stop at the desired location. More importantly, it is jointly trained to predict human attention from the input dialog and visual observations and learns where to look at during inference. Experiments on our AVDN dataset show that multitask learning is beneficial and human attention prediction improves navigation performance. The main contributions of our work are concluded as follows:

  • We propose a new dataset and a simulator for aerial vision-and-dialog navigation. The dataset includes over 3K aerial navigation trajectories with human-human dialogs.

  • We introduce ANDH and ANDH-Full tasks to evaluate the agent’s ability to understand natural language dialog, reason about aerial scenes, and navigate to the target location in a photorealistic aerial environment.

  • We build a Human Attention Aided (HAA) model as the baseline for the ANDH and ANDH-Full tasks. Besides predicting the waypoint navigation actions, HAA also learns to predict human attention along the navigation trajectory. Experiments on our AVDN dataset validate the effectiveness of our HAA model.

2 Related work

Vision-and-Language Navigation

Vision-and-Language Navigation (VLN) is an emerging multi-modal task that studies the problem of using both language instructions and visual observation to predict navigation actions. We compare some of the works with our AVDN dataset in Table 1. Early VLN datasets such as Anderson et al. (2018); Ku et al. (2020) start with the indoor house environments in the Matterport3D simulator Chang et al. (2017), where the visual scenes are connected on a navigation graph. To simulate continuous state change as in the real world, Krantz et al. (2020) built a 3D continuous environment by reconstructing the scene based on topological connections where the agent uses continuous actions during the navigation. Some other VLN studies focus on language instructions. Nguyen et al. (2019); Nguyen and Daumé III (2019); Thomason et al. (2020) created datasets where the agent can interact with the user by sending fixed signals or having dialogs. There are also works on synthetic indoor environments, such as Shridhar et al. (2020); Padmakumar et al. (2021) that use an interactive simulation environment with synthetic views named ALFRED, where the agent needs to follow language instructions or dialogs to finish household tasks. Besides the indoor environment, some VLN datasets work on the more complex outdoor environment, such as the Touchdown dataset Chen et al. (2019) and the modified LANI dataset Misra et al. (2018). Blukis et al. (2019) is similar to ours for both using drones. However, the synthetic environment they used is far from the realistic scene, and they ignored the control of the drone’s altitude, where such navigation is oversimplified and has a large gap towards navigation in the real world in terms of language and vision aspects. Our work absorbs the advantage from previous works where we have continuous environments and dialog instructions to better approximate the real-world scenario. More importantly, our AVDN dataset is the first photorealistic outdoor aerial VLN dataset to the best of our knowledge.

Dataset Env Photorealistic Dialog
R2R, RxR indoor
VNLA, HANNA indoor
VLN-CE indoor
CVDN indoor
TEACH indoor
TouchDown street-view
modified LANI aerial
AVDN (ours) aerial
Table 1: Example Vision-and-Language Navigation Datasets. R2R Anderson et al. (2018), RxR Ku et al. (2020), VNLA Nguyen et al. (2019), VLN-CE Krantz et al. (2020), CVDN Thomason et al. (2020), TEACH Padmakumar et al. (2021), TouchDown Chen et al. (2019), modified LANI Blukis et al. (2019).

Aerial Navigation

Although using both vision and language for aerial navigation is a relatively new topic, vision-based aerial navigation for drones is already an active topic in the field. Some inspiring works Loquercio et al. (2018); Giusti et al. (2015); Smolyanskiy et al. (2017); Fan et al. ; Bozcan and Kayacan (2020); Majdik et al. (2017); Kang et al. (2019) worked on using pre-collected real-world drone data to tackle aerial vision navigation problems. However, due to the hardness of collecting data and the risk of crashes, some other works applied simulation for aerial navigation. These aerial navigation simulators are mostly using scenes with synthetic views Chen et al. (2018); Shah et al. (2017); Chen et al. (2020), where rich ground truths are provided without the need for annotation. Nevertheless, the visual gap between the synthetic views and the views in the real world could cause trouble in learning. As for our AVDN dataset, the simulator we build uses satellite images for simulating top-down visual observation of the drone, which avoids the shortcoming of having only 2D scenes and adopts the strength of the satellite images where the rich labels are already available. As a result, we balanced the trade-off between the high cost of drone data collection and the benefit of photorealistic data.

3 Dataset

The AVDN dataset includes dialogs, navigation trajectory, and drone’s visual observation with human attention. Some examples of the dialog and drone’s visual observation with human attention are shown in Figure 2. With the help of a newly proposed simulator, we record the AVDN trajectories created by two groups of humans interacting with each other, playing either the commander role or the follower role. Our AVDN dataset is the first aerial navigation dataset based on dialogs to the best of our knowledge.

Figure 2: Example of dialog and corresponding drone’s visual observation overlaid with human attention (red circles) from our AVDN dataset. More examples are shown in the Appendix.

3.1 Simulator

We build a simulator to simulate the drone with a top-down view area as in Figure (a)a. Our simulation environment is a continuous space so that the simulated drone can move continuously to any point within the environment. The drone’s visual observations are square images generated corresponding to the drone’s view area by cropping from high-resolution satellite images in the xView dataset Lam et al. (2018)

, an open-source large-scale satellite image object detection dataset, as in Figure

(b)b. Although the satellite images lack dynamics and stereoscopy, we argue that by using satellite images, our simulator is capable of providing equally rich visual features as in the real world and some examples are shown in Figure (c)c. Therefore, we believe that the cropped satellite images are reasonable and cheap substitutes for the images taken by the real drone’s onboard camera. We also design an interface for our simulator, where the simulated drone can be controlled with a keyboard and the drone’s visual observation will be displayed in real-time with a digital compass, as shown in Figure (a)a. During the control, the users can also provide their attention over the displayed images on the interface by clicking the region they attend to. Last but not least, our simulator is capable of generating navigation overviews, showing the starting positions, destination areas, current view area and past trajectory if they exist, as in Figure (b)b and (c)c.

(a) Simulated drone.
(b) Cropped satellite images.
(c) Real drone visual observation.
Figure 3: (a) shows how simulated drone’s visual observation is generated from satellite images in our simulator. We compare the simulated drone’s visual observation from satellite images, (b), with images from a drone’s onboard camera at about 200m above ground, (c).

3.2 Dataset Collection

We collect our dataset with the help of Amazon Mechanical Turk (AMT) workers and drone experts, where AMT workers play the commander role to provide instructions, and drone experts play the follower role to control a simulated drone and carry out the instruction. We pay the workers with wages no less than $15/h, and the data collection lasts for 90 days. We adopt an asynchronous data collection method, where the followers and commanders work in turns rather than simultaneously. This not only lowers the cost of data collection but also simulates how aerial vision-and-dialog navigation would work in practice.


One or multiple rounds of data collection could be involved for collecting an AVDN trajectory with dialog. Before the first round of data collection, we sample objects in the xView dataset Lam et al. (2018), as the destination areas and pair them with randomly selected about 40-meter-wide initial view areas of the drone that are always within 1.5km distance. With these prepared navigation information, we generate navigation overviews using our simulator, as shown in Figure (c)c.

In the first round of data collection, the navigation overviews are presented to AMT workers for creating the initial instructions. We instruct the AMT workers to write instructions as if they are talking to a drone pilot based on the marked satellite images. Next, we let human drone experts control the simulated drone through our simulator interface following the instructions and ask questions if they cannot find the destination area. During the simulation for -th navigation trajectory, a series of view areas are recorded, where has three properties, the center coordinate , width , and direction . One or more keyboard controls might be needed to go from to but the distance between the centers of and is . When the experts stop the current navigation session, they can either enter questions into a chatbox, claim the destination with a template sentence or reject the instruction for bad quality.

In the second and following rounds of data collection (if needed), we gather the history dialog from the previous round and let AMT workers continue the dialog by providing further instructions based on the unfinished navigation data. Our experts will again navigate the drone and ask questions when necessary. We iterate the process until the destination is successfully reached and claimed by the expert as in Figure (b)b.

Success Condition

The AVDN trajectory is successful when the object area is clearly and completely observed. We define the successful finish of the AVDN trajectory by both checking the center of the follower’s last view area and computing the Intersection over Union (IoU) of and the destination area . If the center of is inside and the IoU of and is larger than 0.4 when the follower claims destination, the navigation trajectory is regarded as successfully finished. Otherwise, the navigation will continue with more data collection rounds needed.

(a) simulator interface (b) Successful trajectory
(c) Navigation overview
Figure 4: (a) shows the interface from our simulator shown to the drone experts during the data collection. (b) is a marked satellite image showing a navigation trajectory that successfully reaches the destination area. The white bounding box is the current view area of the follower. The purple line is the recorded ground truth view area center trajectory. The purple bounding box is the destination area. (c) is a navigation overview generated at the beginning of the data collection.

3.3 Data Analysis

Our AVDN dataset includes 3,064 aerial navigation trajectories, each with multi-round natural language dialog. The most frequent words are shown in Figure (a)a. The recorded AVDN trajectory path length has an average of 287, and its distribution is shown in Figure (b)b. There are 2 rounds of dialog on average per full trajectory, and based on the data collection rounds, the trajectories and dialogs can be further separated into 6,269 sub-trajectories and corresponding dialog rounds.

We split our dataset into training, seen-validation, unseen-validation, and unseen-testing sets, where seen and unseen sets are pre-separated by making sure the area locations of the visual scenes are over 100 apart from each other. We show some statistical analysis across the dataset splits in Table 2. The visual scenes in our dataset come from the xView dataset Lam et al. (2018), which covers both urban and rural scenes. The average covered area of the satellite images is 1.2.

Rather than providing a target hint in the beginning as in Thomason et al. (2020), the destination must be inferred from the human instructions given by the commander. For example, the commander may give a detailed description of the destination initially or write a rough instruction first and then describe the destination later in the dialog. We also find that there are two ways of describing the directions for navigation: egocentric direction description, such as “turn right”, and allocentric direction description, such as “turn south”. By filtering and categorizing words related to directions, we find that of the dialog rounds use egocentric direction description and dialog rounds include allocentric direction description. There are dialog rounds that have mixed direction deceptions, making the instruction complex. This opens a new challenge for developing a language understanding module that can ground both the egocentric and allocentric descriptions to navigation actions.

Split #areas #dialogs
Training 350 126m 2221 90 4591 145m
Seen-val 197 120m 197 79 370 148m
Unseen-val 30 131m 214 83 411 144m
Unseen-test 65 117m 432 91 897 142m
Table 2: Dataset statistics. #areas refers to the number of non-overlapped satellite images used. Destination area-dim is the average dimension of the sampled destination areas. #dialogs is the number of dialogs, and #words per dialog is the average number of words in each dialog. #sub-paths is the number of sub-paths where each sub-path corresponds to one round of dialog. Sub-path length is the average sub-path length.
(a) Frequent words
(b) Path length distribution
Figure 5: (a) displays the frequent words that appear in the dialogs and (b) shows the path length distribution of our AVDN dataset.

4 Task

Following indoor dialog navigation Thomason et al. (2020); Kim et al. (2021), we introduce an Aerial Navigation from Dialog History (ANDH) task and an Aerial Navigation from Full Dialog History (ANDH-Full) task based on our AVDN dataset and simulator.

4.1 Aerial Navigation from Dialog History

The goal of the task is to let the agent predict aerial navigation actions that lead to the destination area following the instructions in the dialog history. Specifically, one round of dialog, , is input to the agent at the starting of the sub-trajectory , where and is the number of sub-trajectories in trajectory . At each time step, an image of the view area is provided, where . The goal is to navigate to the destination, , corresponding to dialog round

where is the time step when sub-trajectory ends. The agent outputs actions, and the drone’s view area will move accordingly. As a result, the predicted view area sequence, , as the result of predicted waypoint actions will be recorded for evaluation with regarding to the ground truth view area sequence , where is the number of total predicted waypoint actions for sub-trajectory of trajectory and is the time step in trajectory when sub-trajectory starts.

4.2 Aerial Navigation from Full Dialog History

Compared with the ANDH task, the major difference of ANDH-Full task is that it adopts the complete dialog history as input. At the beginning of each dialog round, we add a prompt of the drone’s direction, eg., “facing east”, by checking the ground truth. It helps to clarify the following language instructions, especially when egocentric direction descriptions exist. With the full dialog and visual observation, the agent needs to predict the full navigation trajectory from the starting view area to the destination area . ANDH-Full provides complete supervision for agents on a navigation trajectory with a more precise destination description and includes longer utterances and more complex vision grounding challenges. Similar to the ANDH task, the predicted actions will be used to control the agent’s movement, and the evaluation target will be the sequence of view areas recorded after every action together with the navigation trajectory generated from it.

4.3 Evaluation

Since the agent in both tasks, ANDH and ANDH-Full, needs to generate predicted view area sequences, the evaluation for both tasks is the same. In the evaluation, the center points of every view area are connected to form the navigation trajectory, and the last view area is used to determine whether the predicted navigation successfully leads to the destination area. The predicted navigation is successful if the IoU between the predicted final view area and the destination area is greater than 0.4. We apply several metrics for evaluation.

Success Rate (SR): the number of the predicted trajectory being regarded as successful, i.e., the final view area of the predicted trajectory satisfies the IoU requirement, over the number of total trajectories predicted.

Success weighted by inverse Path Length (SPL) Anderson et al. (2018): measuring the Success Rate weighted by the total length of the navigation trajectory.

Goal Progress (GP) Thomason et al. (2020): evaluating the distance of the progress made towards the destination area. It is computed as the Euclidean distance of the sub-trajectory, deducted by the remaining distance from the center of the predicted final view area to the center of destination area .

Average Trajectory Deviation (ATD): the average distance from the predicted view area centers to the ground truth trajectory. It measures the predicted path fidelity weighted by length. We derive the ground truth trajectory by connecting the center positions of the ground truth view area sequence with line segments , where and is a line segment start from point and ends at point and returns the center area of the area. The ATD for predicted view area sequence is

Where is the distance between point and line segment and is the length of line segment .

Figure 6: Our Human Attention Aided (HAA) model. The output our the model will interact with our simulator for generating the input for next time step.

5 Model

We proposed a Human Attention Aided (HAA) model for the ANDH and ANDH-Full tasks as shown in Figure 6, where it takes as input multi-modal information and generates multi-modal predictions, including human attention prediction and navigation prediction.

Multi-modal Encoding

The input has three modalities, the drone’s direction, image, and language. The drone’s directions and images from the drone’s visual observation are input to the model at every time step. A fully connected direction encoder module with a followed direction LSTM is used to understand the continuous forward direction inputs and an xView-pretrained Darknet-53111 Redmon and Farhadi (2018) followed with a vision LSTM are used for extracting image features. As for language inputs, special language tokens such as [INS] and [QUE] are added in front of each instruction and question in the dialog before input to the model at the start of a prediction series. Next, our model uses a BERT encoder Devlin et al. (2018) to get the language features of the input dialog history. Language features are first used for the attention module ahead of the vision LSTM, and then the hidden states from both direction LSTM and vision LSTM are concatenated and also attended by the language features.

Navigation Prediction and Waypoint Control The navigation outputs from our model include waypoint actions and navigation progress. The waypoint action has three dimensions corresponding to coordinates in a 3D space. The navigation progress prediction Xiang et al. (2019) is to generate a dimension navigation progress indicator for deciding when to stop. If the predicted navigation progress is less than a threshold, the simulated drone will execute the predicted waypoint action. The drone’s view area center in the next time step is the first two dimensions of the predicted waypoint and the width of the view area is determined by the altitude of the predicted waypoint. The waypoint action also depends on the drone’s direction, where the direction is kept towards the direction of movement.

Human Attention Prediction A human attention decoder is proposed to predict the human attention mask using the output of the attention module that deals with the image features. We build the decoder based on He et al. (2019), where the input to the decoder will be decoded to an

representation through a fully connected layer and then linearly interpolated to a mask with the same shape as the input image. The greater the values in the mask means more likely the human follower attends the corresponding pixels.

Training We first train our HAA model on ANDH task and then fine-tuned on ANDH-Full task because ANDH task is relatively easier given a larger dataset available. For each task, we conduct the training alternately in teacher-forcing Williams and Zipser (1989) and student-forcing modes, where the main difference is whether the model interacts with the simulator using ground truth actions or the predicted actions. Our model is trained with a sum of losses from both navigation prediction and human attention prediction. First, the predicted waypoint action and predicted navigation progress are trained with Mean Square Error (MSE) loss, supervised by the ground truth and computed based on the recorded trajectory in our dataset.

where the is computing the rotation change as a result of the waypoint action. Second, for human attention prediction training, we apply the modified Normalized Scanpath Score loss (NSS) He et al. (2019). Given a predicted human attention map and a ground-truth human attention mask ,

Since human attention may not exist in certain view areas, the human attention loss is only computed for view areas with recorded human attention.

Seen Validation Unseen Validation Unseen Testing
Rule-based 4.0 4.1 5.3 0.80 0.0 0.0 0.80 0.80 5.6 5.8 1.6 0.80 0.5 0.5 5.1 0.81 3.4 3.6 3.3 0.80 0.0 0.0 3.3 0.82
Vision-only 3.8 3.8 -1.0 0.46 0.5 0.5 2.4 0.42 7.0 7.5 0.5 0.47 2.3 2.3 0.7 0.43 3.3 3.7 -0.1 0.52 0.2 0.2 -1.5 0.48
Language-only 6.7 7.3 21.0 0.48 0.5 0.5 22.1 0.56 10.1 10.7 17.8 0.52 2.3 2.3 19.2 0.55 9.0 9.4 19.0 0.51 1.4 1.4 34.9 0.53
W/o attention 7.8 8.4 25.0 0.47 1.0 1.0 28.2 0.44 10.6 11.2 27.8 0.47 0.9 0.9 31.1 0.40 9.2 9.6 26.3 0.48 0.9 0.9 38.8 0.39
HAA (Ours) 9.9 10.5 32.7 0.45 2.5 2.5 49.5 0.43 13.1 13.9 32.8 0.46 5.6 5.6 51.3 0.44 9.7 10.3 33.3 0.44 1.2 1.2 53.9 0.41
Table 3: Result comparison on both ANDH and ANDH-Full tasks.

6 Experiments

We compare our HAA model and several baselines on the ANDH and ANDH-Full tasks.

Baselines We first design a rule-based non-learning baseline model that uses keywords in the dialog history to generate navigation actions. It adopts the same control strategy as our model but only moves to a fixed direction based on detected keywords and stops at a random time step. Then we introduce two uni-modal baseline models, where they take as input either vision or language. Both uni-modal baseline models utilize the same modality encoders and LSTM structure as in our model and have the drone’s direction as input at each time step. Last but not least, we also build an ablation model with the same structure but without human attention prediction training.

Results on ANDH and ANDH-Full The models’ weights are selected based on unseen SR, and we list the results of the models on ANDH and ANDH-Full tasks in Table 3. Based on the results, our HAA model outperforms the baseline models in all metrics for ANDH task. As for ANDH-Full task, our model performs the best in SPL, SR and GP, showing a better ability to find the destination area, but is diminished a little in terms of the path fidelity due to the longer and more complex trajectories.

Additionally, we notice that the language-only uni-modal model achieves much better performance than the vision-only uni-modal model, which indicates that the language instructions play an important role in guiding the navigation. Also, compared with uni-modal baseline models, the w/o attention model has more significant performance improvements in SPL and SR on the seen validation set than on the unseen validation set. This shows that our model structure is effective for leveraging the vision information in the multi-modal learning task because the seen and unseen sets are split based on the vision data of our AVDN dataset.

Figure 7: Ablation study between short vs. long trajectories on the ANDH task in terms of number of successfull sub-trajectories. Human attention prediction significantly improves the navigation performance, especially in long trajectories.

Impact of Human Attention Prediction We first evaluate the impact of human attention prediction training in our HAA model by testing the models’ performance for ANDH task on subsets of our AVDN unseen validation set where sub-trajectories are separated into four subsets based on the ground truth sub-trajectory length. Among subsets, the longer the trajectory, the greater the challenge, because the control sequence required to reach the destination area is longer. In Figure 7 we compare the number of successful sub-trajectory in different subsets between our HAA model and the w/o attention model. With human attention prediction training, our HAA model achieves significant performance improvements for subsets of longer trajectory. It leads to the conclusion that the human attention prediction training benefit navigation prediction especially for long trajectories.

Besides improving the task performance, human attention prediction also benefits the interpretability of the model by generating visualizable attention predictions paired with navigation predictions. We evaluate the human attention prediction result using the Normalized Scanpath Saliency (NSS) score, which measures the normalized saliency prediction at the ground truth human attention. Our HAA model receives NSS scores of 0.71, 0.52 and 0.56, respectively in seen validation, unseen validation, and test set, indicating the human attention prediction is effective.

7 Conclusion

In this work, we introduce a dataset for Aerial Vision-and-Language Navigation with over 3k human-human free-form dialogs. A continuous-space drone simulator with photorealistic scenes based on satellite images is built which makes the vision domain in our dataset complex and close to reality. Challenging tasks, Aerial Navigation from Dialog History and Aerial Navigation from Full Dialog History are proposed based on our dataset focusing on navigation. We design a Human Attention Aided model for both tasks and demonstrate the potential of human attention data by showing model’s navigation prediction performance could be benefited from the human attention prediction training. Our work provides the possibilities for further studies to develop stronger models on AVDN that not only focus on navigation prediction but also on questions generation. Furthermore, based on our results, future works may further investigate using human attention prediction training to help solve VLN problems.

Broader Impact

This work proposed a dataset, a simulator, tasks, and models for Aerial Vision-and-Language Navigation. Since satellite images are needed to simulate the drone’s observation, risks of privacy leaking may exist. By using the open-source satellite dataset xView Lam et al. (2018), we mitigate the risks while also being able to develop a simulator for training our model. We also recognize the potential ethical problems during the dataset collection, where human annotators are involved. As a result, we utilized the Amazon Mechanical Turk (AMT) website to find workers willing to participate in the project. With AMT, our data collection is constrained by legal terms, and the data collection protocol is under the AMT’s approval. The agreement signed by both requesters and workers on AMT also ensures a transparent and fair data annotation process and that privacy is well protected.


Appendix A HAA Model and Experiment Details

There are around 150m parameters in our HAA model.

a.1 Lanuage Feature Encoding

Our model uses a encoder Devlin et al. (2018) with pretrained weights that open-sourced on Hugging Face Wolf et al. (2020)

to extract language feature of the input dialog history. For ANDH task, We extract two types of language features in ANDH task, where the input is either all the dialog rounds from dialog round 1 to the dialog round corresponding to the target sub-trajectory, or only the dialog round for the target sub-trajectory. The language feature generated from all previous dialog rounds are used to attend to the image feature extracted by DarkNet-53 and input to the vision LSTM cell, whereas the language feature of the current dialog round with more related information is used to attend to concatenated vision and direction LSTM hidden states. In ANDH-Full task, since the input is the full dialog history, same language feature are extracted and input to both attention module.

a.2 Attention Module

The attention modules that are used in our HAA model have the same structure. They generate soft attention based on dot-product attention mechanism. The inputs are context features and attention features. There is a fully connected layer before the output. The context features attended by the attention features are concatenated with the attention features to become the input of the fully connected layer, and the output will be the attention module’s output which has the same shape as the attention features.

a.3 Waypoint Control with Rotation Following Strategy

The predicted waypoint actions have three dimensions based on the strategy of waypoint control with rotation following. At each time step, the waypoint action is derived by first choosing a point in the current view area and then combining it with an extra dimension for altitude control. When the waypoint action is performed, the drone will arrive at the position of the waypoint with the chosen altitude. Our waypoint control also adopts a rotation following strategy, meaning that the agent will always keep its head pointing in the forward direction. Our control method avoids having repeated identical actions in the action sequence and maintains the maximum continuity in control. This rotation following strategy also avoids adding redundant control dimensions.

a.4 Navigation Progress Prediction

As for the navigation progress prediction, we adopt the idea of L2Stop Xiang et al. (2019) and create a navigation progress predictor to help decide when to stop, which overcomes the problem that the model would fail to stop at the desired position. The navigation progress is trained with the supervision of IoU score of the current view area and the destination area. When the IoU is larger than 0, it indicates the designation area is seen in and the larger the IoU the closer the to the . During the inference time, the predicted navigation stops when the generated navigation progress indicator is less than 0.25.

a.5 Training Details

We train our HAA model on one Nvidia RTX A5000 graphic card. We first train our model for approximately 100k iterations which took about 48 hours on ANDH task with batch size being 4 and learning rate being 1e-5. Then, the model weights with the best performance on the seen validation set are selected to be fine-tuned for the ANDH-Full task. Since the ANDH-Full task uses full dialog history as input, where more GPU RAM is needed, we use a batch size of 1 and learning rate of 2.5e-6 and train the model for 200k more interactions which take about 70 hours.

Appendix B Simulator Details

We design a simulator to simulate a drone flying with its onboard camera facing straight downward. The simulator uses satellite images for the drone’s visual observation, where the observation is square image patches cropped from the satellite image based on the drone’s view area. Since satellite images have boundaries that are not adjacent with each other, we prevent the drone’s view area from moving out of boundary by automatically invalidate the drone’s action that will lead to out-of-boundary view areas. Additionally for simplicity, we assume perfect control of the drone’s movement, and therefore, the drone’s current view area is determined by the previous drone’s position and navigation action.

During the dataset collection, the follower controls the simulated drone through the simulator interface with keyboards. We defined 8 keys for the control with a total of four degrees of freedoms (DoFs), where there are 2 DoFs for horizontal movement, 1 DoF for altitude control, and 1 DoF for rotation control. Despite that our simulator environment is continuous, the control through the interface is discrete for an easier control experience. Every time a key is pressed, the simulated drone will move along the DoF for a fixed distance and the higher the simulated drone flies, the faster it moves with one press of the keyboard. Before the follower presses ESC key to stop the control, he/she can also generate the human attention data by using the mouse to left-click on the attended image region shown on the interface. After every left-click, a circle with a radius being 1/10 of the current view area width will become the attended region and be displayed on the interface. Also, a right-click on the circle will remove this region from the attention record.

Appendix C Dataset Details and Examples

We provide some details about our dataset with related examples. Each example includes a dialog, sample drone’s visual observation with human attention and navigation overviews.

Figure 8: Example of a trajectory with one dialog round.
Figure 9: Example of a trajectory that includes auto-instruction about altitude adjustment. There is only one dialog round.
Figure 10: Example of a trajectory that includes auto-instruction rejecting follower’s destination claim. There are two dialog rounds.
Figure 11: Example of a trajectory with three dialog rounds. There is an incorrect instruction in the second dialog round, where the destination should be described as the second nearest brown building rather than the nearest one. For this case, since the instruction is clear and can be followed by the follower, we treat it as an inevitable and acceptable type of instruction with mistakes and keep it in our dataset.

c.1 Human Attention

We record the attention from the follower through our simulator interface when the follower is controlling the simulated drone. In each navigation trajectory collected, the attention are stored in a list where the order of the list is ignored, meaning that the attended areas either recorded earlier or later during the navigation will be retrieved together when using the human attention data. In this way, the human attention data becomes more accurate since the area that followers missed to attend in the current view area is likely to be included in the future time steps. Also, because the previously attended area is kept in later view areas, less effort is needed to annotate the attended areas. We find that 1/7 of the area on average is attended to in the recorded view areas .

c.2 Dialog Structure

The dialogs contained in our AVDN datset have a various number of rounds. Since the dialog rounds are split based on the data collection rounds, each dialog round contains only one instruction written by the commander. Figure 8 shows an example of a simple dialog with only one dialog round. However, when the follower can not follow the initial instruction to find the destination area, questions will be brought up, and therefore more dialog rounds will be introduced. Every dialog rounds start with the instruction from human commanders and could include one or more utterance from the follower, depending on if auto-instructions exist. We provide details about auto-instructions in the next sub-section. Also, when followers are writing the questions, we enable them to define some shortcut keys for frequently used general questions such as “could you further explain it?”, “where should I go?”, etc. To avoid templated dialogs, followers are forbidden to only use the shortcut for the question but need to incorporate their own language.

c.3 Auto-instructions

When the follower claims that the destination is reached, our simulator will check the navigation result automatically using the success condition described in Section 3.2. Then, auto-instructions will be generated based on whether the destination area is reached successfully. Specifically, when the success condition is met, an auto-instruction of “Yes, you have find it!!!” will be added to the dialog as the end; if the destination is in the center of the view area, but the view area is either too large or too small, failing the success condition, the simulator will also provide auto-instructions asking the follower to adjust the drone’s altitude and verify again if the success conditions are met or not, as shown is Figure 9. Otherwise, as in Figure 10, the auto-instruction will let the follower know that the destination area is not reached and allow the follower to ask more questions in the current dialog round.

(a) Words from commander utterances
(b) Words from follower utterances
Figure 12: Counts of top 50 most frequently used words in commander and follower utterances.

c.4 Dialog Quality

To ensure the dialogs in our dataset have good quality, we make efforts during the data collection process and conduct extra examination for the dialog data after the data collection.

During the data collection, online workers from Amazon Mechanical Turk (AMT) are playing as commanders and provide instructions in the dialog, who, compared with the follower that we hired to work on-site and supervised by us in-person, have a higher chance to generate low quality and incorrect language instructions. We develop some strategies to deal with these undesired instructions. First, if the follower, guided by the instruction, lets the drone navigate to a direction that is more than 90 degrees different from the ground truth direction of the destination area, our simulator will automatically label the instruction as incorrect. Those labeled instructions will be discarded and collected again. Then, since the follower needs to read and understand the instructions, they have the chance to report the instructions as being low-quality or incomprehensible and skip them. Finally, in the remaining instructions that are not spotted as low-quality or incorrect, it is still possible that instructions are not accurate or incorrect due to human mistakes from the AMT workers, such as in Figure 11. By manually checking the dialogs and navigation trajectories in randomly selected subsets of our AVDN dataset, we spot only 5 instructions with potential mistakes in 50 dialogs. In those cases, because the follower successfully followed the instruction, we keep those instructions unchanged even if they didn’t help guide the follower to find the destination area. In the real world, the user in AVDN could also make mistakes, so this mistake tolerance strategy makes our dataset even closer to real scenarios.

We further examine the dialog quality after the data collection by analyzing the dialogs. The average utterance (human-written instructions and questions) in a dialog is 3.1, with a minimum and maximum being 1 and 7 because each dialog includes at least one instruction written by a human. The average number of words written by commander and follower are 45 and 19, and there are about 15 words from auto-instructions. Also, in Figure 12

, we show the distribution of the top 30 most frequent words in the commander’s and follower’s utterances. The results show a smooth variance across nouns, verbs, adjectives, and prepositions, indicating that our dataset’s utterances have rich contents and good variety. Last but not least, we manually checked the dialogs in all validation and test sets by visualizing the corresponding navigation trajectory and the dialog, and we observed no major issue.

Appendix D Interface for workers in dataset collection

We use help from Amazon Mechanical Turk (AMT) workers and human drone experts during the collection of our Aerial Vision-and-Language Navigation (AVDN) dataset, where the AMT workers play the commander role providing instructions the drone experts play the follower role asking questions and controlling the drone. In this section, we demonstrate the interface for both groups of workers with all the information they receive in the data collection procedure.

d.1 Interfaces for commanders

Figure 13: Interface for AMT workers (commanders) in first round of data collection.
Figure 14: Interface for AMT workers (commanders) in following rounds of data collection.

There are two interfaces for commanders (AMT workers) depending on which data collection round it is. The interface includes one trajectory each time and contains all the information needed for the commander to create the instruction. Detailed and step-by-step instructions for what needs to be done as a commander are introduced at the beginning of the interface. The AMT workers need to write sentences in the Answer according to the provided information.

In the first round of data collection, the commander needs to write the initial instruction based on an overview of the AVDN trajectory. As shown in Fig. 13 the satellite image shows the trajectory overview marked with a predefined staring position (the red point with an arrow showing the drone’s direction at the starting position) and a destination area (purple bounding box).

In the data collection round after the first round, the commander is required to give follow-up instructions, i.e., answers, to the questions from the follower. The user interface for the second and following rounds is shown in Fig. 14. Besides all the information shown to the commander in the first round, the follower is also provided with previous dialog, past trajectories (broken purple line), and the view area corresponding to the most recent time step (named current view area marked with white bounding box).

d.2 Interface for followers

Figure 15: Interface for human drone experts (follower). The upper window shows the simulated drone’s visual observation and the lower window shows the previous dialog.

The follower uses an interface to interact with our simulator. In our simulator, they receive instructions from the commander and control the simulated drone. The keyboard is used to simulate the drone controller with eight keys representing four channels in the controller, where key w and s represent the channel controlling forward and backward movement, key a and d represent the channel controlling left and right movement, key q and e represent the channel controlling rotating clockwise and anti-clockwise movement and key 1 and 2 represent the channel controlling altitude change. After the experts finish the control, the commander can either claim the destination is reached or ask questions for more instruction. As in Fig. 15, the interface is an image window showing the simulated drone’s visual observation and a text window for displaying the previous dialogs and inputting questions from the follower. There is a compass on the top left of the image window, showing the orientation of the simulated drone. The red cross in the image window shows the center of the view, helping the follower control the drone to right above the destination area, and the red corners in the window show the area of 0.4 IoU with the view area. The follower is instructed to make the destination area larger than the area indicated by the red corners in order to finish successful navigation.