Drones have been widely adopted for many applications in our daily life, from personal entertainment to professional use. It has the advantage of mobility and observing large areas over ground robots. However, compared with ground robots, the control of the aerial robot is more complex because an extra degree of freedom, altitude, is involved. People often need to hold a controller all the time to fly a drone. So it is essential to create a hands-free control experience for drone users and develop an intelligent drone that can complete tasks simply by talking to humans. It can lower the barrier of drone control for users with some disabilities and who have their hands occupied by activities such as taking photos, writing, etc.
Therefore, this work introduces Aerial Vision-and-Dialog Navigation (AVDN), aiming at developing an intelligent drone that can converse with its user to fly to the expected destination. As shown in Figure 1, the user (commander) provides instructions, and the aerial agent (follower) follows the instruction and asks questions when needed. Through free-form dialog, potential ambiguities in the instruction can be gradually resolved when the commander provides further instructions by request. The past visual trajectories are also provided along with the question, which frees the commander from monitoring the drone all the time and minimizes the burden of drone control.
To implement and evaluate the AVDN task, we build a photorealistic simulator with continuous state and action space to simulate a drone flying with its onboard camera pointing straight downward. Then we collect an AVDN dataset of 3,064 aerial navigation trajectories with human-human dialogs, where crowd-sourcing workers play the commander role and drone experts play the follower role, as illustrated in Figure 1. Moreover, we also collect the attention of human followers over the aerial views for a better understanding of where humans ground navigation instructions.
Based on our AVDN dataset, we introduce two challenging navigation tasks, Aerial Navigation from Dialog History (ANDH) and Aerial Navigation from Full Dialog History (ANDH-Full). Both tasks focus on predicting navigation actions that can lead the agent to the destination area, whereas the difference is that ANDH-Full presents the agent with full dialog and requires it to reach the final destination Kim et al. (2021), while ANDH evaluates the agent’s completion of the sub-trajectory within a dialog round given the previous dialog information Thomason et al. (2020).
The proposed tasks open new challenges of sequential action prediction in a large continuous space and natural language grounding on photorealistic aerial scenes. We propose a sequence-to-sequence Human Attention Aided (HAA) model for both tasks. The HAA model predicts waypoints to reduce the complexity of the search space and learns to stop at the desired location. More importantly, it is jointly trained to predict human attention from the input dialog and visual observations and learns where to look at during inference. Experiments on our AVDN dataset show that multitask learning is beneficial and human attention prediction improves navigation performance. The main contributions of our work are concluded as follows:
We propose a new dataset and a simulator for aerial vision-and-dialog navigation. The dataset includes over 3K aerial navigation trajectories with human-human dialogs.
We introduce ANDH and ANDH-Full tasks to evaluate the agent’s ability to understand natural language dialog, reason about aerial scenes, and navigate to the target location in a photorealistic aerial environment.
We build a Human Attention Aided (HAA) model as the baseline for the ANDH and ANDH-Full tasks. Besides predicting the waypoint navigation actions, HAA also learns to predict human attention along the navigation trajectory. Experiments on our AVDN dataset validate the effectiveness of our HAA model.
2 Related work
Vision-and-Language Navigation (VLN) is an emerging multi-modal task that studies the problem of using both language instructions and visual observation to predict navigation actions. We compare some of the works with our AVDN dataset in Table 1. Early VLN datasets such as Anderson et al. (2018); Ku et al. (2020) start with the indoor house environments in the Matterport3D simulator Chang et al. (2017), where the visual scenes are connected on a navigation graph. To simulate continuous state change as in the real world, Krantz et al. (2020) built a 3D continuous environment by reconstructing the scene based on topological connections where the agent uses continuous actions during the navigation. Some other VLN studies focus on language instructions. Nguyen et al. (2019); Nguyen and Daumé III (2019); Thomason et al. (2020) created datasets where the agent can interact with the user by sending fixed signals or having dialogs. There are also works on synthetic indoor environments, such as Shridhar et al. (2020); Padmakumar et al. (2021) that use an interactive simulation environment with synthetic views named ALFRED, where the agent needs to follow language instructions or dialogs to finish household tasks. Besides the indoor environment, some VLN datasets work on the more complex outdoor environment, such as the Touchdown dataset Chen et al. (2019) and the modified LANI dataset Misra et al. (2018). Blukis et al. (2019) is similar to ours for both using drones. However, the synthetic environment they used is far from the realistic scene, and they ignored the control of the drone’s altitude, where such navigation is oversimplified and has a large gap towards navigation in the real world in terms of language and vision aspects. Our work absorbs the advantage from previous works where we have continuous environments and dialog instructions to better approximate the real-world scenario. More importantly, our AVDN dataset is the first photorealistic outdoor aerial VLN dataset to the best of our knowledge.
Although using both vision and language for aerial navigation is a relatively new topic, vision-based aerial navigation for drones is already an active topic in the field. Some inspiring works Loquercio et al. (2018); Giusti et al. (2015); Smolyanskiy et al. (2017); Fan et al. ; Bozcan and Kayacan (2020); Majdik et al. (2017); Kang et al. (2019) worked on using pre-collected real-world drone data to tackle aerial vision navigation problems. However, due to the hardness of collecting data and the risk of crashes, some other works applied simulation for aerial navigation. These aerial navigation simulators are mostly using scenes with synthetic views Chen et al. (2018); Shah et al. (2017); Chen et al. (2020), where rich ground truths are provided without the need for annotation. Nevertheless, the visual gap between the synthetic views and the views in the real world could cause trouble in learning. As for our AVDN dataset, the simulator we build uses satellite images for simulating top-down visual observation of the drone, which avoids the shortcoming of having only 2D scenes and adopts the strength of the satellite images where the rich labels are already available. As a result, we balanced the trade-off between the high cost of drone data collection and the benefit of photorealistic data.
The AVDN dataset includes dialogs, navigation trajectory, and drone’s visual observation with human attention. Some examples of the dialog and drone’s visual observation with human attention are shown in Figure 2. With the help of a newly proposed simulator, we record the AVDN trajectories created by two groups of humans interacting with each other, playing either the commander role or the follower role. Our AVDN dataset is the first aerial navigation dataset based on dialogs to the best of our knowledge.
We build a simulator to simulate the drone with a top-down view area as in Figure (a)a. Our simulation environment is a continuous space so that the simulated drone can move continuously to any point within the environment. The drone’s visual observations are square images generated corresponding to the drone’s view area by cropping from high-resolution satellite images in the xView dataset Lam et al. (2018)
, an open-source large-scale satellite image object detection dataset, as in Figure(b)b. Although the satellite images lack dynamics and stereoscopy, we argue that by using satellite images, our simulator is capable of providing equally rich visual features as in the real world and some examples are shown in Figure (c)c. Therefore, we believe that the cropped satellite images are reasonable and cheap substitutes for the images taken by the real drone’s onboard camera. We also design an interface for our simulator, where the simulated drone can be controlled with a keyboard and the drone’s visual observation will be displayed in real-time with a digital compass, as shown in Figure (a)a. During the control, the users can also provide their attention over the displayed images on the interface by clicking the region they attend to. Last but not least, our simulator is capable of generating navigation overviews, showing the starting positions, destination areas, current view area and past trajectory if they exist, as in Figure (b)b and (c)c.
3.2 Dataset Collection
We collect our dataset with the help of Amazon Mechanical Turk (AMT) workers and drone experts, where AMT workers play the commander role to provide instructions, and drone experts play the follower role to control a simulated drone and carry out the instruction. We pay the workers with wages no less than $15/h, and the data collection lasts for 90 days. We adopt an asynchronous data collection method, where the followers and commanders work in turns rather than simultaneously. This not only lowers the cost of data collection but also simulates how aerial vision-and-dialog navigation would work in practice.
One or multiple rounds of data collection could be involved for collecting an AVDN trajectory with dialog. Before the first round of data collection, we sample objects in the xView dataset Lam et al. (2018), as the destination areas and pair them with randomly selected about 40-meter-wide initial view areas of the drone that are always within 1.5km distance. With these prepared navigation information, we generate navigation overviews using our simulator, as shown in Figure (c)c.
In the first round of data collection, the navigation overviews are presented to AMT workers for creating the initial instructions. We instruct the AMT workers to write instructions as if they are talking to a drone pilot based on the marked satellite images. Next, we let human drone experts control the simulated drone through our simulator interface following the instructions and ask questions if they cannot find the destination area. During the simulation for -th navigation trajectory, a series of view areas are recorded, where has three properties, the center coordinate , width , and direction . One or more keyboard controls might be needed to go from to but the distance between the centers of and is . When the experts stop the current navigation session, they can either enter questions into a chatbox, claim the destination with a template sentence or reject the instruction for bad quality.
In the second and following rounds of data collection (if needed), we gather the history dialog from the previous round and let AMT workers continue the dialog by providing further instructions based on the unfinished navigation data. Our experts will again navigate the drone and ask questions when necessary. We iterate the process until the destination is successfully reached and claimed by the expert as in Figure (b)b.
The AVDN trajectory is successful when the object area is clearly and completely observed. We define the successful finish of the AVDN trajectory by both checking the center of the follower’s last view area and computing the Intersection over Union (IoU) of and the destination area . If the center of is inside and the IoU of and is larger than 0.4 when the follower claims destination, the navigation trajectory is regarded as successfully finished. Otherwise, the navigation will continue with more data collection rounds needed.
3.3 Data Analysis
Our AVDN dataset includes 3,064 aerial navigation trajectories, each with multi-round natural language dialog. The most frequent words are shown in Figure (a)a. The recorded AVDN trajectory path length has an average of 287, and its distribution is shown in Figure (b)b. There are 2 rounds of dialog on average per full trajectory, and based on the data collection rounds, the trajectories and dialogs can be further separated into 6,269 sub-trajectories and corresponding dialog rounds.
We split our dataset into training, seen-validation, unseen-validation, and unseen-testing sets, where seen and unseen sets are pre-separated by making sure the area locations of the visual scenes are over 100 apart from each other. We show some statistical analysis across the dataset splits in Table 2. The visual scenes in our dataset come from the xView dataset Lam et al. (2018), which covers both urban and rural scenes. The average covered area of the satellite images is 1.2.
Rather than providing a target hint in the beginning as in Thomason et al. (2020), the destination must be inferred from the human instructions given by the commander. For example, the commander may give a detailed description of the destination initially or write a rough instruction first and then describe the destination later in the dialog. We also find that there are two ways of describing the directions for navigation: egocentric direction description, such as “turn right”, and allocentric direction description, such as “turn south”. By filtering and categorizing words related to directions, we find that of the dialog rounds use egocentric direction description and dialog rounds include allocentric direction description. There are dialog rounds that have mixed direction deceptions, making the instruction complex. This opens a new challenge for developing a language understanding module that can ground both the egocentric and allocentric descriptions to navigation actions.
Following indoor dialog navigation Thomason et al. (2020); Kim et al. (2021), we introduce an Aerial Navigation from Dialog History (ANDH) task and an Aerial Navigation from Full Dialog History (ANDH-Full) task based on our AVDN dataset and simulator.
4.1 Aerial Navigation from Dialog History
The goal of the task is to let the agent predict aerial navigation actions that lead to the destination area following the instructions in the dialog history. Specifically, one round of dialog, , is input to the agent at the starting of the sub-trajectory , where and is the number of sub-trajectories in trajectory . At each time step, an image of the view area is provided, where . The goal is to navigate to the destination, , corresponding to dialog round
where is the time step when sub-trajectory ends. The agent outputs actions, and the drone’s view area will move accordingly. As a result, the predicted view area sequence, , as the result of predicted waypoint actions will be recorded for evaluation with regarding to the ground truth view area sequence , where is the number of total predicted waypoint actions for sub-trajectory of trajectory and is the time step in trajectory when sub-trajectory starts.
4.2 Aerial Navigation from Full Dialog History
Compared with the ANDH task, the major difference of ANDH-Full task is that it adopts the complete dialog history as input. At the beginning of each dialog round, we add a prompt of the drone’s direction, eg., “facing east”, by checking the ground truth. It helps to clarify the following language instructions, especially when egocentric direction descriptions exist. With the full dialog and visual observation, the agent needs to predict the full navigation trajectory from the starting view area to the destination area . ANDH-Full provides complete supervision for agents on a navigation trajectory with a more precise destination description and includes longer utterances and more complex vision grounding challenges. Similar to the ANDH task, the predicted actions will be used to control the agent’s movement, and the evaluation target will be the sequence of view areas recorded after every action together with the navigation trajectory generated from it.
Since the agent in both tasks, ANDH and ANDH-Full, needs to generate predicted view area sequences, the evaluation for both tasks is the same. In the evaluation, the center points of every view area are connected to form the navigation trajectory, and the last view area is used to determine whether the predicted navigation successfully leads to the destination area. The predicted navigation is successful if the IoU between the predicted final view area and the destination area is greater than 0.4. We apply several metrics for evaluation.
Success Rate (SR): the number of the predicted trajectory being regarded as successful, i.e., the final view area of the predicted trajectory satisfies the IoU requirement, over the number of total trajectories predicted.
Success weighted by inverse Path Length (SPL) Anderson et al. (2018): measuring the Success Rate weighted by the total length of the navigation trajectory.
Goal Progress (GP) Thomason et al. (2020): evaluating the distance of the progress made towards the destination area. It is computed as the Euclidean distance of the sub-trajectory, deducted by the remaining distance from the center of the predicted final view area to the center of destination area .
Average Trajectory Deviation (ATD): the average distance from the predicted view area centers to the ground truth trajectory. It measures the predicted path fidelity weighted by length. We derive the ground truth trajectory by connecting the center positions of the ground truth view area sequence with line segments , where and is a line segment start from point and ends at point and returns the center area of the area. The ATD for predicted view area sequence is
Where is the distance between point and line segment and is the length of line segment .
We proposed a Human Attention Aided (HAA) model for the ANDH and ANDH-Full tasks as shown in Figure 6, where it takes as input multi-modal information and generates multi-modal predictions, including human attention prediction and navigation prediction.
The input has three modalities, the drone’s direction, image, and language. The drone’s directions and images from the drone’s visual observation are input to the model at every time step. A fully connected direction encoder module with a followed direction LSTM is used to understand the continuous forward direction inputs and an xView-pretrained Darknet-53111https://github.com/ultralytics/xview-yolov3 Redmon and Farhadi (2018) followed with a vision LSTM are used for extracting image features. As for language inputs, special language tokens such as [INS] and [QUE] are added in front of each instruction and question in the dialog before input to the model at the start of a prediction series. Next, our model uses a BERT encoder Devlin et al. (2018) to get the language features of the input dialog history. Language features are first used for the attention module ahead of the vision LSTM, and then the hidden states from both direction LSTM and vision LSTM are concatenated and also attended by the language features.
Navigation Prediction and Waypoint Control The navigation outputs from our model include waypoint actions and navigation progress. The waypoint action has three dimensions corresponding to coordinates in a 3D space. The navigation progress prediction Xiang et al. (2019) is to generate a dimension navigation progress indicator for deciding when to stop. If the predicted navigation progress is less than a threshold, the simulated drone will execute the predicted waypoint action. The drone’s view area center in the next time step is the first two dimensions of the predicted waypoint and the width of the view area is determined by the altitude of the predicted waypoint. The waypoint action also depends on the drone’s direction, where the direction is kept towards the direction of movement.
Human Attention Prediction A human attention decoder is proposed to predict the human attention mask using the output of the attention module that deals with the image features. We build the decoder based on He et al. (2019), where the input to the decoder will be decoded to an
representation through a fully connected layer and then linearly interpolated to a mask with the same shape as the input image. The greater the values in the mask means more likely the human follower attends the corresponding pixels.
Training We first train our HAA model on ANDH task and then fine-tuned on ANDH-Full task because ANDH task is relatively easier given a larger dataset available. For each task, we conduct the training alternately in teacher-forcing Williams and Zipser (1989) and student-forcing modes, where the main difference is whether the model interacts with the simulator using ground truth actions or the predicted actions. Our model is trained with a sum of losses from both navigation prediction and human attention prediction. First, the predicted waypoint action and predicted navigation progress are trained with Mean Square Error (MSE) loss, supervised by the ground truth and computed based on the recorded trajectory in our dataset.
where the is computing the rotation change as a result of the waypoint action. Second, for human attention prediction training, we apply the modified Normalized Scanpath Score loss (NSS) He et al. (2019). Given a predicted human attention map and a ground-truth human attention mask ,
Since human attention may not exist in certain view areas, the human attention loss is only computed for view areas with recorded human attention.
|Seen Validation||Unseen Validation||Unseen Testing|
We compare our HAA model and several baselines on the ANDH and ANDH-Full tasks.
Baselines We first design a rule-based non-learning baseline model that uses keywords in the dialog history to generate navigation actions. It adopts the same control strategy as our model but only moves to a fixed direction based on detected keywords and stops at a random time step. Then we introduce two uni-modal baseline models, where they take as input either vision or language. Both uni-modal baseline models utilize the same modality encoders and LSTM structure as in our model and have the drone’s direction as input at each time step. Last but not least, we also build an ablation model with the same structure but without human attention prediction training.
Results on ANDH and ANDH-Full The models’ weights are selected based on unseen SR, and we list the results of the models on ANDH and ANDH-Full tasks in Table 3. Based on the results, our HAA model outperforms the baseline models in all metrics for ANDH task. As for ANDH-Full task, our model performs the best in SPL, SR and GP, showing a better ability to find the destination area, but is diminished a little in terms of the path fidelity due to the longer and more complex trajectories.
Additionally, we notice that the language-only uni-modal model achieves much better performance than the vision-only uni-modal model, which indicates that the language instructions play an important role in guiding the navigation. Also, compared with uni-modal baseline models, the w/o attention model has more significant performance improvements in SPL and SR on the seen validation set than on the unseen validation set. This shows that our model structure is effective for leveraging the vision information in the multi-modal learning task because the seen and unseen sets are split based on the vision data of our AVDN dataset.
Impact of Human Attention Prediction We first evaluate the impact of human attention prediction training in our HAA model by testing the models’ performance for ANDH task on subsets of our AVDN unseen validation set where sub-trajectories are separated into four subsets based on the ground truth sub-trajectory length. Among subsets, the longer the trajectory, the greater the challenge, because the control sequence required to reach the destination area is longer. In Figure 7 we compare the number of successful sub-trajectory in different subsets between our HAA model and the w/o attention model. With human attention prediction training, our HAA model achieves significant performance improvements for subsets of longer trajectory. It leads to the conclusion that the human attention prediction training benefit navigation prediction especially for long trajectories.
Besides improving the task performance, human attention prediction also benefits the interpretability of the model by generating visualizable attention predictions paired with navigation predictions. We evaluate the human attention prediction result using the Normalized Scanpath Saliency (NSS) score, which measures the normalized saliency prediction at the ground truth human attention. Our HAA model receives NSS scores of 0.71, 0.52 and 0.56, respectively in seen validation, unseen validation, and test set, indicating the human attention prediction is effective.
In this work, we introduce a dataset for Aerial Vision-and-Language Navigation with over 3k human-human free-form dialogs. A continuous-space drone simulator with photorealistic scenes based on satellite images is built which makes the vision domain in our dataset complex and close to reality. Challenging tasks, Aerial Navigation from Dialog History and Aerial Navigation from Full Dialog History are proposed based on our dataset focusing on navigation. We design a Human Attention Aided model for both tasks and demonstrate the potential of human attention data by showing model’s navigation prediction performance could be benefited from the human attention prediction training. Our work provides the possibilities for further studies to develop stronger models on AVDN that not only focus on navigation prediction but also on questions generation. Furthermore, based on our results, future works may further investigate using human attention prediction training to help solve VLN problems.
This work proposed a dataset, a simulator, tasks, and models for Aerial Vision-and-Language Navigation. Since satellite images are needed to simulate the drone’s observation, risks of privacy leaking may exist. By using the open-source satellite dataset xView Lam et al. (2018), we mitigate the risks while also being able to develop a simulator for training our model. We also recognize the potential ethical problems during the dataset collection, where human annotators are involved. As a result, we utilized the Amazon Mechanical Turk (AMT) website to find workers willing to participate in the project. With AMT, our data collection is constrained by legal terms, and the data collection protocol is under the AMT’s approval. The agreement signed by both requesters and workers on AMT also ensures a transparent and fair data annotation process and that privacy is well protected.
- Anderson et al. (2018) Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian Reid, Stephen Gould, and Anton Van Den Hengel. 2018. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In
- Blukis et al. (2019) Valts Blukis, Yannick Terme, Eyvind Niklasson, Ross A Knepper, and Yoav Artzi. 2019. Learning to map natural language instructions to physical quadcopter control using simulated flight. arXiv preprint arXiv:1910.09664.
- Bozcan and Kayacan (2020) Ilker Bozcan and Erdal Kayacan. 2020. AU-AIR: A multi-modal unmanned aerial vehicle dataset for low altitude traffic surveillance. arXiv preprint arXiv:2001.11737.
- Chang et al. (2017) Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. 2017. Matterport3d: Learning from rgb-d data in indoor environments. arXiv preprint arXiv:1709.06158.
- Chen et al. (2019) Howard Chen, Alane Suhr, Dipendra Misra, Noah Snavely, and Yoav Artzi. 2019. Touchdown: Natural language navigation and spatial reasoning in visual street environments. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12538–12547.
- Chen et al. (2020) Lyujie Chen, Feng Liu, Yan Zhao, Wufan Wang, Xiaming Yuan, and Jihong Zhu. 2020. Valid: A comprehensive virtual aerial image dataset. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pages 2009–2016. IEEE.
- Chen et al. (2018) Lyujie Chen, Wufan Wang, and Jihong Zhu. 2018. Learning transferable uav for forest visual perception. arXiv preprint arXiv:1806.03626.
- Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Yue Fan, Shilei Chu, Wei Zhang, Ran Song, and Yibin Li.
Learn by observation: Imitation learning for drone patrolling from videos of a human navigator.In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 5209–5216. IEEE.
Giusti et al. (2015)
Alessandro Giusti, Jérôme Guzzi, Dan C Cireşan, Fang-Lin He,
Juan P Rodríguez, Flavio Fontana, Matthias Faessler, Christian Forster,
Jürgen Schmidhuber, Gianni Di Caro, et al. 2015.
A machine learning approach to visual perception of forest trails for mobile robots.IEEE Robotics and Automation Letters, 1(2):661–667.
- He et al. (2019) Sen He, Hamed R Tavakoli, Ali Borji, Yang Mi, and Nicolas Pugeault. 2019. Understanding and visualizing deep visual saliency models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10206–10215.
- Kang et al. (2019) Katie Kang, Suneel Belkhale, Gregory Kahn, Pieter Abbeel, and Sergey Levine. 2019. Generalization through simulation: Integrating simulated and real data into deep reinforcement learning for vision-based autonomous flight. arXiv preprint arXiv:1902.03701.
Kim et al. (2021)
Hyounghun Kim, Jialu Li, and Mohit Bansal. 2021.
Ndh-full: Learning and evaluating navigational agents on full-length
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6432–6442.
- Krantz et al. (2020) Jacob Krantz, Erik Wijmans, Arjun Majumdar, Dhruv Batra, and Stefan Lee. 2020. Beyond the nav-graph: Vision-and-language navigation in continuous environments. In European Conference on Computer Vision, pages 104–120. Springer.
- Ku et al. (2020) Alexander Ku, Peter Anderson, Roma Patel, Eugene Ie, and Jason Baldridge. 2020. Room-across-room: Multilingual vision-and-language navigation with dense spatiotemporal grounding. arXiv preprint arXiv:2010.07954.
- Lam et al. (2018) Darius Lam, Richard Kuzma, Kevin McGee, Samuel Dooley, Michael Laielli, Matthew Klaric, Yaroslav Bulatov, and Brendan McCord. 2018. xview: Objects in context in overhead imagery. arXiv preprint arXiv:1802.07856.
- Loquercio et al. (2018) Antonio Loquercio, Ana I Maqueda, Carlos R Del-Blanco, and Davide Scaramuzza. 2018. Dronet: Learning to fly by driving. IEEE Robotics and Automation Letters, 3(2):1088–1095.
- Majdik et al. (2017) András L Majdik, Charles Till, and Davide Scaramuzza. 2017. The zurich urban micro aerial vehicle dataset. The International Journal of Robotics Research, 36(3):269–273.
- Misra et al. (2018) Dipendra Misra, Andrew Bennett, Valts Blukis, Eyvind Niklasson, Max Shatkhin, and Yoav Artzi. 2018. Mapping instructions to actions in 3d environments with visual goal prediction. arXiv preprint arXiv:1809.00786.
- Nguyen and Daumé III (2019) Khanh Nguyen and Hal Daumé III. 2019. Help, anna! visual navigation with natural multimodal assistance via retrospective curiosity-encouraging imitation learning. arXiv preprint arXiv:1909.01871.
- Nguyen et al. (2019) Khanh Nguyen, Debadeepta Dey, Chris Brockett, and Bill Dolan. 2019. Vision-based navigation with language-based assistance via imitation learning with indirect intervention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12527–12537.
- Padmakumar et al. (2021) Aishwarya Padmakumar, Jesse Thomason, Ayush Shrivastava, Patrick Lange, Anjali Narayan-Chen, Spandana Gella, Robinson Piramuthu, Gokhan Tur, and Dilek Hakkani-Tur. 2021. Teach: Task-driven embodied agents that chat. arXiv preprint arXiv:2110.00534.
- Redmon and Farhadi (2018) Joseph Redmon and Ali Farhadi. 2018. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767.
- Shah et al. (2017) Shital Shah, Debadeepta Dey, Chris Lovett, and Ashish Kapoor. 2017. Airsim: High-fidelity visual and physical simulation for autonomous vehicles. arXiv preprint arXiv:1705.05065.
- Shridhar et al. (2020) Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. 2020. Alfworld: Aligning text and embodied environments for interactive learning. arXiv preprint arXiv:2010.03768.
Smolyanskiy et al. (2017)
Nikolai Smolyanskiy, Alexey Kamenev, Jeffrey Smith, and Stan Birchfield. 2017.
Toward low-flying autonomous mav trail navigation using deep neural networks for environmental awareness.In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4241–4247.
- Thomason et al. (2020) Jesse Thomason, Michael Murray, Maya Cakmak, and Luke Zettlemoyer. 2020. Vision-and-dialog navigation. In Conference on Robot Learning, pages 394–406. PMLR.
Williams and Zipser (1989)
Ronald J Williams and David Zipser. 1989.
A learning algorithm for continually running fully recurrent neural networks.Neural computation, 1(2):270–280.
- Wolf et al. (2020) Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2020. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
- Xiang et al. (2019) Jiannan Xiang, Xin Wang, and William Yang Wang. 2019. Not all actions are equal: Learning to stop in language-grounded urban navigation. In ViGIL@ NeurIPS.
Appendix A HAA Model and Experiment Details
There are around 150m parameters in our HAA model.
a.1 Lanuage Feature Encoding
to extract language feature of the input dialog history. For ANDH task, We extract two types of language features in ANDH task, where the input is either all the dialog rounds from dialog round 1 to the dialog round corresponding to the target sub-trajectory, or only the dialog round for the target sub-trajectory. The language feature generated from all previous dialog rounds are used to attend to the image feature extracted by DarkNet-53 and input to the vision LSTM cell, whereas the language feature of the current dialog round with more related information is used to attend to concatenated vision and direction LSTM hidden states. In ANDH-Full task, since the input is the full dialog history, same language feature are extracted and input to both attention module.
a.2 Attention Module
The attention modules that are used in our HAA model have the same structure. They generate soft attention based on dot-product attention mechanism. The inputs are context features and attention features. There is a fully connected layer before the output. The context features attended by the attention features are concatenated with the attention features to become the input of the fully connected layer, and the output will be the attention module’s output which has the same shape as the attention features.
a.3 Waypoint Control with Rotation Following Strategy
The predicted waypoint actions have three dimensions based on the strategy of waypoint control with rotation following. At each time step, the waypoint action is derived by first choosing a point in the current view area and then combining it with an extra dimension for altitude control. When the waypoint action is performed, the drone will arrive at the position of the waypoint with the chosen altitude. Our waypoint control also adopts a rotation following strategy, meaning that the agent will always keep its head pointing in the forward direction. Our control method avoids having repeated identical actions in the action sequence and maintains the maximum continuity in control. This rotation following strategy also avoids adding redundant control dimensions.
a.4 Navigation Progress Prediction
As for the navigation progress prediction, we adopt the idea of L2Stop Xiang et al. (2019) and create a navigation progress predictor to help decide when to stop, which overcomes the problem that the model would fail to stop at the desired position. The navigation progress is trained with the supervision of IoU score of the current view area and the destination area. When the IoU is larger than 0, it indicates the designation area is seen in and the larger the IoU the closer the to the . During the inference time, the predicted navigation stops when the generated navigation progress indicator is less than 0.25.
a.5 Training Details
We train our HAA model on one Nvidia RTX A5000 graphic card. We first train our model for approximately 100k iterations which took about 48 hours on ANDH task with batch size being 4 and learning rate being 1e-5. Then, the model weights with the best performance on the seen validation set are selected to be fine-tuned for the ANDH-Full task. Since the ANDH-Full task uses full dialog history as input, where more GPU RAM is needed, we use a batch size of 1 and learning rate of 2.5e-6 and train the model for 200k more interactions which take about 70 hours.
Appendix B Simulator Details
We design a simulator to simulate a drone flying with its onboard camera facing straight downward. The simulator uses satellite images for the drone’s visual observation, where the observation is square image patches cropped from the satellite image based on the drone’s view area. Since satellite images have boundaries that are not adjacent with each other, we prevent the drone’s view area from moving out of boundary by automatically invalidate the drone’s action that will lead to out-of-boundary view areas. Additionally for simplicity, we assume perfect control of the drone’s movement, and therefore, the drone’s current view area is determined by the previous drone’s position and navigation action.
During the dataset collection, the follower controls the simulated drone through the simulator interface with keyboards. We defined 8 keys for the control with a total of four degrees of freedoms (DoFs), where there are 2 DoFs for horizontal movement, 1 DoF for altitude control, and 1 DoF for rotation control. Despite that our simulator environment is continuous, the control through the interface is discrete for an easier control experience. Every time a key is pressed, the simulated drone will move along the DoF for a fixed distance and the higher the simulated drone flies, the faster it moves with one press of the keyboard. Before the follower presses ESC key to stop the control, he/she can also generate the human attention data by using the mouse to left-click on the attended image region shown on the interface. After every left-click, a circle with a radius being 1/10 of the current view area width will become the attended region and be displayed on the interface. Also, a right-click on the circle will remove this region from the attention record.
Appendix C Dataset Details and Examples
We provide some details about our dataset with related examples. Each example includes a dialog, sample drone’s visual observation with human attention and navigation overviews.
c.1 Human Attention
We record the attention from the follower through our simulator interface when the follower is controlling the simulated drone. In each navigation trajectory collected, the attention are stored in a list where the order of the list is ignored, meaning that the attended areas either recorded earlier or later during the navigation will be retrieved together when using the human attention data. In this way, the human attention data becomes more accurate since the area that followers missed to attend in the current view area is likely to be included in the future time steps. Also, because the previously attended area is kept in later view areas, less effort is needed to annotate the attended areas. We find that 1/7 of the area on average is attended to in the recorded view areas .
c.2 Dialog Structure
The dialogs contained in our AVDN datset have a various number of rounds. Since the dialog rounds are split based on the data collection rounds, each dialog round contains only one instruction written by the commander. Figure 8 shows an example of a simple dialog with only one dialog round. However, when the follower can not follow the initial instruction to find the destination area, questions will be brought up, and therefore more dialog rounds will be introduced. Every dialog rounds start with the instruction from human commanders and could include one or more utterance from the follower, depending on if auto-instructions exist. We provide details about auto-instructions in the next sub-section. Also, when followers are writing the questions, we enable them to define some shortcut keys for frequently used general questions such as “could you further explain it?”, “where should I go?”, etc. To avoid templated dialogs, followers are forbidden to only use the shortcut for the question but need to incorporate their own language.
When the follower claims that the destination is reached, our simulator will check the navigation result automatically using the success condition described in Section 3.2. Then, auto-instructions will be generated based on whether the destination area is reached successfully. Specifically, when the success condition is met, an auto-instruction of “Yes, you have find it!!!” will be added to the dialog as the end; if the destination is in the center of the view area, but the view area is either too large or too small, failing the success condition, the simulator will also provide auto-instructions asking the follower to adjust the drone’s altitude and verify again if the success conditions are met or not, as shown is Figure 9. Otherwise, as in Figure 10, the auto-instruction will let the follower know that the destination area is not reached and allow the follower to ask more questions in the current dialog round.
c.4 Dialog Quality
To ensure the dialogs in our dataset have good quality, we make efforts during the data collection process and conduct extra examination for the dialog data after the data collection.
During the data collection, online workers from Amazon Mechanical Turk (AMT) are playing as commanders and provide instructions in the dialog, who, compared with the follower that we hired to work on-site and supervised by us in-person, have a higher chance to generate low quality and incorrect language instructions. We develop some strategies to deal with these undesired instructions. First, if the follower, guided by the instruction, lets the drone navigate to a direction that is more than 90 degrees different from the ground truth direction of the destination area, our simulator will automatically label the instruction as incorrect. Those labeled instructions will be discarded and collected again. Then, since the follower needs to read and understand the instructions, they have the chance to report the instructions as being low-quality or incomprehensible and skip them. Finally, in the remaining instructions that are not spotted as low-quality or incorrect, it is still possible that instructions are not accurate or incorrect due to human mistakes from the AMT workers, such as in Figure 11. By manually checking the dialogs and navigation trajectories in randomly selected subsets of our AVDN dataset, we spot only 5 instructions with potential mistakes in 50 dialogs. In those cases, because the follower successfully followed the instruction, we keep those instructions unchanged even if they didn’t help guide the follower to find the destination area. In the real world, the user in AVDN could also make mistakes, so this mistake tolerance strategy makes our dataset even closer to real scenarios.
We further examine the dialog quality after the data collection by analyzing the dialogs. The average utterance (human-written instructions and questions) in a dialog is 3.1, with a minimum and maximum being 1 and 7 because each dialog includes at least one instruction written by a human. The average number of words written by commander and follower are 45 and 19, and there are about 15 words from auto-instructions. Also, in Figure 12
, we show the distribution of the top 30 most frequent words in the commander’s and follower’s utterances. The results show a smooth variance across nouns, verbs, adjectives, and prepositions, indicating that our dataset’s utterances have rich contents and good variety. Last but not least, we manually checked the dialogs in all validation and test sets by visualizing the corresponding navigation trajectory and the dialog, and we observed no major issue.
Appendix D Interface for workers in dataset collection
We use help from Amazon Mechanical Turk (AMT) workers and human drone experts during the collection of our Aerial Vision-and-Language Navigation (AVDN) dataset, where the AMT workers play the commander role providing instructions the drone experts play the follower role asking questions and controlling the drone. In this section, we demonstrate the interface for both groups of workers with all the information they receive in the data collection procedure.
d.1 Interfaces for commanders
There are two interfaces for commanders (AMT workers) depending on which data collection round it is. The interface includes one trajectory each time and contains all the information needed for the commander to create the instruction. Detailed and step-by-step instructions for what needs to be done as a commander are introduced at the beginning of the interface. The AMT workers need to write sentences in the Answer according to the provided information.
In the first round of data collection, the commander needs to write the initial instruction based on an overview of the AVDN trajectory. As shown in Fig. 13 the satellite image shows the trajectory overview marked with a predefined staring position (the red point with an arrow showing the drone’s direction at the starting position) and a destination area (purple bounding box).
In the data collection round after the first round, the commander is required to give follow-up instructions, i.e., answers, to the questions from the follower. The user interface for the second and following rounds is shown in Fig. 14. Besides all the information shown to the commander in the first round, the follower is also provided with previous dialog, past trajectories (broken purple line), and the view area corresponding to the most recent time step (named current view area marked with white bounding box).
d.2 Interface for followers
The follower uses an interface to interact with our simulator. In our simulator, they receive instructions from the commander and control the simulated drone. The keyboard is used to simulate the drone controller with eight keys representing four channels in the controller, where key w and s represent the channel controlling forward and backward movement, key a and d represent the channel controlling left and right movement, key q and e represent the channel controlling rotating clockwise and anti-clockwise movement and key 1 and 2 represent the channel controlling altitude change. After the experts finish the control, the commander can either claim the destination is reached or ask questions for more instruction. As in Fig. 15, the interface is an image window showing the simulated drone’s visual observation and a text window for displaying the previous dialogs and inputting questions from the follower. There is a compass on the top left of the image window, showing the orientation of the simulated drone. The red cross in the image window shows the center of the view, helping the follower control the drone to right above the destination area, and the red corners in the window show the area of 0.4 IoU with the view area. The follower is instructed to make the destination area larger than the area indicated by the red corners in order to finish successful navigation.