Log In Sign Up

Communicative Learning with Natural Gestures for Embodied Navigation Agents with Human-in-the-Scene

by   Qi Wu, et al.

Human-robot collaboration is an essential research topic in artificial intelligence (AI), enabling researchers to devise cognitive AI systems and affords an intuitive means for users to interact with the robot. Of note, communication plays a central role. To date, prior studies in embodied agent navigation have only demonstrated that human languages facilitate communication by instructions in natural languages. Nevertheless, a plethora of other forms of communication is left unexplored. In fact, human communication originated in gestures and oftentimes is delivered through multimodal cues, e.g. "go there" with a pointing gesture. To bridge the gap and fill in the missing dimension of communication in embodied agent navigation, we propose investigating the effects of using gestures as the communicative interface instead of verbal cues. Specifically, we develop a VR-based 3D simulation environment, named Ges-THOR, based on AI2-THOR platform. In this virtual environment, a human player is placed in the same virtual scene and shepherds the artificial agent using only gestures. The agent is tasked to solve the navigation problem guided by natural gestures with unknown semantics; we do not use any predefined gestures due to the diversity and versatile nature of human gestures. We argue that learning the semantics of natural gestures is mutually beneficial to learning the navigation task–learn to communicate and communicate to learn. In a series of experiments, we demonstrate that human gesture cues, even without predefined semantics, improve the object-goal navigation for an embodied agent, outperforming various state-of-the-art methods.


page 1

page 3

page 4

page 7


Situated Multimodal Control of a Mobile Robot: Navigation through a Virtual Environment

We present a new interface for controlling a navigation robot in novel e...

YouRefIt: Embodied Reference Understanding with Language and Gesture

We study the understanding of embodied reference: One agent uses both la...

I Probe, Therefore I Am: Designing a Virtual Journalist with Human Emotions

By utilizing different communication channels, such as verbal language, ...

VGPN: Voice-Guided Pointing Robot Navigation for Humans

Pointing gestures are widely used in robot navigationapproaches nowadays...

Gesture-based Piloting of an Aerial Robot using Monocular Vision

Aerial robots are becoming popular among general public, and with the de...

Solving Visual Object Ambiguities when Pointing: An Unsupervised Learning Approach

Whenever we are addressing a specific object or refer to a certain spati...

Investigating the impact of free energy based behavior on human in human-agent interaction

Humans communicate non-verbally by sharing physical rhythms, such as nod...

I Introduction

Human-human communication takes place in various forms, of which gestures play a crucial role [41]. Gestures include movements of body, head, or hands and can facilitate the understanding of the speech or serve as emblems to deliver messages in place of speech [40, 42]. They can significantly improve the communication efficacy for information conveyance [8].

Similarly, human-robot communication can also occur using multimodal cues [79]. Although robots and autonomous systems are designed to collaborate with humans who supervise, instruct, or evaluate the system to perform specific tasks, most existing communication interfaces assume that humans communicate to an artificial agent only using natural language, either verbally or through text. In stark contrast, the origin of human communications is primarily rooted in nonverbal forms [66], e.g., gestures. Therefore, providing assistive or collaborative AI systems with nonverbal means of communication would open up new research venues to investigate the efficacy of alternative communication forms. Unlike natural languages, which suffer from intermittent conveyance and need continuous attention, nonverbal cues like gestures are immediate and intuitive, hence are less vulnerable to interruptions. In particular, when the environment is noisy, or the agent is listening to someone else, a user might refer to a location using a deictic gesture, e.g., pointing with a finger, instead of describing it with a long sentence.

Fig. 1: Natural gestures can succinctly deliver complex semantic meaning in a physical space. A human user can instruct a robot or a virtual agent to complete a navigation task by simply referring to a target location or object using gestures. The agent should infer the user intent from the gestures.

To illustrate the significance of communications using gestures, let us take Fig. 1 as an example. A human intends to instruct the robot to navigate to the target location or object in the scene. Previous works in embodied visual navigation with language-based human interactions may require a lengthy text message, such as “go to the second brown chair next to the big white table in the living room”. In contrast, gestures allow to express the same message in a much simpler and more natural way, e.g. “go there,” ”clean here,” or ”bring it to me.” Such multimodal messages can only be correctly interpreted in a given physical space where a human and an agent are situated together. The meaning of a human message must be inferred from the joint understanding of the given scene by the agent who also understands the semantics of human natural gestures [63, 62].

Inspired by the above crucial observation, we intend to bring in nonverbal communication cues [41, 25, 39] into the embodied agent navigation task—the most straightforward task of an embodied AI that interacts with the environments and other agents. Despite the progress reported for the embodied agent on the Vision-Language Navigation (VLN) task [16, 14, 33, 21, 3, 68, 17, 26, 78], we contemplate on prior arts and quest for the following questions: Instead of using natural language, can we replace the language grounding by gestures in a similar setting? Can we improve the performance of navigation with gestures incorporated? Can the learning agent acquire the underlying semantics of gestures, even when they are not predefined?

Specifically, we aim to use gestures to communicate with an embodied agent to navigate in a virtual environment. To provide gesture-based instructions for a navigation task, the agent needs a photorealistic simulation environment, and a human player needs to be situated in the same scene to have joint attention [65]. To support such a co-existing environment, we build our virtual environment Ges-THOR with Oculus, Kinect, and Leap Motion, based on the existing framework AI2-THOR [47].

Although human gestures have been used as a communicative interface between humans and robots in robotics [35, 53, 24, 55, 13], prior literature typically predefines the vocabulary of admissible gestures and their definite meanings (e.g., “ok” sign means an approval). In contrast, human gestures are diverse; their meanings are also non-rigid and context-dependent [38]. One needs to develop a flexible system to address the variability and versatility of nearly-unlimited naturalistic human gestures without a predefined set of recognizable gestures. Without defining any gestures and their meanings ourselves, we have collected demonstrations from a group of volunteers who have diverse gesture preferences for the same message.

In our proposed framework, an agent should therefore solve two tasks: multimodal target inference and navigation. Inferring the meanings of human gestures and finding a path to the target location are two major goals of the agent, which mutually help each other, i.e. learn to communicate and communicate to learn. Experiments reveal that our model incorporating gestures outperforms a baseline model only with vision for navigation, as well as models on similar environments and tasks using different methods [76].

This paper makes four contributions: (i) By introducing human gestures as the new communicative interface for embodied AI learning and (ii) developing a simulation framework, Ges-THOR, that supports multimodal interactions with human users, (iii) we demonstrate that the embodied agent’s navigation performance significantly improves after incorporating human gestures. (iv) We further demonstrate that the agent can learn the underlying meanings and intents of human gestures without predefining the associations.

Ii Related Work

Language grounding

Language grounding is crucial for both parties involved in communication to understand each other. Natural language, the most common modality for human-human and human-robot communication, can realize the grounding in various ways. For communication with robots, language can be interpreted from instructional commands to actions [49, 50]. For static images or texts, it can be either visually grounded [4, 54] or text-based [70] Q&A. In our work, language grounding is replaced by “gesture grounding;” we provide gestures as the new communicative interface. The agent is tasked to learn by grounding human gestures into a series of actions and identify target objects.

Vision-language navigation

Image captioning with large datasets [67] and Visual Question Answering (VQA[4, 48] has made significant progress in vision and language understanding, which enables visually-grounded language navigation agent to be trained. Many tasks following the VLN framework [16, 14, 22, 21, 3, 68, 17, 26, 64, 34, 78, 61] have been addressed and solved using end-to-end learning models, either in 2D world [77, 19, 10], 3D world [37, 14], or even photorealistic environments [3, 17, 26, 73, 78]. Some works have also explored the acoustic cue in navigation, but these are not mainly concerned with speech [27, 15]. Our work is built on the existing VLN framework but extended by incorporating gestures as a new modality for communications.

Simulated environments

To help the research in embodied AI learning, various simulated environments have spurred for the community’s benefit. Those 3D environments are created from either synthetic scenes [9, 71, 47, 73, 60, 56] or real photographs [57, 72, 58, 12]; some of them use game engines to enable physical interactions [44, 7, 47, 74, 73]. In this paper, we choose AI2-THOR, which uses Unity as the physics engine, and build the environment on it. Exiting works using AI2-THOR for visual navigation tasks [80, 76] require either the target visualization or its context. In this paper, we propose a gesture-based method to eliminate the need for acquiring additional target information.

RL for navigation

Instead of using traditional path-planning approaches [43] to compute a route to the goal location, the embodied AI community has recently focused more on end-to-end learning for training navigation policies, especially with Reinforcement Learning (RL

). Compared with other machine learning methods, such as supervised learning 

[51], RL benefits from simple reward definitions and easy implementations. As a result, RL becomes the core of the learning framework [58, 72, 3, 80, 33, 76]. In this paper, we choose Proximal Policy Optimization (PPO[59] as the RL model.

Human AI interaction

Human-AI Interaction (HAI) has been intensively investigated in AI, Human-Computer Interaction, and robotics [31, 1, 45]. For the embodied navigation agents, the sprout of simulated environments makes users communicate with the agent interactively. Most existing frameworks achieve this goal using dialogues [11, 21, 32, 77, 52, 64]. However, as discussed, natural language is not the only cue for multimodal communication, and current collaborative frameworks have not yet fully explored a rich spectrum of communicative interfaces for embodied agent navigation. In this paper, we propose gestures as the communicative interface between human users and the artificial agent.

Meanwhile, there is a large body of work on human gestures as a communicative device either to humans or robots [18, 28]. Most of these approaches are based on a predefined gesture set with fixed meanings or focus on gesture type classification [23]

, pose estimation 

[30, 5], or both [75, 29]. In contrast, we let users use any natural gestures and demonstrate the agent can directly learn the semantics and underlying intents of these gestures.

Iii Ges-THOR: A Simulation Framework for Human-Agent Interaction via Gestures

We build an interactive learning framework in Unity based on the iTHOR environment from AI2-THOR for the gesture-boosted embodied AI research, namely Ges-THOR—Gesture-based iTHOR environment.

Iii-a Simulation and Learning Framework

There are many existing physics-based simulation frameworks for photorealistic indoor navigation tasks [72, 73, 58, 47, 57, 74]. We choose AI2-THOR specifically to build our learning environment because it provides a diversity of rooms and interactive features. It has been widely used for different visual navigation tasks [80, 76]. In addition, the game engine Unity provides the ability to deploy across platforms and integrate third-party resources, compatible with the sensory devices we use for this learning environment. We also use AllenAct [69] as the codebase for our modular framework.

Fig. 2: Overview of the learning framework.

(a) Scenes and the (b) learning agent are built in Unity. The agent can perform four actions: move forward, turn left, and turn right, and stop. (c) It receives several sensory inputs, including RGBD images, target labels, and human gestures. Unity contains (d) an external communicator that can communicate with the (e) learning model in PyTorch. The learning model receives states and rewards from the communicator and sends back chosen actions.

Iii-B Human Gesture Sensing

The following setup immerses human players into the virtual environment while allowing the system to capture human gestures:


We use Oculus Rift, Kinect Sensor v2, and Leap Motion Controller (hereinafter referred to as Oculus, Kinect, and Leap Motion) together for gesture sensing via pose estimation. Oculus gives the player the first-person view in the virtual environment; hence the player sees the virtual scene and knows where the target object is. Kinect is used to track overall body movements. However, Kinect is incapable of capturing fine in-hand motions. Leap Motion is brought in to detect hand movements.

Fig. 3: Device arrangement. The human player wears an (c) Oculus headset with (b) Leap Motion. (a) Kinect is placed 1.5m from the human player and 1.5m above the ground. The screen displays (d) A humanoid model that are mirroring the body and hand movements.

Device Arrangement

Fig. 3 illustrates the device arrangement. During data acquisition, the human player is asked to wear the Oculus headset, face the Kinect sensor, and move hands in front of Leap Motion at a distance between 30cm and 60cm. In Unity, a humanoid character (see Fig. 3d) mirrors players’ movements in real-time, including body composure and hand motions.

Fig. 4: Examples of referencing and intervention gestures. The first 4 columns are referencing gestures, while the last one is an intervention gesture. Human players perform different gesture styles while pointing at various target objects in the scene. Top row shows the body movements captured by Kinect, and the bottom row shows the hand configurations recorded by Leap Motion.

Data collection

Ideally, learning would take place in real-time, where a human player continuously observes an agent’s behavior and interacts with it such that the agent can respond to the feedback immediately. Unfortunately, this is infeasible because the entire training process may take hundreds of thousands of episodes. Therefore, we opt for using pre-recorded gestures to simulate real-time interactions between the human player and the agent as closely and efficiently as possible. There are two types of instructional gestures that humans can use: one is for referencing, and the other one is for intervention. To record referencing gestures, a volunteer is given the target object in a scene and asked to communicate with the agent to guide the direction with a gesture. We do not ask participants to use any specific gestures such as pointing with a finger but encourage them to use any gestures as if they are talking to another person. The gesture sequence, as well as the environmental information, is recorded as one episode in the dataset. We have over 230,000 unique episodes for training and 2,500 episodes for validation and testing. For intervention gestures, the player shows gestures in a rejective manner used to warn the agent if it is moving away from the target. We recorded ten different intervention gestures. Kinect Body and Leap Hands can duplicate the player’s movements and save the recorded motions as animation clips in Unity. See Fig. 4 for examples of collected gestures.

Iii-C Sensory Modalities

Multimodal perception is essential for artificial systems. We provide several sensory inputs in our environment to build a multimodal learning framework; see Fig. 2. The observational space consists of the following inputs:


Unity’s built-in camera component allows a 2D view of the virtual space. It is attached to the eyesight of the embodied agent at 1.5m from the ground with a 90-degree field-of-view and provides real-time RGB images in the first-person view. The resolution of the RGB images is , and each pixel contains scaled RGB values from 0 to 1.


The depth image is extracted from the depth buffer of Unity’s camera view. It has a size of

, and each pixel value is a floating-point between 0 and 1, representing the distance from the rendered viewpoint to the agent, linearly interpolated from 0m to 10m.


Unity checks for collisions dynamically in the learning environment. Every time the agent triggers a collision, it can report this event and prevent the agent from penetrating into the object meshes. Note that for our agent design, it can slide along the object surface it collides with. This “sliding” mechanics has been noticed by recent work [6] and may hinder sim2real transition. We rectify this issue by addressing penalties in rewards for such behaviors.


As previously mentioned, we use Oculus, Kinect, and Leap Motion to capture human gestures. Each gesture motion is saved as a sequence of vectors with 100 steps and 95 features consisting of body and hand poses. Note that for referencing gestures, we select motions from the corresponding episode. For intervention gestures, we randomly sample one from saved recordings and use it only when the agent faces away from the target. The raw gesture inputs are encoded and piped into our learning model.

Iv Learning to Navigate with Gesture

In this section, we describe our end-to-end gesture learning model using Deep Reinforcement Learning (DRL). We start by introducing the formulation of the DRL model we use, followed by the other components of the entire architecture.

Iv-a Problem Formulation

We take the ObjectGoal task [2] as our navigation task, where the agent must navigate to an object of a specific category, as our experimental testbed. The details of the task and the agent embodiment are explained below:

Agent Embodiment

The learning agent is represented by a robot character with a capsule bound. The agent has a rigid body component attached to it so that it can detect collisions with environmental objects. It has four available actions: turn left, turn right, move forward, and stop. Each turning action results in a rotation of , and each forward action results in a forward displacement of 0.25m.

Task Definition

The agent is initiated at a random location, and an object is selected randomly as the target; we ensure that the agent can reach the target. Note that there can be more than one instance of the target object type in the same environment. To complete the task, the agent must navigate to the target object instance with a stopping distance equal to or less than 1.5m. The agent then needs to issue a termination (i.e., stop) action in the proximity of the goal, and the object must also be within the agent’s field of view in order to succeed. An episode is terminated if the above success criteria are met or the maximum allowed time step (which is 100 in our setup) is reached. We allow the agent to issue multiple stops in an episode but measure success rates using different numbers of maximum stops (1-3). We allow an unlimited number of stops in training; the agent needs to explore and learn after issuing incorrect stops in earlier episodes.

Iv-B Policy Learning with Ppo

We formulate our visual gesture navigation using DRL, specifically PPO. Our learning process can be viewed as a Markov Decision Process (MDP). At each time step , the agent perceives a state (i.e., a combination of the sensory inputs), receives a reward from the environment, and chooses an action according to the current policy :


where represents parameters for the function approximator of the policy . We implement PPO

with a time horizon of 128 steps, batch size of 128 and 4 epochs for each iteration of gradient descent, and buffer size of 1280 for each policy update. We use Adam 

[46] as the optimizer with a learning rate of 0.0003 and a discount factor of 0.99.

The agent receives a positive reward of if it completes the navigation successfully. Since we encourage the agent to reach the target object with the minimal amount of steps, the agent receives a small time penalty of for each step. We further add a collision penalty of for each collision detected; the collision penalty is added to mitigate the aforementioned “sliding” behavior. If the agent stops in an ineligible location, a penalty of is added.

Iv-C Model Overview

We equip the embodied agent with different sensory modalities, and each of them feeds into a part of the input network for the RL model. Below we introduce these components of the architecture.

Visual network

The backbone of the visual network is ResNet-18 [36]

pre-trained on ImageNet. It takes

RGB and depth images as inputs. The weights of all layers in the network except the last fully connected layer are frozen during training.

Fig. 5: The model overview. Our model fuses perceptions from different sensory modalities, and the actor-critic model samples an action in each step according to the updated policy and send it back to the environment.

Gesture network

The raw input of the gesture network is a sequence with 100 time steps and 95 features. Each feature represents the muscle value from the Unity humanoid model, which can be considered as the coordinates for tracked body and hand joints. The gesture input is flattened and encoded into a vector. In addition, we provide the target object category from a selected set and pass it to an embedding layer. This is equivalent to speech or text instructions from a user in prior work on interactive embodied agent learning. Since our focus in this paper is gesture, we simplify this part of the input as a categorical variable (

e.g., a single word in a fixed vocabulary). Note that this vector alone does not specify the target object location: There can be multiple instances of the same object category in the scene, and the agent needs to infer which instance the human player is referring to. This vector is concatenated with encoded gesture inputs and visual features by an observation combiner. There is a memory unit using Gated Recurrent Unit (GRU[20] after this combiner. Fig. 5 illustrates the entire architecture.

V Evaluation

Scene Types Methods Success Rate (%) Success weighted by Path Length (%)
Train Validation Test Train Validation Test
Kitchen Baseline
Intervention 44.9 31.5 40.3 27.0 20.0 24.0
Living Room Baseline
Intervention 13.0
Bedroom Baseline
Referencing 43.5 28.3
Intervention 22.4 20.4 13.8 11.9
Bathroom Baseline
Intervention 40.5 32.2 35.0 29.1 21.0 23.0
Average Baseline
Intervention 35.2 22.9 23.8 15.0 26.3 16.1
Scene Prior [76]
TABLE I: Evaluation results for train/validation/test split. SR and SPL at the first stop are reported in this table. We compare models trained with referencing gestures and intervention gestures against a baseline model. Reported from [76]. This method uses additional scene prior knowledge but not gestures.
Scene Types Methods Success Rate (%) Success weighted by Path Length (%)
1 Stop 2 Stop 3 Stop 1 Stop 2 Stop 3 Stop
Kitchen Baseline
Intervention 40.3 55.1 62.8 89.0 24.0 32.7 37.2 52.8
Living Room Baseline
Intervention 16.9 21.7 58.0 12.6 30.8
Bedroom Baseline
Intervention 20.4 27.9 33.5 51.2 11.9 16.3 19.5 29.7
Bathroom Baseline
Intervention 35.0 44.6 51.7 76.5 23.0 29.6 33.9 48.3
Average Baseline
Intervention 26.3 36.1 42.4 68.7 16.1 22.1 25.8 40.4
TABLE II: Evaluation results for test scenes with different number of allowed stops ( denotes infinte allowed stops). SR and SPL are presented. We compare models trained with referencing gestures and intervention gestures against a baseline model.

We evaluate our methods in Ges-THOR environment. AI2-THOR provides 120 scenes covering four different room types: kitchen, living room, bedroom, and bathroom. Each room has its own unique appearance and arrangements. We randomly split 30 scenes for each scene type into 20 training rooms, 5 validation rooms, and 5 testing rooms.

There are 38 object categories available for all scenes. Since there is almost no overlap of objects for different scene types, we train and evaluate separately for each scene type. We evaluate each scene for 250 episodes and report the average results for each scene type.

Evaluation Metrics

We use 2 metrics to evaluate different methods:

  • [leftmargin=*,noitemsep,nolistsep,topsep=0pt]

  • SR: for the -th episode, the success can be marked by a binary indicator . The success rate is the ratio of successful episodes over completed episodes N:

  • SPL: this metric is proposed by Anderson et al[2]. It measures the efficacy of the navigation. SPL is calculated as follows:


    where is the shortest path distance from the agent’s starting position to the goal in episode , and is the actually path length taken by the agent.

We have three methods to evaluate the agent performance: (1) Baseline: the agent only has the visual (i.e., RGB and depth images) and object category information. (2) Referencing Gesture: in addition to (1), the agent receives referencing gesture inputs. (3) Intervention Gesture: in addition to (1), the agent receives rejective gesture inputs when the forward direction forms an angle larger than 90 degrees between the agent and the target.

In our comparative setting, the baseline model does not use any gestures. While one may expect that it should always underperform, this is only true if the agent has learned and inferred the semantics of human gestures and incorporated the signals during navigation, which is the focus of our evaluation. Again, this is not trivial because we do not pre-define the meaning of any gestures. Similarly, the intervention gestures is a strong directive feedback from the human user, but we evaluate how well the agent can infer its meaning and adopt it in navigation.

Navigation Performance

Table I show the performance of different methods when evaluated at the first stop, and Table II show the performance at test scenes evaluated at a different number of stops. From the both results, we confirm that adding gestures can significantly improve the navigation success rate as well as the efficiency over the baseline model. Table I puts a hard constraint on the number of stops to 1 to match the state-of-art benchmarks [2]. Of note, models trained with intervention gestures outperform models trained with referencing gestures, both in SR and SPL, demonstrating that intervention gesture is a more effective kind of gesture to communicate with the agent. Table II reports results on test scenes with a different number of allowed stops. We should see that both SR and SPL increase with the number of allowed stops, and the improvement of SR and SPL with gestures is more evident in a lower number of allowed stops.

Qualitative Results

To visualize the effectiveness of our methods, we show some qualitative results in Figs. 7 and 6. Fig. 6 compares our referencing gesture model against the baseline model with visualized trajectories in different scenes and targets. It could be observed that in all scenes, our referencing gesture model enables the agent to navigate to the target more intelligently, while the baseline model often struggles to find the target and stop or takes a longer path to find the target. Fig. 7 demonstrates how our intervention gesture model works to improve the navigation significantly. In this example, the agent rotates at the place where it faces back to the target and is instructed with interventions gestures until the target is in its field of view before making any movements. This indicates that our agent is able to understand and react to the intervention gestures, resulting in much better navigation performance.

Fig. 6: Qualitative results with visualizations of trajectories for baseline (left) and referencing gesture (right) models. Our agent can efficiently navigate to the target with the help of gestures.
Fig. 7: Qualitative results for the intervention gesture model. When the agent is back to the target, it receives interventional gestures and could first rotate until it faces the target before making any forward movements, thus increasing the navigation success rate and efficiency.

Vi Conclusion

In this paper, we propose a new framework for embodied visual navigation where human users can give instructions to the autonomous agent using gestures. Such agents and gesture based interface will be very useful for collaborative robots or virtual agents. We have built a VR-based interactive learning environment, Ges-THOR, based on AI2-THOR and designed an end-to-end deep reinforcement learning model for the navigation task. Our experiments show that the agent is able to interpret human instructions with gestures and improve its visual navigation. We also conclude that interactive activities during agent task execution can improve performance. While the main setting and experimental design of our study have been used in prior works, to the best of our knowledge, our paper is the first incorporating human gestures for embodied agent learning and showing the agent can learn the semantic of gestures without supervision. We will make publicly available our simulation environment and the recorded gesture dataset for any future research for Human-AI interaction via gestures. The future directions include adding more objects, tasks, gestures, and multiple agents in the scene, e.g., navigating to an object and bring it back by showing gestures in our framework and also allowing agents to make gestures to the human player such that both parties can communicate with gestures, which will also help humans to utilize even more diverse gestures to communicate with agents.



  • [1] S. Amershi, D. Weld, M. Vorvoreanu, A. Fourney, B. Nushi, P. Collisson, J. Suh, S. Iqbal, P. N. Bennett, K. Inkpen, et al. (2019) Guidelines for human-ai interaction. In CHI, Cited by: §II.
  • [2] P. Anderson, A. Chang, D. S. Chaplot, A. Dosovitskiy, S. Gupta, V. Koltun, J. Kosecka, J. Malik, R. Mottaghi, M. Savva, et al. (2018) On evaluation of embodied navigation agents. arXiv:1807.06757. Cited by: §IV-A, 2nd item, §V.
  • [3] P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. Sünderhauf, I. Reid, S. Gould, and A. van den Hengel (2018) Vision-and-language navigation: interpreting visually-grounded navigation instructions in real environments. In CVPR, Cited by: §I, §II, §II.
  • [4] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh (2015) Vqa: visual question answering. In CVPR, Cited by: §II, §II.
  • [5] S. Baek, K. I. Kim, and T. Kim (2019) Pushing the envelope for rgb-based dense 3d hand pose estimation via neural rendering. In CVPR, Cited by: §II.
  • [6] D. Batra, A. Gokaslan, A. Kembhavi, O. Maksymets, R. Mottaghi, M. Savva, A. Toshev, and E. Wijmans (2020) ObjectNav revisited: on evaluation of embodied agents navigating to objects. arXiv:2006.13171. Cited by: §III-C.
  • [7] C. Beattie, J. Z. Leibo, D. Teplyashin, T. Ward, M. Wainwright, H. Küttler, A. Lefrancq, S. Green, V. Valdés, A. Sadik, et al. (2016) Deepmind lab. arXiv:1612.03801. Cited by: §II.
  • [8] P. Bremner and U. Leonards (2016) Iconic gestures for robot avatars, recognition and integration with speech. Frontiers in Psychology 7, pp. 183. Cited by: §I.
  • [9] S. Brodeur, E. Perez, A. Anand, F. Golemo, L. Celotti, F. Strub, J. Rouat, H. Larochelle, and A. Courville (2017) Home: a household multimodal environment. arXiv:1711.11017. Cited by: §II.
  • [10] T. Cao, J. Wang, Y. Zhang, and S. Manivasagam (2020) BabyAI++: towards grounded-language learning beyond memorization. arXiv:2004.07200. Cited by: §II.
  • [11] J. Y. Chai, Q. Gao, L. She, S. Yang, S. Saba-Sadiya, and G. Xu (2018) Language to action: towards interactive task learning with physical agents.. In IJCAI, Cited by: §II.
  • [12] A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niessner, M. Savva, S. Song, A. Zeng, and Y. Zhang (2017) Matterport3d: learning from rgb-d data in indoor environments. arXiv:1709.06158. Cited by: §II.
  • [13] J. Chang, J. Xiao, J. Chai, and Z. Zhou (2019) An improved faster r-cnn algorithm for gesture recognition in human-robot interaction. In Chinese Automation Congress, Cited by: §I.
  • [14] D. S. Chaplot, K. M. Sathyendra, R. K. Pasumarthi, D. Rajagopal, and R. Salakhutdinov (2017) Gated-attention architectures for task-oriented language grounding. arXiv:1706.07230. Cited by: §I, §II.
  • [15] C. Chen, U. Jain, C. Schissler, S. V. A. Gari, Z. Al-Halah, V. K. Ithapu, P. Robinson, and K. Grauman (2020) Soundspaces: audio-visual navigation in 3d environments. In ECCV, pp. 17–36. Cited by: §II.
  • [16] D. L. Chen and R. J. Mooney (2011) Learning to interpret natural language navigation instructions from observations. In AAAI, Cited by: §I, §II.
  • [17] H. Chen, A. Suhr, D. Misra, N. Snavely, and Y. Artzi (2019) Touchdown: natural language navigation and spatial reasoning in visual street environments. In CVPR, Cited by: §I, §II.
  • [18] S. Chen, H. Ma, C. Yang, and M. Fu (2015) Hand gesture based robot control system using leap motion. In ICIRA, Cited by: §II.
  • [19] M. Chevalier-Boisvert, D. Bahdanau, S. Lahlou, L. Willems, C. Saharia, T. H. Nguyen, and Y. Bengio (2018) Babyai: a platform to study the sample efficiency of grounded language learning. In ICLR, Cited by: §II.
  • [20] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv:1406.1078. Cited by: §IV-C.
  • [21] A. Das, S. Datta, G. Gkioxari, S. Lee, D. Parikh, and D. Batra (2018) Embodied question answering. In CVPR Workshops, Cited by: §I, §II, §II.
  • [22] A. Das, G. Gkioxari, S. Lee, D. Parikh, and D. Batra (2018) Neural modular control for embodied question answering. arXiv:1810.11181. Cited by: §II.
  • [23] Q. De Smedt, H. Wannous, and J. Vandeborre (2016) Skeleton-based dynamic hand gesture recognition. In CVPR, Cited by: §II.
  • [24] B. S. Ertuğrul, C. Gurpinar, H. Kivrak, and H. Kose (2013) Gesture recognition for humanoid assisted interactive sign language tutoring. In SIU, Cited by: §I.
  • [25] L. Fan, S. Qiu, Z. Zheng, T. Gao, S. Zhu, and Y. Zhu (2021) Learning triadic belief dynamics in nonverbal communication from videos. In CVPR, Cited by: §I.
  • [26] D. Fried, R. Hu, V. Cirik, A. Rohrbach, J. Andreas, L. Morency, T. Berg-Kirkpatrick, K. Saenko, D. Klein, and T. Darrell (2018) Speaker-follower models for vision-and-language navigation. In NeurIPS, Cited by: §I, §II.
  • [27] C. Gan, Y. Zhang, J. Wu, B. Gong, and J. B. Tenenbaum (2020) Look, listen, and act: towards audio-visual embodied navigation. In ICRA, Cited by: §II.
  • [28] Q. Gao, J. Liu, Z. Ju, Y. Li, T. Zhang, and L. Zhang (2017) Static hand gesture recognition with parallel cnns for space human-robot interaction. In ICIRA, Cited by: §II.
  • [29] G. Garcia-Hernando, S. Yuan, S. Baek, and T. Kim (2018) First-person hand action benchmark with rgb-d videos and 3d hand pose annotations. In CVPR, Cited by: §II.
  • [30] L. Ge, Y. Cai, J. Weng, and J. Yuan (2018) Hand pointnet: 3d hand pose estimation using point sets. In CVPR, Cited by: §II.
  • [31] M. A. Goodrich and A. C. Schultz (2008) Human-robot interaction: a survey. Now Publishers Inc. Cited by: §II.
  • [32] D. Gordon, A. Kembhavi, M. Rastegari, J. Redmon, D. Fox, and A. Farhadi (2018) Iqa: visual question answering in interactive environments. In CVPR, Cited by: §II.
  • [33] S. Gupta, J. Davidson, S. Levine, R. Sukthankar, and J. Malik (2017) Cognitive mapping and planning for visual navigation. In CVPR, Cited by: §I, §II.
  • [34] W. Hao, C. Li, X. Li, L. Carin, and J. Gao (2020) Towards learning a generic agent for vision-and-language navigation via pre-training. In CVPR, Cited by: §II.
  • [35] M. Hasanuzzaman, V. Ampornaramveth, T. Zhang, M. Bhuiyan, Y. Shirai, and H. Ueno (2004) Real-time vision-based gesture recognition for human robot interaction. In ROBIO, Cited by: §I.
  • [36] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, Cited by: §IV-C.
  • [37] K. M. Hermann, F. Hill, S. Green, F. Wang, R. Faulkner, H. Soyer, D. Szepesvari, W. M. Czarnecki, M. Jaderberg, D. Teplyashin, et al. (2017) Grounded language learning in a simulated 3d world. arXiv:1706.06551. Cited by: §II.
  • [38] K. Jiang, S. Stacy, C. Wei, A. Chan, F. Rossano, Y. Zhu, and T. Gao (2021) Individual vs. joint perception: a pragmatic model of pointing as communicative smithian helping. In CogSci, Cited by: §I.
  • [39] H. Joo, T. Simon, M. Cikara, and Y. Sheikh (2019) Towards social artificial intelligence: nonverbal social signal prediction in a triadic interaction. In CVPR, pp. 10873–10883. Cited by: §I.
  • [40] J. Joo, E. P. Bucy, and C. Seidel (2019)

    Automated coding of televised leader displays: detecting nonverbal political behavior with computer vision and deep learning.

    International Journal of Communication (19328036). Cited by: §I.
  • [41] J. Joo, F. F. Steen, and M. Turner (2017) Red hen lab: dataset and tools for multimodal human communication research. KI-Künstliche Intelligenz 31 (4), pp. 357–361. Cited by: §I, §I.
  • [42] Z. Kang, C. Indudhara, K. Mahorker, E. P. Bucy, and J. Joo (2020) Understanding political communication styles in televised debates via body movements. In European Conference on Computer Vision, pp. 788–793. Cited by: §I.
  • [43] L. E. Kavraki, P. Svestka, J. Latombe, and M. H. Overmars (1996) Probabilistic roadmaps for path planning in high-dimensional configuration spaces. IEEE Transactions on Robotics and Automation 12 (4), pp. 566–580. Cited by: §II.
  • [44] M. Kempka, M. Wydmuch, G. Runc, J. Toczek, and W. Jaśkowski (2016) Vizdoom: a doom-based ai research platform for visual reinforcement learning. In 2016 IEEE Conference on Computational Intelligence and Games (CIG), Cited by: §II.
  • [45] S. Kiesler, A. Powers, S. R. Fussell, and C. Torrey (2008) Anthropomorphic interactions with a robot and robot–like agent. Social Cognition 26 (2), pp. 169–181. Cited by: §II.
  • [46] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv:1412.6980. Cited by: §IV-B.
  • [47] E. Kolve, R. Mottaghi, W. Han, E. VanderBilt, L. Weihs, A. Herrasti, D. Gordon, Y. Zhu, A. Gupta, and A. Farhadi (2017) Ai2-thor: an interactive 3d environment for visual ai. arXiv:1712.05474. Cited by: §I, §II, §III-A.
  • [48] K. Marino, M. Rastegari, A. Farhadi, and R. Mottaghi (2019) OK-vqa: a visual question answering benchmark requiring external knowledge. In CVPR, Cited by: §II.
  • [49] C. Matuszek, E. Herbst, L. Zettlemoyer, and D. Fox (2013) Learning to parse natural language commands to a robot control system. In Experimental Robotics, Cited by: §II.
  • [50] D. Misra, A. Bennett, V. Blukis, E. Niklasson, M. Shatkhin, and Y. Artzi (2018) Mapping instructions to actions in 3d environments with visual goal prediction. arXiv:1809.00786. Cited by: §II.
  • [51] A. Mousavian, A. Toshev, M. Fišer, J. Košecká, A. Wahid, and J. Davidson (2019) Visual representations for semantic target driven navigation. In ICRA, Cited by: §II.
  • [52] K. Nguyen and H. Daumé III (2019)

    Help, anna! visual navigation with natural multimodal assistance via retrospective curiosity-encouraging imitation learning

    arXiv:1909.01871. Cited by: §II.
  • [53] K. Nickel and R. Stiefelhagen (2007) Visual recognition of pointing gestures for human–robot interaction. Image and Vision Computing 25 (12), pp. 1875–1884. Cited by: §I.
  • [54] Y. Niu, H. Zhang, M. Zhang, J. Zhang, Z. Lu, and J. Wen (2019) Recursive visual attention in visual dialog. In CVPR, Cited by: §II.
  • [55] C. Nuzzi, S. Pasinetti, M. Lancini, F. Docchio, and G. Sansoni (2019) Deep learning-based hand gesture recognition for collaborative robots. IEEE Instrumentation & Measurement Magazine 22 (2), pp. 44–51. Cited by: §I.
  • [56] X. Puig, K. Ra, M. Boben, J. Li, T. Wang, S. Fidler, and A. Torralba (2018) Virtualhome: simulating household activities via programs. In CVPR, pp. 8494–8502. Cited by: §II.
  • [57] M. Savva, A. X. Chang, A. Dosovitskiy, T. Funkhouser, and V. Koltun (2017) MINOS: multimodal indoor simulator for navigation in complex environments. arXiv:1712.03931. Cited by: §II, §III-A.
  • [58] M. Savva, A. Kadian, O. Maksymets, Y. Zhao, E. Wijmans, B. Jain, J. Straub, J. Liu, V. Koltun, J. Malik, et al. (2019) Habitat: a platform for embodied ai research. In CVPR, Cited by: §II, §II, §III-A.
  • [59] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. arXiv:1707.06347. Cited by: §II.
  • [60] M. Shridhar, J. Thomason, D. Gordon, Y. Bisk, W. Han, R. Mottaghi, L. Zettlemoyer, and D. Fox (2020) Alfred: a benchmark for interpreting grounded instructions for everyday tasks. In CVPR, pp. 10740–10749. Cited by: §II.
  • [61] M. Shridhar, X. Yuan, M. Côté, Y. Bisk, A. Trischler, and M. Hausknecht (2020) ALFWorld: aligning text and embodied environments for interactive learning. arXiv preprint arXiv:2010.03768. Cited by: §II.
  • [62] F. F. Steen, A. Hougaard, J. Joo, I. Olza, C. P. Cánovas, et al. (2018) Toward an infrastructure for data-driven multimodal communication research. Linguistics Vanguard 4 (1). Cited by: §I.
  • [63] F. Steen and M. B. Turner (2013) Multimodal construction grammar. Language and the Creative Mind. Borkent, Michael, Barbara Dancygier, and Jennifer Hinnell, editors. Stanford, CA: CSLI Publications. Cited by: §I.
  • [64] J. Thomason, M. Murray, M. Cakmak, and L. Zettlemoyer (2020) Vision-and-dialog navigation. In Conference on Robot Learning, Cited by: §II, §II.
  • [65] M. Tomasello and M. J. Farrar (1986) Joint attention and early language. Child development, pp. 1454–1463. Cited by: §I.
  • [66] M. Tomasello (2010) Origins of human communication. MIT press. Cited by: §I.
  • [67] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan (2016) Show and tell: lessons learned from the 2015 mscoco image captioning challenge. TPAMI 39 (4), pp. 652–663. Cited by: §II.
  • [68] X. Wang, Q. Huang, A. Celikyilmaz, J. Gao, D. Shen, Y. Wang, W. Y. Wang, and L. Zhang (2019) Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. In CVPR, Cited by: §I, §II.
  • [69] L. Weihs, J. Salvador, K. Kotar, U. Jain, K. Zeng, R. Mottaghi, and A. Kembhavi (2020) AllenAct: a framework for embodied ai research. arXiv:2008.12760. Cited by: §III-A.
  • [70] J. Weston, A. Bordes, S. Chopra, A. M. Rush, B. van Merriënboer, A. Joulin, and T. Mikolov (2015) Towards ai-complete question answering: a set of prerequisite toy tasks. arXiv:1502.05698. Cited by: §II.
  • [71] Y. Wu, Y. Wu, G. Gkioxari, and Y. Tian (2018) Building generalizable agents with a realistic and rich 3d environment. arXiv:1801.02209. Cited by: §II.
  • [72] F. Xia, A. R. Zamir, Z. He, A. Sax, J. Malik, and S. Savarese (2018) Gibson env: real-world perception for embodied agents. In CVPR, Cited by: §II, §II, §III-A.
  • [73] X. Xie, H. Liu, Z. Zhang, Y. Qiu, F. Gao, S. Qi, Y. Zhu, and S. Zhu (2019) VRGym: a virtual testbed for physical and interactive ai. In Proceedings of the ACM Turing Celebration Conference, Cited by: §II, §II, §III-A.
  • [74] C. Yan, D. Misra, A. Bennnett, A. Walsman, Y. Bisk, and Y. Artzi (2018) Chalet: cornell house agent learning environment. arXiv:1801.07357. Cited by: §II, §III-A.
  • [75] S. Yang, J. Liu, S. Lu, M. H. Er, and A. C. Kot (2020) Collaborative learning of gesture recognition and 3d hand pose estimation with multi-order feature analysis. In ECCV, Cited by: §II.
  • [76] W. Yang, X. Wang, A. Farhadi, A. Gupta, and R. Mottaghi (2019) Visual semantic navigation using scene priors. In ICLR, Cited by: §I, §II, §II, §III-A, TABLE I.
  • [77] H. Yu, H. Zhang, and W. Xu (2018) Interactive grounded language acquisition and generalization in a 2d world. arXiv:1802.01433. Cited by: §II, §II.
  • [78] F. Zhu, Y. Zhu, X. Chang, and X. Liang (2020) Vision-language navigation with self-supervised auxiliary reasoning tasks. In CVPR, Cited by: §I, §II.
  • [79] Y. Zhu, T. Gao, L. Fan, S. Huang, M. Edmonds, H. Liu, F. Gao, C. Zhang, S. Qi, Y. N. Wu, J. Tenenbaum, and S. Zhu (2020) Dark, beyond deep: a paradigm shift to cognitive ai with humanlike common sense. Engineering 6 (3), pp. 310–345. Cited by: §I.
  • [80] Y. Zhu, R. Mottaghi, E. Kolve, J. J. Lim, A. Gupta, L. Fei-Fei, and A. Farhadi (2017) Target-driven visual navigation in indoor scenes using deep reinforcement learning. In ICRA, Cited by: §II, §II, §III-A.