RERERE: Remote Embodied Referring Expressions in Real indoor Environments

by   Yuankai Qi, et al.
The University of Adelaide

One of the long-term challenges of robotics is to enable humans to communicate with robots about the world. It is essential if they are to collaborate. Humans are visual animals, and we communicate primarily through language, so human-robot communication is inevitably at least partly a vision-and-language problem. This has motivated both Referring Expression datasets, and Vision and Language Navigation datasets. These partition the problem into that of identifying an object of interest, or navigating to another location. Many of the most appealing uses of robots, however, require communication about remote objects and thus do not reflect the dichotomy in the datasets. We thus propose the first Remote Embodied Referring Expression dataset of natural language references to remote objects in real images. Success requires navigating through a previously unseen environment to select an object identified through general natural language. This represents a complex challenge, but one that closely reflects one of the core visual problems in robotics. A Navigator-Pointer model which provides a strong baseline on the task is also proposed.



There are no comments yet.


page 1

page 3

page 6

page 12

page 13

page 14

page 15

page 16


Grounding Spatio-Semantic Referring Expressions for Human-Robot Interaction

The human language is one of the most natural interfaces for humans to i...

Sharing Cognition: Human Gesture and Natural Language Grounding Based Planning and Navigation for Indoor Robots

Cooperation among humans makes it easy to execute tasks and navigate sea...

Vision-and-Language Navigation: A Survey of Tasks, Methods, and Future Directions

A long-term goal of AI research is to build intelligent agents that can ...

Improving Grounded Natural Language Understanding through Human-Robot Dialog

Natural language understanding for robotics can require substantial doma...

Spoken Language Interaction with Robots: Research Issues and Recommendations, Report from the NSF Future Directions Workshop

With robotics rapidly advancing, more effective human-robot interaction ...

Visual Reasoning with Natural Language

Natural language provides a widely accessible and expressive interface f...

Robot Object Retrieval with Contextual Natural Language Queries

Natural language object retrieval is a highly useful yet challenging tas...

Code Repositories


REVERIE: Remote Embodied Visual Referring Expression in Real Indoor Environments

view repo


This is the working code base for modifications to the REVERIE platform

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

You can ask a 10-year-old child to bring you a cushion even in an unfamiliar environment. There’s a good chance that they will succeed. The probability that a robot will succeed at the same task is significantly lower. The robot faces multiple challenges in completing the task, but one of the most fundamental is that of knowing where to look. Children have a wealth of knowledge about their environment that they can apply to such tasks, including the fact that cushions generally inhabit couches, that couches inhabit lounge rooms, and that lounge rooms are often connected to the rest of a building through hallways. Robots typically lack this type of information.

Figure 1: Remote Embodied Referring Expression – RERERE task. We focus on following a natural language based navigation instruction (in blue) and detecting the object described by a referring expression (in red). The target object (in red bounding boxes) might be visible from multiple viewpoints and camera angles. Blue discs are nearby approachable viewpoints.

The task we propose is intended to encourage the development of vision and language models that embody the information required to carry out natural language instructions that refer to previously unseen objects in real environments. It represents an extension of existing Referring Expression, Visual Grounding, and Embodied Question Answering datasets (see [6, 8, 9, 13, 29] for example) to the case where it cannot be assumed that the object in question is visible in the current image. This represents a significant development because solving the problem requires navigating a real environment to a position where the object is likely to be visible, identifying the object in question if it is there, and adapting if it is not. This is far more challenging than selecting a candidate from amongst the set of objects visible in a provided image. We refer to this novel problem as one of interpreting Remote Embodied Referring Expressions in Real indoor Environments (RERERE).

Although vision-and-language tasks motivated by robotics have attracted significant attention previously [5, 8, 15, 26, 30], it is the recent success of vision-and-language navigation [1, 7, 20, 25] and visual grounding [6, 9, 29] that motivates the new RERERE task and dataset. An example from the RERERE dataset is given in Fig 1. In carrying out the task an agent is placed at a starting point and given a natural language query that refers to an object. The query represents an action a robot might usefully carry out, and the location where it should be occur. The response is expected to take the form of an object bounding box encompassing the subject of the query. This is in contrast to [1], where the query represents detailed navigational guidance and [5], where the query represents a natural language question with a natural language answer. One example asks that the robot {‘go to the stairs on level one’, ’bring me the bottom picture that is next to the top of the stairs.’}, for instance. This request not only requires the agent to navigate to ‘the stairs on level one’, but also that it localises the ‘bottom picture that is next to the top of the stairs’ in the area. The task poses several challenges including visual navigation, natural language understanding, object detection and high-level reasoning.

The dataset is built upon the Matterport3D Simulator [1] which provides panoramic image sets for multiple locations around a building, and a (virtual) means of navigating between them. To enable the agent to receive object-level information about the environment we have extended the simulator to incorporate object annotations including labels and bounding boxes. As the task is intended to be image-based, we have enabled the bounding boxes to be projected onto the images so as to automatically accommodate different viewpoints and angles. This particularly enables answers provided as image-based bounding boxes to be assessed equivalently irrespective of the location of the agent. The dataset comprises 10,567 panoramas within 90 buildings containing 3103 objects, and 1.6k open vocabulary, crowd-sourced expressions with an average length of 19 words.

The natural language commands comprise two parts, one of which indicates the location of the object. In contrast to [1] wherein the questions tend to provide detailed trajectory guidance from start to end, RERERE provides only a description of the final destination, and the object-centred action to be carried out. This is a more challenge and realistic setting. The second part in the language expression is a description that can distinguish the target object from other surrounding objects, for example, ‘the bigger green pillow on the bed’ and ‘the laptop with an Elsa sticker on it’. As illustrated in Figure 1, the associated task requires an agent to understand natural-language guidance to navigate to the right location and detect the object (by outputting a bounding box), in a previously unseen case. The final successful rate is measured by the IoU (Intersection of Union) between the predicted bounding box and the ground-truth, or whether the model can select the target object from a given list of candidates.

In this paper, we also investigate the difficulty of the RERERE task by proposing a baseline Navigator-Pointer model composed of a language-guided visual navigation model and a referring expression comprehension model. A human evaluation is also performed to show the gap between the baseline model and human performance.

In summary, our main contributions are:

  1. We introduce a new vision-and-language task that requires an agent to identify a remote object in a real indoor environment on the basis of a natural language action description.

  2. We extend the Matterport3D Simulator [1] to include object annotation and present the RERERE dataset, the first benchmark for Remote Embodied Referring Expression in a real 3D environment.

  3. We propose a Navigator-Pointer model for the RERERE dataset, establishing the first baselines under several evaluation settings.

The simulator, RERERE dataset and models will be publicly available upon the acceptance of this submission.

2 Related Work

Referring Expressions

The referring expression task aims to localise an object in an image given a natural language description. Recent works cast this task as looking for the object that can generate its paired expressions [12, 18, 32]

or jointly embedding the image and expression for matching estimation 

[4, 11, 17, 31]. Yu  [32] propose to compute the appearance difference of the same category objects to enhance the visual features for expression generation. Instead of treating expressions as a unit, [31]

learns a language attention model to decompose expressions into appearance, location, and object relationship for each component.

In contrast to existing referring expression tasks, RERERE introduces three new challenges: i) the robot agent must first navigate to the target room on the basis of a high-level command, such as ‘go to the second bedroom on level two’. ii) the agent must select a panoramic image set to search from the set of panorama locations within the target room. iii) if the target cannot be localised (due to view-angle dependent appearance changes, or absence) the agent must reconsider and try again.

Vision-and-Language Navigation

In the vision-and-language navigation (VLN) task, a robot agent is expected to plot a path to the goal location given natural language navigation instructions (‘turn right … go ahead … turn left …’). A simulator also provides a representation of visual environments at each step of the path. Most of existing simulators roughly fall within two categories: synthetic-photo simulator [2, 15, 27] or realistic-photo simulator [1, 23, 28]. These simulators have been employed in a range of VLN methods [7, 20, 25]. Wang  [25] for example propose a look-ahead module to take into account the future agent state and reward when making action decisions, and in [7], a speaker model is designed to synthesise new instructions for data augmentation and to implement pragmatic reasoning. The work [24] designs a cross-modal reasoning network to learn the trajectory history, the attention of textual instruction, and the local visual attention in environment images. Although the proposed RERERE task also requires an agent to navigate to goal location, it differs from existing VLN tasks in two important aspects: i) the final goal in RERERE is to localise the target object specified in a referring expression, this greatly simplifies performance evaluation as no assessment of the path taken is necessary and ii) the navigation instructions are semantic-level commands such as might more naturally be given to a human, such as ‘go to the first bedroom on level two’ rather than ‘go to the top of the stairs then turn left and walk down the hall to the second doorway’.

Embodied Vision-and-Langauge

Embodied question answering (EQA) [5] requires an agent to answer a question about an object or a room, such as ‘What colour is the car?’ and ‘What room is the OBJ in?’. The subject of the question may be invisible from the initial location of the agent, so navigation may be required. The House3D EQA dataset [5] is based on synthetic images and includes nine kinds of predefined question template. Gordon  [8] introduce an interactive version of the EQA task, where the agent may need to interact with the environment/objects to correctly answer questions. Our Remote Embodied Referring Expression task also falls within the area of embodied vision-and-language, but different from previous works that only output a simple answer or a series of actions, we ask the agent to put a bounding box around a target object. This is a more challenging but more realistic setting because if we want a robot to manipulate an object in an environment, we need its precise location and little more.

3 Object-level Matterport3D Simulator

3.1 Matterport3D Simulator

The Matterport3D Simulator [1] is a large-scale interactive environment constructed from the Matterport3D dataset [3] which includes 10,800 panoramic RGB-D images of 90 real-world building-scale indoor environments. The viewpoints are distributed throughout the entire walk-able floors with an average distance of 2.25m between two adjacent viewpoints. By default, the simulator captures 36 RGB images from 3 pitch angles and 12 yaw angles using cameras at the approximate height of a standing person.

In the simulator, an embodied agent is able to virtually ‘move’ throughout each building by iteratively selecting adjacent nodes from the graph of panorama locations and adjusting the camera pose at each viewpoint. The simulator takes three parameters and returns the rendered colour image , where is the 3D position of a viewpoint of the Matterport3D dataset, is the camera elevation, and denotes camera heading.

3.2 Adding Object-level Annotations

Figure 2: We add object bounding box annotations into the Matterport simulator. The bounding box size, the length-width ratio of the same object may change after we move to another viewpoint or change the camera angle. For example, the size of the bounding box (in blue) of the coffee table is bigger after we move closer to it. The ratio of the bounding box (in yellow) of the sofa changed given a new viewpoint. The bounding box (in red) of TV disappeared in the lower-left image because it is occluded by the armchair.

The bounding boxes are needed in the proposed task, which are either used to assess the agent’s ability to localise the object that is referred to by a natural expression or provided as object hypotheses. We exploit the bounding boxes and labels provided by the Matterport3D dataset.

The main challenge in adding the object annotation, especially the object bounding boxes, is that the coordinate and visibility of 2D bounding boxes vary as the viewpoint and camera angle change. To address these issues, we calculate the bounding box overlap and object depth, , if a bounding box is fully covered by another one and it is behind it (according to the depth information), we treat it as an occluded case and it will be removed from the visible object list at current viewpoint. Specifically, for each building the Matterport3D dataset [3] provides all the objects appearing in it with centre point position , three axis directions , and three radii , one for each axis direction. To correctly render objects in the simulator, we first calculate the eight vertexes using and . Then these vertexes are projected into the camera space by the camera pose provided by Matterport3D at a certain viewpoint. Finally, the visibility in the current camera view and occlusion relationship among objects are computed using OpenGL APIs.

Figure 2 demonstrates an example of projected bounding boxes. We can see there are multiple object bounding boxes appeared in the image and the bounding boxes’ size/ratio is changed after we change to another angle. On average, there are 5 object bounding boxes in every single view. Considering on average each navigation graph (in a single building) contains 117 viewpoints and each viewpoint has 36 camera angles, there are more than 20k bounding boxes provided in a single building.

4 The RERERE Dataset

We now describe the Remote Embodied Referring Expression task and dataset, including the data collection policy and analysis of the expressions we collected. We finally provide several evaluation metrics which consider both navigation and detection accuracy.

4.1 Task

As shown in Figure 1, our RERERE task requires an artificial agent to correctly localise the target object specified by a high-level natural language instruction. Since the target object is in a different room from the starting, the proposed task can be decomposed into two parts: high-level language navigation and referring expression grounding. The high-level language navigation task requires the agent navigate to the correct room111Please note the navigation instructions provided by our task are quite different from the detailed guidance in [1]. Our instructions are more natural and concise thus more challenging. while the grounding referring expression task asks the agent to output the bounding box of the target object.

Formally, at the beginning of each episode the agent is given as input a high-level natural language instruction , where is a symbol denoting the instruction is related to the ‘Navigation’ (such as ‘Go to the bedroom with the striped wallpaper’) and is a symbol denoting the word is from the ‘Referring expression’ part (such as ‘Bring me the pillow that is laying next to the ottoman’). is the length of navigation instruction and is the length of the referring expression. is a single word token.

The agent observes an initial RGB image , determined by the agent’s initial pose comprising a tuple of 3D position, heading and elevation . To find the target object, the agent must execute a sequence of actions , with each action leading to a new pose , and generating a new image observation . The agent can then choose to stop here to output the bounding box of the target object, or continue to move and scan. If the agent choose to ‘stop and detect’, it is required to output the bounding box of its selected image region, where and denote the coordinate of the left-top point of the bounding box, and denote the width and height of the bounding box, respectively. The can be either selected from a set of predefined bounding box candidates or from object detection results. The episode ends after the agent outputs the target bounding box.

The performance of a model is measured from twofold. The navigation performance is measured by five kinds of metrics including the length of the trajectory and the distance between the stop location and the goal location . The performance of referring expression grounding is measured by whether selected the right object candidate (if candidates bounding boxes are given) or by the IoU with ground truth. Refer to Section 4.4 for evaluation details.

4.2 Data Collection

Our goal is to collect high-level human daily commands that may be assigned to a home robot in future, such as ‘Go to the bedroom and bring me a pillow’ or ‘Open the left window in the kitchen’. As these commands are usually very concise, which will be rather challenging for robot understanding, we prefer to reuse the trajectories of the existing R2R [1] navigation dataset for high-level commands annotation, so that one can fairly compare our more natural and concise navigation instructions to the detailed ones provided in R2R, given the same trajectories.

Figure 3: The length distribution of the the collected annotations for describing object and trajectory, respectively.

Our embodied referring expression annotations are collected on Amazon Mechanical Turk (AMT). To this end, we develop an interactive 3D WebGL environment, which will first show a path animation and randomly highlight one object at the goal location for workers to provide annotations to find or operate with. There is no style limitation of the command as long as it can lead the robot to the target object. The workers can look around at the goal location to learn the environment and facilitate to give feasible commands. For each target object, we collect three referring expressions. We also ask the annotators to label which part of the instruction is about navigation (such as ‘the largest bedroom at level 2’) and which part is related to the referring expression (such as ‘the left window’). A benefit of doing this is that one can only evaluate the ‘navigation’ ability of the model by providing the navigation instruction only, or only evaluate the ‘referring expression comprehension’ by providing the object related instruction with the ground truth viewpoint. We encourage researchers to combine them as a single task to evaluate, but we provide baselines for all three cases, namely ‘Navigation-Only’, ‘Grounding-Only’ and ‘Navigation-Grounding’ in this paper. More details can be found in Section 6.

The full collection interface (see in supplementary) is the result of several rounds of experimentation. We use only US-based AMT workers, screened according to their performance on previous tasks. Over 456 workers took part in the data collection, contributing around 1,324 hours of annotation time. One example of the collected data can be found in Figure 1. More examples are in supplementary.

Figure 4: Distribution of referring expressions based on their first four words. Expressions are read from the centre outwards. Arc lengths are proportional to the number of instructions containing each word.

4.3 RERERE Dataset Analysis

At the time of submission, there are totally 11,371 instructions 222We target to collect more than 20K annotations in the full version in the version 0.9. The average length of currently collected instructions is 19 words while averagely 10 words are from the navigation instruction part. Comparing to the detailed navigation instructions provided in the R2R [1] with an average length of 29 words, our navigation command is more concise and natural, in other words, more challenging. At the referring expression side, the previous largest dataset RefCOCOg [32] based on COCO [16] contains an average of 8 words, while we have 9. Figure 3 displays the length distribution of the collected annotations for describing object and trajectory, respectively. According to the instruction vocabulary, we have 1.6k words. The distribution of referring expressions based on their first four words is depicted in Figure 4, which shows the diversity of expressions. We also compute the number of mentioned objects in referring expressions and its distribution is presented in Figure 5. It shows that 30 expressions mention 3 or more objects, 40 expressions mention 2 objects, and the remaining 30 expressions mention 1 object.

Figure 5: Distribution of the number of objects mentioned in referring expressions.

There are 3103 objects in the dataset, falling into 417 categories, which is much more than the number of categories in ReferCOCO [32] that is 80.

Data Splits

We follow the same train/val/test split strategy as the R2R [1] and Matterport3D [3] datasets. In the current version, the training set consists of 59 scenes and 5181 referring expressions over 1605 objects. The validation set including seen and unseen splits totally contains 63 scenes, 938 objects and 3842 referring expression, of which 10 scenes and 2174 referring expression over 422 objects are reserved for val unseen split. For the test set, we collect 2348 referring expressions involving 560 objects randomly scattered in 16 scenes. All the test data are unseen during training and validation procedures. The ground truth for the test set will not be released and we will host an evaluation server where agent trajectories and detected bounding boxes can be uploaded for scoring.

4.4 Evaluation Metrics

We measure the performance of a robot agent in the RERERE task in terms of both visual navigation and referring expression grounding performance.

Figure 6: The main architecture of our proposed Navigator–Pointer Model.

Regarding the navigation performance, we adopt five metrics from the VLN task [1], namely: trajectory length in meters (Len.), navigation error in meters (Err.), success rate (Succ.), oracle success rate (OSuss.), and action steps (Steps). The navigation error is the Euclidean distance between the agent’s final location and the goal location. We consider a navigation success if its navigation error is no larger than 3 meters. The oracle error is defined as the shortest distance to the goal location during one navigation. Similarly, if the oracle error is no larger than 3 meters, we say it is a oracle success. Finally, the success rate and the oracle success rate are defined as the ratio of the number of success/oracle success navigation over the total dataset.

As to the referring expression grounding measurement, we use the success rate and the overlap ratio metrics. All the objects in the simulator have a distinct identification in its scene, so if the grounding task is done using target candidates provided by the simulator, it is able to verify whether the identifications match. If object proposals obtained from out of the simulator, such as from object detection, its identification will be inconsistent with that in the simulator. In such case, we propose to measure the performance via the IoU (Intersection of Union) , where is the grounding result and is the ground truth. Following the common practice in object detection, we treat it as a success if .

5 A Navigator – Pointer Model

As the vast majority of the collected human instructions are composed of a navigation command and an object referring expression, we design a baseline model having two counterpart components: a Navigator and a Pointer.

The whole framework of this Navigator-Pointer model is shown in Figure 6. Given an instruction which includes a navigation part and object referring expression part, in the Navigator model, we first encode the navigation instruction with an LSTM. The action predictor is another LSTM which takes the previous action and current observation as the input and then predicts the next step action by jointly considering the attended navigation instruction features. After the model predicted a ‘STOP’ action, 36 images captured from different camera angles at the stopped viewpoint are passed to the Pointer model.

For each image, assuming the candidate bounding boxes are given (or provided by an object proposal model such as RPN [21]

), the Pointer model first computes features for each candidate object by concatenating the global image CNN features, the regional CNN features and a bounding box spatial feature. Cosine similarity is then calculated between each candidate object feature and the natural language referring expression feature, which is obtained from an LSTM encoder. The one with the highest similarity scores among all the candidates from all 36 images is outputted as the final result. To train the Pointer model, we also introduce negative examples and employ a Margin Ranking Loss. More details are discussed below.

5.1 Navigator Model

We adopt the sequence-to-sequence LSTM architecture with an attention mechanism [1] as the navigation baseline. As shown in Figure 6, the navigator is equipped with an encoder-decoder architecture.

The encoder maps the navigation related high-level instruction using an LSTM [10] - , and maps the current image observation to using the pretrained ResNet-152. We use as the instruction context. Then and the last hidden are concatenated as the joint representation for the input instruction and image observation, denoted as . Based on , a decoder LSTM is learned for each action, which is denoted as . To predict a distribution over actions at step , the global general alignment function in [19] is adopted to identify the most relevant parts of the navigation instruction. This is achieved by first computing an instruction context and then compute an attentional hidden state . The action distribution is finally calculated as .

The predicted action is then sent back to the simulator to obtain the new observation . If the is a ‘STOP’ signal, then the agent stopped navigating and the simulator will return 36 images captured from different camera angles at the stopped viewpoint 333Our simulator also can return captured images at any viewpoints, even before the ‘stop’. Whether to use the returned images at each viewpoint depending on the model strategy. Our baseline model adapts the easiest strategy, say navigation-stop-detect. Another strategy could be performing detection at each point. If nothing found at this point, continue to move, scan and detect, until the target is found. But we leave this as further work.. All these images are sent to the next Pointer model.

5.2 Pointer Model

We formulate the pointer model in a joint-embedding architecture inspired by the success in referring expression [33]

. The purpose of this embedding model is to encode the visual information from the target object and semantic information from the referring expression into a joint embedding space that embeds vectors that are visually or semantically related closer together in the space. Given the referring express

, the pointer model points to its closest object in the embedding space.

Specifically, we use an LSTM to encode the input sentence and use the pre-trained ResNet-152 to encode the object appearance and its context information (the whole image). Then two multi-layer perceptions and two L2 normalisation layers following two paths, the object and the expression. Each perception module is composed of two fully connected layers with ReLUs aiming to transform the object and the expression into a common space. The cosine distance of the two normalised representations is calculated as their similarity score . To train the model, We use a Margin Ranking Loss over three sample pairs:


where and are negative matches randomly selected from other objects and expressions for the same viewpoint environment. is the margin, which we set as . Because there are multiple images in the returned 36 images include the target object, all these images will be used as training examples to train the model.

During the testing, we calculate the similarity score between each candidate object and the natural language referring expression. The highest one among all the candidates from all 36 images is outputted as the final result.

6 Experiments

In this section, we first present the implementation details of our navigator–pointer model and its training. Then, we provide extensive experimental results and analysis.

6.1 Implementation Details

For the navigator, we set the simulator image resolution to 640480 with a vertical field of view of 60 degrees. The amount of hidden units in each LSTM is set to 512. The size of the input word embedding is 256, and the size of the action embedding is 32. A discretized heading of 30 degrees and elevation of 30 degrees are adopted in the simulator. For the pointer, we set the word embedding size to 512, the hidden layer size to 512. Dropout with probability 0.25 is employed at both the visual and the language embedding views. The words in language text for both the navigator and the pointer models are filtered out if it occurs less than five times in the training set.

Training As the collected annotation can be divided into the path-related and object-related parts, we train the navigator model and the pointer model separately. It is worth noting that we adopt the teacher-forcing and student-forcing approaches as in [1] for the navigator training. With the teacher-forcing strategy, the training is supervised by the ground-truth target action from the shortest-path trajectory. However, this leads to a challenging input distribution between training and testing [22]. To address this problem, the student-forcing strategy is also explored, where the next action is drawn from the agent’s output probability over the action space. The student-forcing is equivalent to the online DAGGER [22]. The Adam optimiser with weight decay [14] is utilised to train the models.

6.2 Results

Experimental settings. We adopt three experimental settings: Navigation-Only, Grounding-Only, and Navigation-Grounding, to evaluate the performance of each component and their combination. We also provide human performance in these three settings.

Navigation–Only. Based on the collected high-level navigation instructions, we report the navigation-only experimental results in Table 1 with Random, Shortest, and Human performance as baselines. The Random agent at each step randomly selects heading and completes a total of 5 successful forward actions (if the forward action is unavailable the agent selects right) as in [1]. The Shortest agent always follows the shortest path to the goal.

The results in Table 1 show that:

  • This is a very challenging task from two aspects: (1) Human can achieve a high success rate of 86.23, while the baseline model only achieves a success rate around 12. Please note that the baselines have integrated the recent advances in CNNs and RNNs. (2) This task is rather challening for human because we find that compared to the best case namely the ground truth shown by Shortest, human takes nearly triple steps and double trajectory length to get only a success rate of 86.23.

  • The test split is harder than the val seen and val unseen splits because the best performance achieved by Random, T-forcing, or S-forcing on the test split drops about 5 from that on the val splits.

  • The newly collected high-level instructions are harder than the detailed instructions of the R2R dataset. As we collected high-level instructions for the same path used in the R2R dataset, so it is a fair comparison to the agent’s performance on the R2R dataset as shown in Table 2. Please note that the agent is retrained on R2R using the same setting.

 Len. (m) Err. (m) Succ. (%) OSucc. (%) Steps
Val Seen
Shorest 10.31 0 100 100 11.32
Random 9.55 9.68 12.32 17.13 15.75
T-forcing 10.65 10.01 11.48 21.18 16.10
S-forcing 11.63 7.37 24.47 36.37 30.93
Val UnSeen
Shorest 9.56 0 100 100 12.13
Random 9.66 9.66 16.21 22.42 15.87
T-forcing 10.76 9.82 11.76 20.92 17.93
S-forcing 9.47 9.05 10.52 16.14 36.14
Shorest 9.56 0 100 100 11.56
Random 9.41 9.53 11.35 15.55 16.29
S-forcing 7.99 8.82 12.25 17.67 19.35
Human 21.88 1.77 86.23 89.71 38.3
Table 1: Navigation-only evaluation using the collected high-level navigation instructions.
 Len. (m) Err. (m) Succ. (%) OSucc. (%) Steps
Val Seen
Shorest 10.28 0 100 100 10.87
Random 9.57 9.87 12.61 16.64 16.01
T-forcing 9.44 8.75 18.15 28.07 13.97
S-forcing 9.62 7.28 27.23 36.97 18.46
Val UnSeen
Shorest 9.75 0 100 100 11.94
Random 9.98 9.71 13.49 19.19 15.8
T-forcing 8.92 9.6 14.91 19.19 13.27
S-forcing 7.99 8.52 17 23.14 19.14
Shorest 9.52 0 100 100 11.35
Random 9.27 9.37 13.09 17.67 16.22
S-forcing 7.83 7.81 17.38 25.32 19.2
Huamn 11.9 1.61 86.4 90.2 17.52
Table 2: Navigation-only evaluation using the detailed instructions from the R2R dataset [1].

Grounding-Only. Table 3 presents the performance of the Pointer baseline and human (we randomly select one third of the test split, 1000 referring expressions, for human performance evaluation). The results show that the baseline performs far below human with about a 70 performance gap. This indicates the grounding task in an embedded environment is much challenging due to the aforementioned reasons such as great appearance difference for the same category objects caused by different view angels. Here we use the ground truth candidates, the results of using the FastRCNN [21] as a candidates provider can be found in the supplementary material.

Val Seen Val UnSeen Test
Pointer 22.41 21.25 29.5
Human 100
Table 3: Success rate () of the grounding-only task

Navigation-Grounding. In the Navigation-Grounding setting, we first perform the VLN subtask using the Navigator baseline described in Section 5.1. After it stops or reaches the maximum steps, we then perform the Pointer model described in Section 5.2.

We present the experimental results in Table 4 with human performance on the test split. The results show that the RERERE task is extremely challenging where the baseline performance drops beneath 1 where human performance also incurs a 25 decrease from 100 shown in Table 3 to 74.95 shown in Table 4. We believe that such a huge performance drop is mainly attributed to the the incorrect navigation. In the case where the navigator stops at a wrong room, a perfect pointer cannot find the target object either. These results demonstrate that the proposed task is extremely challenging and solutions may take both navigation and grounding referring expression into consideration.

Val Seen Val UnSeen Test
Navigator-Pointer 1.68 0.57 0.88
Human 74.95
Table 4: Success rate () of the Navigation-Grounding task

7 Conclusion

Enable human-robots collaboration is a long-term goal. In this paper, we make a step further towards this goal by proposing a Remote Embodied Referring Expression in Real indoor Environments (RERERE) task and dataset. The RERERE is the first one to evaluate the capability of an agent to follow high-level natural languages instructions to navigate and select the target object in previously unseen real images rendered buildings. We investigate several baselines and a Navigator-Pointer agent model, which consistently demonstrate the significant necessity of further researches in this field.

We reach three main conclusions: First, RERERE is interesting because existing vision and language methods can be easily plugged in. Second, the challenge of understanding and executing high-level expressions is significant. Finally, the combination of instruction navigation and referring expression grounding is a challenging task because there is still a large gap between the proposed model and human performance.

8 Acknowledgments

We thank Sam Bahrami for his great help in the building of the RERERE dataset.


1 Typical Samples of The RERERE Task

In Figure 1, we present several typical samples of the proposed RERERE task. It shows the diversity in object category, goal region, path instruction, and target object referring expression.

Figure 1: Several typical samples of the collected dataset, which involes various object category, goal region, path instruction, and object referring expression. Blue font denotes high level navigation instructions and red font denotes target object referring expressions.

2 Additional Analysis of The Dataset

We calculate the ratio of regions where the target object locates, namely goal region, and show the results in Figure 2. It shows that the bathroom accounts for the largest proportion in all the train/val seen/val unseen/test splits. The remaining regions account for different proportions in the four splits. For example, the bedroom is the second most region type in train/val seen/test splits, while the hallway ranks second in the val unseen split.

In addition, we compute the most frequently appearing object category in the train/val seen/val unseen/test splits. As shown in Figure 3, pictures and chairs occupy the top two, which is consistent with most of our living scenarios.

Figure 2: Distribution of goal region in each split.
Figure 3: The top 10 object categories in each split.

3 Additional Evaluation Results

It is a common practice to use object detection technologies to obtain target object proposals. Here we provide evaluation results using proposals achieved by MaskRCNN, which is one of the state-of-the-art object detection methods. We use its best configuration ‘X-101-32x8d-FPN’ in the PyTorch framework, which achieves an AP score of 42.2

on the COCO test-dev dataset111

We present results in Table 1 on the Grounding-Only setting. A test is considered success if the IoU between the selected proposal and the ground truth bounding box. The results in Table 1 show that the baseline only achieves a success rate of on the test split, which falls far behind obtained using all the available objects in the simulator as target proposals. Similar results are observed on the val seen and val unseen splits. To help understand these results, we calculate the upper bound of success rate when using detection proposals for referring expression grounding, where a test is considered success as long as there exist one proposal having IoU. The upper bound is shown in Table 2. It shows that the current detection results achieves only up to 43.62 on the test split, which indicates a large gap compared to human performance (100). We also report results in the Navigation-Grounding setting in Table 3. The success rate further drops to 1.58 on the test split, which is caused by the introduce of incorrect navigation.

Val Seen Val UnSeen Test
Pointer 2.14 1.71 2.86
Table 1: Success rate () of the Grounding-Only task using object detection results as proposals.
Val Seen Val UnSeen Test
Pointer 42.54 41.11 43.62
Table 2: Upper bound of success rate () of the Grounding-Only task using object detection results as proposals.
Val Seen Val UnSeen Test
Pointer 1.12 1.32 1.58
Table 3: Success rate () of the Navigation-Grounding task using object detection results as proposals

4 Data Collecting Tool

To collect data for the RERERE task, we develop a WebGL based data collecting tool as shown in Figure 4. To facilitate the workers, we provide real-time updated reference information in the web page according to the location of the agent, including the current level/total level, the current region, and the number of regions in the build having the same function. At the goal location, in addition to highlighting the target object with a red 3D rectangle, we also provide the label of the target object and the number of objects falling in the same category with the target object. Text and video instructions are provided for workers to understand how to make high quality annotations as shown in Figure 5.

Figure 4: Collecting interface with real-time updated reference information.
Figure 5: Text and video instructions for workers in the collecting interface