One of the longstanding goals of Embodied AI is to build agents that interact with their surrounding world and perform tasks. Recently, navigation and instruction following tasks have gained popularity [anderson18, Anderson2018VisionandLanguageNI, Batra2020ObjectNavRO] in the Embodied AI community. These tasks are the building blocks of interactive embodied agents, and over the past few years, we have observed remarkable progress regarding the development of models and algorithms. However, a typical assumption for these tasks is that the environment is static; namely, the agent can move within the environment but cannot interact with objects or modify their state. The ability to interact with and change its environment is crucial for any artificial embodied agent and cannot be studied in static environments. There is a general trend towards interactive tasks [sapien, ALFRED20, xiali2020relmogen]. These tasks focus on specific aspects of interaction such as object manipulation, long-horizon planning and understanding pre-condition and post-conditions of actions. In this paper, we address a more comprehensive task in a visually rich environment that can subsume each of these skills.
We address an instantiation of the rearrangement problem, an interactive task, recently introduced by Batra et al. [batra2020rearrangement]. The goal of the rearrangement task is to reach a goal room configuration from an initial room configuration through interaction. In our instantiation, an agent must recover a scene configuration after we have randomly moved, or changed the state of, several objects (see Fig. 1).
This problem has two stages: walkthrough and unshuffle. During the walkthrough stage, the agent may explore the scene and, through egocentric perception, record information regarding the goal configuration. We then remove the agent from the room and move some objects to other locations or change their state (opening a closed microwave). In the unshuffle stage, the agent must interact with objects in the room to recover the goal configuration observed in the walkthrough stage.
Rearrangement poses several challenges such as inferring the visual differences between the initial and goal configurations, inferring the objects’ state, learning the post-conditions and pre-conditions of actions, maintaining a persistent and compact memory representation during the walkthrough stage, and successful navigation. To establish baseline performance for our task, we evaluate an actor-critic model akin to the state-of-the-art models used for long-horizon tasks such as navigation. We train our baselines using decentralized distributed proximal policy optimization (DD-PPO) [Wijmans2020DDPPOLN, schulman2017proximal], a reward-based RL algorithm, as well as with DAgger [dagger], a behavioral cloning method. During the walkthrough stage, the agent uses a non-parametric mapping module to memorize its observations along with any visible objects and their positions. In the unshuffle stage the agent compares images that it observes against what it has observed in its map and may use this information to inform which objects it should move or open. As a proof-of-concept we also run experiments with a model that includes a semantic mapping component adapted from the Active Neural SLAM model [chaplot2020learning].
To facilitate research in this challenging direction, we compiled the Room Rearrangement (RoomR) dataset. RoomR is built upon AI2-THOR [ai2thor], a virtual interactive environment that enables interacting with objects and changing their state. The RoomR dataset includes 6,000 rearrangement tasks that involve changing the pose and state of multiple objects within an episode. The level of the difficulty of each episode varies depending on the differences between the initial and the goal object configurations. We have used 120 rooms and more than 70 unique object categories to create the dataset.
We consider two variations of the room rearrangement task. In the first setting, which we call the 1-Phase task, the agent completes the walkthrough and unshuffle stages in parallel so that it is given aligned images from the walkthrough and unshuffle configurations at every step. In the second setting, the 2-Phase task, the agent must complete the walkthrough and unshuffle stages sequentially; this 2-Phase variant is more challenging as it requires the agent to reason over longer time spans. Highlighting the difficulty of the rearrangement, our evaluations show that our strong baselines struggle even in the easier 1-Phase task. Rearrangement poses a new set of challenges for the embodied-AI community. Our code and dataset are publicly available. A supplementary video111https://youtu.be/1APxaOC9U-A provides the description of the task and some qualitative results.
2 Related Work
Embodied AI tasks. In recent years, we have witnessed a surge of interest in learning-based Embodied AI tasks. Various tasks have been proposed in this domain: navigation towards objects [Batra2020ObjectNavRO, Yang2018VisualSN, Wortsman2019LearningTL, chaplot2020object] or towards a specific point [anderson18, Savva_2019_ICCV, Wijmans2020DDPPOLN, ramakrishnan2020occant], scene exploration [chen2018learning, chaplot2020learning], embodied question answering [gordon18, embodiedqa], task completion [Zhu2017VisualSP], instruction following [Anderson2018VisionandLanguageNI, ALFRED20], object manipulation [corl2018surreal, pmlr-v100-yu20a], multi-agent coordination [twobodyproblem, cordialsync], and many others. Rearrangement can be considered as a broader task that encompasses skills learned through these tasks.
Rearrangement. Rearrangement Planning is an established field in robotics research where the goal is to reach a goal state from an initial state [BenShahar1996PracticalPP, stilman2007manipulation, king2016rearrangement, kron16, yuan2018rearrangement, labbe2020monte]. While these methods have shown impressive performance, they consider complete observability of the state from perfect visual perception [cosgun2011push, king2016rearrangement], a planar surface as the environment [krontiris2015dealing, song2019multi], a static robot [dogar2011framework, krontiris2014rearranging], same environment for evaluation of generalization [scholz2010combining, King-2015-5955], or a limited set of object categories or limited variation within the categories [amazon, gualtieri2018pick]. Some works address some of these issues, such as generalization to new objects or imperfect perception [zakka2020form2fit, berscheid2020self]. In this paper, we take a step further and relax these assumptions by considering raw visual input instead of perfect perception, a visually and geometrically complex scene as the configuration space, separate scenes for training and evaluation, a variety of objects, and object state changes.
Task and motion planning. Our work can be considered as an instance of joint task and motion planning [kaelbling2011hierarchical, srivastava2014combined, plaku2010sampling, garrett2018ffrob, dantam2016incremental] since solving the rearrangement task requires low-level motion planning to plan a sequence of actions and high-level task planning to recover the goal state from the initial state of the scene. However, the focus of these works is primarily on the planning problem rather than perception.
3 The Room Rearrangement Task
Our goal is to rearrange an initial configuration of a room into a goal configuration. So that our agent does not have to reason about soft-body physics, we restrict our attention to piece-wise rigid objects. Suppose a room contains piece-wise rigid objects. We define the state for object as where
records the pose of the object,
if the object can be opened specifies the openness of an object ( means a door is half open) and if the object cannot be opened (a mug) then ,
records the coordinates in of the corners of the 3D bounding box for object , and
records if the th object is “broken” (1 if broken, otherwise 0).
While this definition of an object’s state is constrained (objects can be more than just “broken” and “unbroken”) it matches well the capabilities of our target embodied environment (AI2-THOR) and can be easily enriched as embodied environments become increasingly realistic. We now let be the set of all possible poses for a single object and the set of all possible joint object poses. The agent’s goal is to convert an initial configuration to a goal .
Our task has two stages: (1) walkthrough and (2) unshuffle. During the walkthrough stage, the agent is placed into a room with goal state , and it should collect as much information as needed for that particular state of the room in a maximum number of actions (for us, 250). The agent is removed from the room after the walkthrough stage. We then select a random subset of the objects and change their state. The state change may be a change in or . This state will be the initial state that the agent observes at the beginning of the unshuffle stage. The agent’s goal is to convert to () via a sequence of actions.
To quantify an agent’s performance, we introduce four metrics below. Recall from the above that an agent begins an unshuffle episode with the room in state and has the goal of rearranging the room to end in state . Suppose that at the end of an unshuffle episode, the agent has reconfigured the room so that it lies in state . In practice, we cannot expect that the agent will place objects in exactly the same positions as in . We instead choose a collection of thresholds which determine if two object poses are, approximately, equal. When two poses are approximately equal we write . Otherwise we write .
Let be two possible poses for object . As it makes little intuitive sense to compare the poses of broken objects, we will always assert that poses of broken objects are unequal. Thus if or we define . Now let’s assume that neither nor . If object is pickupable, let be the intersection over union between the 3D bounding boxes . We then say that if, and only if, . If object is openable but not pickupable, we say that if, and only if, . The use of the IOU above means that object poses can be approximately equal even when their orientations are completely different. While this can be easily made more stringent, our rearrangement task is already quite challenging. Note also that our below metrics do not consider the case where there are multiple identical objects in a scene (as this does not occur in our dataset). We now describe our metrics.
Success (Success) – This is the most unforgiving of our metrics and equals 1 if all object poses in and are approximately equal, otherwise it equals 0.
% Fixed (Strict) (%FixedStrict) – The above Success metric does not give any credit to an agent if it manages to rearrange some, but not all, objects within a room. To this end, let be the set of misplaced objects at the start of the unshuffle stage and let be the set of misplaced objects at the end of the episode. We then let %FixedStrict equal 0 if (the agent has moved an object that should not have been moved) and, otherwise, let %FixedStrict equal (the proportion of objects that were misplaced initially but ended in the correct pose).
% Energy Remaining (%E) – Missing from all of the above metrics is the ability to give partial credit if, for example, the agent moves an object across a room and towards the goal pose, but fails to place it so that it has a sufficiently high IOU with the goal. To allow for partial credit, we define an energy function that monotonically decreases to 0 as two poses get closer together (see the Appendix E for full details) and which equals zero if two poses are approximately equal. The %E metric is then defined as the amount of energy remaining at the end of the unshuffle episode divided by the total energy at the start of the unshuffle episode, .
# Changed (#Changed) – To give additional insight as to our agent’s behavior we also include the #Changed metric. This metric is simply the the number of objects whose pose has been changed by the agent during the unshuffle stage. Note that larger or smaller values of this metric are not necessarily “better” (both moving no objects and moving many objects randomly are poor strategies).
The above metrics are then averaged across episodes when reporting results.
4 The RoomR Dataset
The Room Rearrangement (RoomR) dataset utilizes 120 rooms in AI2-THOR [ai2thor] and contains 6,000 unique rearrangements (50 rearrangements per training, validation, and testing room). Each datapoint consists of an initial room state , the agent’s starting position, and the goal state .
4.1 Generating Rearrangements
The automatic generation of the dataset enables us to scale up the number of rearrangements easily. We generate each room rearrangement using the procedure that follows.
Place agent. We randomize the agent’s position on the floor. The position is restricted to lie on a grid, where each cell is of size . The agent’s rotation is then randomly chosen amongst . The agent’s starting pose is the same for both and .
Shuffle background objects. To obtain different configurations of objects for each task in the dataset, we randomly shuffle each movable object, ensuring background objects do not always appear in the same position. Shuffled objects are never hidden inside other receptacles (fridges, cabinets), which reduces the task’s complexity.
Sample objects. We now randomly sample a set of openable but non-pickupable objects and a set of pickupable objects. These objects and counts are chosen randomly with and .
Goal () setup. We open the objects sampled in the last step to some randomly chosen degree of openness in and move the other pickupable objects to arbitrary locations within the room. The room’s current state is now , the start state for the walkthrough stage.
Initial () setup. We randomize the sampled openable objects’ openness and shuffle the position of each of the sampled pickupable objects once more. We are now in , the start state for the unshuffle stage.
In the above process, we ensure that no broken objects are in or . While we provide a fixed number of datapoints per room, this process can be used to sample a practically unbounded number of rearrangements.
4.2 Dataset Properties
Rooms. There are 120 rooms across the categories of kitchen, living room, bathroom, and bedroom (30 rooms for each category). We designate 20 rooms for training, 5 rooms for validation, and 5 rooms for testing, across each room category. Of the 6000 unique rearrangements in our dataset, 4000 are designated for training, 1000 are set in validation rooms, and 1000 are set in test rooms. For each such split, there are 50 rearrangements per room.
Objects. There are 118 object categories (listed in Appendix F), among which 62 are pickupable (cup) and 10 are openable and non-pickupable (fridge). The set of object categories that appear in the validation and testing rooms is a subset of the object categories that appear during training. Thus, if a plant appears in a validation or testing room, then a plant is also present in one of the training rooms. While all object categories are seen during training, the physical appearance of object instances are often unique in training, validation, and testing rooms. AI2-THOR provides annotation as to if an object is pickupable, openable, movable, or static.
Across the dataset, there are 1895 pickupable object instances and 1262 openable non-pickupable object instances (an average of 15.7 and 10.5, respectively, per room). Fig. 2 shows the distance distribution (horizontal and vertical) of objects between their initial and goal positions. It illustrates the complexity of the problem, where the agent must travel relatively far to recover the goal configuration. Fig. 3 shows the distribution of these object groups and their sizes within every room. Note that pickupable objects (apple, fork) tend to be relatively small and hard to find, compared to openable non-pickupable objects (cabinets, drawers). Further, across room categories, the number of openable non-pickupable objects varies considerably.
In our experiments, Sec. 6, we consider two RoomR task variants: 1-Phase and 2-Phase. In the 1-Phase task, the agent completes the unshuffle and walkthrough stages simultaneously in lock step. The model we employ for this 1-Phase task is a simplification of the model used when performing the 2-Phase task (in which both stages must be completed sequentially and so longer-term memory is required). For space we only describe the 2-Phase model below, see our codebase for all architectural details.
Our network architecture, see Fig. 4, follows the same basic structure as is commonly employed within Embodied AI tasks [Savva_2019_ICCV, Wijmans2020DDPPOLN, robothor, WeihsKembhaviEtAl2019, cordialsync]
: a combination of a convolutional neural network to process input egocentric images, a collection of embedding layers to encode discrete inputs, and an RNN to enable the agent to reason through time. In addition to this baseline architecture, we would like our agent to have two capabilities relevant to the rearrangement task, namely the abilities to, during the unshuffle stage, (a) explicitly compare images seen during the walkthrough stage against those seen during the unshuffle stage, and (b) reference an implicit representation of the walkthrough stage. We now describe the details of our architecture and how we enable these additional capabilities.
Our agents are of the actor-critic [Mnih2016AsynchronousMF] variety and thus, at each timestep , given observations (an egocentric RGB image) and a summary of the agent’s history, we require that an agent produces a policy (a distribution over the agent’s actions) and a value
(an estimate of future rewards). Here we letbe a catch-all parameter representing all of the trainable parameters in our network. As we wish for our agent to have characteristically different behavior in the walkthrough and unshuffle stages, we have two separate policies and (and similarly for ).
To encode input 2242243 RGB egocentric images, we use a ResNet18 [resnet]
model (pretrained on ImageNet) with frozen model weights with the final average pooling and classification layers removed. This ResNet18 model transforms input images intotensors. For our RNN, we leverage a 1-layer LSTM [HochreiterNC1997] with 512 hidden units. To produce the policies and we use two linear layers, each applied to the output from the LSTM, and each followed by a softmax nonlinearity. Similarly, to produce the two values and we use two distinct linear layers applied to the output of the LSTM with no additional nonlinearity. We now describe how we enable agents the abilities (a) and (b) above.
Mapping and image comparison. Our model includes a non-parametric mapping module. The module saves the RGB images seen by the agent during the walkthrough stage, along with the agent’s pose. During the unshuffle stage, the agent (i) queries the metric map for all poses visited during the walkthrough stage, (ii) chooses the pose closest to the agent’s current pose, and then (iii) retrieves the image saved by the walkthrough agent at that pose. Using an attention mechanism, the agent can then compare this retrieved image against its current observation to decide which objects to target.
Implicit representations of the walkthrough stage. In addition to explicitly storing the images seen during the walkthrough stage, we also wish to enable our agent to produce an implicit representation of its experiences during the walkthrough stage. To this end, at every timestep during the walkthrough stage we pass , the output of the 1-layer LSTM described above, to a 1-layer GRU with 512 hidden units to produce the walkthrough encoding . During the unshuffle stage this walkthrough encoding is no longer updated and is simply taken as the encoding from the last walkthrough step. The walkthrough encoding is passed as an input to the LSTM in a recurrent fashion.
This section provides the results for several baseline approaches that achieve state-of-the-art performance on other embodied tasks (navigation). The room rearrangement task and the RoomR dataset are very challenging. To make the problem more manageable, we simplify assumptions in choosing the action space and the sensor modalities. Sec. 6.1 and Sec. 6.2 explain the details of the action space and sensor modalities, respectively. We show that even with these simplifications, the baseline models struggle.
6.1 Action Space
AI2-THOR offers a wide variety of means by which agents may interact with their environment ranging from “low-level” (applying forces to individual objects) to “high-level” (open an object of the given type) interactions. Prior work, [twobodyproblem, WeihsKembhaviEtAl2019, cordialsync, gordon18, ALFRED20] has primarily used higher-level actions to abstract away some details that would otherwise distract from the problem of interest. We follow this prior work and define our agent’s action space as where taking action:
Ahead, Left, Right, Back results in the agent moving 0.25m in the direction specified by X in the agent’s coordinate frame (unless this would result in the agent colliding with an object).
results in the agent rotating 90 clockwise if and 90 counter-clockwise if .
results in the agent lowering/raising its camera angle by 30,
the 62 pickupable object types results in the agent picking up a visible object of type X if: (a) the agent is not already holding an object, (b) the agent is close enough to the object (within 1.5m), and (c) picking up the object would not result in it colliding with objects in front of the agent. If there are multiple objects of type X then the closest is chosen.
results in the agent raising or lowering the agent’s camera to one of two fixed heights allowing it to, , see objects under tables.
the 10 openable object types that are not pickupable, if an object whose openness is different from the openness in the goal state is visible and within 1.5m of the agent, this object’s openness is changed to its value in the goal state.
results in the agent dropping its held object. If the held object’s goal state is visible and within 1.5m of the agent, it is placed into that goal state. Otherwise, a heuristic is used to place the object on a nearby surface.
results in the walkthrough or unshuffle stage immediately terminating.
In total, there are possible actions. Some of the above actions have been designed to be fairly abstract or “high-level,” the PlaceObject action abstracts away all object manipulation complexities. As we discuss in Appendix C, we have implemented “lower-level” actions. Still, we stress that, even with these more abstract actions, the planning and visual reasoning required in RoomR already makes the task very challenging.
6.2 RoomR Variants
We will now detail the 1-Phase and, more difficult, 2-Phase variants of our RoomR task. These variants are, in part, defined by the sensors available to the agent. We begin by listing all sensors (note that only a subset of these will be available to any given agent in the below variants):
|1-Phase (Simple, IL)||2.2||1.3||1.8||7.3||4.7||4.8||1.17||1.10||1.08||1.1||0.7||0.6|
|1-Phase (Simple, PPO)||1.8||2.1||0.7||6.7||6.7||4.6||0.95||0.96||0.99||0.3||0.4||0.4|
|1-Phase (RN18, IL)||8.2||1.7||2.8||17.9||5.0||6.3||0.93||1.14||1.11||1.3||0.9||0.9|
|1-Phase (RN18, PPO)||1.4||1.5||1.1||6.6||6.0||5.3||0.94||0.96||0.98||0.3||0.3||0.3|
|1-Phase (RN18+ANM, IL)||4.8||5.2||3.2||12.8||11.1||8.9||1.05||1.05||1.04||1.3||1.0||1.0|
|2-Phase (RN18, PPO+IL)||1.6||0.5||0.2||4.2||1.2||0.7||1.10||1.15||1.12||0.6||0.4||0.4|
|2-Phase (RN18+ANM, PPO+IL)||2.3||0.6||0.3||7.3||1.6||1.4||1.09||1.15||1.10||0.9||0.5||0.4|
RGB – An egocentric 2242243 RGB image corresponding to the agent’s current viewpoint (90 FOV). In the 1-Phase task this corresponds to the RGB image from the unshuffle stage.
WalkthroughRGB – This sensor is only available in the 1-Phase task and is identical to RGB except it shows the egocentric image as though the agent was in the Walkthrough stage, all objects were in their goal positions. It is this sensor that makes it possible, during the 1-Phase task, for the agent to perform pixel-to-pixel comparisons between the environment as it should be in the walkthrough stage and as it is during the unshuffle stage.
AgentPosition – The agent’s position relative to its starting location (this is equivalent to the assumption of perfect egomotion estimation).
InWalkthrough – Only relevant during the 2-Phase task, this sensor returns “true” if the agent is currently in the walkthrough stage and otherwise returns “false”.
1-Phase Task – In this variant, the agent takes actions within the walkthrough and unshuffle stages simultaneously in lock step. That is, if the agent takes a MoveAhead action, the agent moves ahead in both stages simultaneously; as the agent begins in the same starting position in both stages, the agent’s position will always be the same in both stages. As only navigational actions are allowed during walkthrough, all actions of type are not executed by the agent in the walkthrough stage. During the unshuffle stage, the agent has access to the RGB, WalkthroughRGB, and AgentPosition sensors to complete its task.
2-Phase Task – In this task, the agent must complete both the walkthrough and unshuffle stages sequentially. In this task has access to the RGB, AgentPosition, and InWalkthrough sensors.
6.3 Training Pipeline
As our experimental results show, we found training models to complete the RoomR task using purely reward-based reinforcement learning methods to be extremely challenging. The difficulty remains even when using dense, shaped rewards. Thus, we have chosen to adopt a hybrid training strategy where we use the DD-PPO[Wijmans2020DDPPOLN, schulman2017proximal]
algorithm, a reward-based RL method, to train our agent when it is within the walkthrough stage, and an imitation learning (IL) approach, where we minimize a cross-entropy loss between the agent’s policy and expert actions, is used when in the unshuffle stage. As it has been successfully employed in training agents in other embodied tasks ([GuptaTolaniEtAl2020]), for our IL training, we employ DAgger [dagger]
. In DAgger, we begin training by forcing our agent to always take an expert’s action with probability 1 and anneal this probability to 0 over the first 1Mn for the1-Phase task and 5Mn steps for the 2-Phase task. Tacitly assumed in the above is that we have access to an expert policy which can be efficiently evaluated at every state reached by our agent. Even with access to the full environment state, hand-designing an optimal, efficiently computable, expert is extremely difficult: simple considerations show that planning the agent’s route is at least as difficult as the traveling salesman problem. Therefore, we do not attempt to design an optimal expert and, instead, a greedy heuristic expert with some backtracking and error detection capabilities. See Appendix B for more details. This expert is not perfect but, as seen in Tab. 1, can restore all but a small fraction of objects to their rightful places. For additional training details, see Appendix A.
Recall from Sec. 4 that our dataset contains a training set of size 4000 and validation/testing sets of 1000 instances each. We report results on each of these splits but, for efficiency, include only the first 15 rearrangement instances per room in the training set (leaving 1200 instances).
Baselines. We evaluate the following baseline models:
1-Phase (RN18, IL) – An agent trained using pure imitation learning in the 1-Phase task. Recall that 1-Phase task models use a simplification of the model from Sec. 5, see our code for more details.
1-Phase (RN18, PPO) – As above but trained with PPO.
1-Phase (Simple, IL) – As 1-Phase (RN18, IL) but we replace the ResNet18 CNN backbone and attention module with 3 CNN blocks, this CNN is commonly used in embodied navigation baselines [Savva_2019_ICCV].
1-Phase (Simple, PPO) – As PointU (Simple, IL) but trained with PPO rather than IL.
2-Phase (RN18, PPO+IL) – An agent trained in the 2-Phase task using the model from Sec. 5. PPO and IL are used in the walkthrough and unshuffle stages, respectively.
1-Phase (RN18+ANM, IL) – We pretrain a variant of the “Active Neural SLAM” (ANM) [chaplot2020learning] architecture to perform semantic mapping within AI2-THOR using our set of 72 object categories. We then freeze this mapping network and train our “1-Phase (RN18, IL)” model extended to allow for comparing between the maps created in the unshuffle and walkthrough stages. See Appendix D for more details.
2-Phase (RN18+ANM, PPO+IL) – Similarly as above but with semantic mapping model integrated into “2-Phase (RN18, PPO+IL)” baseline above.
Analysis. We record rolling metrics during training in Fig. 5. After training, we evaluate our models on our three dataset splits and record the average metric values in Tab. 1. From the results, we see several clear trends.
Unshuffling objects is hard – Even when evaluated on the seen training rearrangements in the easier 1-Phase task, the success of our best model is only 8.2%.
Reward-based RL struggles to train – Fig. 5 shows that PPO-based models quickly appear to become trapped in local optima. Tab. 1 shows that the PPO agents move relatively few objects but, when they do move objects, they generally place them correctly even in test scenes.
Pretrained CNN backbones can improve performance
– We hypothesized that using a pretrained CNN backbone would substantially improve generalization performance given the relatively little object variety (compared with ImageNet) in our dataset. We see compelling evidence of this when comparing the performance of the “1-Phase (Simple, IL)” and “1-Phase (RN18, IL)” baselines (Success and %FixedStrict improvements across all splits). The results were more mixed for the PPO-trained baselines.
The 2-Phase task is much more difficult than the 1-Phase task – Comparing the performance of the “2-Phase (RN18, PPO+IL)” and “1-Phase (RN18, IL)” baselines, it is clear that the 2-Phase task is much more difficult than the 1-Phase task. If the agent managed to explore exhaustively during the walkthrough stage then the two tasks would be effectively identical. This suggests that the observed gap is primarily driven by learning dynamics and the walkthrough agent’s failure to explore exhaustively. Note that, as we select the best val. set model, Tab. 1 may give the impression that the 2-Phase baseline failed to train at all: this is not the case as we can see, in Fig. 5, that the “2-Phase (RN18, PPO+IL)” baseline trains to almost the same training-set performance as the 1-Phase IL baselines.
Semantic mapping appears to substantially improve performance – Our preliminary results suggest that semantic mapping can have a substantial impact on improving the generalization performance of rearrangement models, note that the “+ANM” baselines outperform their counterparts in almost all metrics, especially so on the validation and test sets. These results are preliminary as we have not carefully balanced parameter counts to ensure fair comparisons.
See Fig. 6 for success and failure examples.
Our proposed Room Rearrangement task poses a rich set of challenges, including navigation, planning, and reasoning about object poses and states. To facilitate learning for rearrangement, we propose the RoomR dataset that provides a challenging testbed in visually rich interactive environments. We show that modern deep RL methodologies obtain (test-set) performance only marginally above chance. Given the low performance of existing methods we suspect that future high-performance models will require novel architectures enabling comparative mapping (to record object positions during the walkthrough stage and compare these positions against those observed in the unshuffle stage), visual reasoning about object positions, and physics to be able to manipulate objects to their goal locations. Moreover, we require new reinforcement learning methodologies to allow the walkthrough and unshuffle stages to be trained jointly with minimum mutual interference. Given these challenges, we hope the proposed task opens up new avenues of research in the domain of Embodied AI.
Appendix A Implementation details
We train our agents using the AllenAct Embodied AI framework [allenact] for 75 Mn steps. We run our experiments on g4dn.12xlarge Amazon EC2 instances which has 4 NVIDIA T4 GPUs and 48 CPU cores. See Table 2
for an accounting of our training hyperparameters (e.g. learning rate, loss weights, etc.). During training we obtain an FPS of125 when training models with expert supervision and an FPS of when training purely with PPO. Thus, training for 75 Mn steps requires approximately 4.3 days when using expert supervision and 1.8 without.
The reward structures for our agents differ in the walkthrough and unshuffle stages. Rather than provide explicit details here, as these are better read directly from code, we give some intuition about these reward structures.
Unshuffle stage rewards. For the unshuffle stage the reward is quite simple. Suppose that before the agent takes an action the scene is in state and, after the agent takes a step, the scene is in state . The agent’s reward is then equal to the change in energy of the scene (with respect to the goal pose ), i.e. . Thus if the energy has decreased () so that the scene is closer to the goal state than it was before, then the agent gets a positive reward. Otherwise, the agent may receive a negative reward. At the end of an unshuffle episode the agent receives a penalty equal to the negation of the remaining energy.
Walkthrough stage rewards. In the walkthrough stage we would like the agent to see as many of the objects in the scene as possible so that, during the unshuffle stage, the agent can compare the object poses seen against their goal positions. To this end, after every step in the walkthrough stage, the agent receives reward if it observes objects that it has never seen previously in the episode. At the end of the episode we provide the agent a reward based on the proportion of objects the agent has seen among all objects in the scene. We found this reward helpful to encourage the agent to be as exhaustive as possible.
|Discount factor ()|
|GAE parameter ()|
|Value loss coefficient|
|Entropy loss coefficient|
|Clip parameter () [schulman2017proximal]|
|Decay on||, 75e6)|
|PPO-only – Training|
|# Processes to sample steps||(5 per GPU)|
|IL and IL+PPO – Training|
|# Processes to sample steps||(5 per GPU)|
|Common – Training|
|Rollouts per minibatch|
|Gradient clip norm|
corresponds to linear interpolation betweenand within training steps.
Appendix B Heuristic Expert
The sole purpose of our expert is to produce expert actions for our learning agents to imitate. As such it is allowed to “cheat” by using extensive ground truth state information including the scene layout and poses of all objects in current and goal states. As it does not have to reason from visual input, the heuristic expert’s performance cannot be fairly compared against the other agents. At a high-level our expert operates by looping through (1) selecting the closest object that is not in its goal pose, (2) navigating to this object via shortest paths computed on the scene layout, (3) picking up the object, (4) navigating to the closest position from which the object can be placed in its goal pose, and (5) placing the object. As AI2-THOR is physics based, it is possible for the above steps to fail (e.g. an object falls in the way of the agent as it navigates), because of this the agent has backtracking capabilities to allow it to give up on placing an object temporarily in the hope that, in placing other objects, it will remove the obstruction.
Appendix C Lower-level actions
As discussed in Sec. 6.1, in our experiments we use a “high-level” action space in line with prior work. We suspect (and hope) that within the next few years the rearrangement task will be solved using these high-level actions enabling us to move to low-level actions which are more easily implementable on existing robotic hardware. In preparation for this eventuality, we have implemented a number of lower-level actions. Rather than describe these actions individually, we will describe them in contrast to their higher-level counterparts.
Continuous navigation. In our experiments the agent moves at increments of 0.25 meters, uses 90 rotations, and changes its camera angle by at a time. We have implemented fully continuous motion so that the agent can rotate and move arbitrary degrees and distances respectively.
Object manipulation. Our high-level actions include a PlaceObject action that abstracts away the subtleties of moving a held object to a goal location. In our low-level actions we now allow the agent to move a held object through space (within some distance of the agent) possibly colliding with other objects. The agent then must explicitly drop the object into to the goal location.
Opening and picking up objects. When an agent opens an object using one of the 10 high-level open actions the agent is not required to specify the target openness nor specify where, in space, the object to open resides. Fig. 7 shows how objects are targeted with our lower-level actions. For our low-level open action the agent must specify the coordinates (in pixel-space) of the object, as well as the amount that the object is opened. Similarly, our low level Pickup action requires specifying the object with coordinates rather than by the object’s type.
Appendix D Semantic Mapping
As discussed in the main paper, we include two baselines that incorporate the “Active Neural SLAM” module of Chaplot et al. (2020) [chaplot2020learning] which we have adapted (by increasing the number of output channels in the map) to perform semantic mapping.
We pretrain the ANM module so that, given a image from AI2-THOR, it returns a tensor corresponding to an estimate of the semantic map in a 2m2m region directly in front of the agent (3 channels are used to predict free space, the other 72 are used to predict the probability that one of our 72 rearrangement objects occupies a given map location).
After pretraining this module we freeze its weights and incorporate it into our baseline model, recall Sec. 5
. In particular, we remove the nonparametric map from our baseline and replace it with the ANM. During the walkthrough stage the agent constructs the semantic map and saves it. During the unshuffle stage, the agent indexes into the walkthrough map to retrieve the estimate of the egocentric semantic map for the agent’s current position. It compares this walkthrough map estimate against its current map estimate through the use of an attention mechanism: the two estimates are concatenated, embedded via a CNN, and then attention is computed spatially to downsample the embeddings to a single 512-dimensional vector. This embedding is then concatenated to the input to the 1-layer LSTM (recall Sec.5) along with the usual visual and discrete embeddings.
Appendix E Computing the energy between two poses
In our discussion of the “% Energy Remaining” metric (recall Sec. 3.2) we deferred the definition of the energy function , we define this energy function now. Let be two possible poses for an object. Then,
If or we let .
Otherwise, if the object is openable but not pickupable, we let if and otherwise , otherwise
Otherwise, if the object is pickupable, we have two cases. Suppose that . Then we let . Otherwise, we let where be the minimum distance between a point in and a point in .
Note that decreases monotonically as poses come closer together.
Appendix F Object types
The list of all objects have been provided in Tab. 3.
|Object Type [A-L]||Openable||Pickupable|
|Object Type [M-Z]||Openable||Pickupable|