CHALET: Cornell House Agent Learning Environment

by   Claudia Yan, et al.
CUNY Law School

We present CHALET, a 3D house simulator with support for navigation and manipulation. CHALET includes 58 rooms and 10 house configuration, and allows to easily create new house and room layouts. CHALET supports a range of common household activities, including moving objects, toggling appliances, and placing objects inside closeable containers. The environment and actions available are designed to create a challenging domain to train and evaluate autonomous agents, including for tasks that combine language, vision, and planning in a dynamic environment.



There are no comments yet.


page 2

page 3


ArraMon: A Joint Navigation-Assembly Instruction Interpretation Task in Dynamic Environments

For embodied agents, navigation is an important ability but not an isola...

Learning Robust Agents for Visual Navigation in Dynamic Environments: The Winning Entry of iGibson Challenge 2021

This paper presents an approach for improving navigation in dynamic and ...

ManipulaTHOR: A Framework for Visual Object Manipulation

The domain of Embodied AI has recently witnessed substantial progress, p...

Object Rearrangement with Nested Nonprehensile Manipulation Actions

This paper considers the problem of rearrangement planning, i.e finding ...

Pushing it out of the Way: Interactive Visual Navigation

We have observed significant progress in visual navigation for embodied ...

Synthesizing Manipulation Sequences for Under-Specified Tasks using Unrolled Markov Random Fields

Many tasks in human environments require performing a sequence of naviga...

Mapping Walls of Indoor Environment using RGB-D Sensor

Inferring walls configuration of indoor environment could help robot "un...

Code Repositories


Cornell House Agent Learning Environment

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Training autonomous agents poses challenges that go beyond the common use of annotated data in supervised learning. The large set of states an agent may observe and the importance of agent behavior in identifying states for learning require interactive training environments, where the agent observes the outcome of its behavior and receives feedback. While physical environments easily satisfy these requirements, they are costly, difficult to replicate, hard to scale, and require complex robotic agents. These challenges are further exacerbated by the increased focus on neural network policies for agent behavior, which require significant amounts of training data 

[15, 17]. Recently, these challenges are addressed with simulated environments [15, 13, 2, 22, 12, 4, 11, 20, 18, 1, 5]. In this report, we introduce the Cornell House Agent Learning EnvironmenT (CHALET), an interactive house environment. CHALET supports navigation and manipulation of both objects and the environment. It is implemented using the Unity game development engine, and can be deployed on various platforms, including web environments for crowdsourcing.

Fig. 1: Screenshots of various rooms from CHALET. Each house includes 4-7 rooms of various kinds, including bathrooms, bedrooms, and kitchens.

Ii Environment

CHALET includes 58 rooms organized into 10 houses. Figure 1 shows a sample of the rooms. The environment contains 150 object types (e.g., fridge, sofa, plate). 71 types of objects can be manipulated: 60 picked and placed (e.g., plates and towels), 6 opened and closed (e.g., dishwashers and cabinets), and 5 change their state (e.g., opening or closing a faucet). Object types are used with different textures to generate 330 different objects. On average, each room includes 30 objects. Rooms often contain multiple objects of the same kind. For example, kitchens contain many plates and glasses, and bathrooms contain multiple towels. Objects that can be opened and closed are container objects, and can contain other objects. For example, opening a dishwasher exposes a set of racks, and pulling a rack out allows the agent access to the objects on that rack. The agent can also put an object on the rack, close the dishwasher, and open it later to retrieve the object. Figure 2 shows example object manipulations. The environment supports simple physics, including collision-detection and gravity.

Action Description
move-forward Change the agent location in the direction of its current orientation
move-back Change the agent location in the direction opposite to its current orientation
strafe-right Change the agent location in the direction of to its current orientation
strafe-left Change the agent location in the direction of to its current orientation
look-left Change the agent orientation to left
look-right Change the agent orientation to right
look-up Change the agent orientation up (when engaged with a container, change container towards closure)
look-down Change the agent orientation down (when engaged with a container, change container towards open)
interact Engage the container at the current orientation, pick the object at the current orientation, drop the object currently held, toggle state of object at current orientation (e.g., toggle TV power)
TABLE I: The actions available to the agent in CHALET.

Fig. 2: Sampled observations from three sequences of object manipulation, from top to bottom: loading the dishwasher, placing a food item in the freezer, and placing a pie in the oven.

The agent in CHALET observes the environment from a first-person perspective. At any given time, the agent observes only what is in front of it. The agent position is parameterized by two coordinates for its location and two coordinates for the orientation of its view. Changing the agent location is done in the direction of its orientation (i.e., first-person coordinate system). Whether the agent looks up or down does not influence location changes. All agent actions are continuous, but can be discretized to pre-defined precision by specifying the quantities of change for each step. Table I describes the agent actions.

CHALET provides a rich testbed for language, exploration, planning, and modeling challenges. We design rooms to often include many objects of the same types. Instructions or questions that refer to a specific objet must then use spatial relations and object properties to identify the exact instance. For example, to pick up a specific towel in a bathroom, the agent is likely to be given an instruction such as pick up the yellow towel left of the sink. In contrast, in an environment with a single object of each type, it would have been sufficient to ask to pick up the towel. The ability to open and close containers also creates several interesting challenges. Given an instruction, such as put the glass from the cupboard on the table, it is insufficient for an agent to simply align the word glass to an observed object. Instead, it must resolve the noun phrase the cupboard and the relation from to understand it must look for a glass in a specific location. Simply resolving the target object (glass) is insufficient. If multiple cupboards are available, the agent must also explore the different cupboards to find the one containing a glass. This requires both deciding on an exploration policy, and planning a complex sequence of actions. Finally, the agent perspective requires models that support access to previous observations or a representation of them (i.e., memory) to overcome the partial observability challenge.

Iii Implementation Details

CHALET is implemented in Unity 3D,111 a professional game development engine.222The community version of Unity, which was used to develop CHALET is publicly for education purposes. The environment logic is written in the C# scripting language, which supports high-level object-oriented programming constructs and is tightly integrated with the Unity engine. Using Unity provides several advantages. CHALET can be easily compiled for different platforms, including Linux, MacOS, Windows, Android, iOS, and WebGL. Unity also provides a built in physics engine, and supports integration with augmented- and virtual-reality devices. Extending CHALET with new objects from the Unity Asset Store333 is trivial.

CHALET supports three modes of operation:

  • standalone: actions are provided using keyboard and mouse input. The generated trajectory is saved to a file. This model is used for crowdsourcing using a WebGL build.

  • simulator: actions are read from a saved file and executed in sequence. This mode allows replaying previously recorded trajectories, for example during crowdsourcing.

  • client

    : a separate process provides actions, and the framework returns the agent observations and information about the environment as required. Communication is done over sockets. This mode enables interaction with machine learning frameworks.

The framework provides a simple API to compute reward and feedback signals, as required for learning, and provide information about the environment, including the position and state of objects. CHALET also provides programmatic generation of rich scenarios by adding, removing, and re-placing objects during runtime without the use of Unity or re-loading the simulator. To enable this, each room is annotated with a set of surface locations where items may be placed. Placing an objects requires specifying its type and orientation, and the target surface and coordinates.

Iv Related Environments

Software Type Number of Environments Navigation Manipulation
MINOS [18] Simulated 45K houses + variations Yes No
House3D [20] Simulated Thousand houses Yes No
AI2-Thor [22, 7] Simulated 120 rooms Yes Yes
Matterport3D [1] Real Images 90 houses Yes No
HoME [5] Simulated 45000 houses Yes Yes
DeepMind Lab [2] Simulated Few (procedural) Yes No
CHALET Simulated 58 rooms and 10 default houses Yes Yes
TABLE II: Comparison of CHALET with other 3D house simulators

Table II compares CHALET to existing simulators. Savva et al. [18], Wu et al. [20], and Beattie et al. [2] provide similar observations to CHALET in navigation-only environments. In contrast, CHALET emphasizes manipulation of both objects and the environment to support complex tasks. Anderson et al. [1] use real images with a discrete state space for navigation. While CHALET includes 3D rendered environments, it provides a continuous environment with a variety of actions. The most related environments to ours are HoME [5] and AI2-Thor [7], both provide 3D rendered houses with object manipulation. Unlike HoME, which only supports moving objects, CHALET enables toggling the state of objects and changing the environment by modifying containers. In contrast to Thor, CHALET supports moving between rooms in complete houses, while the current version of Thor supports a single room.

There is also significant work on using simulators for other domains. Atari [15], OpenAI Gym [4], Project Malmo [11], Minecraft [16], Gazebo [21], Viz Doom [12]

are commonly used for testing reinforcement learning algorithms. Simulators have also been used to evaluate natural language instruction following 

[14, 3, 13, 9, 8] and question answering [7, 6, 10, 19]. The manipulation features and partial observability challenges of CHALET provide a more realistic testbed for studying language, including for instruction following, visual reasoning, and question answering.