Cornell House Agent Learning Environment
We present CHALET, a 3D house simulator with support for navigation and manipulation. CHALET includes 58 rooms and 10 house configuration, and allows to easily create new house and room layouts. CHALET supports a range of common household activities, including moving objects, toggling appliances, and placing objects inside closeable containers. The environment and actions available are designed to create a challenging domain to train and evaluate autonomous agents, including for tasks that combine language, vision, and planning in a dynamic environment.READ FULL TEXT VIEW PDF
This paper considers the problem of rearrangement planning, i.e finding ...
We present OtoWorld, an interactive environment in which agents must lea...
Many tasks in human environments require performing a sequence of naviga...
The IKEA Furniture Assembly Environment is one of the first benchmarks f...
The ability to jointly understand the geometry of objects and plan actio...
Inferring walls configuration of indoor environment could help robot
In this paper, we are interested in modeling complex activities that occ...
Cornell House Agent Learning Environment
Training autonomous agents poses challenges that go beyond the common use of annotated data in supervised learning. The large set of states an agent may observe and the importance of agent behavior in identifying states for learning require interactive training environments, where the agent observes the outcome of its behavior and receives feedback. While physical environments easily satisfy these requirements, they are costly, difficult to replicate, hard to scale, and require complex robotic agents. These challenges are further exacerbated by the increased focus on neural network policies for agent behavior, which require significant amounts of training data[15, 17]. Recently, these challenges are addressed with simulated environments [15, 13, 2, 22, 12, 4, 11, 20, 18, 1, 5]. In this report, we introduce the Cornell House Agent Learning EnvironmenT (CHALET), an interactive house environment. CHALET supports navigation and manipulation of both objects and the environment. It is implemented using the Unity game development engine, and can be deployed on various platforms, including web environments for crowdsourcing.
CHALET includes 58 rooms organized into 10 houses. Figure 1 shows a sample of the rooms. The environment contains 150 object types (e.g., fridge, sofa, plate). 71 types of objects can be manipulated: 60 picked and placed (e.g., plates and towels), 6 opened and closed (e.g., dishwashers and cabinets), and 5 change their state (e.g., opening or closing a faucet). Object types are used with different textures to generate 330 different objects. On average, each room includes 30 objects. Rooms often contain multiple objects of the same kind. For example, kitchens contain many plates and glasses, and bathrooms contain multiple towels. Objects that can be opened and closed are container objects, and can contain other objects. For example, opening a dishwasher exposes a set of racks, and pulling a rack out allows the agent access to the objects on that rack. The agent can also put an object on the rack, close the dishwasher, and open it later to retrieve the object. Figure 2 shows example object manipulations. The environment supports simple physics, including collision-detection and gravity.
|move-forward||Change the agent location in the direction of its current orientation|
|move-back||Change the agent location in the direction opposite to its current orientation|
|strafe-right||Change the agent location in the direction of to its current orientation|
|strafe-left||Change the agent location in the direction of to its current orientation|
|look-left||Change the agent orientation to left|
|look-right||Change the agent orientation to right|
|look-up||Change the agent orientation up (when engaged with a container, change container towards closure)|
|look-down||Change the agent orientation down (when engaged with a container, change container towards open)|
|interact||Engage the container at the current orientation, pick the object at the current orientation, drop the object currently held, toggle state of object at current orientation (e.g., toggle TV power)|
The agent in CHALET observes the environment from a first-person perspective. At any given time, the agent observes only what is in front of it. The agent position is parameterized by two coordinates for its location and two coordinates for the orientation of its view. Changing the agent location is done in the direction of its orientation (i.e., first-person coordinate system). Whether the agent looks up or down does not influence location changes. All agent actions are continuous, but can be discretized to pre-defined precision by specifying the quantities of change for each step. Table I describes the agent actions.
CHALET provides a rich testbed for language, exploration, planning, and modeling challenges. We design rooms to often include many objects of the same types. Instructions or questions that refer to a specific objet must then use spatial relations and object properties to identify the exact instance. For example, to pick up a specific towel in a bathroom, the agent is likely to be given an instruction such as pick up the yellow towel left of the sink. In contrast, in an environment with a single object of each type, it would have been sufficient to ask to pick up the towel. The ability to open and close containers also creates several interesting challenges. Given an instruction, such as put the glass from the cupboard on the table, it is insufficient for an agent to simply align the word glass to an observed object. Instead, it must resolve the noun phrase the cupboard and the relation from to understand it must look for a glass in a specific location. Simply resolving the target object (glass) is insufficient. If multiple cupboards are available, the agent must also explore the different cupboards to find the one containing a glass. This requires both deciding on an exploration policy, and planning a complex sequence of actions. Finally, the agent perspective requires models that support access to previous observations or a representation of them (i.e., memory) to overcome the partial observability challenge.
CHALET is implemented in Unity 3D,111https://unity3d.com/ a professional game development engine.222The community version of Unity, which was used to develop CHALET is publicly for education purposes. The environment logic is written in the C# scripting language, which supports high-level object-oriented programming constructs and is tightly integrated with the Unity engine. Using Unity provides several advantages. CHALET can be easily compiled for different platforms, including Linux, MacOS, Windows, Android, iOS, and WebGL. Unity also provides a built in physics engine, and supports integration with augmented- and virtual-reality devices. Extending CHALET with new objects from the Unity Asset Store333https://www.assetstore.unity3d.com/en/ is trivial.
CHALET supports three modes of operation:
standalone: actions are provided using keyboard and mouse input. The generated trajectory is saved to a file. This model is used for crowdsourcing using a WebGL build.
simulator: actions are read from a saved file and executed in sequence. This mode allows replaying previously recorded trajectories, for example during crowdsourcing.
: a separate process provides actions, and the framework returns the agent observations and information about the environment as required. Communication is done over sockets. This mode enables interaction with machine learning frameworks.
The framework provides a simple API to compute reward and feedback signals, as required for learning, and provide information about the environment, including the position and state of objects. CHALET also provides programmatic generation of rich scenarios by adding, removing, and re-placing objects during runtime without the use of Unity or re-loading the simulator. To enable this, each room is annotated with a set of surface locations where items may be placed. Placing an objects requires specifying its type and orientation, and the target surface and coordinates.
|Software||Type||Number of Environments||Navigation||Manipulation|
|MINOS ||Simulated||45K houses + variations||Yes||No|
|House3D ||Simulated||Thousand houses||Yes||No|
|AI2-Thor [22, 7]||Simulated||120 rooms||Yes||Yes|
|Matterport3D ||Real Images||90 houses||Yes||No|
|HoME ||Simulated||45000 houses||Yes||Yes|
|DeepMind Lab ||Simulated||Few (procedural)||Yes||No|
|CHALET||Simulated||58 rooms and 10 default houses||Yes||Yes|
Table II compares CHALET to existing simulators. Savva et al. , Wu et al. , and Beattie et al.  provide similar observations to CHALET in navigation-only environments. In contrast, CHALET emphasizes manipulation of both objects and the environment to support complex tasks. Anderson et al.  use real images with a discrete state space for navigation. While CHALET includes 3D rendered environments, it provides a continuous environment with a variety of actions. The most related environments to ours are HoME  and AI2-Thor , both provide 3D rendered houses with object manipulation. Unlike HoME, which only supports moving objects, CHALET enables toggling the state of objects and changing the environment by modifying containers. In contrast to Thor, CHALET supports moving between rooms in complete houses, while the current version of Thor supports a single room.
are commonly used for testing reinforcement learning algorithms. Simulators have also been used to evaluate natural language instruction following[14, 3, 13, 9, 8] and question answering [7, 6, 10, 19]. The manipulation features and partial observability challenges of CHALET provide a more realistic testbed for studying language, including for instruction following, visual reasoning, and question answering.
The malmo platform for artificial intelligence experimentation.In International Joint Conferences on Artificial Intelligence, pages 4246–4247, 2016b.
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, 2017.
Environment-driven lexicon induction for high-level instructions.In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2015. doi: 10.3115/v1/P15-1096.