dm_construction
None
view repo
Physical construction -- the ability to compose objects, subject to physical dynamics, in order to serve some function -- is fundamental to human intelligence. Here we introduce a suite of challenging physical construction tasks inspired by how children play with blocks, such as matching a target configuration, stacking and attaching blocks to connect objects together, and creating shelter-like structures over target objects. We then examine how a range of modern deep reinforcement learning agents fare on these challenges, and introduce several new approaches which provide superior performance. Our results show that agents which use structured representations (e.g., objects and scene graphs) and structured policies (e.g., object-centric actions) outperform those which use less structured representations, and generalize better beyond their training when asked to reason about larger scenes. Agents which use model-based planning via Monte-Carlo Tree Search also outperform strictly model-free agents in our most challenging construction problems. We conclude that approaches which combine structured representations and reasoning with powerful learning are a key path toward agents that possess rich intuitive physics, scene understanding, and planning.
READ FULL TEXT VIEW PDFNone
Humans are a “construction species”—we build forts out of couch cushions as children, pyramids in our deserts, and space stations that orbit hundreds of kilometers above our heads. What do artificial intelligence (AI) agents need to do these sorts of things? This question frames the high-level purpose of this paper: to explore a range of tasks more complex than those typically studied in AI, and to develop approaches for learning to solve them.
Physical construction involves composing multiple elements under physical dynamics and constraints to achieve rich functional objectives. We introduce a suite of simulated physical construction tasks (Fig. 1), similar in spirit to how children play with toy blocks, which involve stacking and attaching together multiple blocks in configurations that satisfy functional objectives. For example, one task requires stacking blocks around obstacles to connect target locations to the ground. Another task requires building shelters which cover up target blocks and keep them dry in the rain. These tasks are representative of real-world construction challenges: they emphasize problem-solving and functionality rather than simply replicating a given target configuration, reflecting the way human construction involves forethought and purpose.
Real-world physical construction assumes many forms and degrees of complexity, but a few basic skills are typically involved: spatial reasoning (e.g. concepts like “empty” vs “occupied”), relational reasoning (e.g. concepts like “next to” or “on top of”), knowledge of physics (e.g., predicting physical interactions among objects), and allocation of materials and resources to different parts of the structure. Our simulated task environment (Fig. 1) is designed to exercise these skills, while still being simple enough to allow careful experimental control and tractable agent training.
While classic AI studied physical reasoning extensively (Chen, 1990; Pfalzgraf, 1997)
, construction has not been well-explored using modern learning-based approaches. We draw on a number of techniques from modern AI, combining and extending them in novel ways to make them more applicable and effective for construction. Our family of deep reinforcement learning (RL) agents can support: (1) vector, sequence, image, and graph-structured representations of scenes; (2) continuous and discrete actions, in absolute or object-centric coordinates; (3) model-free learning via deep Q-learning
(Mnih et al., 2015), or actor-critic methods (Heess et al., 2015; Munos et al., 2016); and (4) planning via Monte-Carlo Tree Search (MCTS) (Coulom, 2006).We find that graph-structured representations and reasoning, object-centric policies, and model-based planning are crucial for solving our most difficult tasks. Our results demonstrate the value of integrating rich structure and powerful learning approaches as a key path toward complex construction behavior.
Physical reasoning has been of longstanding interest in AI. Early work explored physical concepts with an emphasis on descriptions that generalize across diverse settings (Winston, 1970). Geometric logical reasoning was a major topic in symbolic logic research (Chou, 1987; Arnon, 1988), leading to geometric theorem-provers (Bouma et al., 1995), rule-based geometric constraint solvers for computer-aided design (Aldefeld, 1988; Schreck et al., 2012), and logic-based optimization for open-ended objectives in robotics (Toussaint, 2015). Classic work often focused on rules and structured representations rather than learning because the sample complexity of learning was often prohibitive for contemporary computers.
Modern advances in learning-based approaches have opened new avenues for using vector and convolutional representations for physical reasoning (Wu et al., 2015, 2016, 2017; Mottaghi et al., 2016; Fragkiadaki et al., 2016; Finn et al., 2016; Agrawal et al., 2016; Lerer et al., 2016; Li et al., 2016; Groth et al., 2018; Bhattacharyya et al., 2018; Ebert et al., 2018). A common limitation, however, is that due to their relatively unstructured representations of space and objects, these approaches tend not to scale up to complex scenes, or generalize to scenes with different numbers of objects, etc.
Several recent studies have explored learning construction, including learning to stack blocks by placing them at predicted stable points (Li et al., 2017), learning to attach blocks together to stabilize an unstable stack (Hamrick et al., 2018), learning basic block-stacking by predicting shortest paths between current and goal states via a transition model (Zhang et al., 2018), and learning object representations and coarse-grained physics models for stacking blocks (Janner et al., 2019). Though promising, in these works the physical structures the agents construct are either very simple, or provided explicitly as an input rather than being designed by the agent itself. A key open challenge, which this paper begins to address, is how to learn to design and build complex structures to satisfy rich functional objectives.
A main direction we explore is object-centric representations of the scene and agent’s actions (Diuk et al., 2008; Scholz et al., 2014)
, implemented with graph neural networks
(Scarselli et al., 2009; Bronstein et al., 2017; Gilmer et al., 2017; Battaglia et al., 2018). Within the domain of physical reasoning, graph neural networks have been used as forward models for predicting future states and images (Battaglia et al., 2016; Chang et al., 2017; Watters et al., 2017; van Steenkiste et al., 2018), and can allow efficient learning and rich generalization. These models have also begun to be incorporated into model-free and model-based RL, in domains such as combinatorial optimization, motor control, and game playing
(Dai et al., 2017; Kool & Welling, 2018; Wang et al., 2018; Sanchez-Gonzalez et al., 2018; Zambaldi et al., 2019).Our simulated task environment is a continuous, procedurally-generated 2D world implemented in Unity (Juliani et al., 2018) with the Box2D physics engine (Catto, 2013). Each episode contains unmoveable obstacles, target objects, and floor, plus movable rectangular blocks which can be picked up and placed.
On each step of an episode, the agent chooses an available block (from below the floor), and places it in the scene (above the floor) by specifying its position. In all but one task (Covering Hard—see below), there is an unlimited supply of blocks of each size, so the same block can be picked up and placed multiple times. The agent may also attach objects together by assigning the property of “stickiness” to the block it is placing. Sticky objects form unbreakable, nearly rigid bonds with objects they contact. In all but one task (Connecting) the agent pays a cost to make a block sticky. After the agent places a block, the environment runs physics forward until all blocks come to rest.
An episode terminates when: (1) a movable block makes contact with an obstacle, either because it is placed in an overlapping location, or because they collide under physical dynamics; (2) a maximum number of actions is exceeded; or (3) the task-specific termination criterion is achieved (described below). The episode always yields zero reward when a movable block makes contact with an obstacle.
Silhouette task (Fig. 1a). The agent must place blocks to overlap with target blocks in the scene, while avoiding randomly positioned obstacles. The reward function is: for each placed block which overlaps at least with a target block of the same size; and for each block set as sticky. The task-specific termination criterion is achieved when there is at least overlap with all targets.
This is similar to the task in Janner et al. (2019), and challenges agents to reason about physical support of complex arrangements of objects and to select, position, and attach sequences of objects accordingly. However, by fully specifying the target configuration, Silhouette does not require the agent to design a structure to satisfy a functional objective, which is an important component of our other tasks.
Connecting task (Fig. 1b). The agent must stack blocks to connect the floor to three different target locations, avoiding randomly positioned obstacles arranged in layers. The reward function is: for each target whose center is touched by at least one block, and (no penalty) for each block set to sticky. The task-specific termination criterion is achieved when all targets are connected to the floor.
By not fully specifying the target configuration, the Connecting task requires the agent to design a structure with a basic function—connecting targets to the floor—rather than simply implementing it as in the Silhouette task. A wider variety of structures could achieve success in Connecting than Silhouette, and the solution space is much larger because the task is tailored so that solutions usually require many more blocks.
Covering task (Fig. 1c). The agent must build a shelter that covers all obstacles from above, without touching them. The reward function is: , where is the sum of the lengths of the top surfaces of the obstacles which are sheltered by blocks placed by the agent; and for each block set as sticky. The task-specific termination criterion is achieved when at least of the summed obstacle surfaces are covered. The layers of obstacles are well-separated vertically so that the agent can build structures between them.
The Covering task requires richer reasoning about function than the previous tasks: the purpose of the final construction is to provide shelter to a separate object in the scene. The task is also demanding because the obstacles may be elevated far from the floor, and the cost of stickiness essentially prohibits its use.
Covering Hard task (Fig. 1d). Similar to Covering, the agent must build a shelter, but the task is modified to encourage longer term planning: there is a finite supply of movable blocks, the distribution of obstacles is denser, and the cost of stickiness is lower ( per sticky block). It thus incorporates key challenges of the Silhouette task (reasoning about which blocks to make sticky), the Connecting task (reasoning about precise block layouts), and the Covering task (reasoning about arch-like structures). The limited number of blocks necessitates foresight in planning (e.g. reserving long blocks to cover long obstacles). The reward function and termination criterion are the same as in Covering.
With our suite of construction tasks, we can now tackle the question we posed at the top of the Introduction: what would an agent need to perform complex construction behaviors? We expect agents which have explicit structured representations to perform better, due to their capacity for relational reasoning, compositionality, and combinatorial generalization. We implement seven construction agents which vary in the degree of structure in their observation types, internal representations, learning algorithms, and action specifications, as summarized in Table 1 and Fig. 2.
Agent | Observation | Encoder | Policy | Planning | Learning alg. | Action space |
---|---|---|---|---|---|---|
RNN-RS0 | Object | RNN | MLP/vector | - | RS0 | Continuous |
CNN-RS0 | Image | CNN | MLP/vector | - | RS0 | Continuous |
GN-RS0 | Object | - | GN/graph | - | RS0 | Continuous |
GN-DQN | Object | - | GN/graph | - | DQN | Discrete |
GN-DQN-MCTS | Object | - | GN/graph | MCTS | DQN | Discrete |
CNN-GN-DQN | Seg. image | Per-object CNN | GN/graph | - | DQN | Discrete |
CNN-GN-DQN-MCTS | Seg. image | Per-object CNN | GN/graph | MCTS | DQN | Discrete |
Each construction task (Sec. 3) provides object state and/or image observations. Both types are important for construction agents to be able to handle: we ultimately want agents that can use symbolic inputs, e.g., the representations in computer-aided design programs, as well as raw sensory inputs, e.g., photographs of a construction site.
Object state: These observations contain a set of feature vectors that communicate the objects’ positions, orientations, sizes, types (e.g., obstacle, movable, sticky, etc.). Contact information between objects is also provided, as well as the order in which objects were placed in the scene (see Supplemental Sec. C).
Image: Observed images are RGB renderings of the scene, with coordinates appended as two extra channels.
Segmented images: The RGB scene image is combined with a segmentation mask for each object, thus comprising a set of segmented images (similar to Janner et al., 2019).
We use two types of internal representations for computing policies from inputs: fixed-length vectors and directed graphs with attributes.
CNN encoder
: The convolutional neural network (CNN) embeds an input image as a vector representation.
RNN encoder
: Object state input vectors are processed sequentially with a recurrent neural network (RNN)—a gated recurrent unit (GRU)
(Cho et al., 2014)—in the order they were placed in the scene, and the final hidden state vector is used as the embedding.Graph encoder: To convert a set of state input vectors into a graph, we create a node for each input object, and add edges either between all nodes or a subset of them that depends on their type and whether they are in contact (see Supplemental Sec. C.2).
Per-object CNN encoder: To generate a graph-based representation from images, we first split the input image into segments, and generate new images with only single objects. Each of these are passed to a CNN, and the output vectors are used as nodes in a graph, with edges added as above.
MLP policy
: Given a vector representation, we obtain a policy using a multi-layer perceptron (MLP), which outputs actions or Q-values depending on the learning algorithm.
GN policy: Given a graph-based representation from a graph encoder or a per-object CNN, we apply a stack of three graph networks (GN) (Battaglia et al., 2018) arranged in series, where the second net performs some number of recurrent steps, consistent with the “encode-process-decode” architecture described in Battaglia et al. (2018). Unless otherwise noted, we used three recurrent steps.
In typical RL and control settings that involve placing objects, the agent takes absolute actions in the frame of reference of the observation (e.g. Silver et al., 2016, 2018; Zhang et al., 2018; Ganin et al., 2018; Janner et al., 2019). We implement this approach in our “absolute action” agents, where, for example, the agent might choose to “place block D at coordinates ”. However, learning absolute actions scales poorly as the size of the environment grows, because the agent must effectively re-learn its construction policy at every location.
To support learning compositional behaviors which are more invariant to the location in the scene (e.g. stacking one block on top of another), we develop an object-centric alternative to absolute actions which we term relative actions. With relative actions, the agent takes actions in a reference frame relative to one of the objects in the scene. This is a natural way of expressing actions, and is similar to how humans are thought to choose actions in some behavioral domains (Ballard et al., 1997; Botvinick & Plaut, 2004).
The different types of actions are shown at the bottom of Fig. 2, with details in Supplemental Sec. B.
Continuous absolute actions are 4-tuples , where is a horizontal cursor to choose a block from the available blocks at the bottom of the scene, “snapping” to the closest one, determines its placement in the scene and the sign of indicates stickiness (see Sec. 3).
Continuous relative actions are 5-tuples, , where and are as before, is used to choose a reference block (by snapping to the closest one), and determines where to place the objects horizontally relatively to the reference object, the vertical positioning being automatically adjusted.
![]() |
![]() |
Discrete absolute actions are 4-tuples where is an index over the available objects, indicate the discrete index at which to place the object in a grid-like 2D discretization of space, and indicates stickiness.
Absolute actions and continuous relative actions are easily implemented by any agent that outputs a single fixed-length continuous vector, such as that output by an MLP or the global output feature of a GN.
Discrete relative actions are triplets, , where is an edge in the input graph between the to-be-placed block and the selected reference block , is an index over finely discretized horizontal offsets to place the chosen block relatively to the reference block’s top surface, and is as before.
Discrete relative actions are straightforward to implement with a graph-structured internal representation: if the nodes represent objects, then the edges can represent pairwise functions over the objects, such as “place block D on top of block B” (see Fig. 3).
The internal vector and graph representations are used to produce actions either by an explicit policy or a Q-function.
RS0 learning algorithm: For continuous action outputs, we use an actor-critic learning algorithm that combines retrace with stochastic value gradients (denoted RS0) (Munos et al., 2016; Heess et al., 2015; Riedmiller et al., 2018).
DQN learning algorithm: For discrete action outputs, we use Q-learning implemented as a deep Q network (DQN) from Mnih et al. (2015), with Q-values on the edges, similar to Hamrick et al. (2018). See Sec. 4.4 and Fig. 3.
MCTS: Because the DQN agent outputs discrete actions, it is straightforward to combine it with standard planning techniques like Monte-Carlo Tree Search (Coulom, 2006; Silver et al., 2016) (see Fig. 3). We use the base DQN agent as a prior for MCTS, and use MCTS with various budgets (either only at test time, only during training, or both), thereby modifying the distribution of experience fed to the learner. As a baseline, we also perform MCTS without the model-free policy prior. In all results reported in the main text, we use the environment simulator as our model; we also explored using learned models with mixed success (see Supplemental Sec. E.3).
We ran experiments to evaluate the effectiveness of different agent architectures (see Table 1) on our construction tasks. We focused on quantifying the effect of structured actions (Sec. 5.1), the effect of planning both during training and at decision time (Sec. 5.2), zero-shot generalization performance on larger and more complex scenes (Sec. 5.3). In all experiments, we report results for 10 randomly initialized agents (termed “seeds”) which were trained until convergence. Each seed is evaluated on 10,000 scenes, and in all figures we report median performance across seeds as well as errorbars indicating worst and best seed performance.
For efficient training, we found it was important to apply a curriculum which progressively increases the complexity of the task across training episodes. In Silhouette, the curriculum increases the number of targets. In Connecting, it increases the elevation of the targets. In the Covering tasks, it increases the elevation of the obstacles. Details are available in Supplemental Sec. A.2. In our analysis, we evaluated each seed on scenes generated either uniformly at random across all difficulty levels, or only at the hardest difficulty level for each task.
We find that agents which use relative actions consistently outperform those which use absolute actions. Across tasks, almost every relative action agent converges at a similar or higher median performance level (see Fig. 4a), and the best relative agents achieve up to 1.7 times more reward than the best absolute agents when averaging across all curriculum levels. When considering only the most advanced level, the differences are larger with factors of up to 2.4 (Fig. 4b).
Fig. 4c shows examples of the best absolute agents’ constructions (at episode termination) in the most advanced level. These outcomes are qualitatively worse than the best relative agents’ (Fig. 4d). The absolute agents do not anticipate the long term consequences of their actions as well, sometimes failing to make blocks sticky when necessary, or failing to place required objects at the base of a structure, as in Fig. 4c’s Silhouette example. They also fall into poor local minima, building stacks of blocks on the sides of the scene which fail to reach or cover objects in the center, as in Fig. 4c’s Connecting and Covering examples.
By contrast, the best relative agents (which, across all tasks, were GN-DQN) construct more economical solutions (e.g., Fig. 4d, Connecting) and discover richer strategies, such as building arches (Fig. 4d, Covering). The GN-DQN agent’s superior performance suggests that structured representations and relative, object-centric actions are powerful tools for construction. Our qualitative results suggest that these tools provide invariance to dimensions such as spatial location, which can be seen in cases where the GN-DQN agent re-uses local block arrangements at different heights and locations, such as the T structures in Fig. 4g.
Most agents achieve similar levels of performance of Covering Hard: GN-RS0 has the best median performance, while GN-DQN has the best overall seed. But inspecting the qualitative results (Fig. 4), even the best relative agent does not give very strong performance. Though Covering Hard involves placing fewer blocks than other tasks because of their limited supply, reasoning about the sequence of blocks to use, which to make sticky, etc. is indeed a challenge, which we will address in the next section with our planning agent.
Interestingly, the GN-RS0 and GN-DQN agents have markedly different performance despite both using the same structured GN policy. There are a number of subtle differences, but notably, the object-centric information contained in the graph of the GN-RS0 agent must be pooled and passed through the global attribute to produce actions, while the GN-DQN agent directly outputs actions via the graph’s edges. This may allow its policy to be more analogous to the actual structure of the problem than the GN-RS0 agent.
The CNN-RS0 agent’s performance is generally poorer than the GN-based agents’, but the observation formats are also different: the CNN agent must learn to encode images, and it does not receive distinct, parsed objects. To better control for this, we train a GN-based agent from pixels, labelled CNN-GN-DQN, described in Sec. 4. The CNN-GN-DQN agent achieves better performance than the CNN-RS0 agent (see SM Fig. C.2). This suggests that parsing images into objects is valuable, and should be investigated further in future work.
Generally, complex construction should require longer-term planning, rather than simply reactive decision-making. Given a limited set of blocks, for example, it may be crucial to reserve certain blocks for roles they uniquely satisfy in the future. We thus augment our GN-DQN agent with a planning mechanism based on MCTS (see Sec. 4.5) and evaluate its performance in several conditions, varying the search budget at training and testing time independently (a search budget of 0 corresponds to no planning).
Our results (Fig. 5) show that planning is generally helpful, especially in Connecting and Covering Hard. In Connecting, planning with a train budget of 10 and test budget of 100 improves the agent’s median reward from 2.17 to 2.72 on the hardest scenes, or from 72.5% to 90.6% of the optimal reward of 3. In Covering Hard, planning with a train and test budget of 10 improves the agent’s median reward from 3.60 to 4.61. Qualitatively, the planning agent appears to be close to ceiling (Fig. 5h). Note that a pure-planning agent (Fig. 5a-d, gray dashed line) with a budget of 1000 still performs poorly compared to learned policies, underscoring the difficulty of the combinatorially large search space in construction. In Supplemental Sec. E, we discuss of the trade-offs of planning during training, testing, or both.
![]() |
![]() |
One of the most striking features of human construction is how we innovate new things. We next ask: how do our agents generalize to conditions beyond those on which they were trained? In Silhouette, our agents only experience 1-8 targets during training, so we test them on 9 and 16 targets. In Connecting, agents always experience targets at the same elevation within one scene during training, so we test them on targets appearing at multiple different levels in the same scene (in one condition) and all at a higher elevation than experienced during training.
We find that the GN-DQN and especially GN-DQN-MCTS agents with relative actions generalize substantially better than others. In Silhouette, the GN-DQN-* agents cover nearly twice as many targets as seen during training, while the other agents’ performances plateau or fall off dramatically (Fig. 6a). In Connecting with targets at multiple different levels, the GN-DQN and GN-DQN-MCTS agents’ performances drops only slightly, while other agents’ performance drops to near 0 (Fig. 6b). With increased numbers of obstacle layers in Connecting, both agents’ performances drop moderately but remain much better than the less structured agents (Fig. 6c). Fig. 6d-f show the qualitative generalization behavior of the GN-DQN-MCTS agent. Overall, these generalization results provide evidence that structured agents are more robust to scenarios which are more complex than those in their training distribution. This is likely a consequence of their ability to recognize structural similarity and re-use learned strategies.
Recurrent GNs support iterative relational reasoning by propagating information across the scene graph. We vary the number of recurrent steps in our GN-DQN agent to understand how this measure of its relational reasoning capacity affects its task performance.
We find that increasing the number of propagation steps from 1 to 3 to 5 generally improves performance, to a point, across all tasks: in Silhouette, the median rewards were 3.75, 4.04 and 4.06; in Connecting, 2.49, 2.84, and 2.81; in Covering, 3.41, 3.96, and 4.01; and in Covering Hard, 2.62, 3.03, and 3.02, respectively.
We introduced a suite of representative physical construction challenges, and a family of RL agents to solve them. Our results suggest that graph-structured representations, model-based planning under model-free search policies, and object-relative actions are valuable ingredients for achieving strong performance and effective generalization. We believe this work is the first to demonstrate agents that can learn rich construction behaviors in complex settings with large numbers of objects (up to 40-50 in some cases), and can satisfy challenging functional objectives that go beyond simply matching a pre-specified goal configuration.
Given the power of object-centric policies, future work should seek to integrate methods for detecting and segmenting objects from computer vision with learned relational reasoning. Regarding planning, this work only scratches the surface, and future efforts should explore learned models and more sophisticated search strategies, perhaps using policy improvement
(Silver et al., 2018) and gradient-based optimization via differentiable world models (Sanchez-Gonzalez et al., 2018). Finally, procedurally generating problem instances that require complex construction solutions is challenging, and adversarial or other learned approaches may be promising future directions.Our work is only a first step toward agents which can construct complex, functional structures. However we expect approaches that combine rich structure and powerful learning will be key making fast, durable progress.
We would like to thank Yujia Li, Hanjun Dai, Matt Botvinick, Andrea Tacchetti, Tobias Pfaff, Cédric Hauteville, Thomas Kipf, Andrew Bolt, and Piotr Trochim for helpful input and feedback on this work.
Geometric deep learning: going beyond euclidean data.
IEEE Signal Processing Magazine, 34(4):18–42, 2017.Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016)
, 2016.Relational neural expectation maximization: unsupervised discovery of objects and their interactions.
In Proceedings of the 6th International Conference on Learning Representations (ICLR 2018), 2018.For each task, the agent could use either an image-based observation, an object-based observation, or a combination of both as a segmentation-masks-based observation.
Object state observations are a list of vectors (one for each block), where each vector of size 15 contains information about the corresponding block position (), orientation (), size (width, height), linear () and angular () velocities, whether it is sticky or not, and one-hot information about its type (available block, placed block, target or obstacle). The list is ordered by the order under which objects appeared in the scene, but this information is discarded for the graph based agents. Information about which objects are in contact is also provided and is used when constructing the input for the graph based networks (see Sec. C).
Image observations start as RGB renders of the scenes and are re-scaled down to by averaging patches, with the color channels normalized to . The and coordinate is also supplied for each point in the image and is normalized in the interval. The re-scaling procedure helps preserve spatial information at a sub-pixel level as color fading at the boundaries between the objects and the background.
Segmented images observations are a list of images, one for each block. They are obtained using a segmentation of the render that maps each pixel to zero or more blocks that may be present at that pixel. Using this segmentation, we build a binary mask for each block, re-scale it down to by averaging patches, and multiply it with the unsegmented RGB render to obtain per-block renders. We also add the mask as an additional alpha channel to the masked the RGB image, as well as coordinate channels.
The full rendered scene spans a region a size of 1616 (meters, unless otherwise indicated).
At the beginning of the episode, the agent has access to 7 available blocks: three small, three medium and one large block (corresponding to respective widths of 0.7, 2.1 and 3.5, all with height 0.7).
The physics simulation is run for 20 seconds after the agent places each block to make sure that the scene is at an equilibrium position before the score is evaluated, and before the agent can place the next block.
Silhouette: Each scene is comprised of 1 to 8 targets and 0 to 6 obstacles, arranged in up to 6 layers, with a curriculum over the maximum number of targets, maximum number of obstacles, and number of layers (see Fig. H.1
). Levels are generated by (1) tessellating the scene into layers of blocks of the same sizes as the available blocks, with a small separation of 0.35, (2) sequentially (up to the required number of targets) finding the set of target candidates and sampling targets from this set (blocks in the tessellation that are directly on top of the floor or an existing target block) (3) sampling obstacles using a similar procedure. Both obstacles and targets that are further from the floor are sampled with higher probability to favor the generation of harder-to-construct towers and inverted pyramids. The average number of targets is 4.5 on the training distribution, and the number of targets goes up to 8 for the hardest levels. These numbers set an upper bound on the total reward that can be obtained. However, the average reward for an optimal agent is lower than that due to the cost of glue (silhouettes generated using this procedure are not guaranteed to be stable, thus the best possible solution may require glue).
Connecting: There are at most three vertical layers of obstacles above the floor and a layer of three targets above the highest obstacles. Each layer consists of up to three obstacles, whose lengths are uniformly and independently sampled from the interval . The layers of obstacles separated by enough distance for one block can be placed between any two layers of obstacles. The curriculum is comprised of scenes fewer obstacle layers, while the number of targets is unchanged (see Fig. H.4 for examples). Since glue is unpenalized, the maximum reward available to the agent is exactly 3.
Covering: There are at most three vertical layers of obstacles above the floor at any location, and up to 2 obstacles in each layer, with lengths uniformly and independently sampled from the interval . As in Connecting, these layers are well separated so that one block can be placed between any two layers of obstacles. The curriculum is comprised of scenes with obstacles only in the first two lower layers (see Fig. H.2). The total available length to cover is 5.25 on the training distribution and 7.88 for the hardest levels. This provides a tight upper-bound on the maximal reward and the agent could be expected to achieve this.
Covering Hard: There are at most two vertical layers of obstacles above the floor at any location, and up to 2 obstacles in each layer, with lengths uniformly and independently sampled in . The curriculum is comprised of scenes with only one layer of obstacles (see Fig. H.3 for examples). The layers of obstacles are closer to each other than in they were in Connecting or Covering Hard. The maximum length that can be covered is 4.2 on the training distribution and 6.3 on the hardest levels, but this only gives a weak upper bound on the possible reward because of the cost of glue and limited supply of blocks.
Curriculum complexity: Curricula were designed to increase in complexity while preserving instances of scenes from previous points in training to avoid catastrophic forgetting. This allows us to make a distinction, for any task and curriculum level, between Hardest Scenes (scenes types that are introduced for the first time at the present level) and All Scenes (training distribution, including hardest scenes at the current level and lower level scenes). Additional details about the conditions for advancing through the curricula are given in Sec. D for the DQN agents and Sec. F for the RS0 agents.
Continuous absolute actions are 4-tuples . is compared to the -coordinates of each of the available blocks and the closest block is chosen. A new block identical to it is then spawned with its center at location . The resulting object is sticky if and only if the continuous action is positive.
Discrete absolute actions are 4-tuples . is an index within the set of available blocks to decide which block will be placed on the scene. , are discrete index to place this block on the scene. Specifically the scene was discretized in heightwidth using different sizes from to , finding the best results for in Silhouette and for the other tasks. is a discrete variable in indicating whether the placed object should be made sticky or not.
Continuous relative actions are 5-tuples, . Here have identical meaning to the absolute case, and is again the object selected by . The object whose center is closest to is then selected as reference. Then, the -coordinate of the placed block is determined by , such that , where is the -coordinate of the center of , and are the widths of the objects and , and is a small offset so that the objects are not touching laterally.
If is a target object centered at , the -coordinate of the center of will be placed at so that is vertically overlapping with (where is a small offset so that the objects are not perfectly flush). If is a solid object, is placed just above , i.e. , where and are the heights of the objects and , respectively. If the agent chooses an invalid edge (where is not an available block, or where is not a block in the scene), then the episode is terminated with a reward of zero. We use throughout.
Discrete relative actions are triplets, , where is an edge in the structured observation graph between the chosen new block and the selected reference block, is an index over fine discretization of discrete horizontal offsets to place the chosen block relatively to the reference block, and is as before. If the blocks are not an available block or that the block is not a block already in the scene, then the episode is terminated with a reward of 0.
For the offsets, we use a uniform grid with bins in the range , where and are as before (this is such that there are exactly segments in the range ). We then pick the -th value in this grid as the relative position. The coordinate of the placed block is computed as before, but we also experimented with predicting the relative offset , and varying the number of discrete offsets (see Sec. D for details).
MLP
: The pre-processor of the MLP model consists of concatenating the list of blocks as given by the environment (blocks, available blocks, obstacles, targets) padded with zero blocks (up to the total maximum number of objects in each task with a 1 hot indicator of padding), and normalizing it with a LayerNorm layer into a fixed set 100 features. This fixed size vector is then processed by the core MLP consisting of four hidden layers of
units with ReLU non-linearity, and an output layer to match the required output size. We found this MLP model to have equal or worse performance to the RNN agent, and thus did not report results on it in the main text; however, Fig.
C.1 includes results for the MLP agent across the tasks.RNN: The RNN model pre-processor uses a GRU (hidden size of 256) to sequentially process the objects in the the scene (including padding objects up to a maximum size as described in the MLP). The output of the GRU after processing the last object is then used as input for the core MLP (identical in size to the on described in the MLP model). In some generalization settings, where the total number of objects increased drastically, we found better generalization performance by clipping/ignoring some of the objects in the input, than by allowing the network to process a longer sequence of objects than used at training time.
CNN: The CNN model pre-processor passes the 64
64 input image through a 4-layer convolution network (output channels=[16, 32, 32, 32]) followed by a ReLU activation, a linear layer on the flattened outputs into embedding size of 256, and another ReLU activation. Each layer is comprised of a 2d convolution layer (size=3, stride=1, padding=“same”) and a max pooling layer (size=3, stride=2, padding=“same”). The vector embedding of the image is then processed by and MLP core (identical in size to the on described in the MLP model, except that it uses 3 layers instead of 4).
CNN-RN: We found this CNN-RN model to have equal or worse performance to the vanilla CNN agent, and thus did not report results on it in the main text; however, Fig. C.1
includes results for the CNN-RN agent across the tasks. We use a higher-resolution convolutional feature map, using residual connections to increase depth and ease training. Each residual block with N channels consists of a N-channel (size=3, stride=1, padding=“same”) convolution and a max pool (size=3, stride=2, padding=“same”). This is followed by a N-channel convolution (size=3, stride=1, padding=“same”), a ReLU, and another N-channel convolution (size=3, stride=2, padding=“same”), the output of which is added to the max pool output to get the block output. We apply 3 such blocks with N=[16,32,8]. This gives us a vector of length 8 at every spatial location, to which we apply the standard Relation Net architecture
(Santoro et al., 2017): we contatenate each pairs of these vectors, and feed the length-16 vector into a 2-layer MLP with ReLU activations (64 and 128 units), before applying an average pool operation over all pair representations. This 128-length vector is a linear layer to produce the final embedding of size 256.Graph pre-processing: We use the list of objects or segmentation masks to construct the graphs that are input to the RS0-GN and DQN-GN agents, only discarding the information about the order of appearance of the object in the scene.
For the RS0 agent, we then construct a sparse graph from this set of nodes by connecting (1) available objects to all other objects in the scene; (2) targets and obstacles to all blocks in the scenes; and (3) blocks that are in contact. The DQN agent takes a fully-connected graph as input but we also experimented with feeding it the sparse representation (see Sec. D.3 for details).
GN architecture: We use the encode-process-decode architecture described by Battaglia et al. (2018). comprised of an independent graph encoder, a recurrent graph core with separate MLPs as node, edge, and global functions followed by three GRUs, respectively, and finally as a decoder either a graph network (for the RS0 agent) or graph independent (for the DQN agent). In symbols, given a graph observation , we process it as
where and are independent graph network (see Battaglia et al. (2018)), is a full graph network, and is a recurrent independent graph network. We use two hidden layers of units with ReLU non-linearity within all our graph networks.
For this discrete agent, the Q values are finally decoded from as
similarly to the approach of Dai et al. (2017).
For the RS0 agent we find that having more than 1 recurrent steps in the recurrent graph core did not improve performance so we use a single recurrent step, and disabled the GRU (no longer needed without recurrent steps).
Segmented images pre-processing: In the case of the Segmented images observations, each of the nodes in the graph contains an image, which we process independently using a pre-processor similar to that of the CNN model, but smaller (three layers with [8, 16, 8] output channels, followed by two activated linear layers with sizes [64, 32]). This produces a graph with 32 embedded features for each node.
We compare pixel based approaches with object based approaches on Fig. C.2, emphasizing that the graphical networks that take segmented images as input fare closer to their object based graphical counterparts than to raw CNNs, making their usage an exciting avenue for future work.
We implement a DQN agent with a structured graph input and graph output (roughly similar to Dai et al. (2017)), but where the Q-function is defined on the edges of the graph. This agent takes a fully-connected graph as input. The actions are decoded from the edges (resp. the global features) of the network’s output in the case of the discrete relative (resp. absolute) agent. The learner pulls experience from a replay containing graphs, with a fixed replay ratio of 4. The curriculum over scene difficulty is performed on a fixed, short schedule of learner steps. The main difference with respect to a vanilla DQN is the way we perform -exploration, which we explain in more detail below.
We use a distributed setup with up to 128 actors (for the largest MCTS budgets) and 1 learner. Our setup is synchronized to keep the replay ratio constant, i.e. the learner only replays each transition a maximum number of times, and conversely actors may wait for the learner to be done processing transitions. This results in an algorithm which has similar learning dynamics to a non-distributed one.
The majority of actions of the discrete agent are invalid, either because they (1) do not correspond to an edge starting from an available block and reaching to an already placed object; (2) because the resulting configuration would have overlapping objects; or (3) because the resulting scene would be unstable. This has the consequence that doing standard -exploration strongly reduces the length of an episode (longer episodes are exponentially suppressed), effectively performing more exploration at the beginning of an episode than at its end. To counteract this effect, we use an adaptive -schedule, where the probability of taking a random action at the -th step of an episode is given by , where
is an empirical estimate of an episodes typical length, and we use
throughout the paper. The final performance is mostly unchanged, but we observe that this makes learning faster and helps with model training (see Sec. E.3).![]() |
![]() |
The results reported elsewhere in this text for the discrete agent were all obtained with (see Sec. C.2) and a fully-connected input graph, but we experimented with varying and changing the graph connectivity. In Fig. D.1 we show that performance improves with the number of recurrences, but that training is also more unstable, as demonstrated by the wider shaded area around the curve. Empirically, provides the better compromise between performance and stability.
Those results were all obtained with a fully-connected graph, with a number of edges therefore equal to the number of objects squared. Many of those edges do not however correspond to valid actions or to directly actionable connections, and we experimented with removing those edges from the graph, using the same sparse graph used by the RS0 agent and described in Sec. C (note that this graph typically has about 4 times fewer edges than the fully-connected one). What we observe is that this reduces the reasoning capacities of the discrete agent and therefore decreases performance. Augmenting the number of recurrences can partially correct this effect: the best seed with a sparse graph and can get to the same level of performance as a seed of the fully-connected graph with ), but this then happens at the detriment of training stability.
Our discrete relative agents must choose a block to place, an object to use as a reference, and an offset relative to that reference. Thus far, that offset is only in the -direction, since a small -offset above the reference block is almost always sufficient. However, what happens if we allow the agent to choose offset as well? We observe that this multiplies the size of the action space by the number of discretization points (in our case, ), therefore making learning harder. On the other hand, for seeds that manage to start learning, the final performance is equivalent that of the agent which only predicts the relative position (see Fig. D.2), despite a number of actions much larger than that of a typical discrete agent, as shown in Table D.1.
The architecture of the GN-DQN agent naturally represents discrete quantities (i.e., choosing blocks out of a fix set), but using a discrete -offset loses precision over outputting a continuous value. In order to probe the effect of this approximation, we varied the number of discrete locations that the agent is allowed to choose as the second dimension of the action (Fig. D.3). We observe that a finer discretization of the space allows for slightly better final performance on some problems, but also implies a slower and more unstable learning. Empirically, the 15 steps of discretization used in this paper offers the best compromise. An interesting avenue for further research would be to create an agent that can produce continuous actions attached to a particular edge or vertex of the input graph.
Silhouette | Reaching | Covering | Covering Hard | |
---|---|---|---|---|
absolute | ||||
relative | ||||
relative () |
Other parameters: In all the paper, and unless otherwise specified, we fix the learning rate of the discrete agent to (resp. ) for the model-free (resp. model based) agent and use the Adam optimizer. We use a batch size of 16 and a replay ratio of 4. We perform a linear curriculum over the problem difficulty over a short amount of steps ( learner steps). We run all model free agents for learner steps, i.e. approximately actor steps. Model based agent are run for up to learner steps ( actor steps). Every experiment is run with 10 different seeds.
The efficiency of Monte-Carlo Tree Search (MCTS) (Coulom, 2006) planning in RL has recently been highlighted in (Guo et al., 2014; Silver et al., 2016, 2017, 2018). Here we combine our DQN agent with MCTS, in the spirit of Sutton (1991) and Azizzadenesheli et al. (2018). We define a state in the tree by the sequence of actions that led to it. In other words, given an episode starting with a configuration and a sequence of actions , we simply define (we do not try to regroup states that would correspond to the same observation if coming from different actions sequences). Each node in the tree has a value estimated as
(1) |
(observe that the resulting Monte-Carlo tree has a variable connectivity). In this expression, is the standard Monte-Carlo return of a rollout after state . The left term acts as a prior on the value of a node . It is essential to include this term to obtain learning with MCTS, even if using a large budget (see Fig. E.1). We interpret this as being due to the large number of actions stemming from each node and to the fact that many of these actions are actually invalid.
We then perform MCTS exploration by picking the action that maximizes , where for common MCTS with UCT exploration (Kocsis & Szepesvári, 2006) one would have
Remembering that the action can be decomposed as , where represents an edge index and all the remaining dimensions of the action (relative placement, use of glue or not, ..), we instead first pick as the maximizer of
and then as the maximizer of . We find this approach to yield slightly better results (see Fig. E.1), and to offer better invariance to changes in the second dimension of the action (e.g. when introducing two dimensional relative placement or changing the number of discretization steps). We use a value of for the UCT constant, and do not find a strong influence of this value on our results.
We then use a transition model to deduce the observation, reward and discount obtained when transitioning from to . For the results presented in the main part of this work, we focused on using a perfect transition model, obtained from reseeding the environment every time with the initial state of an episode and reapplying the same sequence of actions. While this is impractical for the large MCTS budgets used in some other works, this provides an upper-bound on the performance that can be obtained with a learnt model and allows to separate hyper-parameters analysis. Also, as we will show in the next paragraph, it is possible to obtain significant gains even when performing the MCTS expansion only at test time.
We incorporate planning in two ways to our relative discrete agent. In the first variation, we only perform MCTS at test time, using an independently trained Q-network to act as a prior in our MCTS expansion (cf. Eq.(1)). We observe that this improves the results on almost all problems but for Covering. In particular, in Reaching, the fraction of the hardest scenes where the agent does not reach all three targets is decreased by a factor of 4 (from 55% down to 16%).
In the second variation, we also perform MCTS at training time: the actor generates trajectories using MCTS expansions using its current Q-function, and the resulting trajectories are then fed to the learner (which does not do any Monte-Carlo sampling). We observe that this second approach yield slightly more stable learning and higher performance on Covering Hard, the task that requires the more reasoning (see the last panel of Fig. 5). On the other hand, on other problems, it yields a similar or even decreased performance.
An interesting point to note is that, when training with a perfect simulator, the transfer into the Q-function is very imperfect, as demonstrated by the low value of the left most point on the darker curve of Fig. 5. As it turns out, the agent is relying on the model to select the best action out of the few candidates selected by the Q-function. This may explain why the performance does not necessarily increase when testing with more budget, as the Q-function does not in this case provide a good prior when doing a deeper exploration of the MCTS tree. This is, in essence, also similar to the hypothesis put forward in (Azizzadenesheli et al., 2018).
Finally, we extend the previous model-based results by performing the MCTS expansion with a learnt model rather than a perfect simulator of the environment. Using a learnt object-based model was recently put forward as an efficient method for planning in RL (Pascanu et al., 2017; Hamrick et al., 2017) and for predicting physical dynamics (Sanchez-Gonzalez et al., 2018; Janner et al., 2019). Note, however, that none of these approaches have attempted to use MCTS with a graph network-based model.
The model is an operator taking as an input a graph observation and an action, and outputting a new graph observation alongside a reward and discount for the transition:
Given a sequence of observations , actions , rewards and discounts belonging to a single episode, we train this model with an unrolled loss
where we defined the predicted observation after steps
and the single step loss
In practice we varied the number of unrolls between 1 and 4. The model training is slower with a larger number of unrolls, but it yields more consistent unrolls when used within the MCTS expansion (ideally, the number of unrolls should probably match the typical depth of a MCTS unroll). The model architecture is similar to the one of the main Q-network described in Sec. C.
Model pre-training: At first, we experiment with using a pretrained, learnt model to then perform Q-learning with MCTS. The setup is therefore as follows:
Train an agent model free, or with a perfect environment simulator.
Train a model on trajectories generated by this agent.
Train a second agent with the model learnt in (2)
We observe in Fig. E.2 that this allows to obtain an improved performance at the beginning of training, matching the results obtained with a perfect environment simulator. However, on longer timescales, the performance plateaus and does slightly worse than a model free agent. We interpret this as being due to the rigidity of the model on longer timescales, which is not able to generalize enough to the data distribution that would be required to obtain larger rewards.
Model learnt online: Finally, we try to learn a model online. In this case the agent is trained with a model which is learnt at the same time on trajectories generated by the agent. As shown on Fig. E.3, we are able to slightly outperform the model free agent on short timescales in two of the problems (Silhouette and Covering Hard), while the noise introduced by the model is prohibitive again in Covering. On longer timescales, the imperfections of the model make the agent trained with a learnt model converge to the same rewards as the one trained without a model, rather than with a perfect model.
We believe that both in this case and when pre-training the model, understanding how to better train the model so that it generalizes better and yields sharper predictions are important areas of future research, and we see the positive results described here at the beginning of training as a strong motivation to pursue work in this direction.
We use a Retraced Stochastic Value Gradients (RS0, Heess et al. (2015); Munos et al. (2016); Riedmiller et al. (2018)
) off-policy agent, with a shared observation pre-processor and independent actor and critic core models (MLP or GN). The critic is conditioned on the actions by concatenating them with the input to the model core (either MLP input features of graph globals). The actor learns a Gaussian distribution over each of the actions by outputting parameters using a linear policy head, conditioned on the last layer of the MLP or output globals of the GN. We use a value of 0.98 for the discount and calculated the retrace loss on sequences of length 5.
While the Gaussian policy noise is often sufficient as a source of exploration, due to the highly multi-modal nature of placing objects, we injected additional fixed exploration by sampling a continuous action uniformly over the action range with probability , and sampling from the Gaussian otherwise. We set for tasks with shorter episodes (Silhouette and Covering Hard) and otherwise.
Due to the slower training of RS0 compared to DQN, and the large variance in learning time across the different configurations, we use a dynamic curriculum, only allowing agents to progress through the curriculum once they had achieved a certain performance in the current level.
The criteria for progressing through the curriculum is to obtain at least 50% of the maximum reward in at least 50% of the episodes (Silhouette, Covering Hard) or at least 25% of the maximum reward in at least 25% of the episodes (Connecting, Covering). The threshold values were selected to ensure that the great majority of seeds would reach the maximum level of the curriculum during the allocated experiment time.
To avoid agents from progressing in the curriculum by just solving a particular type of scene, this criteria is applied independently over groups of episodes partitioned based on unique combinations of: number of targets, maximum target height, number of obstacles and maximum obstacle height, and using statistics from the last 200 episodes in each group.
We run every experiment with 10 independent seeds, each of them with 8 actors, 1 learner and 1 FIFO replay (capacity= sequences). Additionally, an evaluation actor with the exploration disabled (and that did not feed data into the replay) is used to generate data to evaluate the dynamic curriculum criteria, and to monitor the overall performance in the task at maximum difficulty.
Due to the off-policy character of the algorithm, we did not set any synchronization between the actors generating the data and the learner obtaining batched of data from the replay. As a consequence, the relative number of actor steps per second and learner steps per second can vary drastically across the different architectures, depending of the relative speed differences between the forward pass (actors), and the backward pass (learner) of the models. Instead, we decided to use a wall-time criteria for terminating our experiments, stopping all experiments after one week of training, or after performance started decreasing.
Task |
Best absolute agent |
Best non-GN relative agent |
Best relative agent |
Best model-based relative agent |
---|---|---|---|---|
Silhouette | GN-RS0 | RNN-RS0 | GN-DQN | GN-DQN with MCTS at test time |
Gen. 16 Blocks |
GN-RS0 | CNN-RS0 | GN-DQN | GN-DQN with MCTS at test time |
Connecting |
RNN-RS0 | RNN-RS0 | GN-DQN | GN-DQN with MCTS at test time |
Gen. Diff. Locs. |
RNN-RS0 | RNN-RS0 | GN-DQN | GN-DQN with MCTS at test time |
Gen. 4 Layers |
RNN-RS0 | RNN-RS0 | GN-DQN | GN-DQN with MCTS at test time |
Covering |
GN-RS0 | CNN-RS0 | GN-DQN | GN-DQN with MCTS at test time |
Covering Hard |
GN-RS0 | RNN-RS0 | GN-DQN | GN-DQN with MCTS at train and test time |
|