Log In Sign Up

Photo-Realistic Blocksworld Dataset

by   Masataro Asai, et al.

In this report, we introduce an artificial dataset generator for Photo-realistic Blocksworld domain. Blocksworld is one of the oldest high-level task planning domain that is well defined but contains sufficient complexity, e.g., the conflicting subgoals and the decomposability into subproblems. We aim to make this dataset a benchmark for Neural-Symbolic integrated systems and accelerate the research in this area. The key advantage of such systems is the ability to obtain a symbolic model from the real-world input and perform a fast, systematic, complete algorithm for symbolic reasoning, without any supervision and the reward signal from the environment.


page 1

page 5

page 7

page 8


pix2rule: End-to-end Neuro-symbolic Rule Learning

Humans have the ability to seamlessly combine low-level visual input wit...

Aesthetics and neural network image representations

We analyze the spaces of images encoded by generative networks of the Bi...

Transferable Task Execution from Pixels through Deep Planning Domain Learning

While robots can learn models to solve many manipulation tasks from raw ...

Neural Symbolic Machines: Learning Semantic Parsers on Freebase with Weak Supervision

Harnessing the statistical power of neural networks to perform language ...

Unsupervised Grounding of Plannable First-Order Logic Representation from Images

Recently, there is an increasing interest in obtaining the relational st...

Unifying Neural Learning and Symbolic Reasoning for Spinal Medical Report Generation

Automated medical report generation in spine radiology, i.e., given spin...

ELASTIC: Numerical Reasoning with Adaptive Symbolic Compiler

Numerical reasoning over text is a challenging task of Artificial Intell...

1 Introduction

Blocksworld is one of the earliest symbolic AI planning domains that have been traditionally expressed in a standardized PDDL language [McDermottMcDermott2000]

. With an advent of modern machine learning systems that are capable of automatically deriving the symbolic, propositional representation of the environment from the raw inputs

[Asai  FukunagaAsai  Fukunaga2018, Kurutach, Tamar, Yang, Russell,  AbbeelKurutach et al.2018], it became realistic to directly solve the visualized version of the classical planning problems like Blocksworld.

To accelerate the research in this area, we publish Photo-realistic Blocksworld Dataset Generator at, a system that renders the realistic Blocksworld images and put the results in a compact Numpy archive.

Figure 1: (Left): A simplified illustration of Blocksworld domain. (Right): An example photo-realistic rendering of a Blocksworld state in this dataset.

In this document, we first describe the Blocksworld domain, then proceed to the details of the dataset and the extension added to the environment. Finally, we show the results of some baseline experiments of an existing Neural-Symbolic planning system Latplan [Asai  FukunagaAsai  Fukunaga2018] applied to this dataset.

2 The Blocksworld Domain

Blocksworld is a domain which was proposed in SHRDLU [WinogradWinograd1971]

interactive natural language understanding program, which takes a command input given as a natural language text string, invokes a planner to achieve the task, then display the result. It is still used as a testbed for evaluating various computer vision / dialog systems

[Perera, Allen, Teng,  GalescuPerera et al.2018, Bisk, Shih, Choi,  MarcuBisk et al.2017, She, Yang, Cheng, Jia, Chai,  XiShe et al.2014, Johnson, Jonker, Van Riemsdijk, Feltovich,  BradshawJohnson et al.2009] Most notably, SHRDLU heavily inspires the CLEVR standard benchmark dataset for Visual Question Answering [Johnson, Hariharan, van der Maaten, Fei-Fei, Zitnick,  GirshickJohnson et al.2017].

Blocksworld domain typically consists of several stacks of wooden blocks placed on a table, and the task is to reorder and stack the blocks in a specified configuration. The blocks are named (typically assigned an alphabet) and are supposed to be identifiable, but in the real world, the color could replace the ID for each block.

In order to move a block , the block should be clear of anything on the block itself, i.e., it should be on top of a stack, so that the arm can directly grasp a single block without hitting another object. This restriction allows the arm to grasp only a single block at a time. While this condition could be expressed in a quantified first-order formula using a predicate on, e.g., , typically this is modeled with an auxiliary predicate clear with an equivalent meaning.

There are a couple of variations to this domain such as the one which explicitly involves the concept of a robotic arm. In this variation, moving a block is decomposed into two actions, e.g., pick-up and put-down, where pick-up action requires a handempty nullary predicate to be true. The handempty predicate indicates that the arm is currently holding nothing, and can grasp a new block. These predicates could be trivially extended to a multi-armed scenario where pick-up and handempty could take an additional ?arm argument, but we do not add such an extension in our dataset.

When modeled as a pure STRIPS domain where the disjunctive preconditions (or’s in the preconditions) are prohibited, the actions can be further divided into pick-up/put-down actions and unstack/stack action, wherein pick-up/put-down the arm grasps/releases a block from/onto the table, while in stack/unstack the arm grasps a block from/onto another block. An example PDDL representation of such a model is presented in Fig. 2.

The domain has further variations such as those containing the size constraint (a larger block cannot be put on a smaller block) or other non-stackable shapes (spheres or pyramids) [Gupta  NauGupta  Nau1992], but we do not consider them in this dataset.

common-lisp (define (domain blocks) (:requirements :strips) (:predicates (on ?x ?y) (ontable ?x) (clear ?x) (handempty) (holding ?x))

(:action pick-up :parameters (?x) :precondition (and (clear ?x) (ontable ?x) (handempty)) :effect (and (not (ontable ?x)) (not (clear ?x)) (not (handempty)) (holding ?x)))

(:action put-down :parameters (?x) :precondition (holding ?x) :effect (and (not (holding ?x)) (clear ?x) (handempty) (ontable ?x))) (:action stack :parameters (?x ?y) :precondition (and (holding ?x) (clear ?y)) :effect (and (not (holding ?x)) (not (clear ?y)) (clear ?x) (handempty) (on ?x ?y))) (:action unstack :parameters (?x ?y) :precondition (and (on ?x ?y) (clear ?x) (handempty)) :effect (and (holding ?x) (clear ?y) (not (clear ?x)) (not (handempty)) (not (on ?x ?y)))))

Figure 2: 4-op Blocksworld in PDDL.

The problem has been known for exhibiting the subgoal interactions caused by the delete effects, which introduced the well-known Sussman’s anomaly [SussmanSussman1973, SacerdotiSacerdoti1975, WaldingerWaldinger1975, McDermott  CharniakMcDermott  Charniak1985, NorvigNorvig1992]. The anomaly states that achieving some subgoal requires destroying an already achieved subgoal once (and restore it later). This characteristic makes the problem difficult as the agent needs to consider how to construct the subgoals in the correct order (subgoal ordering). Later, the class of problem containing such interaction was shown to make the planning problem PSPACE-complete while the delete-free planning problem was shown to be only NP-complete [Bäckström  NebelBäckström  Nebel1995].

Solving the Blocksworld problem suboptimally, i.e., without requiring to return the fewest number of moves that solves the problem, is tractable. Indeed, we can solve the problem in linear time in the following algorithm: First, put-down all blocks onto the floor, then construct the desired stacks. This procedure finishes in a number of steps linear to the number of blocks. In contrast, the decision problem version of the optimization problem, i.e., “Given an integer , is there any path that achieves the goal state under steps?” is shown to be NP-hard [Gupta  NauGupta  Nau1992].

More recently, an in-depth analysis on the suboptimal / approximation algorithms and the problem generation method was performed on the domain [Slaney  ThiébauxSlaney  Thiébaux2001], showing that the domain is still able to convey valuable lessons when analyzing the performance of modern planners.

Figure 3: A demonstration of SHRDLU natural language understanding computer program. The image was taken from Winograd’s original paper [WinogradWinograd1971].

3 The Dataset

The code base for the dataset generator is a fork of CLEVR dataset generator where the rendering code was reused, with modifications to the logic for placing objects and enumerating the valid transitions. It renders the constructed scenes in Blender 3D rendering engine which can produce a photorealistic image generated by ray-tracing (Fig. 1, right).

In the environment, there are several cylinders or cubes of various colors and sizes and two surface materials (Metal/Rubber) stacked on the floor, just like in the usual STRIPS Blocksworld domain. The original environment used in SHRDLU contains shapes other than cubes such as pyramids, which are not stackable. While we do not use those unstackable shapes in the generators by default, it can be supported with a few modifications to the code.

Unlike the original blocksworld, three actions can be performed in this environment: move a block onto another stack or the floor, and polish/unpolish a block, i.e., change the surface of a block from Metal to Rubber or vice versa. All actions are applicable only when the block is on top of a stack or on the floor. The polish/unpolish

actions allow changes in the non-coordinate features of the object vectors, which adds additional complexity.

Figure 4: An example Blocksworld transition. Each state has a perturbation from the jitter in the light positions, object locations, object rotations, and the ray-tracing noise. Objects have the different sizes, colors, shapes and surface materials. Regions corresponding to each object in the environment are extracted according to the bounding box information included in the dataset generator output, but is ideally automatically extracted by object recognition methods such as YOLO [Redmon, Divvala, Girshick,  FarhadiRedmon et al.2016]. Other objects may intrude the extracted regions, as can be seen in the extraction of the green cube (top-left) which also contains the bottom edge of the smaller cube.

The dataset generator takes the number of blocks and the maximum number of stacks that are allowed in the environment. No two blocks are allowed to be of the same color, and no two blocks are allowed to have the same shape and size.

The generator produces a 300x200 RGB image and a state description which contains the bounding boxes (bbox) of the objects. Extracting these bboxes is an object recognition task we do not address in this paper, and ideally, should be performed by a system like YOLO [Redmon, Divvala, Girshick,  FarhadiRedmon et al.2016].

The generator comes with a postprocessing program, which extracts and resizes the image patches in the bboxes to 32x32 RGB. It stores the image patches as well as the bbox vector in a single, compressed Numpy format (.npz) which is easily loaded from a python program environment. The agent that performs a Machine Learning task on this environment is allowed to take the original rendering as well as this segmented dataset.

The generator enumerates all possible states/transitions (480/2592 for 3 blocks and 3 stacks; 80640/518400 for 5 blocks and 3 stacks). Since the rendering could take time, the script provides support for a distributed cluster environment with a batch scheduling system.

4 Baseline Performance Experiment

To demonstrate the baseline performance of an existing Neural-Symbolic planning system, We modified Latplan [Asai  FukunagaAsai  Fukunaga2018] image-based planner, a system that operates on a discrete symbolic latent space of the real-valued inputs and runs Dijkstra’s/A* search using a state-of-the-art symbolic classical planning solver. We modified Latplan to take the set-of-object-feature-vector input rather than images.

Latplan system learns the binary latent space of an arbitrary raw input (e.g. images) with a Gumbel-Softmax variational autoencoder (State AutoEncoder network, SAE), learns a discrete state space from the transition examples, and runs a symbolic, systematic search algorithm such as Dijkstra or A* search which guarantee the optimality of the solution. Unlike RL-based planning systems, the search agent does not contain the learning aspects. The discrete plan in the latent space is mapped back to the raw image visualization of the plan execution, which requires the reconstruction capability of (V)AE. A similar system replacing Gumbel Softmax VAE with Causal InfoGAN was later proposed

[Kurutach, Tamar, Yang, Russell,  AbbeelKurutach et al.2018].

When the network learned the representation, it guarantees that the planner finds a solution because the search algorithm being used (e.g. Dijkstra) is a complete, systematic, symbolic search algorithm, which guarantees to find a solution whenever it is reachable in the state space. If the network cannot learn the representation, the system cannot solve the problem and/or return the human-comprehensive visualization.

We generated a dataset for a 4-blocks, 3-stacks environment, whose search space consists of 5760 states and 34560 transitions. We provided 2500 randomly selected states for training the autoencoder.

We tested the generated representation with AMA PDDL generator [Asai  FukunagaAsai  Fukunaga2018] and the Fast Downward [HelmertHelmert2004] classical planner. AMA is an oracular method that takes the entire raw state transitions, encode each pair with the SAE, then instantiate each encoded pair into a grounded action schema. It models the ground truth of the transition rules, thus is useful for verifying the state representation. Planning fails when SAE fails to encode a given init/goal image into a propositional state that exactly matches one of the search nodes. While there are several learning-based AMA methods that approximate AMA (e.g. AMA [Asai  FukunagaAsai  Fukunaga2018] and Action Learner [Amado, Pereira, Aires, Magnaguagno, Granada,  MeneguzziAmado et al.2018b, Amado, Aires, Pereira, Magnaguagno, Granada,  MeneguzziAmado et al.2018a]), there is information loss between the learned action model and the original search space generated.

We invoked Fast Downward with blind heuristics in order to remove its effect. This is because AMA

generates a huge PDDL model containing all transitions which result in an excessive runtime for initializing any sophisticated heuristics. The scalability issue caused by using a blind heuristics is not an issue since the focus of this evaluation is on the feasibility of the representation.

We solved 30 planning instances generated by taking a random initial state and choosing a goal state by the 3, 7, or 14 steps random walks (10 instances each). The result plans are inspected manually and checked for correctness. While Latplan returned plans for all 30 instances, the plans were correct only in 14 instances due to the error in the reconstruction (Details in Table 1). Apparently, the training seems more difficult than the domains tested in [Asai  FukunagaAsai  Fukunaga2018]

due to the wider variety of disturbances in the environment. This could be further improved by analyzing and addressing the various deficiency of the latent representation learned by the network, and by the hyperparameter tuning for the better accuracy.

Random walk steps The number of solved instances
used for generating (out of 10 instances each)
the problem instances
3 7
7 5
14 2
Table 1: The number of instances solved by Latplan using a VAE.
Figure 5: An example of a problem instance. (Left) The initial state. (Right) The goal state. The planner should unpolish a green cube and move the blocks to the appropriate goal position, while also following the environment constraint that the blocks can move or polished only when it is on top of a stack or on the floor.
Figure 6: An example of a successful plan execution returned by Latplan. Latplan found an optimal solution because of the underlying optimal search algorithm (Dijkstra).

5 Related Work

The crucial difference between this environment and Atari Learning Environment [Bellemare, Naddaf, Veness,  BowlingBellemare et al.2013] is twofold: First, the action label is not readily available, i.e., in ALE, the agent have the total knowledge on the possible combination of the key/lever inputs (up,down,left,right,fire), while in our dataset the state transition pairs are not labeled by the actions. The agent is required to find the set of actions by itself, possibly using an additional learning mechanism such as AMA system [Asai  FukunagaAsai  Fukunaga2018].

Second, in ALE, the scoring criteria / reinforcement signal is defined. In contrast, classical planning problems like Blocksworld do not contain such signals except for the path length (also called an intrinsic reward in the RL field). A possible extension of the baseline planning system presented above is to make it learn the goal conditions rather than requiring the single goal state as the input.

There is another image dataset [Bisk, Marcu,  WongBisk et al.2016] for an 3D environment that consists of blocks, but the environment does not contain the combinatorial aspect that is present in our dataset. Specifically, the environment does not contain the subgoal conflicts as all blocks are initially located on the table.

6 Discussion and Conclusion

We introduced a generator for the Photo-Realistic Blocksworld dataset and specified the environment that is depicted in it. In the future, we aim to increase the variety of classical planning domains that are expressed in the visual format, in a spirit similar to the Atari Learning Environment [Bellemare, Naddaf, Veness,  BowlingBellemare et al.2013], or perhaps borrowing some data from it.


  • [Amado, Aires, Pereira, Magnaguagno, Granada,  MeneguzziAmado et al.2018a] Amado, L., Aires, J. P., Pereira, R. F., Magnaguagno, M. C., Granada, R.,  Meneguzzi, F. 2018a. LSTM-Based Goal Recognition in Latent Space  arXiv preprint arXiv:1808.05249.
  • [Amado, Pereira, Aires, Magnaguagno, Granada,  MeneguzziAmado et al.2018b] Amado, L., Pereira, R. F., Aires, J., Magnaguagno, M., Granada, R.,  Meneguzzi, F. 2018b. Goal Recognition in Latent Space.
  • [Asai  FukunagaAsai  Fukunaga2018] Asai, M.  Fukunaga, A. 2018. Classical Planning in Deep Latent Space: Bridging the Subsymbolic-Symbolic Boundary 

    In Proc. of AAAI Conference on Artificial Intelligence.

  • [Bäckström  NebelBäckström  Nebel1995] Bäckström, C.  Nebel, B. 1995. Complexity Results for SAS+ Planning  Computational Intelligence, 11(4), 625–655.
  • [Bellemare, Naddaf, Veness,  BowlingBellemare et al.2013] Bellemare, M. G., Naddaf, Y., Veness, J.,  Bowling, M. 2013. The Arcade Learning Environment: An Evaluation Platform for General Agents  Journal of Artificial Intelligence Research, 47, 253–279.
  • [Bisk, Marcu,  WongBisk et al.2016] Bisk, Y., Marcu, D.,  Wong, W. 2016. Towards a Dataset for Human Computer Communication via Grounded Language Acquisition.  In AAAI Workshop: Symbiotic Cognitive Systems.
  • [Bisk, Shih, Choi,  MarcuBisk et al.2017] Bisk, Y., Shih, K. J., Choi, Y.,  Marcu, D. 2017. Learning interpretable spatial operations in a rich 3d blocks world  arXiv preprint arXiv:1712.03463.
  • [Gupta  NauGupta  Nau1992] Gupta, N.  Nau, D. S. 1992. On the Complexity of Blocks-World Planning  Artificial Intelligence, 56(2), 223–254.
  • [HelmertHelmert2004] Helmert, M. 2004. A Planning Heuristic Based on Causal Graph Analysis  In Proc. of the International Conference on Automated Planning and Scheduling(ICAPS),  161–170.
  • [Johnson, Hariharan, van der Maaten, Fei-Fei, Zitnick,  GirshickJohnson et al.2017] Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Zitnick, C. L.,  Girshick, R. 2017. CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning  In CVPR.
  • [Johnson, Jonker, Van Riemsdijk, Feltovich,  BradshawJohnson et al.2009] Johnson, M., Jonker, C., Van Riemsdijk, B., Feltovich, P. J.,  Bradshaw, J. M. 2009. Joint Activity Testbed: Blocks World for Teams (BW4T)  In International Workshop on Engineering Societies in the Agents World,  254–256. Springer.
  • [Kurutach, Tamar, Yang, Russell,  AbbeelKurutach et al.2018] Kurutach, T., Tamar, A., Yang, G., Russell, S.,  Abbeel, P. 2018. Learning Plannable Representations with Causal InfoGAN  In In Proceedings of ICML / IJCAI / AAMAS 2018 Workshop on Planning and Learning (PAL-18).
  • [McDermott  CharniakMcDermott  Charniak1985] McDermott, D.  Charniak, E. 1985. Introduction to Artificial Intelligence  Reading: Addison-Wesley.
  • [McDermottMcDermott2000] McDermott, D. V. 2000. The 1998 AI Planning Systems Competition  AI Magazine, 21(2), 35–55.
  • [NorvigNorvig1992] Norvig, P. 1992. Paradigms of Artificial Intelligence Programming: Case Studies in Common LISP. Morgan Kaufmann.
  • [Perera, Allen, Teng,  GalescuPerera et al.2018] Perera, I., Allen, J., Teng, C. M.,  Galescu, L. 2018. Building and learning structures in a situated blocks world through deep language understanding  In Proceedings of the First International Workshop on Spatial Language Understanding,  12–20.
  • [Redmon, Divvala, Girshick,  FarhadiRedmon et al.2016] Redmon, J., Divvala, S., Girshick, R.,  Farhadi, A. 2016. You Only Look Once: Unified, Real-Time Object Detection 

    In Proc. of IEEE Conference on Computer Vision and Pattern Recognition,  779–788.

  • [SacerdotiSacerdoti1975] Sacerdoti, E. D. 1975. The Nonlinear Nature of Plans  , STANFORD RESEARCH INST MENLO PARK CA.
  • [She, Yang, Cheng, Jia, Chai,  XiShe et al.2014] She, L., Yang, S., Cheng, Y., Jia, Y., Chai, J.,  Xi, N. 2014. Back to the blocks world: Learning new actions through situated human-robot dialogue  In Proceedings of the 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL),  89–97.
  • [Slaney  ThiébauxSlaney  Thiébaux2001] Slaney, J.  Thiébaux, S. 2001. Blocks World Revisited  Artificial Intelligence, 125(1-2), 119–153.
  • [SussmanSussman1973] Sussman, G. J. 1973. A Computational Model of Skill Acquisition.
  • [WaldingerWaldinger1975] Waldinger, R. 1975. Achieving Several Goals Simultaneously. Stanford Research Institute Menlo Park, CA.
  • [WinogradWinograd1971] Winograd, T. 1971. Procedures as a Representation for Data in a Computer Program for Understanding Natural Language  , MASSACHUSETTS INST OF TECH CAMBRIDGE PROJECT MAC.