1 Introduction
Image generation has been an extensively studied topic in machine learning and computer vision. Vast number of papers have explored generating images through low dimensional latent representations
(Goodfellow et al., 2014; Arjovsky et al., 2017; Li et al., 2017; Kingma and Welling, 2013; van den Oord et al., 2017; Oord et al., 2016). However, it is challenging to learn disentangled representations which allows us to control the generation after learning the generative models (Higgins et al., 2017; Kim and Mnih, 2018; Locatello et al., 2018; Chen et al., 2016). In this paper, we will explore generating constructive solid geometry images through programs in the form of contextfree grammar. We can consider these programs as an alternative of lowdimensional representation of the image. The model for extracting the programs can be seen as an encoder and the renderer that reconstructs the image is the decoder. Parsing an image of geometric shapes into programs enables us to manipulate only the desired components of the image while reconstruct the rest of the image.There are two types of image renderers that construct the images from programs – differentiable and nondifferentiable renderers. While creating differentiable renderers is a fast developing field (e.g. (Li et al., 2018; Liu et al., 2019)
), they are still not as welldeveloped as the nondifferentiable ones. On the other hand, learning a generative process by combining modern machine learning models (e.g. neural networks) and nondifferentiable renderers is challenging because we are not able to obtain the gradient respect to the input.
The goal of our paper is learning to parse an image into programs which can be described by a contextfreegrammar, such as constructive solid geometry (CSG) image (Hubbard, 1990). (Sharma et al., 2018) have studied the same problem using supervised learning where the images and the programs are paired up as the input. However, we want to tackle a more general image parsing problem where the ground truth program is not available for training. In this paper, we aim to solve it with only target images as input without any other form of program supervision.
Here are the three key components to successfully learn to parse an image to programs with minimum data supervision:

We adopt a nondifferentiable renderer for this problem. The lack of direct gradient information from the nondifferentiable renderer implies that we can only train our model based on the final reconstructed image’s reward. We use REINFORCE (Williams, 1992) as our main building component.

We introduce Tree LSTM to impose a structure on our output space to reduce the search space to only valid programs. A naive parameterization of program generation is using simple RNN, which is not guaranteed to always generate grammatically correct programs. However, the renderer cannot parse invalid programs and the reconstructed images of invalid programs are defined to receive low rewards. Therefore, it is necessary to limit the search space to avoid the sparse reward problem.

We propose a better entropy estimator for the standard entropy regularization to encourage exploration of the search space. We propose a stepwise entropy estimator and show that it has lower variance than the naive estimator. Instead of using simple Monte Carlo, we adopt a sampling without replacement scheme to improve the optimization efficiency significantly.
2 Related Work
Our work is related to program synthesis, visionasinversegraphics, as well as generating images with strokebased rendering (SBR) systems.
Program synthesis has been a growing interest to researchers in machine learning (Balog et al., 2017; Shin et al., 2018; Devlin et al., 2017; Zohar and Wolf, 2018). Research on converting images to programs are more directly related to our work (Sharma et al., 2018; Ellis et al., 2019; Tian et al., 2019; Ellis et al., 2018; Liu et al., 2018). The setup of our work closely follows (Sharma et al., 2018) and (Ellis et al., 2019) is also an extension from it. Both works rely on supervised pretraining before using REINFORCE to fine tune the models, while our goal is to train the grammar generation model with only the target images as the inputs without the supervision of the corresponding programs of training images. (Tian et al., 2019) incorporated a differentiable renderer into the learning pipeline while we treated our renderer as an external procedure independent from the learning process. With a nondifferentiable renderer, we cannot directly propagate the gradient from the reward to the grammar generation model. This distinction leads to very different algorithm designs. (Liu et al., 2018) focuses on the symmetry and repetition aspects of a scene. The algorithm takes in a scene of 3D geometric objects stacked or lined up and parse it into programs in terms of loops and rotations. Although related, this work has a very different setting from ours. (Ellis et al., 2018) used neural network to extract the shapes from handdrawn sketches, formulated the grammatical rules as constraints and obtain the final program by solving a constraint satisfaction problem. This process can be computationally expensive compared to neural network’s performance in test time.
Research in visionasinversegraphics concerns parsing a scene into a collection of shapes or 3D primitives with descriptions that imitates the original scene (Tulsiani et al., 2017; Romaszko et al., 2017; Wu et al., 2017). (Wu et al., 2017) derenders a scene into a collection of objects, such as a tree and a girl etc, with colors and locations as parameters. (Yao et al., 2018) differs from the last work in that it further manipulates the objects derendered, for example, changing a car’s color from blue to red and putting it back to the scene. These works do not deal with the interactions among the parsed objects through operations, which is the distinction with our setting.
Strokebased rendering creates an image in a way natural to human. Some of the examples are recreating paintings imitating a painter’s brush stroke by (Huang et al., 2019), drawing sketches of objects by (Ha and Eck, 2017). SPIRAL by (Ganin et al., 2018)
is an adversarially trained deep reinforcement learning agent. The agent is able to recreate MuJoCo scenes, MNIST digits, and Omniglot characters. Strokebased rendering behaves in an additive way. The action space usually consists of a line of various continuous parameters such as width and length. A grammar structure is unnecessary to the rendering process because each action is additive. This contrasts from our problem in which the action space includes primitive shapes, operations as well as grammar tokens. Overall, strokebased rendering is less expressive and less precise when it comes to structured objects generation.
3 Problem Definition
The input of the model is an image constructed from basic shapes such as triangle, circle or triangle, each with a designated size and location (see Figure 6). The output of the model is a program to reconstruct the input in the format of contextfree grammar (CFG).
In this paper, we use constructive solid geometry (CSG) image (Hubbard, 1990) to form an image. Some sample images can be found in Figure 8. The CFG for CSG includes the binary shape operations of plus, minus, and intersection. The contextfree grammar rules are as follows:
(1)  
(2)  
(3)  
(4) 
, , and are nonterminal tokens for the start, operations, and shapes. The others are called terminal tokens, such as (union), (intersection), (subtraction), and stands for a circle with radius and at . Each line above is called production rule or just rule for simplicity. Please refer to Figure 7 for some examples of CSG images with their corresponding programs.
The setting of the problem is adopted from (Sharma et al., 2018). They are able to parse the grammatical programs from the images through supervised training. However, supervised training involves constructing images with known programs to be the input at the same time. Manually constructing programs for training can be timeconsuming but also restricts the target images to be an exact translation of the program. We are interested in unsupervised training because images usually do not come with their corresponding programs.
4 Proposed Algorithm
Our model consists of a CNN encoder for reading the canvas, an embedding layer for the action tokens, and an RNN for generating the grammatical program sequences (see Figure 1 for demonstration). The model was trained with entropy regularized REINFORCE (Williams, 1992). Let be the entropy of the sequence (we will define this later), is a reward function based on the sequence, and is the parameters of our model. The objective is optimized as follows:
(5) 
The output program is converted to an image by a nondifferentiable renderer. The image is compared to the target image and receives a reward , which is our definition of . We adopt Chamfer distance as part of the reward function as in (Sharma et al., 2018). Chamfer distance calculates the average matching distance to the nearest feature. Let and be pixels in each image respectively. Chamfer distance is described formally as follows:
(6) 
The Chamfer distance is scaled by , which is the length of the image diagonal, such that it is between 0 and 1. Under the setting of this problem, the reward mostly falls between 0.9 and 1. In order to magnify the reward difference among actions, we exponentiated to the power of 20. We add the other component to the reward function based on pixel differences in order to differentiate shapes with similar sizes and locations. The final reward function can be described as follow:
(7) 
with representing the original image and representing the generated image and in this setting. When a generated image receives a very low reward from the function, its exact reward value provides little insight on its performance and is largely depending on its target image, thus we chose as the minimum reward in order to simplify the reward function’s behavior when the generated images have poor qualities. Similar reward clipping idea was also proposed in DQN (Mnih et al., 2013) and they used it to unify the reward ranges across different games to .
Our setup closely follows (Sharma et al., 2018). However, naively extending (Sharma et al., 2018) by optimizing reinforcement learning objectives does not learn to generate correct grammatical programs. We introduce two crucial designs to improve the learning. First, we adopt a grammar encoded tree LSTM to ensure a valid output sequence with an image stack to provide intermediate images. Second, we propose a lowervariance entropy estimator to improve both the estimation and optimization.
4.1 Grammar Encoded Tree LSTM
There are 3 main production rules in the grammatical program generation. One is shape selection (), another type is operation selection (), and the third type is grammar selection (). Grammar selection in this problem setting includes , and as in the grammar definition which decides whether the program will expand. Let the set of shapes to be , the set of operations to be , and the set of nonterminal outcomes to be (e.g. in (2)). A naive parameterization is to let the candidate set of the LSTM output to be , where is the end token, and treat it as a standard language model to generate the program (Sharma et al., 2018). The model does not explicitly encode grammar structures, and expect the model to capture it implicitly during the learning process. The drawback is that the generated program is not guaranteed to be always grammatically correct. For example, the possible generation can be , which is an invalid program.
Instead of blindly sampling terminal tokens, it is easy to sample grammatically correct programs based on its production rules given a well defined grammar. Therefore, we propose a grammar (tree) LSTM which takes the grammar production rule into account for generation, and it is guaranteed to always generate grammatically correct program and significantly reduce the search space during the training. Grammar encoded models have been used in language tasks, such as (Tai et al., 2015; Wang et al., 2018). They used multiple RNN’s for different nonterminal tokens to ensure that the output words follow the grammar rules. Such models are not popular because there are significant amount of exceptions to the rules in natural languages. In contrast, CFG is a welldefined grammar making it an ideal candidate to be modeled by grammar encoded tree LSTM.
The proposed model can be simply understood as an RNN model with a masking mechanism by properly maintaining a stack to rule out invalid outputs. We increase the size of the output space from of the previous approach (e.g. (Sharma et al., 2018)) to by including the nonterminal tokens. During the generation, we maintain a stack to trace the current production rule. Based on the current nonterminal token, we use the masking mechanism to weed out the invalid output candidates. For example, if we are handling the nonterminal , we mask the invalid output to reduce the candidate size from to only. For more details, please refer to (Tai et al., 2015).
To boost the performance, we further propose an image stack to maintain the intermediate images during the generation and feed them into the model to provide additional information. An example for better understanding is shown below and you can trace each step through Figure 1.

A grammar stack with a start token and an end token as well as an empty image stack is initialized.

In the first iteration, the token is popped out. Following (1), all other options will be masked except , the only possible output. token is added to the stack.

In the second iteration, or any iteration where the token is popped, the input for all examples and all softmax outputs are masked except the entries representing and according to (2). If is sampled, , and tokens will be added to the stack separately in that exact order to expand the program further. If is sampled, it will be added to the stack and the program cannot expand further.

When a shape token is sampled, it will not be added to the grammar stack as they do not contribute to the program structure. Instead, the image of the shape will be pushed onto the corresponding image stack.

When an operation token is sampled, it also will not be added to the grammar stack. Instead, we pop out the top two images to apply the operation on them and push the final image onto the image stack again.

When the stack has popped out all the added tokens, the end token will be popped out in the last iteration. We then finish the sampling as standard RNN language models.
Before finishing generating a program, there can be multiple images present yet to be assembled by operations in the image stack. We consider them to be the observations in the process and utilize LSTM with those observations as a part of the input to better infer our future direction in the search space.
In this process, the model will produce a sequence of tokens, including grammatical, shape and operation tokens. We only keep all the terminal tokens as the final output program and discard the rest. The programs are ensured to be grammatically correct.
In practice, we implement the masking mechanism by adding a vector to the output before passing into softmax layer to get the probability. The vector contains
for valid output and large negative numbers for invalid ones. This makes sure that invalid options will have almost zero probability of being sampled. The input of the RNN cell includes encoded target image and intermediate images from the image stack, embedded popout token from grammar stack and the hidden state from the RNN’s last iteration. Following previous works in NLP , we call the model Tree LSTM. The exact algorithm is in Algorithm 1.4.2 Exploration with Entropy Regularization
Entropy regularization in reinforcement learning is a standard practice for encouraging exploration. We argue that a careful design of entropy estimation with lower variance can enhance the exploration effect.
Let
to be the random variable of possible programs. The entropy is denoted as
^{1}^{1}1Here we overload and as used in (1) and (2) to follow the convention.. The possible outcomes of can be exponentially large. Therefore, we usually estimate the entropy via(8) 
with finite samples . Without further assumption, we are not able to improve . However, we can decompose program into , where each is the random variable for the token at position
in the program. Under autoregressive models (e.g. RNN), we can further access the conditional probability. Therefore, we propose a decomposed entropy estimator
as(9) 
where , and is the conditional entropy.
Lemma 4.1.
The proposed decomposed entropy estimator is unbiased with lower variance, that is and .
4.3 Effective Entropy Optimization
We have established our REINFORCE with entropy regularization objective and the next step is to optimize it in an efficient way. In this section, we will show that sampling without replacement is more dataefficient than sampling with replacement. We demonstrate our point by an experiment in a synthetic setting (Figure 2). In this experiment, we initialize a distribution of 100 variables with three of them having significantly higher probability than the others (as in the second image in Figure 2
). The loss function is the estimated entropy from the sampled variables. For sampling with replacement, the estimation is
. For sampling without replacement, the estimation is . In both cases, represents the probability of the th sampled variable and is the total variables drawn. represents the renormalized probability of the variable after removing the previously sampled variables. is the importance weighting for each term.In this experiment, we chose for sampling without replacement and for sampling with replacement. The increase in entropy with sampling 20 variables without replacement is more rapid than sampling 40 variables with replacement under the same setting. At the end of the 700 iterations, the distribution of sampling without replacement is visibly more uniform than the distribution from sampling with replacement.
Implementing sampling without replacement on a tree structure in this case can be challenging. We do not have the resources to instantiate every possible path and perform sampling without replacement bottomup. Instead, we adopt a form of stochastic beam search by combining topdown sampling without replacement with Gumbel trick that is equivalent to sampling without replacement bottomup (Kool et al., 2019b). Please refer to Algorithm 2 for the full algorithm including the sampling without replacement and tree LSTM components and Appendix A for more details.
5 Experiments
In the experiment section, we are going to investigate how much each of the design features affect the learning process. We will show that the entropy estimator we proposed earlier (9) has smaller variance than the sequence log probability method (8). We also train the model with supervision and report its result in comparison to the unsupervised method. For the last experiment, we compare the dataefficiency of sampling with and without replacement.
5.1 Reward Comparison of Design Features
Type  Length 5  Length 7  Length 9 

Training set size  3600  4800  12000 
Testing set size  586  789  4630 
We used synthetic dataset to test our algorithm and compare the effects of each feature. The algorithm chooses from 27 shape actions, 3 operation actions and 2 grammar actions to create image on a 64 by 64 canvas. The 27 shape actions have their size, position in the canvas as well as the type of geometry (circle, triangle or square) encoded in the selection. For evaluation purpose, our synthetic dataset has the ground truth program to each of the images. We separated our dataset by the length of the program to study the image complexity’s effect on the learning process. Program length also implies the number of shapes in the images. Programs of length 5 have 3 shapes and 2 operation actions and programs of length 7 have 4 shapes and 3 operation actions etc. For each length of program, We generated all possible combinations of shape actions and operation actions in text and then filter out the duplicates and empty images. Images are considered duplicates if only 120 pixels are different between the two and are considered empty if there are no more than 120 pixels on the canvas. For dataset size information, please refer to the Table 1.
For this dataset, we sampled 19 distinct programs for each target image to approximate the objective functions. The coefficient for negative entropy is 0.05 and the learning rate is 0.01.
We compared each feature’ effects on the learning process of all three datasets with increasing difficulty. When we train the model via sampling with replacement, the training process is not able to get out of a discrete local optimum as seen in Figure 3 (yellow). The starting reward is also lower comparing to sampling without replacement because the final reward is the maximum reward of all distinct programs sampled. When we take out the entropy term in the objective function, the reward function is still able to improve with the length 5 dataset but it is barely improving with increasing program length. Without the tree structure, the reward stays around 0.3 which is the lowest possible reward in Figure 3 (green) because the program is unable to generate a valid program to render. All these design features are crucial to learning increasingly complex images. We measure our converged algorithm’s performance on the test sets of the three datasets with Chamfer and IoU metrics (Table 2 (Left)). Figure 7 provides some qualitative examples on the algorithms in Figure 3.
The Chamfer metric is defined as where and is the number of pixels on the diagonal of an image, and are the target and generated images, as defined in the first term of Equation 7. The value is between 0 and 1. If the generated image is a perfect match with the target, it will receive 1 in this metric. The IoU metric is defined as the intersection between the two images over the union: .
5.2 Reward Comparison with Supervised Learning Method
In order to study the difference between the supervised and unsupervised learning methods, we compared the training and testing results using a supervised learning method with the same neural network model. The input at each step is the concatenation of the embedded ground truth program as well as the encoded final and intermediate images. We used the same synthetic data set and the same Chamfer similarity metric as in Table 2 to measure the quality of the generated data. The testing results of the supervised method worsens with the increasing complexity (program length) of the test set while the training results are almost perfect across all three datasets. Meanwhile, the unsupervised method receives consistently high scores. This shows that the supervised training method does not generalize well to unseen data in comparison to the unsupervised learning method (Table 2). Our conjecture for this result is that because the supervised learning optimizes over the loss function in the program space while the unsupervised learning optimizes over the reward function in the image space, two programs that are very close in the program space may result in very different image. We will leave the more vigorous explanation as future work.

5.3 Reward Comparison with Supervised Pretraining Method
In this experiment we pretrained the supervised model described in Section 5.2 on a third of the synthetic training dataset till convergence. We take the model and further train it with REINFORCE without the tree structure or sampling without replacement on the full training sets. We report the reward curve throughout the training process as in Figure 4. In this figure, there is a sharp drop in reward during the training process with all three datasets. Our explanation is that while the supervised pretrained model provides a grammatical structure on the output programs, it is not able to retain the structure after updates resulting in the drop in reward.
5.4 Variance Study of Entropy Estimation
In this study (Figure 5) we want to demonstrate that estimator achieves a lower variance than as discussed in Section 4.2.
We take a single model saved at epoch 40 during the training time of the length 5, 7, and 9 dataset and estimate the entropy with weighted sum stepwise method (Equation
9) and sequence log probability method (Equation 8). We also considered two sampling schemes: sampling without replacement and sampling with replacement. We combine both entropy estimation methods with the two sampling schemes creating four instances for comparisons. The xaxis of the plot documents the number of distinct programs in the sampling without replacement method. The number of distinct programs is replaced with the number of repetitions for sampling with replacement. We obtain a single estimation of the entropy by taking the mean after running the sampling with replacement scheme the same number of times as the number of distinct programs. We further repeat the estimation 100 times to obtain the mean and variance. The means of sampling with replacement method act as a baseline for the means of the sampling without replacement method while we compare the variances of the two entropy estimation methods.We experimented on the number of distinct programs from 2 to 80. In all three datasets, the stepwise entropy estimator (green) shows significantly smaller variance. But we notice that longer programs, or more complex images, require much more distinct programs to reduce the variance in estimation. This makes sense because the search space increases exponentially with longer program length. There also exists bias in the estimation using the sampling without replacement method. In all three cases, the bias resolves after the number of distinct programs increase to be greater than 10. The bias is greater in dataset with longer program length.
6 Discussion
In this paper, we proposed an entropy regularized REINFORCE based algorithm with grammar encoded tree LSTM that leverage grammatical structures to parse a CSG image into contextfree grammar. It is the first paper to successfully parse a CSG image with nondifferentiable renderer into CFG without program supervision. Our ingredients include a tree LSTM that guarantees the output programs’ correctness, an unbiased and lowvariance entropy estimator of a program sequence, and sampling without replacement to improve data and optimization efficiency. Our experiments have demonstrated the importance of each of our design features quantitatively and qualitatively.
References
 Arjovsky et al. (2017) Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.
 Balog et al. (2017) Matej Balog, Alexander L. Gaunt, Marc Brockschmidt, Sebastian Nowozin, and Daniel Tarlow. Deepcoder: Learning to write programs. In International Conference on Representation Learning (ICLR), 2017.
 Chen et al. (2016) Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, 2016.
 Cover and Thomas (2012) Thomas M Cover and Joy A Thomas. Elements of information theory. John Wiley & Sons, 2012.
 Devlin et al. (2017) Jacob Devlin, Jonathan Uesato, Surya Bhupatiraju, Rishabh Singh, Abdelrahman Mohamed, and Pushmeet Kohli. Robustfill: Neural program learning under noisy i/o. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 990–998. JMLR. org, 2017.
 Ellis et al. (2018) Kevin Ellis, Daniel Ritchie, Armando SolarLezama, and Josh Tenenbaum. Learning to infer graphics programs from handdrawn images. In Advances in neural information processing systems, pages 6059–6068, 2018.
 Ellis et al. (2019) Kevin Ellis, Maxwell Nye, Yewen Pu, Felix Sosa, Josh Tenenbaum, and Armando SolarLezama. Write, execute, assess: Program synthesis with a repl, 2019.
 Ganin et al. (2018) Yaroslav Ganin, Tejas Kulkarni, Igor Babuschkin, SM Eslami, and Oriol Vinyals. Synthesizing programs for images using reinforced adversarial learning. arXiv preprint arXiv:1804.01118, 2018.
 Goodfellow et al. (2014) Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
 Ha and Eck (2017) David Ha and Douglas Eck. A Neural Representation of Sketch Drawings. ArXiv eprints, April 2017.
 Higgins et al. (2017) Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. betavae: Learning basic visual concepts with a constrained variational framework. ICLR, 2017.
 Huang et al. (2019) Zhewei Huang, Wen Heng, and Shuchang Zhou. Learning to paint with modelbased deep reinforcement learning. arXiv preprint arXiv:1903.04411, 2019.
 Hubbard (1990) Philip M Hubbard. Constructive solid geometry for triangulated polyhedra. 1990.
 Kim and Mnih (2018) Hyunjik Kim and Andriy Mnih. Disentangling by factorising. arXiv preprint arXiv:1802.05983, 2018.
 Kingma and Welling (2013) Diederik P Kingma and Max Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 Kool et al. (2019a) Wouter Kool, Herke van Hoof, and Max Welling. Buy 4 reinforce samples, get a baseline for free! 2019.
 Kool et al. (2019b) Wouter Kool, Herke Van Hoof, and Max Welling. Stochastic beams and where to find them: The gumbeltopk trick for sampling sequences without replacement. In International Conference on Machine Learning, pages 3499–3508, 2019.

Li et al. (2017)
ChunLiang Li, WeiCheng Chang, Yu Cheng, Yiming Yang, and Barnabás
Póczos.
Mmd gan: Towards deeper understanding of moment matching network.
In Advances in Neural Information Processing Systems, pages 2203–2213, 2017.  Li et al. (2018) TzuMao Li, Miika Aittala, Frédo Durand, and Jaakko Lehtinen. Differentiable monte carlo ray tracing through edge sampling. In SIGGRAPH Asia 2018 Technical Papers, page 222. ACM, 2018.
 Liu et al. (2018) Yunchao Liu, Zheng Wu, Daniel Ritchie, William T Freeman, Joshua B Tenenbaum, and Jiajun Wu. Learning to describe scenes with programs. 2018.
 Liu et al. (2019) HsuehTi Derek Liu, Michael Tao, ChunLiang Li, Derek Nowrouzezahrai, and Alec Jacobson. Beyond pixel normballs: Parametric adversaries using an analytically differentiable renderer. 2019.
 Locatello et al. (2018) Francesco Locatello, Stefan Bauer, Mario Lucic, Gunnar Rätsch, Sylvain Gelly, Bernhard Schölkopf, and Olivier Bachem. Challenging common assumptions in the unsupervised learning of disentangled representations. arXiv preprint arXiv:1811.12359, 2018.
 Mnih et al. (2013) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
 Oord et al. (2016) Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. arXiv preprint arXiv:1601.06759, 2016.
 Romaszko et al. (2017) Lukasz Romaszko, Christopher KI Williams, Pol Moreno, and Pushmeet Kohli. Visionasinversegraphics: Obtaining a rich 3d explanation of a scene from a single image. In Proceedings of the IEEE International Conference on Computer Vision, pages 851–859, 2017.

Sharma et al. (2018)
Gopal Sharma, Rishabh Goyal, Difan Liu, Evangelos Kalogerakis, and Subhransu
Maji.
Csgnet: Neural shape parser for constructive solid geometry.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 5515–5523, 2018.  Shin et al. (2018) Richard Shin, Illia Polosukhin, and Dawn Song. Improving neural program synthesis with inferred execution traces. In Advances in Neural Information Processing Systems, pages 8917–8926, 2018.

Tai et al. (2015)
Kai Sheng Tai, Richard Socher, and Christopher D Manning.
Improved semantic representations from treestructured long shortterm memory networks.
InProceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
, pages 1556–1566, 2015.  Tian et al. (2019) Yonglong Tian, Andrew Luo, Xingyuan Sun, Kevin Ellis, William T Freeman, Joshua B Tenenbaum, and Jiajun Wu. Learning to infer and execute 3d shape programs. arXiv preprint arXiv:1901.02875, 2019.
 Tulsiani et al. (2017) Shubham Tulsiani, Hao Su, Leonidas J Guibas, Alexei A Efros, and Jitendra Malik. Learning shape abstractions by assembling volumetric primitives. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2635–2643, 2017.
 van den Oord et al. (2017) Aaron van den Oord, Oriol Vinyals, et al. Neural discrete representation learning. In Advances in Neural Information Processing Systems, pages 6306–6315, 2017.
 Wang et al. (2018) Xinyi Wang, Hieu Pham, Pengcheng Yin, and Graham Neubig. A treebased decoder for neural machine translation. arXiv preprint arXiv:1808.09374, 2018.
 Williams (1992) Ronald J Williams. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine learning, 8(34):229–256, 1992.
 Wu et al. (2017) Jiajun Wu, Joshua B Tenenbaum, and Pushmeet Kohli. Neural scene derendering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 699–707, 2017.
 Yao et al. (2018) Shunyu Yao, Tzu Ming Hsu, JunYan Zhu, Jiajun Wu, Antonio Torralba, Bill Freeman, and Josh Tenenbaum. 3daware scene manipulation via inverse graphics. In Advances in Neural Information Processing Systems, pages 1887–1898, 2018.
 Zohar and Wolf (2018) Amit Zohar and Lior Wolf. Automatic program synthesis of long programs with a learned garbage collector. In Advances in Neural Information Processing Systems, pages 2094–2103, 2018.
Appendix A Sampling Without Replacement
This section describes how we achieve sampling without replacement with the help of stochastic beam search [Kool et al., 2019b].
At each step of generation, the algorithm chooses the top k branches to expand based on the score at time step . The score is sampled from Gumble(), where is the log probability of the partial sequence at time step , conditioned on its parent’s score being the maximum. Please refer to 3 for details in the branching process.
The sampling without replacement algorithm requires correct scaling of the objective functions to ensure unbiasness. The scaling term is . represents the probability of the sequence and represents the set of all sampled sequences for . where is the ()th largest score among all the possible branches. It acts as a threshold for branching selection. During implementation, we need to keep an extra beam, thus beams in total, to accurately estimate in order to ensure the unbiasness of the estimator without normalization. Additional normalization terms and are employed to reduce variance but they increase bias in the final estimation. The exact objective is as follows [Kool et al., 2019a]:
(10) 
The baseline term is defined as . Incorporating a baseline into the REINFORCE objective is a standard practice in order to reduce variance in the estimation.
Entropy estimation uses a similar scaling scheme as the REINFORCE objective:
where and denotes the first elements of the sequence . The estimator is unbiased excluding the term. The normalization term reduces the variance of the estimator.
The input of the function Sampling_without_Replacement at time step is a matrix . Each row of the matrix represents the beams that we maintain and the column of the matrix is the size of the action space. The entry in the matrix is equal to where consists of . Row of the matrix represents the log probability of the partial sequence expanding one more step with all potential actions.
For each beam, we sample a Gumbel random variable with location at each of the element of the vector . Then we need to adjust the Gumbel random variable by conditioning on its parent’s stochastic score being the largest, the resulting value is the stochastic score for each of the potential expansions.
Here is the largest value in the vector and is the stochastic score of the parent of th beam’s all the possible outcomes at step . Conditioning on the parent stochastic score being the largest in this topdown sampling scheme makes sure that each leaf’s stochastic score independently, equivalent to sampling the sequences bottom up [Kool et al., 2019b]. Once we have aggregated all the stochastic scores for all potential expansions of beams, we select the top expansions. Note that the reason that we maintain one more beam than we intended to expand because we need the largest stochastic score to be the threshold during estimation of the entropy and REINFORCE objective.
Appendix B Proof of Stepwise Entropy Estimation’s Unbiasness
Entropy of a sequence can be decomposed into the sum of the conditional entropy at each step conditioned on the previous values. This is also called the chain rule for entropy calculation. Let
be drawn from [Cover and Thomas, 2012]:(11) 
If we sum up the empirical entropy at each step after the softmax output, we can obtain an unbiased estimator of the entropy. Let
be the set of sequences that we sampled and each sampled sequence consists of :In order to incorporate the stepwise estimation of the entropy into the beam search, we use the similar reweighting scheme as the REINFORCE objective. The difference is that the REINFORCE objective is reweighted after obtaining the full sequence because we only receive the reward at the end and here we reweight the entropy at each step. We denote each time step by and each sequence by , the set of sequences selected at time step is and the complete set of all possible sequences of length is and . We are taking the expectation of the estimator over the scores. As we discussed before, at each step, each potential beam receives a stochastic score . The beams associated with the top stochastic scores are chosen to be expanded further and is the th largest . can also be seen as a threshold in the branching selection process and . For details on the numerical stable implementation of , please refer to [Kool et al., 2019b].
For the proof of , please refer to the paper [Kool et al., 2019b], apendix D.
Appendix C Proof of Lower Variance of the Stepwise Entropy Estimator
We will continue using the notations from above. We want to compare the variance of the two entropy estimator and the stepwise entropy estimator and show that the second estimator has lower variance.
Proof.
We abuse to be and to be to simplify the notations.
∎
The fifth equation holds from the fact that . The result still stands after applying reweighting for the beam search.
Appendix D Shape Encoding Demonstration
In Figure 6, we show the code name on top of the image that it represents. c, s, and t represent circle, square and triangle respectively. The first two numbers represent the position of the shape in the canvas and the last number represents the size.
Comments
There are no comments yet.