Unsupervised Program Synthesis for Images using Tree-Structured LSTM

01/27/2020 ∙ by Chenghui Zhou, et al. ∙ Carnegie Mellon University 12

Program synthesis has recently emerged as a promising approach to the image parsing task. However, most prior works have relied on supervised learning methods, which require ground truth programs for each training image. We present an unsupervised learning algorithm that can parse constructive solid geometry (CSG) images into context-free grammar with a non-differentiable renderer. We propose a grammar-encoded tree LSTM to effectively constrain our search space by leveraging the structure of the context-free grammar while handling the non-differentiable renderer via REINFORCE and encouraging the exploration by regularizing the objective with an entropy term. Instead of using simple Monte Carlo sampling, we propose a lower-variance entropy estimator with sampling without replacement for effective exploration. We demonstrate the effectiveness of the proposed algorithm on a synthetic 2D CSG dataset, which outperforms baseline models by a large margin.



There are no comments yet.


page 16

page 17

page 18

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Image generation has been an extensively studied topic in machine learning and computer vision. Vast number of papers have explored generating images through low dimensional latent representations 

(Goodfellow et al., 2014; Arjovsky et al., 2017; Li et al., 2017; Kingma and Welling, 2013; van den Oord et al., 2017; Oord et al., 2016). However, it is challenging to learn disentangled representations which allows us to control the generation after learning the generative models (Higgins et al., 2017; Kim and Mnih, 2018; Locatello et al., 2018; Chen et al., 2016). In this paper, we will explore generating constructive solid geometry images through programs in the form of context-free grammar. We can consider these programs as an alternative of low-dimensional representation of the image. The model for extracting the programs can be seen as an encoder and the renderer that reconstructs the image is the decoder. Parsing an image of geometric shapes into programs enables us to manipulate only the desired components of the image while reconstruct the rest of the image.

There are two types of image renderers that construct the images from programs – differentiable and non-differentiable renderers. While creating differentiable renderers is a fast developing field (e.g. (Li et al., 2018; Liu et al., 2019)

), they are still not as well-developed as the non-differentiable ones. On the other hand, learning a generative process by combining modern machine learning models (e.g. neural networks) and non-differentiable renderers is challenging because we are not able to obtain the gradient respect to the input.

The goal of our paper is learning to parse an image into programs which can be described by a context-free-grammar, such as constructive solid geometry (CSG) image (Hubbard, 1990). (Sharma et al., 2018) have studied the same problem using supervised learning where the images and the programs are paired up as the input. However, we want to tackle a more general image parsing problem where the ground truth program is not available for training. In this paper, we aim to solve it with only target images as input without any other form of program supervision.

Here are the three key components to successfully learn to parse an image to programs with minimum data supervision:

  • We adopt a non-differentiable renderer for this problem. The lack of direct gradient information from the non-differentiable renderer implies that we can only train our model based on the final reconstructed image’s reward. We use REINFORCE (Williams, 1992) as our main building component.

  • We introduce Tree LSTM to impose a structure on our output space to reduce the search space to only valid programs. A naive parameterization of program generation is using simple RNN, which is not guaranteed to always generate grammatically correct programs. However, the renderer cannot parse invalid programs and the reconstructed images of invalid programs are defined to receive low rewards. Therefore, it is necessary to limit the search space to avoid the sparse reward problem.

  • We propose a better entropy estimator for the standard entropy regularization to encourage exploration of the search space. We propose a stepwise entropy estimator and show that it has lower variance than the naive estimator. Instead of using simple Monte Carlo, we adopt a sampling without replacement scheme to improve the optimization efficiency significantly.

2 Related Work

Our work is related to program synthesis, vision-as-inverse-graphics, as well as generating images with stroke-based rendering (SBR) systems.

Program synthesis has been a growing interest to researchers in machine learning (Balog et al., 2017; Shin et al., 2018; Devlin et al., 2017; Zohar and Wolf, 2018). Research on converting images to programs are more directly related to our work (Sharma et al., 2018; Ellis et al., 2019; Tian et al., 2019; Ellis et al., 2018; Liu et al., 2018). The setup of our work closely follows (Sharma et al., 2018) and (Ellis et al., 2019) is also an extension from it. Both works rely on supervised pretraining before using REINFORCE to fine tune the models, while our goal is to train the grammar generation model with only the target images as the inputs without the supervision of the corresponding programs of training images. (Tian et al., 2019) incorporated a differentiable renderer into the learning pipeline while we treated our renderer as an external procedure independent from the learning process. With a non-differentiable renderer, we cannot directly propagate the gradient from the reward to the grammar generation model. This distinction leads to very different algorithm designs. (Liu et al., 2018) focuses on the symmetry and repetition aspects of a scene. The algorithm takes in a scene of 3D geometric objects stacked or lined up and parse it into programs in terms of loops and rotations. Although related, this work has a very different setting from ours. (Ellis et al., 2018) used neural network to extract the shapes from hand-drawn sketches, formulated the grammatical rules as constraints and obtain the final program by solving a constraint satisfaction problem. This process can be computationally expensive compared to neural network’s performance in test time.

Research in vision-as-inverse-graphics concerns parsing a scene into a collection of shapes or 3D primitives with descriptions that imitates the original scene (Tulsiani et al., 2017; Romaszko et al., 2017; Wu et al., 2017). (Wu et al., 2017) de-renders a scene into a collection of objects, such as a tree and a girl etc, with colors and locations as parameters. (Yao et al., 2018) differs from the last work in that it further manipulates the objects de-rendered, for example, changing a car’s color from blue to red and putting it back to the scene. These works do not deal with the interactions among the parsed objects through operations, which is the distinction with our setting.

Stroke-based rendering creates an image in a way natural to human. Some of the examples are recreating paintings imitating a painter’s brush stroke by (Huang et al., 2019), drawing sketches of objects by (Ha and Eck, 2017). SPIRAL by (Ganin et al., 2018)

is an adversarially trained deep reinforcement learning agent. The agent is able to recreate MuJoCo scenes, MNIST digits, and Omniglot characters. Stroke-based rendering behaves in an additive way. The action space usually consists of a line of various continuous parameters such as width and length. A grammar structure is unnecessary to the rendering process because each action is additive. This contrasts from our problem in which the action space includes primitive shapes, operations as well as grammar tokens. Overall, stroke-based rendering is less expressive and less precise when it comes to structured objects generation.

3 Problem Definition

The input of the model is an image constructed from basic shapes such as triangle, circle or triangle, each with a designated size and location (see Figure 6). The output of the model is a program to reconstruct the input in the format of context-free grammar (CFG).

In this paper, we use constructive solid geometry (CSG) image (Hubbard, 1990) to form an image. Some sample images can be found in Figure 8. The CFG for CSG includes the binary shape operations of plus, minus, and intersection. The context-free grammar rules are as follows:


, , and are non-terminal tokens for the start, operations, and shapes. The others are called terminal tokens, such as (union), (intersection), (subtraction), and stands for a circle with radius and at . Each line above is called production rule or just rule for simplicity. Please refer to Figure 7 for some examples of CSG images with their corresponding programs.

The setting of the problem is adopted from (Sharma et al., 2018). They are able to parse the grammatical programs from the images through supervised training. However, supervised training involves constructing images with known programs to be the input at the same time. Manually constructing programs for training can be time-consuming but also restricts the target images to be an exact translation of the program. We are interested in unsupervised training because images usually do not come with their corresponding programs.

4 Proposed Algorithm

Our model consists of a CNN encoder for reading the canvas, an embedding layer for the action tokens, and an RNN for generating the grammatical program sequences (see Figure 1 for demonstration). The model was trained with entropy regularized REINFORCE (Williams, 1992). Let be the entropy of the sequence (we will define this later), is a reward function based on the sequence, and is the parameters of our model. The objective is optimized as follows:


The output program is converted to an image by a non-differentiable renderer. The image is compared to the target image and receives a reward , which is our definition of . We adopt Chamfer distance as part of the reward function as in (Sharma et al., 2018). Chamfer distance calculates the average matching distance to the nearest feature. Let and be pixels in each image respectively. Chamfer distance is described formally as follows:


The Chamfer distance is scaled by , which is the length of the image diagonal, such that it is between 0 and 1. Under the setting of this problem, the reward mostly falls between 0.9 and 1. In order to magnify the reward difference among actions, we exponentiated to the power of 20. We add the other component to the reward function based on pixel differences in order to differentiate shapes with similar sizes and locations. The final reward function can be described as follow:


with representing the original image and representing the generated image and in this setting. When a generated image receives a very low reward from the function, its exact reward value provides little insight on its performance and is largely depending on its target image, thus we chose as the minimum reward in order to simplify the reward function’s behavior when the generated images have poor qualities. Similar reward clipping idea was also proposed in DQN (Mnih et al., 2013) and they used it to unify the reward ranges across different games to .

Our setup closely follows (Sharma et al., 2018). However, naively extending (Sharma et al., 2018) by optimizing reinforcement learning objectives does not learn to generate correct grammatical programs. We introduce two crucial designs to improve the learning. First, we adopt a grammar encoded tree LSTM to ensure a valid output sequence with an image stack to provide intermediate images. Second, we propose a lower-variance entropy estimator to improve both the estimation and optimization.

4.1 Grammar Encoded Tree LSTM

There are 3 main production rules in the grammatical program generation. One is shape selection (), another type is operation selection (), and the third type is grammar selection (). Grammar selection in this problem setting includes , and as in the grammar definition which decides whether the program will expand. Let the set of shapes to be , the set of operations to be , and the set of non-terminal outcomes to be (e.g. in (2)). A naive parameterization is to let the candidate set of the LSTM output to be , where is the end token, and treat it as a standard language model to generate the program (Sharma et al., 2018). The model does not explicitly encode grammar structures, and expect the model to capture it implicitly during the learning process. The drawback is that the generated program is not guaranteed to be always grammatically correct. For example, the possible generation can be , which is an invalid program.

Instead of blindly sampling terminal tokens, it is easy to sample grammatically correct programs based on its production rules given a well defined grammar. Therefore, we propose a grammar (tree) LSTM which takes the grammar production rule into account for generation, and it is guaranteed to always generate grammatically correct program and significantly reduce the search space during the training. Grammar encoded models have been used in language tasks, such as (Tai et al., 2015; Wang et al., 2018). They used multiple RNN’s for different non-terminal tokens to ensure that the output words follow the grammar rules. Such models are not popular because there are significant amount of exceptions to the rules in natural languages. In contrast, CFG is a well-defined grammar making it an ideal candidate to be modeled by grammar encoded tree LSTM.

The proposed model can be simply understood as an RNN model with a masking mechanism by properly maintaining a stack to rule out invalid outputs. We increase the size of the output space from of the previous approach (e.g. (Sharma et al., 2018)) to by including the non-terminal tokens. During the generation, we maintain a stack to trace the current production rule. Based on the current non-terminal token, we use the masking mechanism to weed out the invalid output candidates. For example, if we are handling the non-terminal , we mask the invalid output to reduce the candidate size from to only. For more details, please refer to (Tai et al., 2015).

Figure 1: This is an example of grammar encoded tree LSTM at work. The top layer of canvases demonstrates the image stack and the bottom layer demonstrates the grammar stack. The blue, orange, yellow and green colored LSTM cell generates grammatical symbols according to the CFG rule (1), (2), (3) and (4

) respectively. In implementation, we can constrain the output space by adding a mask to the output of the LSTM and render the invalid options with close to zero probability of being sampled.

To boost the performance, we further propose an image stack to maintain the intermediate images during the generation and feed them into the model to provide additional information. An example for better understanding is shown below and you can trace each step through Figure 1.

  • A grammar stack with a start token and an end token as well as an empty image stack is initialized.

  • In the first iteration, the token is popped out. Following (1), all other options will be masked except , the only possible output. token is added to the stack.

  • In the second iteration, or any iteration where the token is popped, the input for all examples and all softmax outputs are masked except the entries representing and according to (2). If is sampled, , and tokens will be added to the stack separately in that exact order to expand the program further. If is sampled, it will be added to the stack and the program cannot expand further.

  • If is popped out of the stack, the output space for that iteration will be limited to all the operations (3). Similarly, if is popped out, the output space is limited to all the geometric shapes (4).

  • When a shape token is sampled, it will not be added to the grammar stack as they do not contribute to the program structure. Instead, the image of the shape will be pushed onto the corresponding image stack.

  • When an operation token is sampled, it also will not be added to the grammar stack. Instead, we pop out the top two images to apply the operation on them and push the final image onto the image stack again.

  • When the stack has popped out all the added tokens, the end token will be popped out in the last iteration. We then finish the sampling as standard RNN language models.

Before finishing generating a program, there can be multiple images present yet to be assembled by operations in the image stack. We consider them to be the observations in the process and utilize LSTM with those observations as a part of the input to better infer our future direction in the search space.

In this process, the model will produce a sequence of tokens, including grammatical, shape and operation tokens. We only keep all the terminal tokens as the final output program and discard the rest. The programs are ensured to be grammatically correct.

In practice, we implement the masking mechanism by adding a vector to the output before passing into softmax layer to get the probability. The vector contains

for valid output and large negative numbers for invalid ones. This makes sure that invalid options will have almost zero probability of being sampled. The input of the RNN cell includes encoded target image and intermediate images from the image stack, embedded pop-out token from grammar stack and the hidden state from the RNN’s last iteration. Following previous works in NLP , we call the model Tree LSTM. The exact algorithm is in Algorithm 1.

Input:   Grammar Stack , Image Stack , Target Image , Sample Set

1:function TreeLSTM()
3:     for   do
9:         Estimate entropy at this node :
11:         Update the log probabilities of partial sequences
13:     end for
14:     return
15:end function
Algorithm 1 Tree LSTM Model

4.2 Exploration with Entropy Regularization

Entropy regularization in reinforcement learning is a standard practice for encouraging exploration. We argue that a careful design of entropy estimation with lower variance can enhance the exploration effect.


to be the random variable of possible programs. The entropy is denoted as

111Here we overload and as used in (1) and (2) to follow the convention.. The possible outcomes of can be exponentially large. Therefore, we usually estimate the entropy via


with finite samples . Without further assumption, we are not able to improve . However, we can decompose program into , where each is the random variable for the token at position

in the program. Under autoregressive models (e.g. RNN), we can further access the conditional probability. Therefore, we propose a decomposed entropy estimator



where , and is the conditional entropy.

Lemma 4.1.

The proposed decomposed entropy estimator is unbiased with lower variance, that is and .

The proof is simple by following (Cover and Thomas, 2012) and we leave it in Appendix B and C.

Figure 2: The left most image demonstrates the entropy value increases over 700 iterations by sampling 20 distinct samples with and without replacement as well as sampling 40 samples with replacement. The second image shows the initial distribution. The third and fourth images show the ending distributions after we maximize the entropy by sampling with and without replacement.

4.3 Effective Entropy Optimization

We have established our REINFORCE with entropy regularization objective and the next step is to optimize it in an efficient way. In this section, we will show that sampling without replacement is more data-efficient than sampling with replacement. We demonstrate our point by an experiment in a synthetic setting (Figure 2). In this experiment, we initialize a distribution of 100 variables with three of them having significantly higher probability than the others (as in the second image in Figure 2

). The loss function is the estimated entropy from the sampled variables. For sampling with replacement, the estimation is

. For sampling without replacement, the estimation is . In both cases, represents the probability of the -th sampled variable and is the total variables drawn. represents the re-normalized probability of the variable after removing the previously sampled variables. is the importance weighting for each term.

In this experiment, we chose for sampling without replacement and for sampling with replacement. The increase in entropy with sampling 20 variables without replacement is more rapid than sampling 40 variables with replacement under the same setting. At the end of the 700 iterations, the distribution of sampling without replacement is visibly more uniform than the distribution from sampling with replacement.

Implementing sampling without replacement on a tree structure in this case can be challenging. We do not have the resources to instantiate every possible path and perform sampling without replacement bottom-up. Instead, we adopt a form of stochastic beam search by combining top-down sampling without replacement with Gumbel trick that is equivalent to sampling without replacement bottom-up (Kool et al., 2019b). Please refer to Algorithm 2 for the full algorithm including the sampling without replacement and tree LSTM components and Appendix A for more details.

Figure 3: From left to right, we have reward per batch for programs of length 5, 7 and 9. It demonstrates the performance of our algorithm and controlled comparison in performance with removing one component at a time.

Input:   Target Image , Number of beams
Initialize:   Grammar stack , Image stack , Beam set

for  do
     (See (Kool et al., 2019b))
      (Entropy estimation by (9) and Appendix A)
     if  then  else 
end for
Algorithm 2 Sampling w/o Replacement Tree LSTM

5 Experiments

In the experiment section, we are going to investigate how much each of the design features affect the learning process. We will show that the entropy estimator we proposed earlier (9) has smaller variance than the sequence log probability method (8). We also train the model with supervision and report its result in comparison to the unsupervised method. For the last experiment, we compare the data-efficiency of sampling with and without replacement.

5.1 Reward Comparison of Design Features

Type Length 5 Length 7 Length 9
Training set size 3600 4800 12000
Testing set size 586 789 4630
Table 1: Dataset statistics of different program lengths.

We used synthetic dataset to test our algorithm and compare the effects of each feature. The algorithm chooses from 27 shape actions, 3 operation actions and 2 grammar actions to create image on a 64 by 64 canvas. The 27 shape actions have their size, position in the canvas as well as the type of geometry (circle, triangle or square) encoded in the selection. For evaluation purpose, our synthetic dataset has the ground truth program to each of the images. We separated our dataset by the length of the program to study the image complexity’s effect on the learning process. Program length also implies the number of shapes in the images. Programs of length 5 have 3 shapes and 2 operation actions and programs of length 7 have 4 shapes and 3 operation actions etc. For each length of program, We generated all possible combinations of shape actions and operation actions in text and then filter out the duplicates and empty images. Images are considered duplicates if only 120 pixels are different between the two and are considered empty if there are no more than 120 pixels on the canvas. For dataset size information, please refer to the Table 1.

For this dataset, we sampled 19 distinct programs for each target image to approximate the objective functions. The coefficient for negative entropy is 0.05 and the learning rate is 0.01.

We compared each feature’ effects on the learning process of all three datasets with increasing difficulty. When we train the model via sampling with replacement, the training process is not able to get out of a discrete local optimum as seen in Figure 3 (yellow). The starting reward is also lower comparing to sampling without replacement because the final reward is the maximum reward of all distinct programs sampled. When we take out the entropy term in the objective function, the reward function is still able to improve with the length 5 dataset but it is barely improving with increasing program length. Without the tree structure, the reward stays around 0.3 which is the lowest possible reward in Figure 3 (green) because the program is unable to generate a valid program to render. All these design features are crucial to learning increasingly complex images. We measure our converged algorithm’s performance on the test sets of the three datasets with Chamfer and IoU metrics (Table 2 (Left)). Figure 7 provides some qualitative examples on the algorithms in Figure 3.

The Chamfer metric is defined as where and is the number of pixels on the diagonal of an image, and are the target and generated images, as defined in the first term of Equation 7. The value is between 0 and 1. If the generated image is a perfect match with the target, it will receive 1 in this metric. The IoU metric is defined as the intersection between the two images over the union: .

5.2 Reward Comparison with Supervised Learning Method

In order to study the difference between the supervised and unsupervised learning methods, we compared the training and testing results using a supervised learning method with the same neural network model. The input at each step is the concatenation of the embedded ground truth program as well as the encoded final and intermediate images. We used the same synthetic data set and the same Chamfer similarity metric as in Table 2 to measure the quality of the generated data. The testing results of the supervised method worsens with the increasing complexity (program length) of the test set while the training results are almost perfect across all three datasets. Meanwhile, the unsupervised method receives consistently high scores. This shows that the supervised training method does not generalize well to unseen data in comparison to the unsupervised learning method (Table 2). Our conjecture for this result is that because the supervised learning optimizes over the loss function in the program space while the unsupervised learning optimizes over the reward function in the image space, two programs that are very close in the program space may result in very different image. We will leave the more vigorous explanation as future work.

Metric Length 5 Length 7 Length 9
Chamfer 0.985 0.960 0.969
IoU 0.996 0.964 0.969
Chamfer length 5 length 7 length 9
Training 0.997 0.996 0.994
Testing 0.987 0.906 0.833
Table 2: (Left) The performance of the converged model on the test set measured with Chamfer distance to the power of 20 and IoU. (Right) Supervised training results.

5.3 Reward Comparison with Supervised Pretraining Method

In this experiment we pretrained the supervised model described in Section 5.2 on a third of the synthetic training dataset till convergence. We take the model and further train it with REINFORCE without the tree structure or sampling without replacement on the full training sets. We report the reward curve throughout the training process as in Figure 4. In this figure, there is a sharp drop in reward during the training process with all three datasets. Our explanation is that while the supervised pretrained model provides a grammatical structure on the output programs, it is not able to retain the structure after updates resulting in the drop in reward.

Figure 4: REINFORCE training with supervised pretraining model.

5.4 Variance Study of Entropy Estimation

In this study (Figure 5) we want to demonstrate that estimator achieves a lower variance than as discussed in Section 4.2.

We take a single model saved at epoch 40 during the training time of the length 5, 7, and 9 dataset and estimate the entropy with weighted sum stepwise method (Equation

9) and sequence log probability method (Equation 8). We also considered two sampling schemes: sampling without replacement and sampling with replacement. We combine both entropy estimation methods with the two sampling schemes creating four instances for comparisons. The x-axis of the plot documents the number of distinct programs in the sampling without replacement method. The number of distinct programs is replaced with the number of repetitions for sampling with replacement. We obtain a single estimation of the entropy by taking the mean after running the sampling with replacement scheme the same number of times as the number of distinct programs. We further repeat the estimation 100 times to obtain the mean and variance. The means of sampling with replacement method act as a baseline for the means of the sampling without replacement method while we compare the variances of the two entropy estimation methods.

We experimented on the number of distinct programs from 2 to 80. In all three datasets, the stepwise entropy estimator (green) shows significantly smaller variance. But we notice that longer programs, or more complex images, require much more distinct programs to reduce the variance in estimation. This makes sense because the search space increases exponentially with longer program length. There also exists bias in the estimation using the sampling without replacement method. In all three cases, the bias resolves after the number of distinct programs increase to be greater than 10. The bias is greater in dataset with longer program length.

Figure 5: Compare the estimation of the entropy following the Equation A as a weighted sum of stepwise entropy vs. taking the average of the log probability of the sequence. The beam size goes from 2 to 80. From left to right, we demonstrate the result on datasets of three program lengths.

6 Discussion

In this paper, we proposed an entropy regularized REINFORCE based algorithm with grammar encoded tree LSTM that leverage grammatical structures to parse a CSG image into context-free grammar. It is the first paper to successfully parse a CSG image with non-differentiable renderer into CFG without program supervision. Our ingredients include a tree LSTM that guarantees the output programs’ correctness, an unbiased and low-variance entropy estimator of a program sequence, and sampling without replacement to improve data and optimization efficiency. Our experiments have demonstrated the importance of each of our design features quantitatively and qualitatively.


Appendix A Sampling Without Replacement

This section describes how we achieve sampling without replacement with the help of stochastic beam search [Kool et al., 2019b].

At each step of generation, the algorithm chooses the top k branches to expand based on the score at time step . The score is sampled from Gumble(), where is the log probability of the partial sequence at time step , conditioned on its parent’s score being the maximum. Please refer to 3 for details in the branching process.

The sampling without replacement algorithm requires correct scaling of the objective functions to ensure unbiasness. The scaling term is . represents the probability of the sequence and represents the set of all sampled sequences for . where is the ()-th largest score among all the possible branches. It acts as a threshold for branching selection. During implementation, we need to keep an extra beam, thus beams in total, to accurately estimate in order to ensure the unbiasness of the estimator without normalization. Additional normalization terms and are employed to reduce variance but they increase bias in the final estimation. The exact objective is as follows [Kool et al., 2019a]:


The baseline term is defined as . Incorporating a baseline into the REINFORCE objective is a standard practice in order to reduce variance in the estimation.

Entropy estimation uses a similar scaling scheme as the REINFORCE objective:

where and denotes the first elements of the sequence . The estimator is unbiased excluding the term. The normalization term reduces the variance of the estimator.

The input of the function Sampling_without_Replacement at time step is a matrix . Each row of the matrix represents the beams that we maintain and the column of the matrix is the size of the action space. The entry in the matrix is equal to where consists of . Row of the matrix represents the log probability of the partial sequence expanding one more step with all potential actions.

For each beam, we sample a Gumbel random variable with location at each of the element of the vector . Then we need to adjust the Gumbel random variable by conditioning on its parent’s stochastic score being the largest, the resulting value is the stochastic score for each of the potential expansions.

Here is the largest value in the vector and is the stochastic score of the parent of -th beam’s all the possible outcomes at step . Conditioning on the parent stochastic score being the largest in this top-down sampling scheme makes sure that each leaf’s stochastic score independently, equivalent to sampling the sequences bottom up [Kool et al., 2019b]. Once we have aggregated all the stochastic scores for all potential expansions of beams, we select the top expansions. Note that the reason that we maintain one more beam than we intended to expand because we need the largest stochastic score to be the threshold during estimation of the entropy and REINFORCE objective.

Input:   Log probability of sequences up to time step , Beam Set , Number of beams

1:function Sampling_without_Replacement()
3:     for  do
7:         Aggregate the values in the vector
9:     end for
10:     Choose top values in
13:     return
14:end function
Algorithm 3 Sampling w/o Replacement

Appendix B Proof of Stepwise Entropy Estimation’s Unbiasness

Entropy of a sequence can be decomposed into the sum of the conditional entropy at each step conditioned on the previous values. This is also called the chain rule for entropy calculation. Let

be drawn from [Cover and Thomas, 2012]:


If we sum up the empirical entropy at each step after the softmax output, we can obtain an unbiased estimator of the entropy. Let

be the set of sequences that we sampled and each sampled sequence consists of :

In order to incorporate the stepwise estimation of the entropy into the beam search, we use the similar reweighting scheme as the REINFORCE objective. The difference is that the REINFORCE objective is reweighted after obtaining the full sequence because we only receive the reward at the end and here we reweight the entropy at each step. We denote each time step by and each sequence by , the set of sequences selected at time step is and the complete set of all possible sequences of length is and . We are taking the expectation of the estimator over the scores. As we discussed before, at each step, each potential beam receives a stochastic score . The beams associated with the top- stochastic scores are chosen to be expanded further and is the -th largest . can also be seen as a threshold in the branching selection process and . For details on the numerical stable implementation of , please refer to [Kool et al., 2019b].

For the proof of , please refer to the paper [Kool et al., 2019b], apendix D.

Appendix C Proof of Lower Variance of the Stepwise Entropy Estimator

We will continue using the notations from above. We want to compare the variance of the two entropy estimator and the stepwise entropy estimator and show that the second estimator has lower variance.


We abuse to be and to be to simplify the notations.

The fifth equation holds from the fact that . The result still stands after applying reweighting for the beam search.

Appendix D Shape Encoding Demonstration

In Figure 6, we show the code name on top of the image that it represents. c, s, and t represent circle, square and triangle respectively. The first two numbers represent the position of the shape in the canvas and the last number represents the size.

Figure 6: Each shape encoding is on top of the image it represents.

Appendix E Example Outputs and Programs

In Figure 7, we compare the performance of each algorithm qualitatively with one example for each image complexity level. Figure 8 shows some example outputs from the sampling without replacement scheme.

Figure 7: We show a target image from each dataset and attach its correct program below. To the right of the target programs are the outputs for our algorithm and three variants each missing one design feature. The reward is on top of each output image.
Figure 8: Some example outputs of our algorithm. Each row represents one example occupying 5 columns. The leftmost images of the five columns are the target images and the four columns to its right are the output of four beams. Its corresponding reward is on top of each image. The output is the image with the highest reward among the outputs, which is highlighted in red frame.