1 Introduction
There has been much interest recently in generative models, following the introduction of both variational autoencoders (VAEs)
(Kingma & Welling, 2014) and generative adversarial networks (GANs) (Goodfellow et al., 2014). These models have successfully been applied to a range of tasks, including image generation (Radford et al., 2015), image completion (IIzuka et al., 2017), texture synthesis (Jetchev et al., 2017; Xian et al., 2018), sketch generation (Ha & Eck, 2017), and music generation (Dieleman et al., 2018).Despite their successes, generative models still have difficulty capturing global structure. For example, consider the image completion task in Figure 1. The original image (left) is of a building, for which the global structure is a 2D repeating pattern of windows. Given a partial image (middle left), the goal is to predict the completion of the image. As can be seen, a stateoftheart image completion algorithm has trouble reconstructing the original image (right) (IIzuka et al., 2017). ^{1}^{1}1The baseline model performs particularly poorly since our dataset is small. We show on synthetic data that our approach is significantly better even when a large amount of data is available. Realworld data often contains such global structure, including repetitions, reflectional or rotational symmetry, or even more complex patterns.
In the past few years, program synthesis (SolarLezama et al., 2006) has emerged as a promising approach to capturing patterns in data (Ellis et al., 2015, 2018; Valkov et al., 2018)
. The idea is that simple programs can capture global structure that evades stateoftheart deep neural networks. A key benefit of using program synthesis is that we can design the space of programs to capture different kinds of structure—e.g., repeating patterns
(Ellis et al., 2018), symmetries, or spatial structure (Deng et al., 2018)—depending on the application domain. The challenge is that for the most part, existing approaches have synthesized programs that operate directly over raw data. Since programs have difficulty operating over perceptual data, existing approaches have largely been limited to very simple data—e.g., detecting 2D repeating patterns of simple shapes (Ellis et al., 2018).We propose to address these shortcomings by synthesizing programs that represent the underlying structure of highdimensional data. In particular, we decompose programs into two parts: (i) a
sketch that represents the skeletal structure of the program (SolarLezama et al., 2006), with holes that are left unimplemented, and (ii) components that can be used to fill these holes. We consider perceptual components—i.e., holes in the sketch are filled with raw perceptual data. For example, the programrepresents the structure in the original image in Figure 1 (left). The black text is the sketch, and the component is a subimage taken from the given partial image. Then, the draw function renders the given subimage at the given position. We call a sketch whose holes are filled with perceptual components a neurosymbolic program.
Building on these ideas, we propose an approach called programsynthesis (guided) generative models (PSGM) that combines neurosymbolic programs representing global structure with stateoftheart deep generative models. By incorporating programmatic structure, PSGM substantially improves the quality of these stateoftheart models. As can be seen, the completion produced using PSGM (middle right of Figure 1) substantially outperforms the baseline.
We show that PSGM can be used for both generation from scratch and for image completion. The generation pipeline is shown in Figure 2. At a high level, PSGM for generation operates in two phases:

First, it generates a program that represents the global structure in the image to be generated. In particular, it generates a program representing the latent global structure in the image (left in Figure 2), where is a sketch and is a perceptual component.

Second, our algorithm executes to obtain a structure rendering representing the program as an image (middle of Figure 2). Then, our algorithm uses a deep generative model to complete into a full image (right of Figure 2). The structure in helps guide the deep generative model towards images that preserve the global structure.
The imagecompletion pipeline (see Figure 3) is similar.
Training these models endtoend is challenging, since a priori, ground truth global structure is unavailable. Furthermore, representative global structure is very sparse, so approaches such as reinforcement learning do not scale. Instead, we leverage domainspecific program synthesis algorithms to produce examples of programs that represent global structure of the training data. In particular, we propose a synthesis algorithm tailored to the image domain, which extracts programs with nested forloops that can represent multiple 2D repeating patterns in images. Then, we use these example programs as supervised training data.
Our programs can capture rich spatial structure in the training data. For example, in Figure 2, the program structure encodes a repeating structure of 0’s and 2’s on the whole image, and a separate repeating structure of 3’s on the righthand side of the image. Furthermore, in Figure 1, the generated image captures the idea that the repeating pattern of windows does not extend to the bottom portion of the image.
Contributions.
We propose an architecture of generative models that incorporates programmatic structure, as well as an algorithm for training these models (Section 2). Our learning algorithm depends on a domainspecific program synthesis algorithm for extracting global structure from the training data; we propose such an algorithm for the image domain (Section 3). Finally, we evaluate our approach on synthetic data and on a realworld dataset of building facades (Tyleček & Šára, 2013), both on the task of generation from scratch and on generation from a partial image. We show that our approach substantially outperforms several stateoftheart deep generative models (Section 4).
for loop from sampled program  structure rendering  completed image 
Our image generation pipeline consists of the following steps: (i) Our generative model samples a latent vector
, and samples a program (left). (ii) Our model executes to obtain a rendering of the program structure (middle). (iii) Our model samples a completion of into a full image (right).partial image  synthesized program  extrapolated program 
structure rendering  completion (ours)  original image 
Related work.
There has been growing interest in applying program synthesis to machine learning, for purposes of interpretability
(Wang & Rudin, 2015; Verma et al., 2018), safety (Bastani et al., 2018), and lifelong learning (Valkov et al., 2018). Most relevantly, there has been interest in using programs to capture structure that deep learning models have difficulty representing
(Lake et al., 2015; Ellis et al., 2015, 2018; Pu et al., 2018). For instance, Ellis et al. (2015)proposes an unsupervised learning algorithm for capturing repeating patterns in simple line drawings; however, not only are their domains simple, but they can only handle a very small amount of noise. Similarly,
Ellis et al. (2018) captures 2D repeating patterns of simple circles and polygons; however, rather than synthesizing programs with perceptual components, they learn a simple mapping from images to symbols as a preprocessing step. The closest work we are aware of is Valkov et al. (2018), which synthesizes programs with neural components (i.e., components implemented as neural networks); however, their application is to lifelong learning, not generation, and to learning with supervision (labels) rather than to unsupervised learning of structure.Additionally, there has been work extending neural module networks (Andreas et al., 2016) to generative models (Deng et al., 2018). These algorithms essentially learn a collection of neural components that can be composed together based on hierarchical structure. However, they require that the structure be available (albeit in natural language form) both for training the model and for generating new images.
Finally, there has been work incorporating spatial structure into generative models for generating textures (Jetchev et al., 2017); however, their work only handles a single infinite repeating 2D pattern. In contrast, we can capture a rich variety of spatial patterns parameterized by a space of programs. For example, the image in Figure 1 generated by our technique contains different repeating patterns in different parts of the image.
2 Generative Models with Latent Structure
We describe our proposed architecture for generative models that incorporate programmatic structure. For most of this section, we focus on generation; we discuss how we adapt these techniques to image completion at the end. We illustrate our generation pipeline in Figure 2.
Let be a distribution over a space with unknown parameters
that we want to estimate. We study the setting where
is generated based on some latent structure, which consists of a program sketch and a perceptual component , and where the structure is in turn generated conditioned on a latent vector —i.e.,Figure 2 shows an example of a sampled program (left), and an example of the sampled completion (right). To sample a completion, our model executes to obtain a structure rendering (middle), and then samples a completion based on —i.e., .
We now describe our algorithm for learning the parameters of , followed by a description of our choices of architecture for and .
Learning algorithm.
Given training data , where , the maximum likelihood estimate is
Since is intractable to optimize, we use an approach based on the variational autoencoder (VAE). In particular, we use a variational distribution
which has parameters . Then, we optimize while simultaneously optimizing . Using , the evidence lower bound on the loglikelihood is
(1)  
where is the KL divergence and is information entropy. Thus, we can approximate by optimizing the lower bound (1) instead of . However, (1) remains intractable since we are integrating over all program sketches and perceptual components . Using sampling to estimate these integrals would be very computationally expensive. Instead, we propose an approach that uses a single point estimate of and for each , which we describe below.
Synthesizing structure.
For a given , we use program synthesis to infer a single likely choice and of the latent structure. The program synthesis algorithm must be tailored to a specific domain; we propose an algorithm for inferring forloop structure in images in Section 3. Then, we use these point estimates in place of the integrals over and —i.e., we assume that
where is the Dirac delta function. Plugging into (1) gives
(2)  
where we have dropped the degenerate terms and (which are constant with respect to the parameters ). As a consequence, (1) decomposes into two parts that can be straightforwardly optimized—i.e.,
where we can optimize and independently:
Latent structure VAE.
Note that is exactly equal to the objective of a VAE, where is the encoder and is the decoder—i.e., learning the distribution over latent structure is equivalent to learning the parameters of a VAE. The architecture of this VAE depends on the representation of and . In the case of forloop structure in images, we use a sequencetosequence VAE.
Generating data with structure.
The term
corresponds to learning a probability distribution (conditioned on the latent structure
and )—e.g., we can estimate this distribution using another VAE. As before, the architecture of this VAE depends on the representation of and . Rather than directly predicting based on and , we can leverage the program structure more directly by first executing the program to obtain its output , which we call a structure rendering. In particular, is a more direct representation of the global structure represented by , so it is often more suitable to use as input to a neural network. The middle of Figure 2 shows an example of a structure rendering for the program on the left. Then, we can train a model .In the case of images, we use a VAE with convolutional layers for the encoder and transpose convolutional layers for the decoder . Furthermore, instead of estimating the entire distribution , we also consider two nonprobabilistic approaches that directly predict from , which is an image completion problem. We can solve this problem using GLCIC, a stateoftheart image completion model (IIzuka et al., 2017). We can also use CycleGAN (Zhu et al., 2017), which solves the more general problem of mapping a training set of structured renderings to a training set of completed images . ^{2}^{2}2Pix2Pix (Isola et al., 2017) may seem more appropriate since it takes training pairs , but CycleGAN outperformed it.
Image completion.
In image completion, we are given a set of training pairs , and the goal is to learn a model that predicts the complete image given a partial image . Compared to generation, our likelihood is now conditioned on —i.e., . Now, we describe how we modify each of our two models and to incorporate this extra information.
First, the programmatic structure is no longer fully latent, since we can observe partial programmatic structure in . In particular, we can leverage our program synthesis algorithm to help perform completion. We first synthesize programs and representing the global structure in and , respectively. Then, we can train a model that predicts given —i.e., it extrapolates to a program representing the structure of the whole image. Thus, unlike generation, where we sample a program , we use the extrapolated program .
The second model for the most part remains the same, except when we execute to obtain a structure rendering , we render onto the partial image instead of onto a blank image to obtain the final rendering . Then, we complete the structure rendering into a prediction of the full image as before (i.e., using a VAE, GLCIC, or CycleGAN).
Our image completion pipeline is shown in Figure 3, including the given partial image (top left), the program synthesized from the partial image (top middle), the extrapolated program (top right), the structure rendering (bottom left), and the predicted full image (bottom middle).
3 Synthesizing Programmatic Structure
Image representation.
Since the images we work with are very high dimensional, for tractability, we assume that each image is divided into a grid containing rows and columns, where each grid cell has size pixels (where
is a hyperparameter). For example, this grid structure is apparent in Figure
3 (top right), where , and , for the facade and synthetic datasets respectively. For , we let denote the subimage at the position in the grid.Program grammar.
Given this structure, we consider programs that draw 2D repeating patterns of subimages on the grid. More precisely, we consider programs
that are length lists of pairs consisting of a sketch and a perceptual component ; here, is a hyperparameter. ^{3}^{3}3So far, we have assumed that a program is a single pair , but the generalization to a list of pairs is straightforward. A sketch has form
end for 
where are undetermined parameters that must satisfy and , and where ?? is a hole to be filled by a perceptual component, which is an subimage . ^{4}^{4}4For colored images, we have . Then, upon executing the iteration of the forloop, the program renders subimage at position in the grid. Figure 3 (top middle) shows an example of a sketch where its hole is filled with a subimage , and Figure 3 (bottom left) shows the image rendered upon executing . Figure 2 shows another such example.
Program synthesis problem.
Given a training image , our program synthesis algorithm outputs the parameters of each sketch in the program (for ), along with a perceptual component to fill the hole in sketch . Together, these parameters define a program .
The goal is to synthesize a program that faithfully represents the global structure in
. We capture this structure using a boolean tensor
, wherewhere is a hyperparameter, and is a distance metric between on the space of subimages. In our implementation, we use a weighted sum of earthmover’s distance between the color histograms of and , and the number of SIFT correspondences between and .
Additionally, we associate a boolean tensor with a given program . First, for a sketch with parameters , we define
i.e., the set of grid cells where sketch renders a subimage upon execution. Then, we have
i.e., indicates whether the sketch renders a subimage at both of the grid cells and . Then,
where the disjunction of boolean tensors is defined elementwise. Intuitively, identifies the set of pairs of grid cells and that are equal in the image rendered upon executing each pair in . ^{5}^{5}5Note that the covers of different sketches in can overlap; we find that ignoring this overlap does not significantly impact our results.
Finally, our program synthesis algorithm aims to solve the following optimization problem:
(3)  
where and are applied elementwise, and is a hyperparameter. In other words, the objective of (3) is the number of true positives (i.e., entries where ), and the number of false negatives (i.e., entries where ), and computes their weighted sum. Thus, the objective of (3) measures for how well represents the global structure of .
For tractability, we restrict the search space in (3) to programs of the form
In other words, rather than searching over all possible subimages , we only search over the subimages that actually occur in the training image
. This may lead to a slightly suboptimal solution, for example, in cases where the optimal subimage to be rendered is in fact an interpolation between two similar but distinct subimages in the training image. However, we found that in practice this simplifying assumption still produced viable results.
Program synthesis algorithm.
Exactly optimizing (3
) is in general an NPcomplete problem. Thus, our program synthesis algorithm uses a partially greedy heuristic. In particular, we initialize the program to
. Then, on each iteration, we enumerate all pairs and determine the pair that most increases the objective in (3), where is the set of all subimages for . Finally, we add to . We show the full algorithm in Algorithm 1. We have the following straightforward guarantee:Proof.
If , then optimizing is equivalent to set cover, where the items are tuples
and the sets are . The theorem follows from (Hochbaum, 1997). ∎
In general, (3) is not submodular, but we find that the greedy heuristic still works well in practice.
4 Experiments
We perform two experiments—one for generation from scratch and one for image completion. We find substantial improvement in both tasks. Details on neural network architectures are in Appendix A, and additional examples for image completion are in Appendix B.
4.1 Datasets
Synthetic dataset.
We developed a synthetic dataset based on MNIST. Each image consists of a grid, where each grid cell is pixels. Each grid cell is either filled with a colored MNIST digit or a solid color background. The program structure is a 2D repeating pattern of an MNIST digit; to add natural noise, we each iteration of the forloop in a sketch renders different MNIST digits, but with the same MNIST label and color. Additionally, we chose the program structure to contain correlations characteristic of realworld images—e.g., correlations between different parts of the program, correlations between the program and the background, and noise in renderings of the same component. Examples are shown in Figure 4. We give details of how we constructed this dataset in Appendix A. This dataset contains 10,000 training and 500 test images.
Facades dataset.
Our second dataset consists of 1855 images (1755 training, 100 testing) of building facades.^{6}^{6}6We chose a large training set since our dataset is so small. These images were all scaled to a size of pixels, and were divided into a grid of cells each of size 17 or 18 pixels. These images contain repeating patterns of objects such as windows and doors.
4.2 Generation from Scratch
Experimental setup.
We evaluate our approach PSGM for generation from scratch on the synthetic dataset—the facades dataset was too small to produce meaningful results. As described in Section 2, we use Algorithm 1 to synthesize a program representing each training image . Then, we train the encoder and the decoder on the training set .
For the second stage of PSGM (i.e., completing the structure rendering into an image ), we use a variational encoderdecoder (VED)
where encodes a structure rendering into a latent vector , and decodes into a complete image . We train and using the VAE training loss, except we minimize the distance between a structure rendering and the original image . Additionally, we trained a CycleGAN model to map structure renderings to complete images, by giving the CycleGAN unaligned pairs of and as training data. We compare our VED model to a VAE (Kingma & Welling, 2014), and our CycleGAN model to SpatialGAN (Bergmann et al., 2017).
Results.
Model  Score 

PSGM (CycleGAN)  85.51 
BL (SpatialGAN)  258.68 
PSGM (VED)  59414.7 
BL (VAE)  60368.4 
PSGM (VED Stage 1 )  32.0 
PSGM (VED Stage 2 )  59382.6 
Original Images  
PSGM (CycleGAN)  
Baseline (SpatialGAN)  
PSGM (VED)  
Baseline (VAE) 
Original Image (Synthetic)  Original Image (Facades)  
PSGM (GLCIC, Synthetic)  PSGM (GLCIC, Facades)  
Baseline (GLCIC, Synthetic)  Baseline (GLCIC, Facades) 
We measure performance for PSGM with the VED and the baseline VAE using the variational lower bound on the negative loglikelihood (NLL) (Zhao et al., 2017) on a heldout test set. For our approach, we use the lower bound (2),^{7}^{7}7Technically, is lower bounded by the loss of the variational encoderdecoder). which is the sum of the NLLs of the first and second stages; we report these NLLs separately as well. Figure 4 shows examples of generated images. For PSGM and SpatialGAN, we use Fréchet inception distance (Heusel et al., 2017). Table 1 shows these metrics of both our approach and the baseline.
Discussion.
The models based on our approach quantitatively improve over the respective baselines. The examples of images generated using our approach with VED completion appear to contain more structure than those generated using the baseline VAE. Similarly, the images generated using our approach with CycleGAN clearly capture more complex structure than the unbounded 2D repeating texture patterns captured by SpatialGAN.
4.3 Image Completion
Experimental setup.
Second, we evaluated our approach PSGM for image completion, on both our synthetic and the facades dataset. For this task, we compare using three image completion models: GLCIC (IIzuka et al., 2017), CycleGAN (Zhu et al., 2017), and the VED architecture described in Section 4.2. GLCIC is a stateoftheart image completion model. CycleGAN is a generic imagetoimage transformer. It uses unpaired training data, but we found that for our task, it outperforms approaches such as Pix2Pix (Isola et al., 2017) that take paired training data. For each model, we trained two versions:

Our approach (PSGM): As described in Section 2 (for image completion), given a partial image , we use Algorithm 1 to synthesize a program . We extrapolate to , and execute to obtain a structure rendering . Finally, we train the image completion model (GLCIC, CycleGAN, or VED) to complete to the original image .

Baseline: Given a partial image , we train the image completion model (GLCIC, CycleGAN, or VED) to directly complete to the original image .
Model  Synthetic  Facades  

PSGM  BL  PSGM  BL  
GLCIC  106.8  163.66  141.8  195.9 
CycleGAN  91.8  218.7  124.4  251.4 
VED  44570.4  52442.9  8755.4  8636.3 
Results.
As in Section 4.2, we measure performance using Fréchet inception distance for GLCIC and CycleGAN, and negative loglikelihood (NLL) to evaluate the VED, reported on a heldout test set. We show these results in Table 2. We show examples of completed image using GLCIC in Figure 5. We show additional examples of completed images including those completed using CycleGAN and VED in Appendix B.
Discussion.
Our approach PSGM outperforms the baseline in every case except the VED on the facades dataset. We believe the last result is since both VEDs failed to learn any meaningful structure (see Figure 7 in Appendix B).
A key reason why the baselines perform so poorly on the facades dataset is that the dataset is very small. Nevertheless, even on the synthetic dataset (which is fairly large), PSGM substantially outperforms the baselines. Finally, generative models such as GLCIC are known to perform poorly far away from the edges of the provided partial image (IIzuka et al., 2017). A benefit of our approach is that it provides the global context for a deeplearning based image completion model such as GLCIC to perform local completion.
5 Conclusion
We have proposed a new approach to generation that incorporates programmatic structure into stateoftheart deep learning models. In our experiments, we have demonstrated the promise of our approach to improve generation of highdimensional data with global structure that current stateoftheart deep generative models have difficulty capturing.
References

Andreas et al. (2016)
Andreas, J., Rohrbach, M., Darrell, T., and Klein, D.
Neural module networks.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 39–48, 2016.  Bastani et al. (2018) Bastani, O., Pu, Y., and SolarLezama, A. Verifiable reinforcement learning via policy extraction. In NIPS, 2018.
 Bergmann et al. (2017) Bergmann, U., Jetchev, N., and Vollgraf, R. Learning texture manifolds with the periodic spatial GAN. CoRR, abs/1705.06566, 2017. URL http://arxiv.org/abs/1705.06566.
 Deng et al. (2018) Deng, Z., Chen, J., Fu, Y., and Mori, G. Probabilistic neural programmed networks for scene generation. In Advances in Neural Information Processing Systems, pp. 4032–4042, 2018.
 Dieleman et al. (2018) Dieleman, S., van den Oord, A., and Simonyan, K. The challenge of realistic music generation: modelling raw audio at scale. In Advances in Neural Information Processing Systems, pp. 8000–8010, 2018.
 Ellis et al. (2015) Ellis, K., SolarLezama, A., and Tenenbaum, J. Unsupervised learning by program synthesis. In Advances in neural information processing systems, pp. 973–981, 2015.
 Ellis et al. (2018) Ellis, K., Ritchie, D., SolarLezama, A., and Tenenbaum, J. Learning to infer graphics programs from handdrawn images. In Advances in Neural Information Processing Systems, pp. 6060–6069, 2018.
 Goodfellow et al. (2014) Goodfellow, I., PougetAbadie, J., Mirza, M., Xu, B., WardeFarley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.
 Ha & Eck (2017) Ha, D. and Eck, D. A neural representation of sketch drawings. arXiv preprint arXiv:1704.03477, 2017.
 Heusel et al. (2017) Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. Gans trained by a two timescale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pp. 6626–6637, 2017.
 Hochbaum (1997) Hochbaum, D. S. Approximating covering and packing problems: set cover, vertex cover, independent set, and related problems. Approximation Algorithms for NPHard Problem, pp. 94–143, 1997.
 IIzuka et al. (2017) IIzuka, S., SimoSerra, E., and Ishikawa, H. Globally and locally consistent image completion. In ACM Trans. Graph., 2017.

Isola et al. (2017)
Isola, P., Zhu, J.Y., Zhou, T., and Efros, A. A.
Imagetoimage translation with conditional adversarial networks.
In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5967–5976. IEEE, 2017.  Jetchev et al. (2017) Jetchev, N., Bergmann, U., and Vollgraf, R. Texture synthesis with spatial generative adversarial networks. 2017.
 Kingma & Welling (2014) Kingma, D. P. and Welling, M. Autoencoding variational bayes. In ICLR, 2014.
 Lake et al. (2015) Lake, B. M., Salakhutdinov, R., and Tenenbaum, J. B. Humanlevel concept learning through probabilistic program induction. Science, 350(6266):1332–1338, 2015.
 Pu et al. (2018) Pu, Y., Miranda, Z., SolarLezama, A., and Kaelbling, L. Selecting representative examples for program synthesis. In International Conference on Machine Learning, pp. 4158–4167, 2018.
 Radford et al. (2015) Radford, A., Metz, L., and Chintala, S. Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR, 2015.
 SolarLezama et al. (2006) SolarLezama, A., Tancau, L., Bodik, R., Seshia, S., and Saraswat, V. Combinatorial sketching for finite programs. In ASPLOS, volume 41, pp. 404–415. ACM, 2006.
 Tyleček & Šára (2013) Tyleček, R. and Šára, R. Spatial pattern templates for recognition of objects with regular structure. In Proc. GCPR, Saarbrucken, Germany, 2013.
 Valkov et al. (2018) Valkov, L., Chaudhari, D., Srivastava, A., Sutton, C., and Chaudhuri, S. Houdini: Lifelong learning as program synthesis. In Advances in Neural Information Processing Systems, pp. 8701–8712, 2018.
 Verma et al. (2018) Verma, A., Murali, V., Singh, R., Kohli, P., and Chaudhuri, S. Programmatically interpretable reinforcement learning. In ICML, 2018.
 Wang & Rudin (2015) Wang, F. and Rudin, C. Falling rule lists. In Artificial Intelligence and Statistics, pp. 1013–1022, 2015.
 Xian et al. (2018) Xian, W., Sangkloy, P., Agrawal, V., Raj, A., Lu, J., Fang, C., Yu, F., and Hays, J. Texturegan: Controlling deep image synthesis with texture patches. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8456–8465, 2018.
 Zhao et al. (2017) Zhao, S., Song, J., and Ermon, S. Towards deeper understanding of variational autoencoding models. arXiv preprint arXiv:1702.08658, 2017.
 Zhu et al. (2017) Zhu, J.Y., Park, T., and Efros, A. Unpaired imagetoimage translation using cycleconsistent adversarial networks. In ICCV., 2017.
Appendix A Experimental Details
a.1 Synthetic Dataset
To sample a random image, we started with a grid, where each grid cell is pixels. We randomly sample a program (for ), where each perceptual component is a randomly selected MNIST image (downscaled to our grid cell size and colorized). To create correlations between different parts of , we sample depending on . First, to sample each component , we first sample latent properties of (i.e., its MNIST label and its color ). Second, we sample the parameters of conditional on these properties. To each of the 25 possible latent properties of
, we associate a discrete distribution over latent properties for later elements in the sequence, as well as a mean and standard deviation for each of the parameters of the corresponding sketch
.We then render by executing each in sequence. However, when executing , on each iteration of the forloop, instead of rendering the subimage at each position in the grid, we randomly sample another MNIST image with the same label as , recolor to be the same color as , and render . By doing so, we introduce noise into the programmatic structure.
a.2 Generation from Scratch
PSGM architecture.
For the first stage of PSGM (i.e., generating the program ), we use a 3layer LSTM encoder and a feedforward decoder . The LSTM includes sequences of 13dimensional vectors, of which 6 dimensions represent the structure of the forloop being generated, and 7 dimensions are an encoding of the image to be rendered. The image compression was performed via a convolutional architecture with 2 convolutional layers for encoding and 3 deconvolutional layers for decoding.
For the second stage of PSGM (i.e., completing the structure rendering into an image ), we use a VED; the encoder is a CNN with 4 layers, and the decoder
is a transpose CNN with 6 layers. The CycleGAN model has a discriminator with 3 convolutional layers and a generator which uses transfer learning by employing the pretrained ResNet architecture.
Baseline architecture.
The architecture of the baseline is a vanilla VAE with the same as the architecture as the VED we used for the second state of PSGM, except the input to the encoder is the original training image instead of the structure rendering . The baselines with CycleGAN also use the same architecture as PSGM with CycleGAN/GLCIC. The Spatial GAN was trained with 5 layers each in the generative/discriminative layer, and 60dimensional global and 3dimensional periodic latent vectors.
a.3 Image completion.
PSGM architecture.
For the first stage of PSGM for completion (extrapolation of the program from a partial image to a full image), we use a feedforward network with three layers. For the second stage of completion via VAE, we use a convolutional/deconvolutional architecture. The encoder is a CNN with 4 layers, and the decoder is a transpose CNN with 6 layers. As was the case in generation, the CycleGAN model has a discriminator with 3 convolutional layers and a generator which uses transfer learning by employing the pretrained ResNet architecture.
Baseline architecture.
For the baseline VAE architecture, we used a similar architecture to the PSGM completion step (4 convolutional and 6 deconvolutional layers). The only difference was the input, which was a partial image rather than an image rendered with structure. The CycleGAN architecture was similar to that used in PSGM (although it mapped partial images to full images rather than partial images with structure to full images).
Appendix B Additional Results
In Figure 6, we show examples of how our image completion pipeline is applied to the facades dataset, and in Figure 7, we show examples of how our image completion pipeline is applied to our synthetic dataset.
Original Image  

Partial Image  
Structure Rendering  
PSGM (GLCIC)  
Baseline (GLCIC)  
PSGM (CycleGAN)  
Baseline (CycleGAN)  
PSGM (VED)  
Baseline (VED) 
Original Image  

Partial Image  
Structure Rendering (Partial)  
Structure Rendering (Extrapolated)  
PSGM (GLCIC)  
Baseline (GLCIC)  
PSGM (CycleGAN)  
Baseline (CycleGAN)  
PSGM (VED)  
Baseline (VED) 