Learning Neurosymbolic Generative Models via Program Synthesis

01/24/2019 ∙ by Halley Young, et al. ∙ 10

Significant strides have been made toward designing better generative models in recent years. Despite this progress, however, state-of-the-art approaches are still largely unable to capture complex global structure in data. For example, images of buildings typically contain spatial patterns such as windows repeating at regular intervals; state-of-the-art generative methods can't easily reproduce these structures. We propose to address this problem by incorporating programs representing global structure into the generative model---e.g., a 2D for-loop may represent a configuration of windows. Furthermore, we propose a framework for learning these models by leveraging program synthesis to generate training data. On both synthetic and real-world data, we demonstrate that our approach is substantially better than the state-of-the-art at both generating and completing images that contain global structure.



page 2

page 3

page 4

page 7

page 8

page 12

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

There has been much interest recently in generative models, following the introduction of both variational autoencoders (VAEs) 

(Kingma & Welling, 2014) and generative adversarial networks (GANs) (Goodfellow et al., 2014). These models have successfully been applied to a range of tasks, including image generation (Radford et al., 2015), image completion (IIzuka et al., 2017), texture synthesis (Jetchev et al., 2017; Xian et al., 2018), sketch generation (Ha & Eck, 2017), and music generation (Dieleman et al., 2018).

Despite their successes, generative models still have difficulty capturing global structure. For example, consider the image completion task in Figure 1. The original image (left) is of a building, for which the global structure is a 2D repeating pattern of windows. Given a partial image (middle left), the goal is to predict the completion of the image. As can be seen, a state-of-the-art image completion algorithm has trouble reconstructing the original image (right) (IIzuka et al., 2017). 111The baseline model performs particularly poorly since our dataset is small. We show on synthetic data that our approach is significantly better even when a large amount of data is available. Real-world data often contains such global structure, including repetitions, reflectional or rotational symmetry, or even more complex patterns.

In the past few years, program synthesis (Solar-Lezama et al., 2006) has emerged as a promising approach to capturing patterns in data (Ellis et al., 2015, 2018; Valkov et al., 2018)

. The idea is that simple programs can capture global structure that evades state-of-the-art deep neural networks. A key benefit of using program synthesis is that we can design the space of programs to capture different kinds of structure—e.g., repeating patterns 

(Ellis et al., 2018), symmetries, or spatial structure (Deng et al., 2018)—depending on the application domain. The challenge is that for the most part, existing approaches have synthesized programs that operate directly over raw data. Since programs have difficulty operating over perceptual data, existing approaches have largely been limited to very simple data—e.g., detecting 2D repeating patterns of simple shapes (Ellis et al., 2018).

We propose to address these shortcomings by synthesizing programs that represent the underlying structure of high-dimensional data. In particular, we decompose programs into two parts: (i) a

sketch that represents the skeletal structure of the program (Solar-Lezama et al., 2006), with holes that are left unimplemented, and (ii) components that can be used to fill these holes. We consider perceptual components—i.e., holes in the sketch are filled with raw perceptual data. For example, the program

represents the structure in the original image in Figure 1 (left). The black text is the sketch, and the component is a sub-image taken from the given partial image. Then, the draw function renders the given sub-image at the given position. We call a sketch whose holes are filled with perceptual components a neurosymbolic program.

Building on these ideas, we propose an approach called program-synthesis (guided) generative models (PS-GM) that combines neurosymbolic programs representing global structure with state-of-the-art deep generative models. By incorporating programmatic structure, PS-GM substantially improves the quality of these state-of-the-art models. As can be seen, the completion produced using PS-GM (middle right of Figure 1) substantially outperforms the baseline.

We show that PS-GM can be used for both generation from scratch and for image completion. The generation pipeline is shown in Figure 2. At a high level, PS-GM for generation operates in two phases:

  • First, it generates a program that represents the global structure in the image to be generated. In particular, it generates a program representing the latent global structure in the image (left in Figure 2), where is a sketch and is a perceptual component.

  • Second, our algorithm executes to obtain a structure rendering representing the program as an image (middle of Figure 2). Then, our algorithm uses a deep generative model to complete into a full image (right of Figure 2). The structure in helps guide the deep generative model towards images that preserve the global structure.

The image-completion pipeline (see Figure 3) is similar.

Training these models end-to-end is challenging, since a priori, ground truth global structure is unavailable. Furthermore, representative global structure is very sparse, so approaches such as reinforcement learning do not scale. Instead, we leverage domain-specific program synthesis algorithms to produce examples of programs that represent global structure of the training data. In particular, we propose a synthesis algorithm tailored to the image domain, which extracts programs with nested for-loops that can represent multiple 2D repeating patterns in images. Then, we use these example programs as supervised training data.

Our programs can capture rich spatial structure in the training data. For example, in Figure 2, the program structure encodes a repeating structure of 0’s and 2’s on the whole image, and a separate repeating structure of 3’s on the right-hand side of the image. Furthermore, in Figure 1, the generated image captures the idea that the repeating pattern of windows does not extend to the bottom portion of the image.


We propose an architecture of generative models that incorporates programmatic structure, as well as an algorithm for training these models (Section 2). Our learning algorithm depends on a domain-specific program synthesis algorithm for extracting global structure from the training data; we propose such an algorithm for the image domain (Section 3). Finally, we evaluate our approach on synthetic data and on a real-world dataset of building facades (Tyleček & Šára, 2013), both on the task of generation from scratch and on generation from a partial image. We show that our approach substantially outperforms several state-of-the-art deep generative models (Section 4).

for loop from sampled program structure rendering completed image
Figure 2:

Our image generation pipeline consists of the following steps: (i) Our generative model samples a latent vector

, and samples a program (left). (ii) Our model executes to obtain a rendering of the program structure (middle). (iii) Our model samples a completion of into a full image (right).
partial image synthesized program extrapolated program
structure rendering completion (ours) original image
Figure 3: Our image completion pipeline consists of the following steps: (i) Given a partial image (top left), our program synthesis algorithm (Section 3) synthesizes a program representing the structure in the partial image (top middle). (ii) Our model extrapolates to a program representing the structure of the whole image. (iii) Our model executes to obtain a rendering of the program structure (bottom left). (iv) Our model completes into an image (bottom middle), which resembles the original image (bottom right).

Related work.

There has been growing interest in applying program synthesis to machine learning, for purposes of interpretability 

(Wang & Rudin, 2015; Verma et al., 2018), safety (Bastani et al., 2018), and lifelong learning (Valkov et al., 2018)

. Most relevantly, there has been interest in using programs to capture structure that deep learning models have difficulty representing 

(Lake et al., 2015; Ellis et al., 2015, 2018; Pu et al., 2018). For instance, Ellis et al. (2015)

proposes an unsupervised learning algorithm for capturing repeating patterns in simple line drawings; however, not only are their domains simple, but they can only handle a very small amount of noise. Similarly,

Ellis et al. (2018) captures 2D repeating patterns of simple circles and polygons; however, rather than synthesizing programs with perceptual components, they learn a simple mapping from images to symbols as a preprocessing step. The closest work we are aware of is Valkov et al. (2018), which synthesizes programs with neural components (i.e., components implemented as neural networks); however, their application is to lifelong learning, not generation, and to learning with supervision (labels) rather than to unsupervised learning of structure.

Additionally, there has been work extending neural module networks (Andreas et al., 2016) to generative models (Deng et al., 2018). These algorithms essentially learn a collection of neural components that can be composed together based on hierarchical structure. However, they require that the structure be available (albeit in natural language form) both for training the model and for generating new images.

Finally, there has been work incorporating spatial structure into generative models for generating textures (Jetchev et al., 2017); however, their work only handles a single infinite repeating 2D pattern. In contrast, we can capture a rich variety of spatial patterns parameterized by a space of programs. For example, the image in Figure 1 generated by our technique contains different repeating patterns in different parts of the image.

2 Generative Models with Latent Structure

We describe our proposed architecture for generative models that incorporate programmatic structure. For most of this section, we focus on generation; we discuss how we adapt these techniques to image completion at the end. We illustrate our generation pipeline in Figure 2.

Let be a distribution over a space with unknown parameters

that we want to estimate. We study the setting where

is generated based on some latent structure, which consists of a program sketch and a perceptual component , and where the structure is in turn generated conditioned on a latent vector —i.e.,

Figure 2 shows an example of a sampled program (left), and an example of the sampled completion (right). To sample a completion, our model executes to obtain a structure rendering (middle), and then samples a completion based on —i.e., .

We now describe our algorithm for learning the parameters of , followed by a description of our choices of architecture for and .

Learning algorithm.

Given training data , where , the maximum likelihood estimate is

Since is intractable to optimize, we use an approach based on the variational autoencoder (VAE). In particular, we use a variational distribution

which has parameters . Then, we optimize while simultaneously optimizing . Using , the evidence lower bound on the log-likelihood is


where is the KL divergence and is information entropy. Thus, we can approximate by optimizing the lower bound (1) instead of . However, (1) remains intractable since we are integrating over all program sketches and perceptual components . Using sampling to estimate these integrals would be very computationally expensive. Instead, we propose an approach that uses a single point estimate of and for each , which we describe below.

Synthesizing structure.

For a given , we use program synthesis to infer a single likely choice and of the latent structure. The program synthesis algorithm must be tailored to a specific domain; we propose an algorithm for inferring for-loop structure in images in Section 3. Then, we use these point estimates in place of the integrals over and —i.e., we assume that

where is the Dirac delta function. Plugging into (1) gives


where we have dropped the degenerate terms and (which are constant with respect to the parameters ). As a consequence, (1) decomposes into two parts that can be straightforwardly optimized—i.e.,

where we can optimize and independently:

Latent structure VAE.

Note that is exactly equal to the objective of a VAE, where is the encoder and is the decoder—i.e., learning the distribution over latent structure is equivalent to learning the parameters of a VAE. The architecture of this VAE depends on the representation of and . In the case of for-loop structure in images, we use a sequence-to-sequence VAE.

Generating data with structure.

The term

corresponds to learning a probability distribution (conditioned on the latent structure

and )—e.g., we can estimate this distribution using another VAE. As before, the architecture of this VAE depends on the representation of and . Rather than directly predicting based on and , we can leverage the program structure more directly by first executing the program to obtain its output , which we call a structure rendering. In particular, is a more direct representation of the global structure represented by , so it is often more suitable to use as input to a neural network. The middle of Figure 2 shows an example of a structure rendering for the program on the left. Then, we can train a model .

In the case of images, we use a VAE with convolutional layers for the encoder and transpose convolutional layers for the decoder . Furthermore, instead of estimating the entire distribution , we also consider two non-probabilistic approaches that directly predict from , which is an image completion problem. We can solve this problem using GLCIC, a state-of-the-art image completion model (IIzuka et al., 2017). We can also use CycleGAN (Zhu et al., 2017), which solves the more general problem of mapping a training set of structured renderings to a training set of completed images . 222Pix2Pix (Isola et al., 2017) may seem more appropriate since it takes training pairs , but CycleGAN outperformed it.

Image completion.

In image completion, we are given a set of training pairs , and the goal is to learn a model that predicts the complete image given a partial image . Compared to generation, our likelihood is now conditioned on —i.e., . Now, we describe how we modify each of our two models and to incorporate this extra information.

First, the programmatic structure is no longer fully latent, since we can observe partial programmatic structure in . In particular, we can leverage our program synthesis algorithm to help perform completion. We first synthesize programs and representing the global structure in and , respectively. Then, we can train a model that predicts given —i.e., it extrapolates to a program representing the structure of the whole image. Thus, unlike generation, where we sample a program , we use the extrapolated program .

The second model for the most part remains the same, except when we execute to obtain a structure rendering , we render onto the partial image instead of onto a blank image to obtain the final rendering . Then, we complete the structure rendering into a prediction of the full image as before (i.e., using a VAE, GLCIC, or CycleGAN).

Our image completion pipeline is shown in Figure 3, including the given partial image (top left), the program synthesized from the partial image (top middle), the extrapolated program (top right), the structure rendering (bottom left), and the predicted full image (bottom middle).

3 Synthesizing Programmatic Structure

Image representation.

Since the images we work with are very high dimensional, for tractability, we assume that each image is divided into a grid containing rows and columns, where each grid cell has size pixels (where

is a hyperparameter). For example, this grid structure is apparent in Figure 

3 (top right), where , and , for the facade and synthetic datasets respectively. For , we let denote the sub-image at the position in the grid.

Program grammar.

Given this structure, we consider programs that draw 2D repeating patterns of sub-images on the grid. More precisely, we consider programs

that are length lists of pairs consisting of a sketch and a perceptual component ; here, is a hyperparameter. 333So far, we have assumed that a program is a single pair , but the generalization to a list of pairs is straightforward. A sketch has form

end for

where are undetermined parameters that must satisfy and , and where ?? is a hole to be filled by a perceptual component, which is an sub-image . 444For colored images, we have . Then, upon executing the iteration of the for-loop, the program renders sub-image at position in the grid. Figure 3 (top middle) shows an example of a sketch where its hole is filled with a sub-image , and Figure 3 (bottom left) shows the image rendered upon executing . Figure 2 shows another such example.

Program synthesis problem.

Given a training image , our program synthesis algorithm outputs the parameters of each sketch in the program (for ), along with a perceptual component to fill the hole in sketch . Together, these parameters define a program .

The goal is to synthesize a program that faithfully represents the global structure in

. We capture this structure using a boolean tensor

, where

where is a hyperparameter, and is a distance metric between on the space of sub-images. In our implementation, we use a weighted sum of earthmover’s distance between the color histograms of and , and the number of SIFT correspondences between and .

Additionally, we associate a boolean tensor with a given program . First, for a sketch with parameters , we define

i.e., the set of grid cells where sketch renders a sub-image upon execution. Then, we have

i.e., indicates whether the sketch renders a sub-image at both of the grid cells and . Then,

where the disjunction of boolean tensors is defined elementwise. Intuitively, identifies the set of pairs of grid cells and that are equal in the image rendered upon executing each pair in . 555Note that the covers of different sketches in can overlap; we find that ignoring this overlap does not significantly impact our results.

Finally, our program synthesis algorithm aims to solve the following optimization problem:


where and are applied elementwise, and is a hyperparameter. In other words, the objective of (3) is the number of true positives (i.e., entries where ), and the number of false negatives (i.e., entries where ), and computes their weighted sum. Thus, the objective of (3) measures for how well represents the global structure of .

For tractability, we restrict the search space in (3) to programs of the form

In other words, rather than searching over all possible sub-images , we only search over the sub-images that actually occur in the training image

. This may lead to a slightly sub-optimal solution, for example, in cases where the optimal sub-image to be rendered is in fact an interpolation between two similar but distinct sub-images in the training image. However, we found that in practice this simplifying assumption still produced viable results.

  for  do
  end for
Algorithm 1 Synthesizes a program representing the global structure of a given image .

Program synthesis algorithm.

Exactly optimizing (3

) is in general an NP-complete problem. Thus, our program synthesis algorithm uses a partially greedy heuristic. In particular, we initialize the program to

. Then, on each iteration, we enumerate all pairs and determine the pair that most increases the objective in (3), where is the set of all sub-images for . Finally, we add to . We show the full algorithm in Algorithm 1. We have the following straightforward guarantee:

Theorem 3.1.

If , then , where is returned by Algorithm 1 and solves (3).


If , then optimizing is equivalent to set cover, where the items are tuples

and the sets are . The theorem follows from (Hochbaum, 1997). ∎

In general, (3) is not submodular, but we find that the greedy heuristic still works well in practice.

4 Experiments

We perform two experiments—one for generation from scratch and one for image completion. We find substantial improvement in both tasks. Details on neural network architectures are in Appendix A, and additional examples for image completion are in Appendix B.

4.1 Datasets

Synthetic dataset.

We developed a synthetic dataset based on MNIST. Each image consists of a grid, where each grid cell is pixels. Each grid cell is either filled with a colored MNIST digit or a solid color background. The program structure is a 2D repeating pattern of an MNIST digit; to add natural noise, we each iteration of the for-loop in a sketch renders different MNIST digits, but with the same MNIST label and color. Additionally, we chose the program structure to contain correlations characteristic of real-world images—e.g., correlations between different parts of the program, correlations between the program and the background, and noise in renderings of the same component. Examples are shown in Figure 4. We give details of how we constructed this dataset in Appendix A. This dataset contains 10,000 training and 500 test images.

Facades dataset.

Our second dataset consists of 1855 images (1755 training, 100 testing) of building facades.666We chose a large training set since our dataset is so small. These images were all scaled to a size of pixels, and were divided into a grid of cells each of size 17 or 18 pixels. These images contain repeating patterns of objects such as windows and doors.

4.2 Generation from Scratch

Experimental setup.

We evaluate our approach PS-GM for generation from scratch on the synthetic dataset—the facades dataset was too small to produce meaningful results. As described in Section 2, we use Algorithm 1 to synthesize a program representing each training image . Then, we train the encoder and the decoder on the training set .

For the second stage of PS-GM (i.e., completing the structure rendering into an image ), we use a variational encoder-decoder (VED)

where encodes a structure rendering into a latent vector , and decodes into a complete image . We train and using the VAE training loss, except we minimize the distance between a structure rendering and the original image . Additionally, we trained a CycleGAN model to map structure renderings to complete images, by giving the CycleGAN unaligned pairs of and as training data. We compare our VED model to a VAE (Kingma & Welling, 2014), and our CycleGAN model to SpatialGAN (Bergmann et al., 2017).


Model Score
PS-GM (CycleGAN) 85.51
BL (SpatialGAN) 258.68
PS-GM (VED) 59414.7
BL (VAE) 60368.4
PS-GM (VED Stage 1 ) 32.0
PS-GM (VED Stage 2 ) 59382.6
Table 1: Performance of our approach PS-GM versus the baseline (BL) for generation from scratch. We report Fréchet inception distance for GAN-based models, and negative log-likelihood for the VAE-based models
Original Images
PS-GM (CycleGAN)
Baseline (SpatialGAN)
Baseline (VAE)
Figure 4: Examples of synthetic images generated using our approach, PS-GM (with VED and CycleGan), and the baseline (a VAE and a SpatialGAN). Images in different rows are unrelated since the task is generation from scratch.
Original Image (Synthetic) Original Image (Facades)
PS-GM (GLCIC, Synthetic) PS-GM (GLCIC, Facades)
Baseline (GLCIC, Synthetic) Baseline (GLCIC, Facades)
Figure 5: Examples of images generated using our approach (PS-GM) and the baseline, using GLCIC for image completion.

We measure performance for PS-GM with the VED and the baseline VAE using the variational lower bound on the negative log-likelihood (NLL) (Zhao et al., 2017) on a held-out test set. For our approach, we use the lower bound (2),777Technically, is lower bounded by the loss of the variational encoder-decoder). which is the sum of the NLLs of the first and second stages; we report these NLLs separately as well. Figure 4 shows examples of generated images. For PS-GM and SpatialGAN, we use Fréchet inception distance (Heusel et al., 2017). Table 1 shows these metrics of both our approach and the baseline.


The models based on our approach quantitatively improve over the respective baselines. The examples of images generated using our approach with VED completion appear to contain more structure than those generated using the baseline VAE. Similarly, the images generated using our approach with CycleGAN clearly capture more complex structure than the unbounded 2D repeating texture patterns captured by SpatialGAN.

4.3 Image Completion

Experimental setup.

Second, we evaluated our approach PS-GM for image completion, on both our synthetic and the facades dataset. For this task, we compare using three image completion models: GLCIC (IIzuka et al., 2017), CycleGAN (Zhu et al., 2017), and the VED architecture described in Section 4.2. GLCIC is a state-of-the-art image completion model. CycleGAN is a generic image-to-image transformer. It uses unpaired training data, but we found that for our task, it outperforms approaches such as Pix2Pix (Isola et al., 2017) that take paired training data. For each model, we trained two versions:

  • Our approach (PS-GM): As described in Section 2 (for image completion), given a partial image , we use Algorithm 1 to synthesize a program . We extrapolate to , and execute to obtain a structure rendering . Finally, we train the image completion model (GLCIC, CycleGAN, or VED) to complete to the original image .

  • Baseline: Given a partial image , we train the image completion model (GLCIC, CycleGAN, or VED) to directly complete to the original image .

Model Synthetic Facades
GLCIC 106.8 163.66 141.8 195.9
CycleGAN 91.8 218.7 124.4 251.4
VED 44570.4 52442.9 8755.4 8636.3
Table 2: Performance of our approach PS-GM versus the baseline (BL) for image completion. We report Fréchet distance for GAN-based models, and negative log-likelihood (NLL) for the VED.


As in Section 4.2, we measure performance using Fréchet inception distance for GLCIC and CycleGAN, and negative log-likelihood (NLL) to evaluate the VED, reported on a held-out test set. We show these results in Table 2. We show examples of completed image using GLCIC in Figure 5. We show additional examples of completed images including those completed using CycleGAN and VED in Appendix B.


Our approach PS-GM outperforms the baseline in every case except the VED on the facades dataset. We believe the last result is since both VEDs failed to learn any meaningful structure (see Figure 7 in Appendix B).

A key reason why the baselines perform so poorly on the facades dataset is that the dataset is very small. Nevertheless, even on the synthetic dataset (which is fairly large), PS-GM substantially outperforms the baselines. Finally, generative models such as GLCIC are known to perform poorly far away from the edges of the provided partial image (IIzuka et al., 2017). A benefit of our approach is that it provides the global context for a deep-learning based image completion model such as GLCIC to perform local completion.

5 Conclusion

We have proposed a new approach to generation that incorporates programmatic structure into state-of-the-art deep learning models. In our experiments, we have demonstrated the promise of our approach to improve generation of high-dimensional data with global structure that current state-of-the-art deep generative models have difficulty capturing.


  • Andreas et al. (2016) Andreas, J., Rohrbach, M., Darrell, T., and Klein, D. Neural module networks. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pp. 39–48, 2016.
  • Bastani et al. (2018) Bastani, O., Pu, Y., and Solar-Lezama, A. Verifiable reinforcement learning via policy extraction. In NIPS, 2018.
  • Bergmann et al. (2017) Bergmann, U., Jetchev, N., and Vollgraf, R. Learning texture manifolds with the periodic spatial GAN. CoRR, abs/1705.06566, 2017. URL http://arxiv.org/abs/1705.06566.
  • Deng et al. (2018) Deng, Z., Chen, J., Fu, Y., and Mori, G. Probabilistic neural programmed networks for scene generation. In Advances in Neural Information Processing Systems, pp. 4032–4042, 2018.
  • Dieleman et al. (2018) Dieleman, S., van den Oord, A., and Simonyan, K. The challenge of realistic music generation: modelling raw audio at scale. In Advances in Neural Information Processing Systems, pp. 8000–8010, 2018.
  • Ellis et al. (2015) Ellis, K., Solar-Lezama, A., and Tenenbaum, J. Unsupervised learning by program synthesis. In Advances in neural information processing systems, pp. 973–981, 2015.
  • Ellis et al. (2018) Ellis, K., Ritchie, D., Solar-Lezama, A., and Tenenbaum, J. Learning to infer graphics programs from hand-drawn images. In Advances in Neural Information Processing Systems, pp. 6060–6069, 2018.
  • Goodfellow et al. (2014) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.
  • Ha & Eck (2017) Ha, D. and Eck, D. A neural representation of sketch drawings. arXiv preprint arXiv:1704.03477, 2017.
  • Heusel et al. (2017) Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pp. 6626–6637, 2017.
  • Hochbaum (1997) Hochbaum, D. S. Approximating covering and packing problems: set cover, vertex cover, independent set, and related problems. Approximation Algorithms for NP-Hard Problem, pp. 94–143, 1997.
  • IIzuka et al. (2017) IIzuka, S., Simo-Serra, E., and Ishikawa, H. Globally and locally consistent image completion. In ACM Trans. Graph., 2017.
  • Isola et al. (2017) Isola, P., Zhu, J.-Y., Zhou, T., and Efros, A. A.

    Image-to-image translation with conditional adversarial networks.

    In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5967–5976. IEEE, 2017.
  • Jetchev et al. (2017) Jetchev, N., Bergmann, U., and Vollgraf, R. Texture synthesis with spatial generative adversarial networks. 2017.
  • Kingma & Welling (2014) Kingma, D. P. and Welling, M. Auto-encoding variational bayes. In ICLR, 2014.
  • Lake et al. (2015) Lake, B. M., Salakhutdinov, R., and Tenenbaum, J. B. Human-level concept learning through probabilistic program induction. Science, 350(6266):1332–1338, 2015.
  • Pu et al. (2018) Pu, Y., Miranda, Z., Solar-Lezama, A., and Kaelbling, L. Selecting representative examples for program synthesis. In International Conference on Machine Learning, pp. 4158–4167, 2018.
  • Radford et al. (2015) Radford, A., Metz, L., and Chintala, S. Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR, 2015.
  • Solar-Lezama et al. (2006) Solar-Lezama, A., Tancau, L., Bodik, R., Seshia, S., and Saraswat, V. Combinatorial sketching for finite programs. In ASPLOS, volume 41, pp. 404–415. ACM, 2006.
  • Tyleček & Šára (2013) Tyleček, R. and Šára, R. Spatial pattern templates for recognition of objects with regular structure. In Proc. GCPR, Saarbrucken, Germany, 2013.
  • Valkov et al. (2018) Valkov, L., Chaudhari, D., Srivastava, A., Sutton, C., and Chaudhuri, S. Houdini: Lifelong learning as program synthesis. In Advances in Neural Information Processing Systems, pp. 8701–8712, 2018.
  • Verma et al. (2018) Verma, A., Murali, V., Singh, R., Kohli, P., and Chaudhuri, S. Programmatically interpretable reinforcement learning. In ICML, 2018.
  • Wang & Rudin (2015) Wang, F. and Rudin, C. Falling rule lists. In Artificial Intelligence and Statistics, pp. 1013–1022, 2015.
  • Xian et al. (2018) Xian, W., Sangkloy, P., Agrawal, V., Raj, A., Lu, J., Fang, C., Yu, F., and Hays, J. Texturegan: Controlling deep image synthesis with texture patches. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8456–8465, 2018.
  • Zhao et al. (2017) Zhao, S., Song, J., and Ermon, S. Towards deeper understanding of variational autoencoding models. arXiv preprint arXiv:1702.08658, 2017.
  • Zhu et al. (2017) Zhu, J.-Y., Park, T., and Efros, A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV., 2017.

Appendix A Experimental Details

a.1 Synthetic Dataset

To sample a random image, we started with a grid, where each grid cell is pixels. We randomly sample a program (for ), where each perceptual component is a randomly selected MNIST image (downscaled to our grid cell size and colorized). To create correlations between different parts of , we sample depending on . First, to sample each component , we first sample latent properties of (i.e., its MNIST label and its color ). Second, we sample the parameters of conditional on these properties. To each of the 25 possible latent properties of

, we associate a discrete distribution over latent properties for later elements in the sequence, as well as a mean and standard deviation for each of the parameters of the corresponding sketch


We then render by executing each in sequence. However, when executing , on each iteration of the for-loop, instead of rendering the sub-image at each position in the grid, we randomly sample another MNIST image with the same label as , recolor to be the same color as , and render . By doing so, we introduce noise into the programmatic structure.

a.2 Generation from Scratch

PS-GM architecture.

For the first stage of PS-GM (i.e., generating the program ), we use a 3-layer LSTM encoder and a feedforward decoder . The LSTM includes sequences of 13-dimensional vectors, of which 6 dimensions represent the structure of the for-loop being generated, and 7 dimensions are an encoding of the image to be rendered. The image compression was performed via a convolutional architecture with 2 convolutional layers for encoding and 3 deconvolutional layers for decoding.

For the second stage of PS-GM (i.e., completing the structure rendering into an image ), we use a VED; the encoder is a CNN with 4 layers, and the decoder

is a transpose CNN with 6 layers. The CycleGAN model has a discriminator with 3 convolutional layers and a generator which uses transfer learning by employing the pre-trained ResNet architecture.

Baseline architecture.

The architecture of the baseline is a vanilla VAE with the same as the architecture as the VED we used for the second state of PS-GM, except the input to the encoder is the original training image instead of the structure rendering . The baselines with CycleGAN also use the same architecture as PS-GM with CycleGAN/GLCIC. The Spatial GAN was trained with 5 layers each in the generative/discriminative layer, and 60-dimensional global and 3-dimensional periodic latent vectors.

a.3 Image completion.

PS-GM architecture.

For the first stage of PS-GM for completion (extrapolation of the program from a partial image to a full image), we use a feedforward network with three layers. For the second stage of completion via VAE, we use a convolutional/deconvolutional architecture. The encoder is a CNN with 4 layers, and the decoder is a transpose CNN with 6 layers. As was the case in generation, the CycleGAN model has a discriminator with 3 convolutional layers and a generator which uses transfer learning by employing the pre-trained ResNet architecture.

Baseline architecture.

For the baseline VAE architecture, we used a similar architecture to the PS-GM completion step (4 convolutional and 6 deconvolutional layers). The only difference was the input, which was a partial image rather than an image rendered with structure. The CycleGAN architecture was similar to that used in PS-GM (although it mapped partial images to full images rather than partial images with structure to full images).

Appendix B Additional Results

In Figure 6, we show examples of how our image completion pipeline is applied to the facades dataset, and in Figure 7, we show examples of how our image completion pipeline is applied to our synthetic dataset.

Original Image
Partial Image
Structure Rendering
Baseline (GLCIC)
PS-GM (CycleGAN)
Baseline (CycleGAN)
Baseline (VED)
Figure 6: Examples of our image completion pipeline on the facades dataset.
Original Image
Partial Image
Structure Rendering (Partial)
Structure Rendering (Extrapolated)
Baseline (GLCIC)
PS-GM (CycleGAN)
Baseline (CycleGAN)
Baseline (VED)
Figure 7: Examples of our image completion pipeline on our synthetic dataset.