Log In Sign Up

CanvasVAE: Learning to Generate Vector Graphic Documents

by   Kota Yamaguchi, et al.

Vector graphic documents present visual elements in a resolution free, compact format and are often seen in creative applications. In this work, we attempt to learn a generative model of vector graphic documents. We define vector graphic documents by a multi-modal set of attributes associated to a canvas and a sequence of visual elements such as shapes, images, or texts, and train variational auto-encoders to learn the representation of the documents. We collect a new dataset of design templates from an online service that features complete document structure including occluded elements. In experiments, we show that our model, named CanvasVAE, constitutes a strong baseline for generative modeling of vector graphic documents.


page 6

page 7

page 8


Color Recommendation for Vector Graphic Documents based on Multi-Palette Representation

Vector graphic documents present multiple visual elements, such as image...

Optimizing Slate Recommendations via Slate-CVAE

The slate recommendation problem aims to find the "optimal" ordering of ...

Document Collection Visual Question Answering

Current tasks and methods in Document Understanding aims to process docu...

Metaknowledge Extraction Based on Multi-Modal Documents

The triple-based knowledge in large-scale knowledge bases is most likely...

Lightweight Selective Disclosure for Verifiable Documents on Blockchain

To achieve lightweight selective disclosure for protecting privacy of do...

A Probabilistic Generative Model for Typographical Analysis of Early Modern Printing

We propose a deep and interpretable probabilistic generative model to an...

Editable AI: Mixed Human-AI Authoring of Code Patterns

Developers authoring HTML documents define elements following patterns w...

1 Introduction

In creative workflows, designers work on visual presentation via vector graphic formats. 2D vector graphics represent images in a compact descriptive structure; instead of spatial array of pixels, graphic documents describe a canvas and arrangement of visual elements such as shapes or texts in a specific format like SVG or PDF. Vector graphics are crucial in creative production for its resolution-free representation, human interpretability, and editability. Because of its importance in creative applications, there has been a long but active history of research on tracing vector graphic representation from a raster image [28, 14, 2, 20, 26].

In this work, we study a generative model of vector graphic documents. While raster-based generative models show tremendous progress in synthesizing high-quality images [10, 24, 12], there has been relatively scarce studies on vector graphic documents [33, 3, 17]

. Although both raster and vector graphics deal with images, vector graphics do not have canvas pixels and cannot take advantage of the current mainstream approach of convolutional neural networks without rasterization, which is typically not differentiable 

[19]. Learning a generative model of vector graphics therefore imposes us unique challenges in 1) how to represent complex data structure of vector graphic formats in a unified manner, 2) how to formulate the learning problem, and 3) how to evaluate the quality of documents.

We address the task of generative learning of vector graphics using a variational auto-encoder (VAE) [13], where we define documents by a multi-modal combination of canvas attributes and a sequence of element attributes. Unlike conditional layout inference [33, 15, 17], we consider unconditional document generation including both a canvas and variable number of elements. Our architecture, named CanvasVAE, learns an encoder that projects a given graphic document into a latent code, and a decoder that reconstructs the given document from the latent code. We adopt Transformer-based network [5] in both the encoder and the decoder to process variable-length sequence of elements in a document. The learned decoder can take randomly sampled latent code to generate a new vector graphic document. For our study, we collect a large-scale dataset of design templates for our study that offers complete document structure and content information. In evaluation, we propose to combine normalized metrics for all attributes to measure the overall quality of reconstruction and generation. We compare several variants of CanvasVAE architecture and show that a Transformer-based model constitutes a strong baseline for the vector graphic generation task.

We summarize our contributions in the following.

  1. [itemsep=-.7ex]

  2. We propose the CanvasVAE architecture for the task of unconditional generative learning of vector graphic documents, where we model documents by a structured, multi-modal set of canvas and element attributes.

  3. We build Crello dataset, which is a dataset consisting of large number of design templates and features complete vector information including occluded elements.

  4. We empirically show that our Transformer-based variant of CanvasVAE achieves a strong performance in both document reconstruction and generation.

2 Related work

Generative layout modeling

There has been several attempts at conditional layout modeling where the goal is to generate bounding box arrangements given certain inputs. LayoutVAE [11]

learns a two-stage autoregressive VAE that takes a label set and generates bounding boxes for each label, for scene image representation. For design applications, Zheng  

[33] report a generative model for magazine layout conditioned on a set of elements and meta-data, where raster adversarial networks generate layout maps. Lee  [15] propose a three-step approach to predict a layout given an initial set of elements that accepts partial relation annotation. Li  [16, 17] learn a model that refines the geometry of the given elements, such that the refined layout looks realistic to a discriminator built on a differentiable wire-frame rasterizer. Tan  [30] propose text-to-scene generation that explicitly considers a layout and attributes. Wang  [31]

consider a reinforcement learning approach to select appropriate elements for the given document. For UI layout domain, Manandhar  

[22] propose to learn UI layout representation by metric learning and raster decoder. Li  [18] recently report an attempt in multi-modal representation learning of UI layout.

In contrast to conditional layout generation, we tackle on the task of unconditional

document generation including layout and other attributes. Gupta recently report autoregressive model for generating layout 

[7] but without learning a latent representation and instead relies on beam search. READ [25] is the only pilot study similar to our unconditional scenario, although their recursive model only considers labeled bounding box without content attributes. Arroyo  [1] very recently report a layout generation model. Our model fully works in symbolic vector data without explicit rasterization [33, 16], which allows us to easily process data in a resolution free manner.

Vector graphic generation

Although our main focus is document-level generation, there has been several important work in stroke or path level vector graphic modeling that aims at learning to generate resolution-free shapes. Sketch RNN is a pioneering work on learning drawing strokes using recurrent networks [8]. SPIRAL is a reinforcment adversarial learning approach to vectorize a given raster image [6]. Lopes learn an autoregressive VAE to generate vector font strokes [21]. Song report a generative model of Bézier curves for sketch strokes [29]. Carlier propose DeepSVG architecture that consists of a hierarchical auto-encoder that learns a representation for a set of paths [3]. We get many inspirations from DeepSVG especially in our design of oneshot decoding architecture.

Dataset Attribute of Name Type Size Dim Description
Crello Canvas Length Categorical 50 1 Length of elements up to 50
Group Categorical 7 1 Broad design group, such as social media posts or blog headers
Format Categorical 68 1 Detailed design format, such as Instagram post or postcard
Width Categorical 42 1 Canvas pixel width available in
Height Categorical 47 1 Canvas pixel height available in
Category Categorical 24 1 Topic category of the design, such as holiday celebration
Element Type Categorical 6 1 Element type, such as vector shape, image, or text placeholder
Position Categorical 64 2 Left and top position each quantized to 64 bins
Size Categorical 64 2 Width and height each quantized to 64 bins
Opacity Categorical 8 1 Opacity quantized to 8 bins
Color Categorical 16 3 RGB color each quantized to 16 bins, only relevant for solid fill and texts
Image Numerical 1 256 Pre-trained image feature, only relevant for shapes and images
RICO Canvas Length Categorical 50 1 Length of elements up to 50
Element Component Categorical 27 1 Element type, such as text, image, icon, etc.
Position Categorical 64 2 Left and top position each quantized to 64 bins
Size Categorical 64 2 Width and height each quantized to 64 bins
Icon Categorical 59 1 Icon type, such as arrow, close, home
Button Categorical 25 1 Text on button, such as login or back
Clickable Categorical 2 1 Binary flag indicating if the element is clickable
Table 1: Attribute descriptions for vector graphic data

3 Vector graphic representation

3.1 Document structure

In this work, we define vector graphic documents to be a single-page canvas and associated sequence of visual elements such as texts, shapes, or raster images. We represent a document by a set of canvas attributes and a sequence of elements , where is a set of element attributes. We denote a set of canvas and element attribute indices by and , respectively. Canvas attributes represent global document properties, such as canvas size or document category. Element attributes indicate element-specific configuration, such as position, size, type of the element, opacity, color, or a texture image if the element represents a raster image. In addition, we explicitly include the element length in the canvas attributes . We represent elements by a sequence, where the order reflects the depth of which elements appear on top. The actual attribute definition depends on datasets we describe in the next section.

3.2 Datasets

Crello dataset

Crello dataset consists of design templates we obtained from online design service, The dataset contains designs for various display formats, such as social media posts, banner ads, blog headers, or printed posters, all in a vector format. In dataset construction, we first downloaded design templates and associated resources (e.g., linked images) from After the initial data acquisition, we inspected the data structure and identified useful vector graphic information in each template. Next, we eliminated mal-formed templates or those having more than 50 elements, and finally obtained 23,182 templates. We randomly partition the dataset to 18,714 / 2,316 / 2,331 examples for train, validation, and test splits.

In Crello dataset, each document has canvas attributes = {, , , , , } and element attributes = {, , , , , }. Table 1

summarizes the detail of each attribute. Image and color attributes are exclusive; we extract color for text placeholders and solid backgrounds, and we extract image features for shapes and image elements in the document. Except for image features, we quantize numeric attributes to one-hot representations, such as element position, size, or colors, because 1) discretization implicitly enforces element alignment, and 2) attributes often do not follow normal distribution suitable for regression. The image feature allows us content-aware document modeling, and also is useful for visualization purpose.

We obtain image features using a raster-based convolutional VAE that we pre-train from all the image and shape elements in the Crello dataset. We do not use ImageNet pre-trained model here, because ImageNet does not contain alpha channels nor vector shapes. For pre-training of the VAE, we rasterize all the image and shape elements in

pixel canvas with resizing, and saves in RGBA raster format. From the rasterized images, we learn a VAE consisting of MobileNetV2-based encoder [27] and a 6-layer convolutional decoder. After pre-training the convolutional VAE, we obtain a 256-dimensional latent representation using the learned image encoder for all the image and shape elements in Crello dataset.

In contrast to existing layout datasets that mainly consider a set of labeled bounding boxes [4, 33, 34], our Crello dataset offers complete vector graphic structure including appearance for occluded elements. This enables us to learn a generative model that considers the appearance and attributes of graphic elements in addition to the layout structure. Also, Crello dataset contains canvas in various aspect ratio. This imposes us a unique challenge, because we have to handle variable-sized documents that raster-based models do not work well with.

RICO dataset

RICO dataset offers a large number of user interface designs for mobile applications with manually annotated elements [4]. We use RICO dataset to evaluate the generalization ability of our CanvasVAE. All the UI screenshots from RICO have a fixed resolution of pixels, and there is no document-wise label. We set canvas attributes to only have element length: = {}, and for each element, we model = {, , , , , }. Most of the pre-processing follows Crello dataset; we quantize numeric attributes to one-hot representations. Table 1 summarizes the attributes we consider in this work.

Figure 1: CanvasVAE architecture.

4 CanvasVAE

Our goal is to learn a generative model of vector graphic documents. We aim at learning a VAE that consists of a probabilistic encoder and a decoder using neural networks.

VAE basics

Let us denote a vector graphic instance by and a latent code by . A VAE learns a generative model and an approximate posterior , using variational lower bounds [13]:


where is the parameters of the inference model and is the parameters of the generative model.

We model the approximate variational posterior

by a Gaussian distribution with a diagonal covariance:


where and are the encoder outputs that we model by a neural network. We set the prior over the latent code to be a unit multivariate Gaussian . We also model the data likelihood using a neural network. Fig 1 illustrates our CanvasVAE encoder and decoder architecture.


Our encoder takes a vector graphic input and predicts the parameters of approximate prior , . We describe the encoder in the following:


The encoder first projects each canvas attribute using a feed-forward layer to the dimensionality, and adds up to make the hidden side input to the Transformer block. Similarly, the encoder projects each element attribute using a feed-forward layer to the same dimensionality, and adds up together with the position embedding to make the hidden input to the Transformer block for each step . The positional embedding provides information on absolute position within the element sequence [5]. We learn the positional embedding during training. is a variant of Transformer model that adds a side input between the self attention and the feed-forward layer, which is similar to the decoder block of DeepSVG [3]. We stack up multiple Transformer blocks to transform input embedding and to produce temporally pooled internal representation . and are the last feed-forward layer of the encoder to produce and .


Our decoder takes a sampled latent code and produces reconstruction from . We describe our decoder by:


where is the last feed-forward network for the attribute . Our decoder uses the same Transformer block with the encoder. The decoder has the positional embedding to feed the absolute position information for sequence reconstruction.

At generation, we apply stochastic sampling to

and obtain a maximum likelihood estimate from our decoder head

for categorical attributes, or regression output for numerical attributes. To generate a sequence from the latent code , we have to first decide the number of elements in the document. We predict the length from , and feed the masking information to the Transformer block to exclude out-of-sequence elements in self-attention, and drop extra elements at the final reconstruction.

Loss function

We derive the loss function for our CanvasVAE from the variational lower bounds (Eq 

1). For a sample in our dataset, the loss for each document is given by:


where and are hyper-parameters to weight the regularization terms. is a loss term for attribute . We use cross entropy for categorical attributes and mean squared error for numeric attributes. At training time, we use teacher-forcing; we discard the predicted length and force the ground truth length in the decoder.

5 Experiments

We evaluate our CanvasVAE in reconstruction and generation scenarios. In reconstruction, we evaluate the overall capability of our encoder-decoder model to reproduce the given input. In generation scenario, we evaluate the decoder capability in terms of the quality of randomly generated documents.

5.1 Evaluation metrics

5.1.1 Reconstruction metrics

We have to be able to measure the similarity between two documents to evaluate reconstruction quality. Unlike raster images, there is no standard metric to measure the distance between vector graphic documents. Our loss function (Eq 11) is also not appropriate due to teacher-forcing of sequence length. Considering the multi-modal nature of vector graphic formats, we argue that an ideal metric should be able to evaluate the quality of all the modalities in the document at a uniform scale, and that the metric can handle variable length structure. We choose the following two metrics to evaluate the document similarity.

Structural similarity

For document and , we measure the structural similarity by the mean of normalized scores for each attribute :


where is a scoring function, and . We exclude length from the canvas attributes because element scores take length into account. For canvas attributes, we adopt accuracy as the scoring function since there are only categorical attributes in our datasets.

For categorical element attributes, a scoring function must be able to evaluate variable length elements. We use BLEU score [23] that is a precision-based metric often used in machine translation. BLEU penalizes a shorter prediction by the brevity term: , where

is the predicted element length. We use unigram BLEU for evaluation, because vector graphic elements do not exhibit strong ordinal constraints and elements can be swapped as long as they do not overlap. For the image feature that is the only numerical element attribute in Crello, we use the cosine similarity in [0, 1] scale between the average-pooled features over sequence, multiplied by the brevity term of BLEU score. Note that our structural similarity is not symmetric because BLEU is not symmetric.

In Crello, the presence of image and color attributes depend on the element type, and can become empty. We exclude empty attributes from in the calculation of eq 12 if either or include empty attributes.

We evaluate reconstruction performance by the average score over the document set :

Figure 2: Performance curves in terms of vs. and mIoU vs. over in validation splits. Top-right models show better performance in both reconstruction and generation.
Layout mean IoU

We also include evaluation by mean intersection over union (mIoU) on labeled bounding boxes [22] to analyze layout quality. We use type attribute in Crello dataset and component attribute in RICO dataset as a primary label for elements. To compute mIoU, we draw bounding boxes on a canvas in the given element order, compute the IoU for each label, then average over labels. Since we quantize position and size of each element, we draw bounding boxes on a grid canvas. Similar to Eq 13, we obtain the final score by dataset average.

The mIoU metric ignores attributes other than element position, element size, and element label. Content attributes such as image or color have no effect on the mIoU metric. Although Crello dataset has variable-sized canvas, we ignore the aspect ratio of the canvas and only evaluate on the relative position and size of elements.

5.1.2 Generation metric

Similar to reconstruction, there is no standard approach to evaluate the similarity between sets of vector graphic documents. It is also not appropriate to use a raster metric like FID score [9] because our document can not be rasterized to a fixed resolution nor is a natural image. We instead define the following metric to evaluate the distributional similarity of vector graphic documents.

For real and generated document sets and

, we first obtain descriptive statistics for each attribute

, then compute the similarity between the two sets:


where is an aggregation function that computes descriptive statistics of attribute , and is a scoring function. For categorical attributes, we use histogram for and normalized histogram intersection for . For numerical attributes, we use average pooling for and cosine similarity for .

Figure 3: Crello reconstruction comparison. For each item, a left picture shows visualization of the design template with colored text placeholders and textured elements, and a right color map illustrates element types. The legend of types are the following: green = vector shape, magenta = image, purple = text placeholder, and yellow = solid fill.
Figure 4: RICO reconstruction comparison. Color indicates the component label of each element. Color is from the original RICO schema [4].
Figure 5: Stochastic sampling and reconstruction examples.
Figure 6: Interpolation example.
Figure 7: Randomly generated Crello documents.
Figure 8: Randomly generated RICO documents.

5.2 Baselines

We include the following variants of our CanvasVAE as baselines. Since there is no reported work that is directly comparable to CanvasVAE, we carefully pick comparable building blocks from existing work.

AutoReg LSTM  We replace Transformer blocks and temporal pooling in our model (Eq 5, 8) with an LSTM. The side input to the transformer blocks is treated as initial hidden states. Also, we introduce autoregressive inference procedure instead of our one-shot decoding using positional embedding; we predict an element at given the elements until in the decoder. The initial input is a special beginning-of-sequence embedding, which we learn during training. Our autoregressive LSTM baseline has a decoder similar to LayoutVAE [11], although we do not stochastically sample at each step nor have conditional inputs but have an encoder for the latent code.

AutoReg Transformer  Similar to Autoregressive LSTM, we introduce the autoregressive inference but use Transformer blocks. The decoding process is similar to LayoutTransformer [7], but we add an encoder for the latent code. We also explicitly predict sequence length instead of an end-of-sequence flag to terminate the prediction.

Oneshot LSTM  We use positional embedding but replace Transformer blocks with an LSTM. We use a bidrectional LSTM for this one-shot model because positional embedding allows both past and future information for prediction.

Oneshot Transformer  Our CanvasVAE model described in Sec 4. We also compare the number of Transformer blocks at 1 and 4 for ablation study.

5.3 Quantitative evaluation

Dataset Model mIoU
Crello AutoReg LSTM 79.75 33.52 86.44
AutoReg Trans x1 85.47 33.86 87.48
AutoReg Trans x4 84.65 35.36 86.60
Oneshot LSTM 84.95 40.50 86.85
Oneshot Trans x1 88.67 47.02 87.57
Oneshot Trans x4 87.75 45.50 88.15
RICO AutoReg LSTM 87.73 42.51 93.74
AutoReg Trans x1 94.96 51.13 94.40
AutoReg Trans x4 92.06 48.74 95.11
Oneshot LSTM 91.01 51.93 92.05
Oneshot Trans x1 94.35 60.42 93.90
Oneshot Trans x4 94.45 59.47 95.14
Table 2: Test performance (%).

For each baseline, we report the test performance of the best validation model in terms of that we find by a grid search over . For other hyper-parameters, we empirically set the size of latent code to 512 for Crello and 256 for RICO, and in all baselines. We train all baseline models using Adam optimizer with learning rate fixed to

for 500 epochs in both datasets. For generation evaluation, we sample

from zero-mean unit normal distribution up to the same size to the test split.

Table 2

summarizes the test evaluation metrics of the baseline models. In Crello, oneshot Transformer x1 configuration has the best reconstruction in structure (88.67%) and layout mIoU (47.02%), while oneshot Transformer x4 has the best generation score (88.15%). In RICO, autoregressive Transformer x1 has the best structure reconstruction (94.96%), oneshot Transformer x1 has the best mIoU (60.42%), and oneshot Transformer x4 has the best generation score (95.14%).

Performance trade-offs

We note that the choice of has a strong impact on the evaluation metric, and that explains the varying testing results in Table 2. We plot in Fig 2 the validation performance as we move from to in Crello, and from to in RICO. The plots clearly show there is a trade-off relationship between reconstruction and generation. A smaller KL divergence indicates smoother latent space, which in turn indicates better generation quality from random sampling of but hurts the reconstruction performance. From the plots, oneshot Transformer x4 seems consistently performing well in both datasets except for evaluation in RICO, where most baselines saturate the reconstruction score. We suspect saturation is due to over-fitting tendency in RICO dataset, as RICO does not contain high dimensional attributes like image feature. Given the performance characteristics, we conjecture that oneshot Transformer performs the best.

Autoregressive vs. oneshot

Table 2 and Fig 2 suggests oneshot models consistently perform better than the autoregressive counterparts in both datasets. This trend makes sense, because autoregressive models cannot consider the layout placement of future elements at the current step. In contrast, oneshot models can consider spatial relationship between elements better at inference time.

5.4 Qualitative evaluation


We compare reconstruction quality of Crello and RICO testing examples in Fig 3 and Fig 4. Here, we reconstruct the input deterministically by the mean latent code in Eq 2. Since Crello dataset has rich content attributes, we present a document visualization that fills in image and shape elements using a nearest neighbor retrieval by image features (Sec 3.2), and a color map of element type bounding boxes. For RICO, we show a color map of component bounding boxes.

We observe that, while all baselines reasonably reconstruct documents when there are a relatively small number of elements, oneshot models tend to reconstruct better as a document becomes more complex (Second row of Fig 3).

We also show how sampling from results in variation in the reconstruction quality in Fig 5. Depending on the input, sampling sometimes leads to different layout arrangement or canvas size.


One characteristic of VAEs is the smoothness of the latent space. We show in Fig 6 an example of interpolating latent codes between two documents in Crello dataset. The result shows a gradual transition between two documents that differ in many aspects, such as the number of elements or canvas size. The resulting visualization is discontinuous in that categorical attributes or retrieved images change at specific point in between, but we can still observe some sense of continuity in design.


Fig 7 shows randomly generated Crello design documents from oneshot Transformer x4 configuration. Our CanvasVAE generates documents in diverse layouts and aspect ratios. Generated documents are not realistic in that the quality is not sufficient in immediate use in real-world creative applications, but seem to already serve for inspirational purposes. Also, we emphasize that these generated documents are easily editable thanks to the vector graphic format.

We also show in Fig 8 randomly generated UIs with RICO dataset. Although sometimes generated layouts include overlapping elements, they show diverse layout arrangements with semantically meaningful structure such as toolbar or list components.

6 Conclusion

We present CanvasVAE, an unconditional generative model of vector graphic documents. Our model learns an encoder and a decoder that takes vector graphic consisting of canvas and element attributes including layout geometry. With our newly built Crello dataset and RICO dataset, we demonstrate CanvasVAE successfully learns to reconstruct and generate vector graphic documents. Our results constitute a strong baseline for generative modeling of vector graphic documents.

In the future, we are interested in further extending CanvasVAE by generating text content and font styling, integrating pre-training of image features in an end-to-end model, and combining a learning objective that is aware of appearance, for example, by introducing differentiable rasterizer [19] to compute raster reconstruction loss. We also wish to look at whether feedback-style inference [32] allows partial inputs such as user-specified constraints [15], which is commonly seen in application scenarios.


  • [1] D. M. Arroyo, J. Postels, and F. Tombari (2021)

    Variational transformer networks for layout generation

    In CVPR, pp. 13642–13652. Cited by: §2.
  • [2] M. Bessmeltsev and J. Solomon (2019) Vectorization of line drawings via polyvector fields. ACM Transactions on Graphics (TOG) 38 (1), pp. 1–12. Cited by: §1.
  • [3] A. Carlier, M. Danelljan, A. Alahi, and R. Timofte (2020) DeepSVG: a hierarchical generative network for vector graphics animation. arXiv preprint arXiv:2007.11301. Cited by: §1, §2, §4.
  • [4] B. Deka, Z. Huang, C. Franzen, J. Hibschman, D. Afergan, Y. Li, J. Nichols, and R. Kumar (2017) RICO: a mobile app dataset for building data-driven design applications. In ACM Symposium on User Interface Software and Technology, pp. 845–854. Cited by: §3.2, §3.2, Figure 4.
  • [5] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1, §4.
  • [6] Y. Ganin, T. Kulkarni, I. Babuschkin, S. A. Eslami, and O. Vinyals (2018) Synthesizing programs for images using reinforced adversarial learning. In ICML, pp. 1666–1675. Cited by: §2.
  • [7] K. Gupta, A. Achille, J. Lazarow, L. Davis, V. Mahadevan, and A. Shrivastava (2020) Layout generation and completion with self-attention. arXiv preprint arXiv:2006.14615. Cited by: §2, §5.2.
  • [8] D. Ha and D. Eck (2018) A neural representation of sketch drawings. In ICLR, Note: 2018 External Links: Link Cited by: §2.
  • [9] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter (2017) GANs trained by a two time-scale update rule converge to a local nash equilibrium. arXiv preprint arXiv:1706.08500. Cited by: §5.1.2.
  • [10] J. Johnson, A. Gupta, and L. Fei-Fei (2018) Image generation from scene graphs. In CVPR, pp. 1219–1228. Cited by: §1.
  • [11] A. A. Jyothi, T. Durand, J. He, L. Sigal, and G. Mori (2019) LayoutVAE: stochastic scene layout generation from a label set. In CVPR, pp. 9895–9904. Cited by: §2, §5.2.
  • [12] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila (2020) Analyzing and improving the image quality of StyleGAN. In CVPR, Cited by: §1.
  • [13] D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §1, §4.
  • [14] J. Kopf and D. Lischinski (2011) Depixelizing pixel art. In ACM SIGGRAPH 2011, pp. 1–8. Cited by: §1.
  • [15] H. Lee, W. Yang, L. Jiang, M. Le, I. Essa, H. Gong, and M. Yang (2020) Neural design network: graphic layout generation with constraints. ECCV. Cited by: §1, §2, §6.
  • [16] J. Li, J. Yang, A. Hertzmann, J. Zhang, and T. Xu (2019) LayoutGAN: generating graphic layouts with wireframe discriminators. arXiv preprint arXiv:1901.06767. Cited by: §2, §2.
  • [17] J. Li, J. Yang, J. Zhang, C. Liu, C. Wang, and T. Xu (2020) Attribute-conditioned layout gan for automatic graphic design. IEEE Transactions on Visualization and Computer Graphics. Cited by: §1, §1, §2.
  • [18] T. J. Li, L. Popowski, T. M. Mitchell, and B. A. Myers (2021) Screen2Vec: semantic embedding of gui screens and gui components. arXiv preprint arXiv:2101.11103. Cited by: §2.
  • [19] T. Li, M. Lukáč, M. Gharbi, and J. Ragan-Kelley (2020) Differentiable vector graphics rasterization for editing and learning. ACM Transactions on Graphics (TOG) 39 (6), pp. 1–15. Cited by: §1, §6.
  • [20] Y. Liu and Z. Wu (2019) Learning to describe scenes with programs. In ICLR, Cited by: §1.
  • [21] R. G. Lopes, D. Ha, D. Eck, and J. Shlens (2019) A learned representation for scalable vector graphics. In CVPR, pp. 7930–7939. Cited by: §2.
  • [22] D. Manandhar, D. Ruta, and J. Collomosse (2020) Learning structural similarity of user interface layouts using graph networks. In ECCV, pp. 730–746. Cited by: §2, §5.1.1.
  • [23] K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) BLEU: a method for automatic evaluation of machine translation. In ACL, pp. 311–318. Cited by: §5.1.1.
  • [24] T. Park, M. Liu, T. Wang, and J. Zhu (2019) Semantic image synthesis with spatially-adaptive normalization. In CVPR, Cited by: §1.
  • [25] A. G. Patil, O. Ben-Eliezer, O. Perel, and H. Averbuch-Elor (2020)

    READ: recursive autoencoders for document layout generation

    In CVPR Workshops, pp. 544–545. Cited by: §2.
  • [26] P. Reddy, M. Gharbi, M. Lukac, and N. J. Mitra (2021) Im2Vec: synthesizing vector graphics without vector supervision. arXiv preprint arXiv:2102.02798. Cited by: §1.
  • [27] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In CVPR, pp. 4510–4520. Cited by: §3.2.
  • [28] P. Selinger (2003) Potrace: a polygon-based tracing algorithm. Cited by: §1.
  • [29] Y. Song (2020) Béziersketch: a generative model for scalable vector sketches. ECCV. Cited by: §2.
  • [30] F. Tan, S. Feng, and V. Ordonez (2019) Text2scene: generating compositional scenes from textual descriptions. In CVPR, pp. 6710–6719. Cited by: §2.
  • [31] G. Wang, Z. Qin, J. Yan, and L. Jiang (2020) Learning to select elements for graphic design. In ICMR, pp. 91–99. Cited by: §2.
  • [32] T. Wang, K. Yamaguchi, and V. Ordonez (2018) Feedback-prop: convolutional neural network inference under partial evidence. In CVPR, pp. 898–907. Cited by: §6.
  • [33] X. Zheng, X. Qiao, Y. Cao, and R. W. Lau (2019) Content-aware generative modeling of graphic design layouts. ACM Transactions on Graphics (TOG) 38 (4), pp. 1–15. Cited by: §1, §1, §2, §2, §3.2.
  • [34] X. Zhong, J. Tang, and A. J. Yepes (2019) PubLayNet: largest dataset ever for document layout analysis. In ICDAR, pp. 1015–1022. Cited by: §3.2.