A Learned Representation for Scalable Vector Graphics

04/04/2019 ∙ by Raphael Gontijo Lopes, et al. ∙ Google 0

Dramatic advances in generative models have resulted in near photographic quality for artificially rendered faces, animals and other objects in the natural world. In spite of such advances, a higher level understanding of vision and imagery does not arise from exhaustively modeling an object, but instead identifying higher-level attributes that best summarize the aspects of an object. In this work we attempt to model the drawing process of fonts by building sequential generative models of vector graphics. This model has the benefit of providing a scale-invariant representation for imagery whose latent representation may be systematically manipulated and exploited to perform style propagation. We demonstrate these results on a large dataset of fonts and highlight how such a model captures the statistical dependencies and richness of this dataset. We envision that our model can find use as a tool for graphic designers to facilitate font design.



There are no comments yet.


page 5

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Learned Vector Graphics Representation Pixel Counterpart
Conveying Different Styles
Figure 6: Learning fonts in a native command space. Unlike pixels, scalable vector graphics (SVG) [11] are scale-invariant representations whose parameterizations may be systematically adjusted to convey different styles. All vector images are samples from a generative model of the SVG specification.

The last few years have witnessed dramatic advances in generative models of images that produce near photographic quality imagery of human faces, animals, and natural objects [4, 25, 27]. These models provide an exhaustive characterization of natural image statistics [51] and represent a significant advance in this domain. However, these advances in image synthesis ignore an important facet of how humans interpret raw visual information [47], namely that humans seem to exploit structured representations of visual concepts [33, 21]. Structured representations may be readily employed to aid generalization and efficient learning by identifying higher level primitives for conveying visual information [32] or provide building blocks for creative exploration [21, 20]. This may be best seen in human drawing, where techniques such as gesture drawing [44] emphasize parsimony for capturing higher level semantics and actions with minimal graphical content [53].

Our goal is to train a drawing model by presenting it with a large set of example images [16, 13]. To succeed, the model needs to learn both the underlying structure in those images and to generate drawings based on the learned representation [2]

. In computer vision this is referred to as an “inverse graphics” problem

[38, 31, 22, 41]. In our case the output representation is not pixels but rather a sequence of discrete instructions for rendering a drawing on a graphics engine. This poses dual challenges in terms of learning discrete representations for latent variables [56, 23, 37] and performing optimization through a non-differential graphics engine (but see [36, 31]). Previous approaches focused on program synthesis approaches [32, 10] or employing reinforcement and adversarial learning [13]. We instead focus on a subset of this domain where we think we can make progress and improves the generality of the approach.

Font generation represents a 30 year old problem posited as a constrained but diverse domain for understanding higher level perception and creativity [21]

. Early research attempted to heuristically systematize the creation of fonts for expressing the identity of characters (e.g.

a, 2) as well as stylistic elements constituting the “spirit” of a font [20]. Although such work provides great inspiration, the results were limited by a reliance on heuristics and a lack of a learned, structured representation [46]. Subsequent work for learning representations for fonts focused on models with simple parameterizations [34], template matching [54], example-based hints [62], or more recently, learning manifolds for detailed geometric annotations [5].

We instead focus the problem on generating fonts specified with Scalable Vector Graphics (SVG) – a common file format for fonts, human drawings, designs and illustrations [11]. SVG’s are a compact, scale-invariant representation that may be rendered on most web browsers. SVG’s specify an illustration as a sequence of a higher-level commands paired with numerical arguments (Figure 6, top). We take inspiration from the literature on generative models of images in rasterized pixel space [15, 55]. Such models provide powerful auto-regressive formulations for discrete, sequential data [15, 55] and may be applied to rasterized renderings of drawings [16]. We extend these approaches to the generation of sequences of SVG commands for the inference of individual font characters. The goal of this work is to build a tool to learn a representation for font characters and style that may be extended to other artistic domains [7, 49, 16], or exploited as an intelligent assistant for font creation [6]. In this work, our main contributions are as follows:

  • Build a generative model for scalable vector graphics (SVG) images and apply this to a large-scale dataset of 14 M font characters.

  • Demonstrate that the generative model provides a perceptually smooth latent representation for font styles that captures a large amount of diversity and is consistent across individual characters.

  • Exploit the latent representation from the model to infer complete SVG fontsets from a single (or multiple) characters of a font.

  • Identify semantically meaningful directions in the latent representation to globally manipulate font style.

2 Related Work

2.1 Generative models of images

Generative models of images have generally followed two distinct directions. Generative adversarial networks [14] have demonstrated impressive advances [45, 14] over the last few years resulting in models that generate high resolution imagery nearly indistinguishable from real photographs [25, 4].

A second direction has pursued building probabilistic models largely focused on invertible representations [8, 27]. Such models are highly tractable and do not suffer from training instabilities largely attributable to saddle-point optimization [14]. Additionally, such models afford a true probabilistic model in which the quality of the model may be measured with well-characterized objectives such as log-likelihood.

2.2 Autoregressive generative models

One method for vastly improving the quality of generative models with unsupervised objectives is to break the problem of joint prediction into a conditional, sequential prediction task. Each step of the conditional prediction task may be expressed with a sequential model (e.g. [19]) trained in an autoregressive fashion. Such models are often trained with a teacher-forcing training strategy, but more sophisticated methods may be employed [1].

Autoregressive models have demonstrated great success in speech synthesis [42]

and unsupervised learning tasks

[56] across multiple domains. Variants of autoregressive models paired with more sophisticated density modeling [3] have been employed for sequentially generating handwriting [15].

2.3 Modeling higher level languages

The task of learning an algorithm from examples has been widely studied. Lines of work vary from directly modeling computation [24] to learning a hierarchical composition of given computational primitives [12]. Of particular relevance are efforts that learn to infer graphics programs from the visual features they render, often using constructs like variable bindings, loops, or simple conditionals [10].

The most comparable methods to this work yield impressive results on unsupervised induction of programs usable by a given graphics engine [13]. As their setup is non differentiable, they use the REINFORCE [59] algorithm to perform adversarial training [14]. This method achieves impressive results despite not relying on labelled paired data. However, it tends to draw over previously-generated drawings, especially later in the generation process. While this could be suitable for modelling the generation of a 32x32 rastered image, SVGs require a certain level of precision in order to scale with few perceptible issues.

2.4 Learning representations for fonts

Previous work has focused on enabling propagation of style between classes by identifying the class and style from high level features of a single character [46, 21, 20] or by finding correspondences between these features of different characters [54]. These features are often simplifications, such as character skeletons, which diminish the flexibility of the methods. Other work directly tackles style manipulation of generated full characters [34]

, but uses a simple parametric model that allows the user to only adjust parameters like its weight or width.

The most relevant works are those that attempt to learn a manifold of font style. Some unpublished work has explored how probabilistic methods may model pixel-based representations of font style [35]. The model learned semantically-meaningful latent spaces which can manipulate rasterized font images. More directly comparable, recent work learned energy models to capture the relationship between discretized points along the outlines of each character in order to address font generation and extrapolation [5]. This method yields impressive results for extrapolating between very few examples, but is limited by the need of having all characters of a certain class be composed of equivalent shapes Additionally, the model discretely approximates points on a character’s outline which may lead to visual artifacts at larger spatial scales.

3 Methods

Image Autoencoder

SVG Decoder

Figure 7: Model architecture

. Visual similarity between SVGs is learned by a class-conditioned, convolutional variational autoencoder (VAE)

[30, 16] on a rendered representation (blue). The class label and learned representation are provided as input to a model that decodes SVG commands (purple). The SVG decoder consists of stacked LSTM’s [19] followed by a Mixture Density Network (MDN) [15, 3]. See text for details.

3.1 Data

We compiled a font dataset composed of  M examples across 62 characters (i.e. 0-9, a-z, A-Z), which we term SVG-Fonts. The dataset consists of fonts in a common font format (SFD) 111https://fontforge.github.io, excluding examples where the unicode ID does not match the targeted 62 character set specified above. In spite of the filtering, label noise exists across the roughly 220 K fonts examined.

We employed a one-to-one mapping from SFD to SVG file format using a subset of 4 SVG commands: moveTo, lineTo, cubicBezier and EOS. SVG commands were normalized by starting from the top-most command and ordering in a clockwise direction. In preliminary experiments, we found it advantageous to specify command arguments using relative positioning information. See Appendix for more details of the dataset collection and normalization.

The final dataset consists of a sequence of commands specified in tuples. Each item in the sequence consists of a discrete selection of an SVG command paired with a set of normalized, floating-point numbers specifying command arguments. We restricted the dataset to only 4 SVG command types and examples with fewer then 50 commands to aid learning but these restrictions may be relaxed to represent the complete SVG language. In comparison, note that [13] restricted inference to 20 actions for generating images. Finally, we partitioned the dataset into M and M examples for training and testing 222We plan to open-source tools to reproduce the construction of a subset of the dataset, as well as code to train the proposed model upon acceptance to a peer-reviewed conference..

3.2 Network Architecture

The proposed model consists of a variational autoencoder (VAE) [30, 16] and an autoregressive SVG decoder implemented in Tensor2Tensor [57]. Figure 7 provides a diagram of the architecture but please see the Appendix for details. Briefly, the VAE consists of a convolutional encoder and decoder paired with instance normalization conditioned on the label (e.g. a, 2, etc.) [9, 43]. The VAE is trained as a class-conditioned autoencoder resulting in a latent code that is largely class-independent [28]. In preliminary experiments, we found that 32 dimensions provides a reasonable balance between expressivity and tractability. Note that the latent code consists of and

- the mean and standard deviation of a multivariate Gaussian that may be sampled at test time.

The SVG decoder consists of 4 stacked LSTMs [19] trained with dropout [52, 61, 50]. The final layer is a Mixture Density Network (MDN) [3, 15] that may be stochastically sampled at test time. The LSTM receives as input the previous sampled MDN output, concatenated with the discrete class label and the latent style representation . The SVG decoder’s loss is composed of a softmax cross-entropy loss over one-hot SVG commands plus the MDN loss applied to the real-valued arguments.

In principle, the model may be trained end-to-end, but we found it simpler to train the two parts of the model separately. The VAE is trained using pixel renderings of the fonts using the Adam optimizer () [26] for epochs. We employ a high value of [18], and tune the number of free bits using cross-validation [29]. After convergence, the weights of the VAE are frozen, and the SVG decoder is trained to output the SVG commands from the latent representation using teacher forcing [60].

Note that both the VAE and MDN are probabilistic models that may be sampled many times during evaluation. The results shown here are the selected best out of 10 samples. Please see Appendix for modeling, training and evaluation details.

4 Results

Figure 8: Selected examples of generated fonts. Examples generated by sampling a random latent representation and running the SVG decoder by conditioning on and all class labels. Each font character is selected as the best of 10 samples. See Figures 17 and 18 in Appendix for additional examples.

We compiled a font dataset consisting of 14 M examples. Individual font characters were normalized and converted into SVG format for training and evaluation. We trained a VAE and SVG decoder over epochs of the data and evaluated the results on a hold-out test split. Figures 6 and 8 show selected results from the trained model, but please see Appendix (Figures 17 and 18) for more exhaustive samples highlighting successes and failures. What follows is an analysis of the representational ability of the model to learn and generate SVG specified fonts.

4.1 Learning a smooth, latent representation of font style

Learned Latent Space Detail View

Linear Interpolation Between Two Characters

Figure 9: Learning a smooth, latent representation of font style. UMAP visualization [39] of the learned latent space across 1 M examples (left). Purple boxes (A, B) provide a detail view of select regions. Blue lines (1-9) indicate linear interpolations in the full latent space between two characters of the dataset. Points along these linear interpolations are rendered as SVG images. Number in upper-right corner indicates number of strokes in SVG rendering. Best viewed in digital color.

We first ask whether the proposed model may learn a latent representation for font style that is perceptually smooth and interpretable. To address this question, we visualize the 32-dimensional font-style for 1 M examples from the training set and reduce the dimensionality to 2 using UMAP [39]. We discretize this 2D space and visualize the pixel-based decoding of the mean within each grid location (Figure 9). The purple boxes shows two separate locations of this manifold, where we note the smooth transitions of the characters: (A) represents a non-italics region, while (B) represents an italics one. Further, local directions within these regions also reveal visual semantics: within (A), from left-to-right we note a change in the amount of serif, while top-to-bottom highlights a change in boldness.

Next, we examine whether this smooth space translates into perceptually meaningful decodings of SVGs. We visualize linear interpolations between for pairs of SVGs from the dataset (Figure 9, 1-6). Note the smooth transition between the decoded SVGs, despite the fact that each SVG decoding is composed of a multitude of different commands. For instance, note that in the top row, each SVG is composed of 15-30 commands even though the perceptual representation appears quite smooth.

4.2 Exploiting the latent representation for style propagation

Figure 10: Exploiting the latent representation for style propagation. A single character may provide sufficient information for reconstructing the rest of a font set. The latent representation for a font is computed from a single character (purple box) and SVG images are generated for other characters from .
Figure 11: Conditioning on increasing numbers of characters improves style propagation. Top: Layout follows Figure 10. The average latent representation for a font is computed from a set of characters (purple boxes) and SVG images are generated for other characters from . Note that increasing number of characters (purple boxes) improves the consistency and quality of the style propagation. Bottom: For all generated characters, we calculate a corresponding

and measure the variance of

over all generated characters within a font. Lower variance in indicates a more visually consistent font style. Each dot corresponds to the observed variance in when conditioned on or characters. Note that most fonts contain higher consistency (i.e. lower variance) when conditioned on more characters.

Because the VAE is conditioned on the class label, we expect that the latent representation would only encode the font style with minimal class information [28]. We wish to exploit this model structure to perform style propagation across fonts. In particular, we ask whether a single character from a font set is sufficient to infer the rest of the font set in a visually plausible manner [46, 20].

To perform this task, we calculate the latent representation for a single character and condition the SVG decoder on as well as the label for all other font characters (i.e. 0-9, a-z, A-Z). Figure 10 shows the results of this experiment. For each row, is calculated from the character in the red box. The other characters in that row are generated from the SVG decoder conditioned on .

We observe a perceptually-similar style consistently within each row [46, 20]. Note that there was no requirement during training that the same point in latent space would correspond to a perceptually similar character across labels – that is, the consistency across class labels was learned in an unsupervised manner [46, 20]. Thus, a single value of seems to correspond to a perceptually-similar set of characters that resembles a plausible fontset.

Additionally, we observe a large amount of style variety across rows (i.e. different ) in Figure 10. The variety indicates that the latent space is able to learn and capture a large diversity of styles observed in the training set as observed in Figure 9.

Finally, we also note that for a given column the decoded glyph does indeed belong to the class that was supplied to the SVG decoder. These results indicate that encodes style information consistently across different character labels, and that the proposed model largely disentangles class label from style.

A natural extension to this experiment is to ask if we could systematically improve the quality of style propagation by employing more then a single character. We address this question by calculating the latent representation for multiple characters and employ the average for style propagation to a new set of characters (Figure 11). We observe a systematic improvement in both style consistency and quality of individual icon outputs as one conditions on increasing numbers of characters.

To quantify this improvement in style consistency, we render the generated characters and calculate the associated style for each character. If the method of style propagation were perfectly self-consistent, we would expect that the across all generated characters would be identical. If, however, the style propagation were not consistent, the inferred would vary across each of the generated characters. To calculate the observed improvement, we measure the variance of across all generated characters for each of 19 fonts explored with this technique when conditioned on on or characters (Figure 11, bottom). Indeed, we observe that conditioning on more characters generally decreases the variance of the generated styles, indicating that this procedure improves style consistency. Taken together, we suspect that these results on style progagation suggest a potential direction for providing iterative feedback to humans for synthesizing new fonts (see Discussion).

4.3 Building style analogies with the learned representation

Figure 12: Building style analogies with the learned representation. Semantically meaningful directions may be identified for globally altering font attributes. Top row: Bold (blue) and non-bold (red) regions of latent space (left) provide a vector direction that may be added to arbitrary points in latent space (A) for decreasing or increasing the strength of the attribute. Middle and bottom rows: Same for italics (B) and condensed (C).

Given that the latent style is perceptually smooth and aligned across class labels, we next ask if we may find semantically meaningful directions in this latent space. In particular, we ask whether these semantically meaningful directions may permit global manipulations of font style.

Inspired by the work on word vectors [40], we ask whether one may identify analogies for organizing the space of font styles (Figure 12, top). To address this question, we select positive and negative examples for semantic concepts of organizing fonts (e.g. bold, italics, condensed) and identify regions in latent space corresponding to the presence or absence of this concept (Figure 12, left, blue and red, respectively). We compute the average and , and define the concept direction .

We test if these directions are meaningful by taking an example font style from the dataset (Figure 12, right, yellow), and adding (or subtracting) the concept vector scaled by some parameter . Finally, we compute the SVG decodings for across a range of .

Figure 12 (right) shows the resulting fonts. Note that across the three properties examined, we observe a smooth interpolation in the direction of the concept modeled (e.g.: first row v becomes increasingly bold from left to right). We take these results to indicate that one may interpret semantically meaningful directions in the latent space. Additionally, these results indicate that one may find directions in the latent space to globally manipulate font style.

4.4 Quantifying the quality of the learned representations

Assessing Generalization
Effect of Sequence Length on Decoding Quality
Figure 15: Quantifying the quality of the learned representations. (a) Top: Negative log-likelihood for training and test datasets over epochs. Bottom: Negative log-likelihood for selected, individual classes within the dataset. (b, left) Test negative log-likelihood of all characters with label 8 (top) and 7 (bottom) as a function of the number of SVG commands. (b, right) Examples from 8 and 7 with few commands and high (red) or low loss (orange). Examples with many commands and high (blue) or low loss (green).

Almost all of the results presented have been assessed qualitatively. This is largely due to the fact that the quality of the results are assessed based on human judgements of aesthetics. In this section, we attempt to provide some quantitative assessment of the quality of the proposed model.

Figure (a)a (top) shows the training dynamics of the model as measured by the overall training objective. Over the course of training epochs, we find that the model does indeed improve in terms of likelihood and plateaus in performance. Furthermore, the resulting model does not overfit on the training set in any significant manner as measured by the log-likelihood.

Figure (a)a (bottom) shows the mean negative log likelihoods for the examples in each class of the dataset. There is a small but systematic spread in average likelihood across classes. This is consistent with our qualitative results, where certain classes would consistently yield lower quality SVG decodings than others (e.g. 7 in Figure 10).

We can characterize the situations where the model performs best, and some possible causes for its improved performance. Figure (b)b shows the negative log likelihoods of examples from the test set of a given class, as a function of their sequence lengths. With longer sequences, the variance of log likelihoods increase. For the best performing class (8, top) the loss values also trend downwards, whereas for the worst performing (7, bottom), the trend stays relatively level. This means that the model had a harder time reliably learning characters especially with longer sequence lengths.

Finally, in order to see what makes a given character hard or easy to learn, we examine test examples that achieved high and low loss, at different sequence lengths. Figure LABEL:fig6:c reveals that for any class characters with high loss are generally highly stylized, regardless of their sequence lengths (red, blue), whereas easier to learn characters are more commonly found styles (yellow, green).

4.5 Limitations of working with a learned, stochastic, sequential representation

Common issues Quantifying Model Confidence
Figure 16: Limitations of proposed sequential, stochastic generative model. Left: Low-likelihood samples may result in errors difficult to correct. Color indicates ordering of sequential sample (blue red). Right: Regions of latent space with high variance result in noisy SVG decodings. Latent representation color coded based on variance: light (dark) green indicates low (high) variance. Visualization of rendered and SVG decoded samples (purple, blue).

Given the systematic variability in the model performance across class label and sequence length, we next examine how specific features of the modeling choices may lead to these failures. An exhaustive set of examples of model failures is highlighted in in Figure 18 of the Appendix. We discuss below two common modes for failure due to the sequential, stochastic nature of the generative model.

At each stochastic sampling step, the proposed model may select a low likelihood decision. Figure 16a highlights how a mistake in drawing 3 early on leads to a series of errors that the model was not able to error correct. Likewise, Figure LABEL:fig7:a shows disconnected start and end points in drawing 6 caused by the accumulation of errors over time steps. Both of these errors may be remedied through better training schedules that attempt to teach forms of error correction [1], but please see Discussion.

A second systematic limitation is reflected in the uncertainty captured within the model. Namely, the proposed architecture contains some notion of confidence in its own predictions as measured by the variance in the VAE latent representation. We visualize the confidence by color-coding the UMAP representation for the latent style by (Figure 16b, right). Lighter green colors indicate high model confidence reflected by lower VAE variance. Areas of high confidence show sharp outputs and decode higher quality SVGs (Figure 16b, purple). Conversely, areas with lower confidence correspond to areas with higher label noise or more stylized characters. These regions of latent space decode lower quality SVGs (Figure 16b, yellow). Addressing these systematic limitations is a modeling challenge for building next generation generative models for vector graphics (see Discussion).

5 Discussion

In the work we presented a generative model for vector graphics. This model has the benefit of providing a scale-invariant representation for imagery whose latent representation may be systematically manipulated and exploited to perform style propagation. We demonstrate these results on a large dataset of fonts and highlight the limitations of a sequential, stochastic model for capturing the statistical dependencies and richness of this dataset. Even in its present form, the current model may be employed as an assistive agent for helping humans design fonts in a more time-efficient manner [6, 46]. For example, a human may design a small set of characters and employ style propagation to synthesize the remaining set of characters (Figure 10, 11).

An immediate question is how to build better-performing models for vector graphics. Immediate opportunities include new attention-based architectures [58] or potentially some form of adversarial training [14]. Improving the model training to afford the opportunity for error correction may provide further gains [1].

A second direction is to employ this model architecture on other SVG vector graphics datasets. Examples include icons datasets [7] or human drawings [49, 16]. Such datasets reveal additional challenges above-and-beyond the fonts explored in this work as the SVG drawings encompass a larger amount of diversity and drawings containing larger numbers of strokes in a sequence. Additionally, employing color, brush stroke and other tools in illustration as a predicted features offers new and interesting directions for increasing the expressivity of the learned models.


We’d like to thank the following people: Diederik Kingma, Benjamin Caine, Trevor Gale, Sam Greydanus, Keren Gu, and Colin Raffel for discussions and feedback; Monica Dinculescu and Shan Carter for insights from designers’ perspective; Ryan Sepassi for technical help using Tensor2Tensor; Jason Schwarz for technical help using UMAP; Joshua Morton for infrastructure help; Yaroslav Bulatov, Yi Zhang, and Vincent Vanhoucke for help with the dataset; and the Google Brain and the AI Residency teams.


Appendix A Appendix

a.1 Dataset details

We collected a total of font characters (, per class) in a common format (SFD), while retaining only characters whose unicode id corresponded to the classes 0-9, a-z, A-Z. Filtering by unicode id is imperfect because many icons intentionally declare an id such that equivalent characters can be rendered in that font style (e.g.: 七 sometimes declares the unicode id normally reserved for 7).

We then convert the SFD icons into SVG. The SVG format can be composed of many elements (square, circle, etc). The most expressive of these is the path element whose main attribute is a sequence of commands, each requiring a varying number of arguments (lineTo: 1 argument, cubicBezierCurve: 3 arguments, etc.). An SVG can contain multiple elements but we found that SFD fonts can be modelled with a single path and a subset of its commands (moveTo, lineTo, cubicBezierCurve, EOS). This motivates our method to model SVGs as a single sequence of commands.

In order to aid learning, we filter out characters with over 50 commands. We also found it crucial to use relative positioning in the arguments of each command. Additionally, we re-scale the arguments of all icons to ensure that most real-values in the dataset will lie in similar ranges. This process preserves size differences between icons. Finally, we standardize the command ordering within a path such that each shape begins and ends at its top-most point and the curves always start by going clockwise. We found that setting this prior was important to remove any ambiguity regarding where the SVG decoder should start drawing from and which direction (information which the image encoder would not be able to provide).

When rendering the SVGs as x pixel images, we pick a render region that captures tails and descenders that go under the character’s baseline. Because relative size differences between fonts are still present, this process is also imperfect: zoom out too little and many characters will still go outside the rendered image, zoom out too much and most characters will be too small for important style differences to be salient.

Lastly, we convert the SVG path into a vector format suitable for training a neural network model: each character is represented by a sequence of commands, each consisting of tuples with: 1) a one-hot encoding of command type (

moveTo, lineTo, etc.) and 2) a normalized representation of the command’s arguments (e.g.: x, y positions). Note that for this dataset we only use 4 commands (including EOS), but this representation can easily be extended for any SVG icon that use any or all of the commands in the SVG path language.

a.2 Details of network architecture

Our model is composed of two separate substructures: a convolutional variational autoencoder and an auto-regressive SVG decoder. The model was implemented and trained with Tensor2Tensor [57].

The image encoder is composed of a sequence of blocks, each composed of a convolutional layer, conditional instance normalization (CIN) [9, 43]

, and a ReLU activation. Its output is a

representation of the input image. At training time, is sampled using the reparameterization trick [30, 48]. At test time, we simply use . The image decoder is an approximate mirror image of the encoder, with transposed convolutions in place of the convolutions. All convolutional-type layers have SAMEpadding. CIN layers were conditioned on the icon’s class.


Kernel, Stride

Output Dim
Conv-CIN-ReLU 5, 1 64x64x32
Conv-CIN-ReLU 5, 1 32x32x32
Conv-CIN-ReLU 5, 1 32x32x64
Conv-CIN-ReLU 5, 2 16x16x64
Conv-CIN-ReLU 4, 2 8x8x64
Conv-CIN-ReLU 4, 2 4x4x64
Flatten-Dense - 64
Table 1: Architecture of convolutional image encoder containing parameters.
Operations Kernel, Stride Output Dim
Dense-Reshape - 4x4x64
ConvT-CIN-ReLU 4, 2 8x8x64
ConvT-CIN-ReLU 4, 2 16x16x64
ConvT-CIN-ReLU 5, 1 16x16x64
ConvT-CIN-ReLU 5, 2 32x32x64
ConvT-CIN-ReLU 5, 1 32x32x32
ConvT-CIN-ReLU 5, 2 64x64x32
ConvT-CIN-ReLU 5, 1 64x64x32
Conv-Sigmoid 5, 1 64x64x1
Table 2: Architecture of convolutional image decoder containing parameters.

The SVG decoder consists of 4 stacked LSTMs cells with hidden dimension , trained with feed-forward dropout [52], as well as recurrent dropout [61, 50] at

keep-probability. The decoder’s topmost layer consists of a Mixture Density Network (MDN)

[3, 15]. It’s hidden state is initialized by conditioning on . At each time-step, the LSTM receives as input the previous time-step’s sampled MDN output, the character’s class and the representation. The total number of parameters is .

a.3 Training details

The optimization objective of the image VAE is the log-likelihood reconstruction loss and the KL loss applied to with KL-beta . We use Kingma et al’s [29] trick with free bits ( per dimension of ). The model is trained with batch size .

The SVG decoder’s loss is composed of a softmax cross-entropy loss between the one-hot command-type encoding, added to the MDN loss applied to the real-valued arguments. We found it useful to scale the softmax cross-entropy loss by when training with this mixed loss. We trained this model with teacher forcing and used batch size .

All models are initialized with the method proposed by He et al. (2015) [17] and trained with the Adam optimizer with [26].

a.4 Visualization details

The dimensionality reduction algorithm used for visualizing the latent space is UMAP [39]. We fit the activations of M examples from the dataset into

UMAP components, using the cosine similarity metric, with a minimum distance of

, and nearest neighbors. After fitting, we discretize the 2D space into discrete buckets and decode the average in each grid cell with the image decoder.

a.5 Samples from generated font sets

Figures 6, 9, 10, 12 and 16 contained selected examples to highlight the successes and failures of the model as well as demonstrate various results. To provide the reader with a more complete understanding of the model performance, below we provide additional samples from the model highlighting successes (Figure 17) and failures (Figure 18). As before, results shown are selected best out of 10 samples.

Figure 17: More examples of randomly generated fonts. Details follow figure 8.
Figure 18: Examples of poorly generated fonts. Details follow figure 8. Highly stylized characters clustered in a high-variance region of the latent space . Samples from this region generate poor quality SVG fonts. See Section 4.5 for details.