Few-shot Compositional Font Generation with Dual Memory

05/21/2020 ∙ by Junbum Cha, et al. ∙ 9

Generating a new font library is a very labor-intensive and time-consuming job for glyph-rich scripts. Despite the remarkable success of existing font generation methods, they have significant drawbacks; they require a large number of reference images to generate a new font set, or they fail to capture detailed styles with only a few samples. In this paper, we focus on compositional scripts, a widely used letter system in the world, where each glyph can be decomposed by several components. By utilizing the compositionality of compositional scripts, we propose a novel font generation framework, named Dual Memory-augmented Font Generation Network (DM-Font), which enables us to generate a high-quality font library with only a few samples. We employ memory components and global-context awareness in the generator to take advantage of the compositionality. In the experiments on Korean-handwriting fonts and Thai-printing fonts, we observe that our method generates a significantly better quality of samples with faithful stylization compared to the state-of-the-art generation methods quantitatively and qualitatively. Source code is available at https://github.com/clovaai/dmfont.



There are no comments yet.


page 1

page 2

page 3

page 4

Code Repositories


Official PyTorch implementation of DM-Font (ECCV 2020)

view repo


Official PyTorch implementation of LF-Font (Few-shot Font Generation with Localized Style Representations and Factorization) AAAI 2021

view repo


Official PyTorch implementation of MX-Font (Multiple Heads are Better than One: Few-shot Font Generation with Multiple Localized Experts) ICCV 2021

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Advances of web technology lead people to consume a massive amount of texts on the web. It makes designing a new font style, e.g., personalized handwriting, critical. However, because traditional methods to make a font library heavily rely on expert designers by manually designing each glyph, creating a font library is extremely expensive and labor-intensive for glyph-rich scripts such as Chinese (more than 50,000 glyphs), Korean (11,172 glyphs), or Thai (11,088 glyphs) [koreantextbook].

Recently, end-to-end font generation methods [zi2zi, jiang2017dcfont, jiang2019_aaai_scfont, lyu2017_icdar_aegg, chang2018_bmvc_hgan, chang2018_wacv_densecyclegan]

have been proposed to build a font set without human experts. The methods solve image-to-image translation tasks between various font styles based on generative adversarial networks (GANs) 

[gan]. While the methods have shown the remarkable achievement, they still require a large number of samples, e.g., samples [jiang2017dcfont, jiang2019_aaai_scfont] to generate a new font set. Moreover, they require additional training to create a new glyph set, i.e., they need to finetune the pretrained model on the given new glyph subset. Thus, these finetune-based methods are rarely practical if collecting the target glyphs is extremely expensive, e.g., human handwriting.

Several recent studies attempt to generate a font set without additional training with a large number of glyphs, but using only a few samples [azadi2018mcgan, sun2018_ijcai_savae, zhang2018_cvpr_emd, gao2019agisnet, srivatsan2019_emnlp_deepfactorization]. Despite their successful few-shot generation performances on training styles, existing few-shot font generation methods often fail to generate high-quality font library with unseen style few-shot samples as illustrated in Figure 1. We solve this problem using the inherent glyph characteristics in contrast to most of the previous works handling the problem in the end-to-end data-driven manner without any human prior. A few researchers have considered characteristics of glyphs to improve font generation methods [sun2018_ijcai_savae, jiang2019_aaai_scfont], but their approaches are either still requiring more than samples [jiang2019_aaai_scfont], or only designed for memory efficiency [sun2018_ijcai_savae].

In this paper, we focus on a famous family of scripts, called compositional scripts, which are composed of a combination of sub-glyphs or components. For example, the Korean script has 11,172 valid glyphs with only 68 components. One can build a full font library by designing only 68 sub-glyphs and combine them by the pre-defined rule. However, this rule-based method has a significant limitation; a sub-glyph changes its shape and position diversely depending on the combination, as shown in Figure 2. Hence, even if a user has a complete sub-glyphs, generating a full font set is impossible without the combination rule of components. Due to the limitations, compositional scripts have been manually designed for each glyph despite its compositionality [koreantextbook].

Our framework for the few-shot font generation, Dual Memory-augmented Font Generation Network (DM-Font), utilizes the compositionality supervision in the weakly-supervised manner, i.e., no component-wise bounding box or mask is required but only component labels are required, resulting on more efficient and effective generation. We employ the dual memory structure (persistent memory and dynamic memory) to efficiently capture the global glyph structure and the local component-wise styles, respectively. This strategy enables us to generate a new high-quality font library with only a few samples, e.g., samples and samples for Korean and Thai, respectively. In the experiments, the generated Korean and Thai fonts show both quantitatively better visual quality in various metrics and qualitatively being preferred in the user study.

2 Related Works

2.1 Few-shot image-to-image translation

Image-to-image (I2I) translation [isola2017_cvpr_pix2pix, zhu2017_iccv_cyclegan, stargan, karras2019stylegan, starganv2] aims to learn the mapping between different domains. This mapping preserves the content in the source domain while changing the style as the target domain. Mainstream I2I translation methods assume an abundance of target training samples which is impractical. To deal with more realistic scenarios where the target samples are scarce, few-shot I2I translation works appeared recently [liu2019funit]. These methods can be directly applied to the font generation task as a translation task between the reference font and target font. We compare our method with FUNIT [liu2019funit].

As an independent line of research, style transfer methods [gatys2016neuralstyle, wct, adain, deepphotostyle, photowct, wct2] have been proposed to transfer styles of an unseen reference while preserving the original content. Unlike I2I translation tasks, style transfer methods cannot be directly transformed to font generation tasks, because they usually define the style as the set of textures and colors. However, in font generation tasks, the style of font is usually defined as discriminative local property of the font. Hence, our work does not concern style transfer methods as our baseline.

2.2 Automatic font generation

Automatic font generation task is an I2I translation between different font domains, i.e., styles. We categorize the automatic font generation methods into two classes, which are many-shot and few-shot methods, according to way to generate a new font set. Many-shot methods [zi2zi, jiang2017dcfont, lyu2017_icdar_aegg, chang2018_bmvc_hgan, chang2018_wacv_densecyclegan, jiang2019_aaai_scfont] directly finetune the model on the target font set with a large number of samples, e.g., . It is impractical in many real-world scenarios when collecting new glyphs is costly, e.g., handwriting.

In contrast, few-shot font generation methods [zhang2018_cvpr_emd, srivatsan2019_emnlp_deepfactorization, azadi2018mcgan, gao2019agisnet, sun2018_ijcai_savae] does not require additional finetuning and a large number of reference images. However, the existing few-shot methods have significant drawbacks. For example, some methods generate a whole font set at single forward path [azadi2018mcgan, srivatsan2019_emnlp_deepfactorization]. Hence, they require a huge model capacity and cannot be applied to glyph-rich scripts but scripts with only a few glyphs, e.g., Latin alphabet. On the other hand, EMD [zhang2018_cvpr_emd] and AGIS-Net [gao2019agisnet] can be applied to any general scripts, but they show worse synthesizing quality to unseen style fonts, as observed in our experimental results. SA-VAE [sun2018_ijcai_savae], a Chinese-specific method, keeps the model size small by compressing one-hot character-wise embeddings based on compositionality of Chinese script. Compared with SA-VAE, ours handles the features as component-wise, not character-wise. It brings huge advantages in not only reducing feature dimension but also in performances as shown in our experimental results.

Figure 2: Examples of compositionality of Korean script. Even if we choose the same sub-glyph, e.g., “ㄱ”, the shape and position of each sub-glyph are varying depending on the combination, as shown in red boxes.

3 Preliminary: Complete Compositional Scripts

Compositional script is a widely-used glyph-rich script, where each glyph can be decomposed by several components as shown in Fig. 2. These scripts account for 24 of the top 30 popular scripts, including Chinese, Hindi, Arabic, Korean, Thai. A compositional script is either complete or not, where each glyph in complete compositional scripts can be decomposed to fixed number sub-glyphs. For example, every Korean glyph can be decomposed by three sub-glyphs (See Fig. 2). Similarly, a Thai character has four components. Furthermore, complete compositional letters have specific sub-glyph sets for each component type. For example, the Korean alphabet has three component types where each component type has , , sub-glyphs. By combining them, Korean letter has valid characters. Note that the minimum number of glyphs to get the entire sub-glyph set is . Similarly, Thai letter can represent characters, and characters are required to cover whole sub-glyphs.

Some compositional scripts are not complete. For example, each character of the Chinese letter can be decomposed into a diverse number of sub-glyphs. Although we mainly validate our method on Korean and Thai scripts, our method can be easily extended to other compositional scripts.

(a) Architecture overview.
(b) Encoding phase detail.
(c) Decoding phase detail.
Figure 3: DM-Font overview. (a) The model encodes the reference style glyphs and stores the component-wise features into the memory – (b). The decoder generates images with the component-wise features – (c). (b) The encoder extracts the component-wise features and stores them into the dynamic memory using the component label and the style label . (c) The memory addressor loads the component features by the character label and feeds them to the decoder.

4 Dual Memory-augmented Font Generation Network

In this section, we introduce a novel architecture, Dual Memory-augmented Font Generation Network (DM-Font), which utilizes the compositionality of a script by the augmented dual memory structure. DM-Font disentangles global composition information and local styles, and writes them into persistent and dynamic memory, respectively. It enables to make a high-quality full glyph library only with very few references, e.g., samples for Korean, samples for Thai.

4.1 Architecture overview

We illustrate the architecture overview of DM-Font in Fig. 2(a). The generation process consists of encoding and decoding stages. In the encoding stage, the reference style glyphs are encoded to the component features and stored into the dynamic memory. After the encoding, the decoder fetches the component features and generates the target glyph according to the target character label.

Encoder disassembles a source glyph into the several component features using the pre-defined decomposition function. We adopt multi-head structure, one head per one component type. The encoded component-wise features are written into the dynamic memory as shown in Figure 2(b).

We employ two memory modules, where persistent memory (PM) is a component-wise learned embedding that represents the intrinsic shape of each component and the global information of the script such as the compositionality, while dynamic memory (DM) stores encoded component features of the given reference glyphs. Hence, PM captures the global information of sub-glyphs independent to each font style, while encoded features in DM learn unique local styles depending on each font. Note that DM simply stores and retrieves the encoded features, but PM is learned embedding trained from the data. Therefore, DM is adaptive to the reference input style samples, while PM is fixed after training. We provide detailed analysis of each memory in the experiments.

Memory addressor provides the access address of both dynamic and persistent memory based on the given character label as shown in Figure 2(b) and Figure 2(c). We use pre-defined decomposition function to get the component-wise address, where is the label of i-th component of , and is the number of sub-glyphs for . For example, the function decomposes a Korean character, “한” by {“ㅎ”, “ㅏ”, “ㄴ”}. The function maps input character to Unicode and decomposes it by a simple rule. More details of the decomposition function are given in Appendix 0.A.

The component-wise encoded features for the reference , whose character label is and style label is , are stored into DM during the encoding stage. In our scenario, the encoder is a multi-head encoder, and can be decomposed by to sub-glyph labels . Hence, the features in DM at address , is computed by , where is the index of the component type and is the encoder output corresponding to .

In the decoding stage, decoder generates a target glyph with the target character and the reference style using the component-wise features stored into the dynamic memory and the persistent memory as the following:


where refers to the concatenation operation.

For the better generation quality, we also employ a discriminator and a component classifier. For

discriminator , we use a multitask discriminator [mescheder2018training, liu2019funit] with the font condition and the character condition. The multitask discriminator has independent branches for each target class and each branch performs binary classification. Considering two types of conditions, we use two multitask discriminator, one for character classes and the other for font classes, with a shared backbone. We further use component classifier to ensure the model to fully utilize the compositionality. The component classifier provides additional supervision to the generator that stabilizes the training.

Moreover, we introduce the global-context awareness and local-style preservation to the generator, called compositional generator. Specifically, self-attention blocks [cao2019gcnet, zhang2019sagan] are used in the encoder to facilitate relational reasoning between components, and the hourglass block [newell2016hourglass, Lin_2017_CVPR_FPN] is attached to the decoder to aware global-context while preserving locality. In the experiment section, we analyze the impact of the architectural improvements on the final performance. We provide the architecture and the implementation details in Appendix 0.A.

DM-Font learns the compositionality in the weakly-supervised manner; it does not require any exact component location, e.g., component-wise bounding boxes, but only component labels are required. Hence, DM-Font is not restricted to the font generation only, but can be applied to any generation task with compositionality, e.g., attribute conditioned generation tasks. Extending DM-Font to attribute labeled datasets, e.g., CelebA [celeba], will be an interesting topic.

4.2 Learning

We train DM-Font from font sets , where is a target glyph image, and is a character and font label, respectively. During the training, we assume that different font labels represent different styles, i.e., we set in equation (1). Also, for the efficiency, we only encode a core component subset to compose the target glyph into the DM instead of the full component set. For example, the Korean script has the full component set with size , but only components are required to construct a single character.

We use adversarial loss to let the model generate plausible images.


where generates an image from the given image and target label by equation (1). The discriminator is conditional on the target label . We employed two types of the discriminator to solve the problem. The font discriminator is a conditional discriminator on the source font index and the character discriminator aims to classify what is the given character.

loss adds supervision from the ground truth target as the following:


We also use feature matching loss to improve the stability of the training. The feature matching loss is constructed using the output from the -th layer of the -layered discriminator, .


Lastly, to let the model fully utilize the compositionality, we train the model with additional component-classification loss. For the given input , we extract the component-wise features using the encoder , and train them with cross-entropy loss (CE) using component labels , where is the component decomposition function to the given character label .


The final objective function to optimize the generator , the discriminator , and the component classifier is defined as the following:



are control parameters to importance of each loss function. We set

for all experiments.

5 Experiments

5.1 Datasets

5.1.1 Korean-handwriting dataset.

Due to its diversity and data sparsity, generating a handwritten font with only a few samples is challenging. We validate the models using Korean-handwriting fonts111We collect public fonts from http://uhbeefont.com/. refined by the expert designer. Each font library contains widely-used Korean glyphs. We train the models using fonts and characters, and validate the models on the remaining split. We separately evaluate the models on the seen () and unseen () characters to measure the generalizability to the unseen characters. characters are used for the reference.

5.1.2 Thai-printing dataset.

Compared with Korean letters, Thai letters have more complex structure; Thai characters are composed of four sub-glyphs while Korean characters have three components. We demonstrate the models on Thai-printing fonts222https://github.com/jeffmcneill/thai-font-collection.. The train-evaluation split strategy is same as Korean-handwriting experiments, and samples are used for the few-shot generation.

5.1.3 Korean-unrefined dataset.

We also gather unrefined Korean handwriting dataset from non-experts, letting each applicant write characters. This dataset is extremely diverse and not refined by expert designers different from the Korean-handwriting dataset. We use the Korean-unrefined dataset as the validation of the models trained on the Korean-handwriting dataset, i.e., the Korean-unrefined dataset is not visible during the training, but only a few samples are visible for the evaluation. samples are used for the generation as well as the Korean-handwriting dataset.

5.2 Comparison methods and evaluation metrics

5.2.1 Comparison methods.

We compare our model with state-of-the-art few-shot font generation methods, including EMD [zhang2018_cvpr_emd], AGIS-Net [gao2019agisnet], and FUNIT [liu2019funit]. We exclude the methods which are Chinese-specific [sun2018_ijcai_savae] or not applicable to glyph-rich scripts [srivatsan2019_emnlp_deepfactorization]. Here, we slightly modified FUNIT, originally designed for unsupervised translation, by changing its reconstruction loss to loss with ground truths and conditioning the discriminator to both contents and styles.

Pixel-level Content-aware Style-aware
Evaluation on the seen character set during training
EMD [zhang2018_cvpr_emd] 0.691 0.361 80.4 0.084 138.2 5.1 0.089 134.4
FUNIT [liu2019funit] 0.686 0.369 94.5 0.030 42.9 5.1 0.087 146.7
AGIS-Net [gao2019agisnet] 0.694 0.399 98.7 0.018 23.9 8.2 0.088 141.1
DM-Font (ours) 0.704 0.457 98.1 0.018 22.1 64.1 0.038 34.6
Evaluation on the unseen character set during training
EMD [zhang2018_cvpr_emd] 0.696 0.362 76.4 0.095 155.3 5.2 0.089 139.6
FUNIT [liu2019funit] 0.690 0.372 93.3 0.034 48.4 5.6 0.087 149.5
AGIS-Net [gao2019agisnet] 0.699 0.398 98.3 0.019 25.9 7.5 0.089 146.1
DM-Font (ours) 0.707 0.455 98.5 0.018 20.8 62.6 0.039 40.5
Table 1: Quantatitive evaluation on the Korean-handwriting dataset. We evaluate the methods on the seen and unseen character sets. Higher is better, except perceptual distance (PD) and mFID.
Pixel-level Content-aware Style-aware
Evaluation on the seen character set during training
EMD [zhang2018_cvpr_emd] 0.773 0.640 86.3 0.115 215.4 3.2 0.087 172.0
FUNIT [liu2019funit] 0.712 0.449 45.8 0.566 1133.8 4.6 0.084 167.9
AGIS-Net [gao2019agisnet] 0.758 0.624 87.2 0.091 165.2 15.5 0.074 145.2
DM-Font (ours) 0.776 0.697 87.0 0.103 198.7 50.3 0.037 69.4
Evaluation on the unseen character set during training
EMD [zhang2018_cvpr_emd] 0.770 0.636 85.0 0.123 231.0 3.4 0.087 171.6
FUNIT [liu2019funit] 0.708 0.442 45.0 0.574 1149.8 4.7 0.084 166.9
AGIS-Net [gao2019agisnet] 0.755 0.618 85.4 0.103 188.4 15.8 0.074 145.1
DM-Font (ours) 0.773 0.693 87.2 0.101 195.9 50.6 0.037 69.6
Table 2: Quantatitive evaluation on the Thai-printing dataset. We evaluate the methods on the seen and unseen character sets. Higher is better, except perceptual distance (PD) and mFID.

5.2.2 Evaluation metrics.

Assessing a generative model is difficult because of its non-tractability. Several quantitative evaluation metrics 

[johnson2016_eccv_perceptual, heusel2017_nips_ttur_fid, zhang2018_cvpr_lpips, ferjad2020ganeval] have attempted to measure the performance of the trained generative model with different assumptions, but it is still controversial what is the best evaluation methods for generative models. In this paper, we consider three diverse levels of evaluation metrics; pixel-level, perceptual-level and human-level evaluations.

Pixel-level evaluation metrics assess the pixel structural similarity between the ground truth image and the generated image. We employ the structural similarity index (SSIM) and multi-scale structural similarity index (MS-SSIM).

However, pixel-level metrics often disagree with human perceptions. Thus, we also evaluate the models with perceptual-level evaluation metrics. We trained four ResNet-50 [he2016_cvpr_resnet] models on the Korean-handwriting dataset and Thai-printing dataset to classify style and character label. Unlike the generation task, the whole fonts and characters are used for the training. More detailed classifier training settings are in Appendix 0.B. We denote a metric is context-aware if the metric is performed using the content classifier, and style-aware is defined similarly. Note that the classifiers are independent to the font generation models, but only used for the evaluation. We report the top-1 accuracy, perceptual distance (PD) [johnson2016_eccv_perceptual, zhang2018_cvpr_lpips], and mean FID (mFID) [liu2019funit] using the classifiers. PD is computed by distance of the features between generated glyph and GT glyph, and mFID is a conditional FID [heusel2017_nips_ttur_fid] by averaging FID for each target class.

Finally, we conduct a user study on the Korean-unrefined dataset for measuring human-level evaluation metric. We ask users about three types of preference: content preference, style preference, and user preference considering both content and style. The questionnaire is made of 90 questions, 30 for each preference. Each question shows 40 glyphs, consisting of 32 glyphs generated by four models and 8 GT glyphs. The order of choices is shuffled for anonymity. We collect total 3,420 responses from 38 Korean natives. More details of user study are provided in Appendix 0.B.

(a) Seen character set during training.
(b) Unseen character set during training.
Figure 4: Qualitative comparison on the Korean-handwriting dataset. Visualization of generated samples with seen and unseen characters. We show insets of baseline results (green box), ours (blue box) and ground truth (red box). Ours successfully transfers the detailed style of the target style, while baselines fail to generate glyphs with the detailed reference style.
(a) Seen character set during training.
(b) Unseen character set during training.
Figure 5: Qualitative comparison on the Thai-printing dataset. Visualization of generated samples with seen and unseen characters. We show insets of baseline results (green box), ours (blue box) and ground truth (red box). Overall, ours faithfully transfer the target style, while other methods even often fail to preserve contents in unseen character sets.

5.3 Main results

5.3.1 Quantitative evaluation.

The main results on Korean-handwriting and Thai-printing datasets are reported in Table 1 and Table 2, respectively. We also report the evaluation results on the Korean-unrefined dataset in Appendix 0.C. We follow the dataset split introduced in Section 5.1. In the experiments, DM-Font remarkably outperforms the comparison methods in most of evaluation metrics, especially on style-aware benchmarks. Baseline methods show slightly worse content-aware performances on unseen characters than seen characters, e.g., AGIS-Net shows worse content-aware accuracy (), PD (), and mFID () in Table 1. In contrast, DM-Font consistently shows better generalizability to the unobserved characters during the training for both datasets. It is because our model interprets a glyph at the component level, the model easily extrapolates the unseen characters from the learned component-wise features stored in memory modules.

Our method shows significant improvements in style-aware metrics. DM-Font achieves and accuracy while other methods show much less accuracy, e.g., about for Korean unseen and Thai unseen character sets, respectively. Likewise, the model shows dramatic improvements in perceptual distance and mFID as well as the accuracy measure. In the latter section, we provide more detailed analysis that the baseline methods are overfitted to the training styles and failed to generalize to unseen styles.

5.3.2 Qualitative comparison.

We also provide visual comparisons in Figure 4 and Figure 5, which contain various challenging fonts including thin, thick, and curvy fonts. Our method generates glyphs with consistently better visual quality than the baseline methods. EMD [zhang2018_cvpr_emd] often erases thin fonts unintentionally, which causes low content scores compared with the other baseline methods. FUNIT [liu2019funit] and AGIS-Net [gao2019agisnet] accurately generate the content of glyphs and capture global styles well including overall thickness and font sizes. However, the detailed styles of the components in their results look different from the ground truths. Moreover, some generated glyphs for unseen Thai style lose the original content (see the difference between green boxes and red boxes in Figure 4 and Figure 5 for more details). Compared with the baselines, our method generates the most plausible images in terms of global font styles and detailed component styles. These results show that our model preserves details in the components using the dual memory and reuse them to generate a new glyph.

5.3.3 User study.

We conduct a user study to further evaluate the methods in terms of human preferences using the Korean-unrefined dataset. Example generated glyphs are illustrated in Figure 6. Users are asked to choose the most preferred generated samples in terms of content preservation, faithfulness to the reference style, and personal preference. The results are shown in Table 3, which present similar intuitions with Table 1; AGIS-Net and our method are comparable in the content evaluation, and our method is dominant in the style preference.

EMD [zhang2018_cvpr_emd] FUNIT [liu2019funit] AGIS-Net [gao2019agisnet] DM-Font (ours)
Best content preserving 1.33% 9.17% 48.67% 40.83%
Best stylization 1.71% 8.14% 17.44% 72.71%
Most preferred 1.23% 9.74% 16.40% 72.63%
Table 3: User study results on the Korean-unrefined dataset. Each number is the preferred model output out of responses.
Figure 6: Samples for the user study. The Korean-unrefined dataset is used.

5.4 More analysis

5.4.1 Ablation study.

Content Style Hmean
Baseline 96.6 6.5 12.2
Dynamic memory 99.8 32.0 48.5
Persistent memory 97.6 46.2 62.8
Compositional 98.3 63.3 77.0
(a) Impact of the memory modules.
Content Style Hmean
Full 98.3 63.3 77.0
Full 97.3 53.8 69.3
Full 97.8 51.3 67.3
Full 3.1 16.0 5.2
(b) Impact of the objective functions.
Table 4: Ablation studies on the Korean-handwriting dataset.

Each content and style score is an average of the seen and unseen accuracies. Hmean denotes the harmonic mean of content and style scores.

We investigate the impact of our design choices by ablative studies. Table 4(a) shows that the overall performances are improved by adding proposed components such as dynamic memory, persistent memory, and compositional generator. We report full table in Appendix 0.C.

Here, the baseline method is similar to FUNIT whose content and style accuracies are and , respectively. The baseline suffers from the failure of style generalization as previous methods. We observe that dynamic memory and persistent memory dramatically improves style scores while preserving content scores. Finally, our architectural improvements bring the best performance.

We also explore the performance influence of each objective. As shown in Table 4(b), removing loss and feature matching loss slightly degrades performances. The component-classification loss, which enforces the compositionality to the model, is the most important factor for successful training.

(output) (NN) (output) (NN) (output) (NN) (output) (NN)
Figure 7: Nearest neighbor analysis. We report the generated images by each model (output) with the given unseen reference style (GT) and the ground truth samples whose label is predicted by the style classifier (NN). Red boxed samples denote training samples. We can conclude that the baseline methods are overfitted to the training style while ours easily generalizes to unseen style.

5.4.2 Style overfitting of baselines.

We analyze the generated glyphs using our style classifier to investigate the style overfitting of the baseline methods. Figure 7 shows the predicted classes for each model output. We observe that the baseline methods often generate samples similar to the training samples. On the other hand, our model avoids the style overfitting by learning the compositionality of glyphs and directly reusing components of inputs. Consequently, as supported by previous quantitative and qualitative evaluations, our model is robust to the out-of-distributed font generation compared to the existing methods. We provide more analysis of the overfitting of comparison methods in the Appendix 0.C.

5.4.3 Component-wise style mixing.

In Figure 8

, we demonstrate our model can interpolate styles component-wisely. It supports that our model fully utilizes the compositionality to generate a glyph.

Figure 8: Component-wise style mixing. We interpolate only one component (marked by blue boxes) between two glyphs (the first column and the last column). The interpolated sub-glyphs are marked by green boxes. Our model successfully interpolates two sub-glyphs, while preserving other local styles.

6 Conclusions

Previous few-shot font generation methods often fail to generalize to unseen styles. In this paper, we propose a novel few-shot font generation framework for compositional scripts, named Dual Memory-augmented Font Generation Network (DM-Font). Our method effectively incorporates the prior knowledge of compositional script into the framework via two external memories: the dynamic memory and the persistent memory. DM-Font utilizes the compositionality supervision in the weakly-supervised manner, i.e., neither component-wise bounding box nor mask used during the training. The experimental results showed that the existing methods fail in stylization on unseen fonts, while DM-Font remarkably and consistently outperforms the existing few-shot font generation methods on Korean and Thai letters. Extensive empirical evidence support that our framework lets the model fully utilize the compositionality so that the model can produce high-quality samples with only a few samples.


We thank Clova OCR and Clova AI Research team for discussion and advice, especially Song Park for the internal review.


Appendix 0.A Network Architecture Details

0.a.1 Memory addressors

Input: A character label
Output: Component labels , , and
Data: The number of components for each -th component type .
unicode = ToUnicode()
// 0xAC00 is the initial Korean Unicode
code = unicode - 0xAC00
= code mod
= (code div ) mod
= code div ()
Algorithm 1 Unicode-based Korean letter decomposition function

The memory addressor converts character label to the set of component labels by the pre-defined decomposition function , where is the label of i-th component of and is the number of sub-glyphs for . In this paper, we employ Unicode-based decomposition functions specified to each language. We describe the decomposition function for Korean script as an example in Algorithm 1. The function disassembles a character into component labels by the pre-defined rule. On the other hand, each Thai character consists of several Unicodes, each of which corresponds to one component. Therefore, each Unicode constituting the letter is a label itself. The Thai decomposition function only needs to determine the component type of each Unicode.

0.a.2 Network architecture

Figure 0.A.1: The encoder holds multiple heads according to the number of component types . We denote the spatial size of each block in the figure.

The proposed architecture has two important properties: global-context awareness and local-style preservation. Global-context awareness allows the relational reasoning between components to the network, boosting to disassemble source glyphs into sub-glyphs and assemble them to the target glyph. Local-style preservation indicates that the local style of source glyph is reflected in the target.

For the global-context awareness, the encoder adopts global-context block (GCBlock) [cao2019gcnet] and self-attention block (SABlock) [zhang2019sagan, vaswani2017_NIPS_Transformer_Attention], and the decoder employs hourglass block (HGBlock) [newell2016hourglass, Lin_2017_CVPR_FPN]. These blocks extend the receptive field globally and facilitate relational reasoning between components while preserving locality. For the local-style preservation, the network handles multi-level features based on the dual memory framework. The specific architecture overview is described visually in Figure 0.A.1.

The generator consists of five modules; convolution block (ConvBlock), residual block (ResBlock), self-attention block, global-context block, and hourglass block. Our SABlock is adopted from Transformer [vaswani2017_NIPS_Transformer_Attention] instead of SAGAN [zhang2019sagan], i.e., the block consists of multi-head self-attention and position-wise feed-forward. We also use two-dimensional relative positional encoding from [bello2019_ICCV_AttentionAugmentedConv]. The hourglass block consists of multiple convolution blocks and downsampling or upsampling operation follows each block. Through hourglass structure, the spatial size of the feature map is reduced to and restored to the original size, which extends the receptive field globally preserving locality. The channel size starts at 32 and doubles as blocks are added, up to 256 for the encoder and 512 for the decoder.

We employ a simple structure for the discriminator. Several residual blocks follow the first convolution block. Like the generator, the channel size starts at 32 and doubles as blocks are added, up to 1024. The output feature map of the last residual block is spatially squeezed to size and it is fed to the two linear, font and character discriminators. Each discriminator is a multi-task discriminator that performs binary classification for each target class. Therefore, the font discriminator produces binary outputs and the character discriminator produces binary outputs, where denotes the number of target classes.

Since the persistent memory (PM) is independent of local styles, we set the size of PM same as the size of high-level features, the final output of the encoder, i.e., . The learned embedding is refined via three convolution blocks, added to the high-level features of dynamic memory (DM), and then fed to the decoder. The component classifier comprises two residual blocks and one linear layer and identifies the class of the high-level component features from the DM.

Appendix 0.B Experimental Setting Details

0.b.1 DM-Font implementation details

We use Adam [kingma2015adam] with a learning rate of 0.0002 for the generator and 0.0008 for the discriminator, following the two time-scale update rule [heusel2017_nips_ttur_fid]. The component classifier use same learning rate with the generator. The discriminator adopts spectral normalization [miyato2018spectral] for the regularization. We train the model with hinge GAN loss [zhang2019sagan, miyato2018spectral, brock2018biggan, liu2019funit, lim2017geometric] during 200K iterations. We employ exponential moving average of the generator [karras2018pggan, yazici2018emagan, liu2019funit, karras2019stylegan]. For the Thai-printing dataset, we use a learning rate of 0.00005 for the generator and 0.0001 for the discriminator with 250K training iterations while other settings are same as the Korean experiments.

0.b.2 Evaluation classifier implementation details

Two different ResNet-50 [he2016_cvpr_resnet] are separately trained for the content and the style classifiers with Korean and Thai scripts. The classifiers are optimized using the Adam optimizer [kingma2015adam]

with 20 epochs. We expect that more recent Adam variants,

e.g., RAdam [radam] or AdamP [heo2020adamp], further improve the classifier performances. The content classifier is supervised to predict a correct character, while the style classifier is trained to predict a font label. We randomly use 85% of the data points as the train data and the remained data points are used for the validation. Unlike the DM-Font training, this strategy shows all characters and fonts to classifiers. In our experiment, every classifier achieves over 96% of validation accuracy: 97.9% Korean content accuracy, 96.0% Korean style accuracy, 99.6% Thai content accuracy, and 99.99% Thai style accuracy. Note that all classifiers are only used for the evaluation but not for the DM-Font training.

0.b.3 User study details

30 different styles (fonts) from the Korean-unrefined dataset are selected for the user study. We randomly choose 8 characters for each style and generate the characters with four different methods: EMD [zhang2018_cvpr_emd], AGIS-Net [gao2019agisnet], FUNIT [liu2019funit], and DM-Font (ours). We also provide ground truth characters for the selected characters to the users for the comparison. Users chose the best method in 3 different criteria (content / style / preference). For each question, we randomly shuffle the methods to keep anonymity of methods. To sum up, we got 3,420 responses from 38 Korean natives with items.

Appendix 0.C Additional Results

0.c.1 Reference set sensitivity

Pixel-level Content-aware Style-aware
Run 1 0.704 0.457 98.1% 0.018 22.1 64.1% 0.038 34.6
Run 2 0.702 0.452 98.8% 0.016 19.9 64.2% 0.038 37.2
Run 3 0.701 0.456 98.0% 0.018 23.4 66.0% 0.037 35.2
Run 4 0.702 0.451 97.8% 0.019 22.9 65.0% 0.038 36.7
Run 5 0.701 0.453 98.2% 0.018 22.9 64.8% 0.038 36.4
Run 6 0.703 0.460 97.2% 0.020 24.8 67.8% 0.036 34.0
Run 7 0.700 0.447 98.3% 0.018 21.9 64.8% 0.037 36.6
Run 8 0.701 0.451 98.2% 0.018 22.2 65.8% 0.037 35.4
Avg. 0.702 0.453 98.1% 0.018 22.5 65.3% 0.037 35.8
Std. 0.001 0.004 0.4% 0.001 1.4 1.2% 0.001 1.1
Table 0.C.1: Reference sample sensitivity. Eight different runs of DM-Font with different reference samples in the Korean-handwriting dataset.

In all experiments, we select the few-shot samples randomly while satisfying the compositionality. Here, we show that the reference sample selection sensitivity of the proposed method. Table 0.C.1 shows the Korean-handwriting generation results of the eight different runs with different sample selections. The results support that DM-Font is robust to the reference sample selection.

Pixel-level Content-aware Style-aware
EMD [zhang2018_cvpr_emd] 0.716 0.340 0.106 99.2 0.079 93.3
FUNIT [liu2019funit] 0.711 0.311 0.080 87.0 0.066 79.4
AGIS-Net [gao2019agisnet] 0.708 0.334 0.052 67.2 0.089 134.5
DM-Font (ours) 0.726 0.387 0.048 46.2 0.046 31.5
Table 0.C.2: Quantatitive Evaluation on the Korean-unrefined dataset. Higher is better, except perceptual distance (PD) and mFID.

0.c.2 Results on the Korean-unrefined dataset

Table 0.C.2 shows the quantitative evaluation results of the Korean-unrefined dataset used for the user study. We use the classifiers trained by the Korean-handwriting dataset for the evaluation. Hence, we only report the perceptual distance and mFID while accuracies are not measurable by the classifiers. In all evaluation metrics, DM-Font consistently shows the remarkable performance as other datasets. The example visual samples are shown in Figure 0.C.1.

0.c.3 Ablation study

Table 0.C.3 shows the full ablation study results including all evaluation metrics. As the observations in the main manuscript, all metrics show similar behavior with the averaged accuracies; our proposed components and objective functions significantly improve the generation quality.

Pixel-level Content-aware Style-aware
Evaluation on the seen character set during training
Baseline 0.689 0.373 96.7 0.026 33.6 6.5 0.084 132.7
+ DM 0.702 0.424 99.7 0.015 19.5 31.8 0.060 77.6
+ PM 0.704 0.435 97.7 0.020 26.9 46.6 0.049 57.1
+ Comp. 0.704 0.457 98.1 0.018 22.1 64.1 0.038 34.6
Evaluation on the unseen character set during training
Baseline 0.693 0.375 96.6 0.027 34.3 6.5 0.084 134.8
+ DM 0.705 0.423 99.8 0.015 19.5 32.3 0.060 81.0
+ PM 0.707 0.432 97.6 0.022 28.9 45.9 0.050 61.4
+ Comp. 0.707 0.455 98.5 0.018 20.8 62.6 0.039 40.5
(a) Impact of components. DM, PM, and Comp. denote dynamic memory, persistent memory, and compositional generator, respectively.
Pixel-level Content-aware Style-aware
Evaluation on the seen character set during training
Full 0.704 0.457 98.1 0.018 22.1 64.1 0.038 34.6
Full 0.695 0.407 97.0 0.022 27.9 53.4 0.046 48.3
Full 0.699 0.427 97.8 0.020 23.8 51.4 0.047 51.4
Full 0.634 0.223 3.0 0.488 965.3 16.2 0.082 118.9
Evaluation on the unseen character set during training
Full 0.707 0.455 98.5 0.018 20.8 62.6 0.039 40.5
Full 0.697 0.401 97.5 0.023 26.8 54.3 0.046 52.3
Full 0.701 0.423 97.8 0.020 24.1 51.2 0.048 56.0
Full 0.636 0.220 3.2 0.486 960.7 15.9 0.082 123.7
(b) Impact of objective functions.
Table 0.C.3: Ablation studies on the Korean-handwriting dataset. Higher is better, except perceptual distance (PD) and mFID.
Figure 0.C.1: Samples for the user study. The Korean-unrefined dataset is used.

0.c.4 Failure cases

(a) content failure (b) style failure (c) content and style
Figure 0.C.2: Failure cases. Examples of generated samples by DM-Font with incorrect content or insufficient stylization.

We illustrate the three failure types of our method in Figure 0.C.2. First, DM-Font can fail to generate the glyphs with the correct content due to the high complexity of the glyph. For example, some samples lose their contents – See from the first to the third column of Figure 0.C.2 (a). In practice, developing a content failure detector and a user-guided font correction system can be a solution. Another widely-observed failure case caused by the multi-modality of components, i.e.

, a component can have multiple styles. Since our scenario assumes that a model only observes one sample for each component, the problem is often ill-posed. Similar ill-posedness problem is also occurred in the colorization problem, and usually addressed by a human-guided algorithm 

[zhang2017real]. Similarly, a user-guided font correction algorithm will be an interesting future research.

Finally, we report the cases caused by the errors in the ground truth samples. Note that the samples in Figure 0.C.2 are generated by the Korean-unrefined dataset which can include inherent errors. When the reference glyphs are damaged as the two rightmost samples in the figure, it is difficult to disentangle style and content from the reference set. Due to the strong compositionality regularization by the proposed dual memory architecture, our model tries to use the memorized local styles while ignoring the damaged reference style.

0.c.5 Examples of various component shape in generated glyphs

Figure 0.C.3: Varying shape of single component. Six visual examples show the variety of a component (with red boxes) across different characters. The results show that DM-Font generates samples with various component shapes by the compositionality.

We provide more examples of the generated samples by DM-Font with the same component in Figure 0.C.3. The figure shows that the shape of each component is varying by different sub-glyphs compositions as described in Figure 2 of the main manuscript. Note that all components are observed a few times (usually once) as the reference. These observations support that our model does not simply copy the reference components, but can properly extract local styles and combine them with global composition information and intrinsic shape stored in persistent memory. To sum up, we conclude that DM-Font disentangles local style and global composition information well, and generates the high quality font library with only a few references.

0.c.6 Generalization gap between seen and unseen fonts

Seen fonts Unseen fonts Gap
Acc(%) PD mFID Acc(%) PD mFID Acc(%) PD mFID
Evaluation on the Korean-handwriting dataset
EMD 74.0 0.032 31.9 5.2 0.089 139.6 68.9 0.057 107.8
FUNIT 98.6 0.015 8.3 5.6 0.087 149.5 93.0 0.072 141.2
AGIS-Net 95.8 0.018 13.4 7.5 0.089 146.1 88.3 0.071 132.7
DM-Font 82.1 0.026 16.9 62.6 0.039 40.5 19.5 0.013 23.6
Evaluation on the Thai-printing dataset
EMD 99.5 0.001 1.0 3.4 0.087 171.6 96.1 0.086 170.6
FUNIT 97.0 0.004 5.0 4.7 0.084 166.9 92.4 0.080 161.9
AGIS-Net 84.6 0.016 28.6 15.8 0.074 145.1 68.8 0.058 116.5
DM-Font 90.2 0.009 13.5 50.6 0.037 69.6 39.6 0.029 56.1
Table 0.C.4: Style generalization gap on the Korean-handwriting dataset. We compute the differences of style-aware scores between seen and unseen font sets. The evaluation uses unseen character set. Smaller gap indicates better generalization.

We provide additional benchmarking results on the seen fonts in Table 0.C.4. Note that Table 1 and 2 in the main manuscript are measured in the unseen fonts only. Simply, “seen fonts” can be interpreted as the training performances, and “unseen fonts” as the validation performances. The comparison methods such as EMD [zhang2018_cvpr_emd], FUNIT [liu2019funit], AGIS-Net [gao2019agisnet], show remarkably good results on the training data (seen fonts) but fail to generalize the performance on the validation set (unseen fonts). We also report the generalization gap between the seen and unseen fonts in Table 0.C.4. The results show that comparison methods suffer from the style memorization issue, what we discussed in the nearest neighbor analysis, and cannot be generalizable to the unseen font styles. In contrast, our method shows significantly better generalization gap comparing to others.