Desktop publishing has been widely used in the printing and publishing industry. Moreover, numerous digital fonts are available for use for various purposes. However, typographic fonts containing Japanese and Chinese characters are fewer in comparison with those containing Latin ones. Consequently, the range of publication or advertisement designs in such languages is narrow. One of the reasons for the paucity of font diversity in East Asian languages is that designing typographic fonts is a highly labor-intensive and time-consuming task. It requires the handicraft of a professional designer because of the complexity of the glyph shapes and the large number of characters used. Approximately thousands of characters are required for daily use, and tens of thousands of characters are required to cover the entire language, in contrast to the 26 letters required for the Latin alphabet.
Therefore, automatic generation of glyph images from only a few references is required, especially for languages having a large number of characters. Such automatic generation would reduce the workload of font designers and enable them to create more diverse and unique fonts. It may also help non-professional users in building their original font library.
This study addresses the problem of generating Japanese typographic fonts, which have a large number of characters, from only a few style reference glyphs as input (Fig. 1). The generated glyphs are expected to have coherent styles with reference glyphs in terms of the shape of skeletons, contour of serifs, and thickness of lines.
Various attempts have been made thus far to design fonts easily. Previous studies on font generation focused on the modeling of outlines [1, 2, 3]. These earlier methods often required human intervention, and the obtained results were insufficient or not coherent among the characters constituting a single font set.
With the rise of deep learning, several end-to-end font generation methods have been proposed over the past few years[4, 5, 6, 7, 8]. However, these methods require a large number of style reference glyphs (approximately hundreds), which may be arduous to obtain in some cases.
Recently, some studies have been conducted to generate fonts from a few reference glyphs without requiring a large number of references. These studies can be classified into two types based on their trends: texture style transfer[9, 10, 11] and glyph shape transfer [12, 13, 14]. The latter is generally a more challenging task than the former as glyph shape features (such as outlines and skeletons) are not easy to separate into content and style. Although these methods successfully generate unseen style fonts with only a few reference glyphs, their results have room for improvement, especially when the number of style reference glyphs is extremely limited.
In this study, we focus on the common structure used in recent approaches [12, 10, 11, 13, 14]: content encoder + style encoder + decoder(s). We propose a simple but effective framework to improve the performance of the existing font generation networks: applying metric learning to their style feature encoders. This method forces the outputs of the style encoder to be closer to each other in the feature space when the input glyphs have the same style; otherwise, the outputs are far away from each other (Fig. 2). Thus, this method allows the style feature encoders to extract only the style features while reducing the impact of the differences in the content features.
We select AGIS-Net  and EMD  as our baselines and backbone networks. AGIS-Net is currently one of the best-performing models in the field of few-shot font generation. However, it has the drawback of poor quality in generating black-and-white and shape-distinctive fonts. EMD is an outstanding model for generating binary and shape-distinctive fonts. However, it often fails to extract novel style features and produces collapsed glyphs. We improve them by utilizing deep metric learning (DML) and discuss the effectiveness of our proposed framework. In summary, our contributions are as follows:
We introduce a simple DML method to existing font generation systems.
We show that metric learning contributes to extracting better style feature embeddings and producing more prominent results, especially when the number of style reference glyphs is considerably limited.
Ii Related Works
Ii-a Font Generation
Previous studies on automatic font generation focused on the explicit modeling of glyph outlines. Suveeranont and Igarashi  automatically extracted a skeleton from the outline of a reference glyph and generated a consistent font set utilizing a weighted blend of template fonts. Campbell and Kautz 
built a manifold of standard Latin fonts, on which we can generate fonts by interpolating existing fonts. Zhou et al. designed glyph modeling for Chinese characters by utilizing radicals. Lian et al. 
proposed a model to generate Chinese handwriting fonts with neural networks to learn styles.
With the rise of deep learning, end-to-end font generation methods have been proposed recently. Upchurch et al. 
proposed a method to synthesize Latin fonts in a supervised way with variational autoencoders (VAEs). The zi2zi 
model utilizes generative adversarial networks (GANs) to generate Chinese typographic fonts. Lyu et al.  also used GANs to generate Chinese calligraphic images. Jiang et al. proposed a system to synthesize Chinese handwriting fonts from 775 samples  and later introduced a more powerful solution . However, these methods require hundreds or more style reference glyphs, which may be arduous to obtain in some cases, such as for handwriting fonts.
Several studies have been recently conducted to synthesize a font library from a few reference glyphs. Azadi et al. proposed MC-GAN , which generates all 26 letters of the Latin alphabet with color and texture from five samples. However, it is structurally difficult to create fonts for languages with a large number of characters through this method. SA-VAE proposed by Sun et al.  is a Chinese-specific model leveraging domain knowledge, such as structures and radicals. Zhang et al. proposed EMD , which synthesizes Chinese black-and-white and shape-distinctive fonts. Moreover, this method is applicable to other language fonts or domains. AGIS-Net proposed by Gao et al.  generates colored and textured Latin and Chinese fonts. This method was developed based on EMD with adversarial training schemes. The authors also attempted to generate black-and-white fonts with promising results, but there is room for improvement. DM-Font proposed by Cha et al.  utilizes dual memory banks for languages whose characters can be decomposed into a fixed number of sub-glyphs, such as Korean and Thai. However, this system requires additional knowledge about the compositions of each glyph, which may be expensive. Furthermore, its applicability to scripts that are not fully compositional such as Chinese has not yet been verified.
Ii-B Deep Metric Learning
DML is a fundamental technique widely used in various areas, such as facial recognition, anomaly detection, and image retrieval. It involves forming a feature space in which the distance between embeddings corresponds to the similarity of inputs.
Previous approaches extracted pairs (contrastive Loss by Hadsell et al. ) or triplets (triplet Loss by Hoffer et al. ) from a mini-batch to move the embeddings closer to or further away from each other, focusing on their relationships. These methods usually depend on the ability to pick a set of informative samples from a mini-batch. As mini-batches do not necessarily reflect the distribution of the entire training data, the optimization targets are not constant, and training tends to be unstable.
Ranjan et al. proposed -constrained softmax loss . They fed an
-normalized feature vector into an additional fully connected layer, whose output dimension was the number of classes. Subsequently, they applied a softmax function and computed the cross-entropy loss. Qian et al. theoretically proved that minimizing a special case of the normalized softmax loss is equivalent to optimizing a smoothed triplet loss.
In this section, we introduce DML to the common component of state-of-the-art font generation frameworks: a style feature encoder. With DML, the style encoders are expected to output embeddings that are close to each other for input glyphs having the same style (small intra-class variance), and embeddings that are far from each other for inputs with different styles (large inter-class variance). Hence, the style feature encoders should be able to extract only style features while reducing the impact of the variances in the content features. Our framework is a simple but powerful approach to improve the performance of the existing schemes.
We describe the details of the loss function and the backbone frameworks below.
Iii-a Deep Metric Learning Loss
We selected the -constrained softmax loss proposed by Ranjan et al.  as the DML method because of its robustness and learning stability. Let be the -normalized style feature embedding for the -th sample. is the font class label corresponding to . denotes the weights of the additional fully connected layer, where is the dimension of the feature embedding, and is the number of font classes in training. represents the biases of the fully connected layer. is the temperature that amplifies the difference among classes. Consequently, we can express the loss function as follows:
We computed the above loss function for the style feature embeddings from the models described below.
Iii-B Backbone Models
AGIS-Net  is a generic model that synthesizes colored and textured fonts. This model follows typical GAN schemes; it has a generator and discriminators. The generator (Fig. (a)a) has two CNN-based encoders: “content encoder,” which encodes what character to synthesize from a single content reference glyph, and “style encoder,” which extracts style features from a subset ( glyphs) of style reference glyphs. The font of the content reference glyphs is constant for all the training and inference processes. For each iteration, randomly selected glyphs from among the entire style reference set ( glyphs) are fed into the style encoder. The decoders originally have two branches: “shape decoder” and “texture decoder.” As we generated black-and-white fonts in our experiments, calculation of the loss function for the intermediate black-and-white images in the shape decoder is not required. Therefore, we eliminated the shape decoder and employ the texture decoder.
For the discriminators, the original model has a shape discriminator , texture discriminator , and local discriminator . The shape and texture discriminators attempt to distinguish whether the outputs of the shape and texture decoders, respectively, are real. As the shape decoder is eliminated, we cannot use the shape discriminator in our training. However, the generated glyph images are not degraded because the shape discriminator functions as a discriminator for both shapes and textures. Regarding the local discriminator, Gao et al. proposed to cut patches randomly from glyph images and feed them to the local discriminator. Furthermore, to obtain better texture details, they manually blurred some real samples with a Gaussian filter and regarded them as fake samples when training.
In AGIS-Net, four types of loss functions were used, in addition to our metric learning loss: loss, adversarial loss, contextual loss, and local texture refinement loss. The loss is the pixel-wise distance between the generated images and the ground-truth images , which is defined as
The adversarial loss is defined as
where denotes the texture discriminator. The loss of the texture discriminator is defined as
where is a real glyph image. The contextual loss  was used to measure the similarity between two images, without spatial alignment. Assume that is the similarity between two feature maps and , is the -th layer feature of VGG19 , and is the number of used layers. The contextual loss is defined as
The local texture refinement loss is defined as
where and are the patches from the generated images and style reference images, respectively, and indicates the blurred patches of .
We used these loss functions as described in the original paper, except for the shape-decoder-related losses. Thus, the loss function used to optimize the generator is expressed as
where denotes the weights used for balancing the losses. The final objective used to optimize the generator and the discriminators is
The training process has two phases: pre-training and fine-tuning. In the pre-training phase, we performed supervised and adversarial learning using several fonts. In the subsequent fine-tuning phase, we performed few-shot training for a single unseen font that we intended to generate. Note that only style reference glyph images are available in this stage. If the glyph to be generated is not in the few-shot reference set, we cannot view its ground-truth image. Then, and are set to zero because they are pair-wise loss functions that require ground-truth images. Furthermore, we did not use the metric learning loss when fine-tuning the model.
EMD  is also a generic model to synthesize glyphs, but it does not employ the GAN architecture. The generator architecture of EMD (Fig. (b)b) is similar to that of AGIS-Net, but it utilizes the bilinear model as the feature mixer. The input format is also different, and EMD concatenates all the
images of the content/style reference set to feed them to each encoder. The training process is one-stage supervised learning, in which we used the entire training dataset. Then, in the inference stage, we generated glyphs fromcontent reference glyphs within the training dataset and style reference glyphs having novel styles.
The loss function for the supervised training is expressed as follows:
where is the number of black pixels in , denotes the training dataset, and is the mean value of the black pixels in the image .
The final objective used to optimize the generator is
We constructed a Japanese typographic font dataset consisting of 368 fonts and 2965 glyphs to evaluate the effectiveness of our proposed framework. All the images are grayscale and have the dimensions 64 64 pixels. We randomly selected 338 fonts for the pre-training of AGIS-Net and the training of EMD. Among these 338 fonts, we used 165 glyphs for the validation set and 2800 glyphs for training. Furthermore, we used 30 fonts not used in training for the fine-tuning of AGIS-Net and the inference of AGIS-Net/EMD. For each font, style reference glyphs are available when synthesizing the unseen style font.
Iv-B Implementation Details
The weight of the metric learning loss was set to 1.0. The temperature of softmax
was selected as 0.5 for the pre-training of AGIS-Net and 0.1 for the training of EMD. The other hyperparameters and the settings of the convolutional layers are the same as the original settings.
We performed the pre-training of AGIS-Net for 20 epochs. The learning rate in the first 10 epochs was set to 0.0002 and was reduced linearly in the remaining 10 epochs. Subsequently, we performed fine-tuning for 200 epochs at the learning rate of 0.00002. The number of glyphs fed to the style encoder simultaneously,, is set to 4. Note that the dimensionalities of the content/style feature embeddings are 512.
Furthermore, we performed the training of EMD for 10 epochs at the learning rate of 0.0002. The dimensionalities of the feature embeddings, , , and (Fig. (b)b) are all 512.
|AGIS-Net  + DML (ours)||0.274||10.44||0.566||115.41||0.219||11.84||0.656||69.76|
|EMD  + DML (ours)||0.187||13.58||0.705||60.91||0.179||13.96||0.726||54.59|
|AGIS-Net  + DML (ours)||0.202||12.56||0.686||53.72||0.193||13.03||0.705||46.86|
|EMD  + DML (ours)||0.174||14.07||0.730||57.41||0.170||14.25||0.738||53.84|
We synthesized Japanese typographic fonts from style reference glyphs with AGIS-Net and EMD. The generated glyphs are shown in Fig. 5. The results demonstrated that our method produces glyphs with clearer contours and less noise compared with the baseline methods. The baseline methods generated poor results, especially when is extremely small, whereas the enhanced models with the proposed method produced accurate results.
Moreover, we conducted quantitative evaluations of the proposed method. To evaluate the performance, we adopted four commonly used metrics:
loss, peak signal-to-noise ratio (PSNR), structural similarity index (SSIM), and Frechet inception distance (FID) . The smaller the values of loss and FID, the better is the performance. In contrast, the larger the values of PSNR and SSIM, the better is the performance. We calculated these metrics for each of the 30 fonts and then calculated the average of these values, as summarized in Table I. The methods enhanced with our framework outperformed the baseline methods in almost all the metrics and .
Our framework with DML is more effective when the number of style reference glyphs is small, which leads to the following inferences. Introducing DML allows our model to extract only the font style features, without being affected by the differences in contents. Therefore, our method requires fewer style reference glyphs for capturing the style characteristics of novel fonts.
Iv-D Effects of Deep Metric Learning
|AGIS-Net  + DML (ours)||0.7822||0.8756||0.8628|
We investigated the distribution of style feature embeddings using AGIS-Net to verify whether the style encoder can output better style feature embeddings with our framework. We pre-trained AGIS-Net with and without DML by using the settings mentioned in Section IV-B. Then, we extracted style feature embeddings from 30 style reference glyphs for each of 30 unseen fonts, with the pre-trained style encoder. As the AGIS-Net style encoder architecture requires images to be input simultaneously, we concatenated a single image times.
With the obtained 900 style feature embeddings, we evaluated the following two metrics: recall@  and normalized mutual information (NMI) . Recall@, a standard measure in image retrieval, denotes the proportion of query vectors for which the embedding vectors of the same class appear in the nearest
neighbors in Euclidean metrics. For NMI, we clustered the style feature embeddings into 30 classes with k-means++ for 100 times, calculated the NMI between each result of the clustering and the true classes, and reported the best value. The results are listed in Table II, where R@ indicates Recall@. All the metrics indicate that our framework with DML helps in improving feature separation.
Moreover, we visualized these style feature embeddings with t-distributed stochastic neighbor embedding (t-SNE)  (Fig. 6). With the use of DML, different styles exist far away from each other, whereas similar styles are located close together but with less overlap. This suggests that the style encoder can extract features by focusing on the differences in style, even in cases where the differences in style are slighter than the differences in content. Such improved feature separation leads to an improvement in the results of typographic font generation.
In this study, we proposed a powerful and simple framework to improve the performance of existing few-shot font generation methods, handling Japanese typographic fonts. The existing methods are known to produce poor results when the number of style reference glyphs is extremely limited. We introduced DML through -constrained softmax loss to the style encoder and demonstrated the remarkable improvement in the outcomes. We used AGIS-Net and EMD as the baseline methods in this study. However, our framework is general and is easily applicable to other methods and tasks. Furthermore, experimental results indicated that our framework helps style encoders focus on style characteristics, without being biased by content properties. Consequently, promising results can be obtained when the style reference glyphs are considerably limited.
-  R. Suveeranont and T. Igarashi, “Example-based automatic font generation,” in Proceedings of International Symposium on Smart Graphics. Springer, 2010, pp. 127–138.
-  N. D. F. Campbell and J. Kautz, “Learning a manifold of fonts,” ACM Trans. Graph., vol. 33, no. 4, Jul. 2014.
-  B. Zhou, Weihong Wang, and Zhanghui Chen, “Easy generation of personal chinese handwritten fonts,” in Proceedings of IEEE International Conference on Multimedia and Expo, 2011, pp. 1–6.
-  P. Upchurch, N. Snavely, and K. Bala, “From a to z: supervised transfer of style and content using deep neural network generators,” arXiv preprint arXiv:1603.02003, 2016.
-  “zi2zi: Master chinese calligraphy with conditional adversarial networks.” [Online]. Available: https://kaonashi-tyc.github.io/2017/04/06/zi2zi.html
-  P. Lyu, X. Bai, C. Yao, Z. Zhu, T. Huang, and W. Liu, “Auto-encoder guided gan for chinese calligraphy synthesis,” in Proceedings of 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 01, 2017, pp. 1095–1100.
-  Y. Jiang, Z. Lian, Y. Tang, and J. Xiao, “Dcfont: an end-to-end deep chinese font generation system,” in SIGGRAPH Asia 2017 Technical Briefs, 11 2017, pp. 1–4.
——, “Scfont: Structure-guided chinese font generation via deep stacked
Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, 2019, pp. 4015–4022.
-  S. Azadi, M. Fisher, V. G. Kim, Z. Wang, E. Shechtman, and T. Darrell, “Multi-content gan for few-shot font style transfer,” in
-  Y. Gao, Y. Guo, Z. Lian, Y. Tang, and J. Xiao, “Artistic glyph image synthesis via one-stage few-shot learning,” ACM Trans. Graph., vol. 38, no. 6, Nov. 2019.
-  S. Yang, J. Liu, W. Wang, and Z. Guo, “TET-GAN: Text effects transfer via stylization and destylization,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, 2019, pp. 1238–1245.
-  Y. Zhang, Y. Zhang, and W. Cai, “Separating style and content for generalized style transfer,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 8447–8455.
-  D. Sun, T. Ren, C. Li, H. Su, and J. Zhu, “Learning to write stylized chinese characters by reading a handful of examples,” in Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18. International Joint Conferences on Artificial Intelligence Organization, 7 2018, pp. 920–927.
-  J. Cha, S. Chun, G. Lee, B. Lee, S. Kim, and H. Lee, “Few-shot compositional font generation with dual memory,” arXiv preprint arXiv:2005.10510, 2020.
-  Z. Lian, B. Zhao, and J. Xiao, “Automatic generation of large-scale handwriting fonts via style learning,” in SIGGRAPH Asia 2016 Technical Briefs. New York, NY, USA: Association for Computing Machinery, 2016.
-  D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Proceedings of Advances in neural information processing systems, 2014, pp. 2672–2680.
-  R. Hadsell, S. Chopra, and Y. LeCun, “Dimensionality reduction by learning an invariant mapping,” in Proceedings of the IEEE conference on computer vision and pattern recognition, vol. 2, 2006, pp. 1735–1742.
-  E. Hoffer and N. Ailon, “Deep metric learning using triplet network,” in Proceedings of International Workshop on Similarity-Based Pattern Recognition. Springer, 2015, pp. 84–92.
-  R. Ranjan, C. D. Castillo, and R. Chellappa, “L2-constrained softmax loss for discriminative face verification,” arXiv preprint arXiv:1703.09507, 2017.
-  Q. Qian, L. Shang, B. Sun, J. Hu, H. Li, and R. Jin, “Softtriple loss: Deep metric learning without triplet sampling,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 6450–6458.
-  R. Mechrez, I. Talmi, and L. Zelnik-Manor, “The contextual loss for image transformation with non-aligned data,” in Proceedings of the European Conference on Computer Vision, 2018, pp. 768–783.
-  K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
-  Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE Trans. Image Process., vol. 13, no. 4, pp. 600–612, 2004.
-  M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,” in Proceedings of Advances in neural information processing systems, 2017, pp. 6626–6637.
-  H. Jegou, M. Douze, and C. Schmid, “Product quantization for nearest neighbor search,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 1, pp. 117–128, 2010.
-  C. D. Manning, P. Raghavan, and H. SchÃ¼tze, Introduction to Information Retrieval. Cambridge university press, 2008.
-  D. Arthur and S. Vassilvitskii, “K-means++: The advantages of careful seeding,” in Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, ser. SODA ’07. USA: Society for Industrial and Applied Mathematics, 2007, pp. 1027–1035.
L. v. d. Maaten and G. Hinton, “Visualizing data using t-sne,”
Journal of machine learning research, vol. 9, no. Nov, pp. 2579–2605, 2008.