[TOG(SIGGRAPH Asia) 2019] Artistic Glyph Image Synthesis via One-Stage Few-Shot Learning
Automatic generation of artistic glyph images is a challenging task that attracts many research interests. Previous methods either are specifically designed for shape synthesis or focus on texture transfer. In this paper, we propose a novel model, AGIS-Net, to transfer both shape and texture styles in one-stage with only a few stylized samples. To achieve this goal, we first disentangle the representations for content and style by using two encoders, ensuring the multi-content and multi-style generation. Then we utilize two collaboratively working decoders to generate the glyph shape image and its texture image simultaneously. In addition, we introduce a local texture refinement loss to further improve the quality of the synthesized textures. In this manner, our one-stage model is much more efficient and effective than other multi-stage stacked methods. We also propose a large-scale dataset with Chinese glyph images in various shape and texture styles, rendered from 35 professional-designed artistic fonts with 7,326 characters and 2,460 synthetic artistic fonts with 639 characters, to validate the effectiveness and extendability of our method. Extensive experiments on both English and Chinese artistic glyph image datasets demonstrate the superiority of our model in generating high-quality stylized glyph images against other state-of-the-art methods.READ FULL TEXT VIEW PDF
Text effects transfer technology automatically makes the text dramatical...
In this work, we explore the problem of generating fantastic special-eff...
Designing fonts for languages with a large number of characters, such as...
We propose a framework, called LiftedGAN, that disentangles and lifts a
Transcribing content from structural images, e.g., writing notes from mu...
Automatic few-shot font generation is in high demand because manual desi...
In this paper, we investigate the Chinese calligraphy synthesis problem:...
[TOG(SIGGRAPH Asia) 2019] Artistic Glyph Image Synthesis via One-Stage Few-Shot Learning
Artistic font design is a time-consuming and labor-intensive process due to the complex combination of lines, serif details, colors and textures. Moreover, maintaining a coherent style among all characters of a font is also difficult. In some language systems (e.g., Chinese, whose official character set GB18030 contains 27,533 characters), it is almost impossible to manually design all characters. Automatically generating glyph images for characters could be helpful for the design of artistic fonts. This paper focuses on the task of automatically synthesizing novel glyph images for characters in an artistic font using just a few samples created by a font designer (see Figure 1(a)). Our method (AGIS-Net) allows to synthesize glyph images that are more coherent and realistic in terms of style and structure compared to state-of-the-art approaches (e.g., MC-GAN [Azadi et al., 2018] and TET-GAN [Yang et al., 2019], see Figure 1(b)).
Up to now, a number of works have been reported for glyph image/glyph generation. They can be roughly classified into two groups, glyph shape synthesis and texture transfer. The first group mainly concentrates on the generation of geometric outlines: Campbell and Kautz built a font manifold, based on finding accurate correspondences between Latin glyphs’ outlines, which typically fails when the number and/or complexity of glyphs greatly increase. Lian et al.  attempted to extract strokes from given Chinese glyphs and learn to write corresponding strokes for other characters in the same style. But it is unsuited for many other scenarios such as Latin glyph synthesizing. For the second group, optimization-based methods [Yang et al., 2017; Men et al., 2018] have shown promising results. Based on the shape outline, patches extracted from the given glyph image with texture effects are rearranged to appropriate positions on the target glyph image. However, they require the outline shape image as reference to establish accurate correspondences and time-consuming online iterative optimization.
To effectively solve the above-mentioned problems, this paper proposes a novel model, AGIS-Net 111Source code and dataset are available at https://hologerry.github.io/AGIS-Net/ (Artistic Glyph Image Synthesis Network), which is capable of generating stylized glyph images by training on a small number of reference samples. To the best of our knowledge, our work is the first to transfer both shape and texture styles to arbitrarily large numbers of characters and generate high-quality synthesis results. Compared to the work [Azadi et al., 2018] most relevant to ours, the proposed model has a wider application scenario, which is not only suitable for English but also Chinese and any other writing systems. Furthermore, unlike MC-GAN [Azadi et al., 2018], the proposed AGIS-Net is a one-stage model, which means that the generator directly outputs stylized glyph images with user-specified contents and styles.
More specifically, we use two encoders to extract content and style features separately, and apply two parallel decoders to recover glyph shape style and texture style. Namely, there exist two branches in our generator. The first branch acts as a guidance to the second branch and the two branches are implemented simultaneously. Since the two branches share the same content and style features from the encoders, less parameters in the model are needed. Furthermore, we use the contextual loss and introduce the local texture refinement loss to further improve image quality. The contextual loss [Mechrez et al., 2018], similar to the perceptual loss, computes the gap between the ground truth and outputs at the feature level and is capable of improving the realism of the whole image. The local texture refinement loss aims at synthesizing high-quality local patches. It attempts to refine the details of the image by adversarial training. To verify the effectiveness and extendibility of our method, we also build a new Chinese artistic glyph image dataset which consists of 1,571,940 glyph images, containing 2,460 synthetic artistic font styles. Additionally, we collect 256,410 glyph images rendered from 35 professional-designed fonts, and each font has 7,326 characters. Qualitative and quantitative experiments conducted on both English and Chinese glyph image datasets demonstrate that our proposed method markedly outperforms existing approaches, synthesizing realistic and high-quality stylized glyph images.
In summary, our key contributions are listed as follows:
We propose a simple yet effective model, AGIS-Net, exploiting two parallel encoder-decoder branches, to transfer artistic font style with respect to both shape style and texture style within a single stage.
We introduce a novel and computationally efficient loss function called the local texture refinement loss, which is helpful to improve the quality of synthesis results in few-shot style transfer tasks.
We construct a new Chinese glyph image dataset, which consists of more than million images covering synthetic artistic font styles and 35 artist-designed font styles.
Extensive experiments clearly verify the effectiveness and extendibility of our proposed method on few-shot learning, and demonstrate its superiority to other existing methods on artistic font style transfer.
Campbell and Kautz 
built a font manifold and generated new fonts by interpolation in a high dimensional space. Lian et al. proposed a system to automatically generate large-scale Chinese handwriting fonts by learning styles of stroke shape and layout separately. Balashova et al.  developed a stroke-based geometric model for glyph synthesis, embedding fonts on a manifold using purely geometric features. [Baluja, 2016] is one of the earliest works to use deep neural networks to generate English glyph images. Upchurch et al.  considered glyph image synthesis as an image analogy task and proposed a modified VAE to separate image style from content. Lyu et al.  proposed to apply an image-to-image translation model to learn mappings from glyph images in the standard font style to those with desired styles. Jiang et al.  designed a deep stacked neural network with the guidance of glyph structure information to synthesize high-quality Chinese fonts.
Style transfer aims at migrating a given image’s style to another image while preserving the latter one’s content. Gatys et al. 
pioneered a style transfer scheme based on Convolutional Neural Networks (CNNs), getting quite appealing results. Johnson et al. extended the work [Gatys et al., 2015] by introducing the perceptual loss for training and using feed-forward networks for image transformation. More recently, Huang and Belongie  presented an effective approach that for the first time enables arbitrary style transfer in real-time. Li et al.  embedded a pair of patch-based feature transforms, whitening and coloring, to an image reconstruction network to synthesize styled images with high visual quality. Gu et al. 
took advantage of reshuffling deep features to achieve arbitrary style transfer while preserving local style patterns and preventing artifacts.
There also exist some works focusing on texture transfer of glyph images. For example, Yang et al.  explored the task of generating special text effects for typography by proposing an optimization-based model. Based on the traditional texture transfer technique, Men et al.  proposed to adopt structure information to effectively guide the synthesis process. More recently, a deep learning based method was reported [Yang et al., 2019] to accelerate the transfer process while maintaining the image quality.
In contrast to the above-mentioned works, many researchers intend to explicitly disentangle the content and style of images. For instance, Tenenbaum et al.  presented a general bilinear model to solve two-factor tasks, providing sufficiently expressive representations of factor interactions. Wang et al.  proposed to factorize the image generation process and use two GANs for surface normal map and image generation, respectively. Gonzalez-Garcia et al.  introduced the concept of cross-domain disentanglement and separated the internal representation into a shared part and two exclusive parts. Kazemi et al.  proposed a method to train GANs to learn disentangled style and content representations of the data. Motivated by these works, we also disentangle the content and style of glyph images to make our model capable of handling the tasks of precisely transferring both shape and texture styles, by using a new network architecture and several novel loss functions.
Generative Adversarial Networks (GANs) were originally proposed in [Goodfellow et al., 2014] by introducing the adversarial process to generative models. Since then, many works [Radford et al., 2015; Salimans et al., 2016] have been proposed to improve the performance of GANs. For instance, Conditional GANs [Chen et al., 2016; Odena et al., 2017] attempt to use labels or images to control the generation and have been applied in many application scenarios such as image-to-image transformation.
In [Isola et al., 2017], Isola et al. proposed a unified image-to-image framework, Pix2Pix, based on conditional GANs. Then, Zhu et al. [2017b] proposed BicycleGAN that can model multimodal distribution and output diverse images. Mao et al.  presented a mode seeking regularization term to address the mode collapse problem for general GAN models. However, paired data are typically hard to obtain in many tasks. To solve the problem, several unpaired methods [Zhu et al., 2017a; Lee et al., 2018] have been reported whose key idea is to use the cycle consistency loss in training. Also, Iizuka et al.  proposed to use global and local discriminators to solve the image completion problem. More recently, there exist some works, such as [Clouâtre and Demers, 2019] and [Liu et al., 2019], which explore the few-shot learning based image generation task.
The work most relevant to our model is MC-GAN [Azadi et al., 2018] which regards glyph image synthesis as an image-to-image translation task, and aims at transforming a content image to the stylized glyph image. MC-GAN connects two networks in series to get a two-stage model to achieve the goals of shape style transfer and texture style transfer, respectively.
As mentioned above, our goal is to synthesize stylized glyph images with conditional contents and styles. Similar to other GANs, we have a generator and several discriminators in our model. To obtain better performance for this task, we specifically design the network architecture and loss function, which will be explicitly discussed in Section 3.1 and 3.2.
We formulate the generation process as a mapping from a content reference image and a small set of style reference images , which all have the same style but different contents, to the output glyph image with the content as and style as . The content image is a binary glyph image in a standard font style (e.g., Code New Roman for English or an average font style for Chinese) containing little style information. The reason for using a set of stylized images rather than just one as our style reference is that each stylized glyph image is composed of its content and style, and thus we have to find a way to disentangle the style from the image. Given a set of stylized glyph images instead of one, our model could be able to extract the common feature from them, namely style information, and ignore the content. Assume that there are stylized images in our few-shot reference set: . During each forward propagation, we randomly select a set of () images from and concatenate them together in a channel-wise manner as our style reference input: . The reasons for doing this are twofold: 1) The model can have many different combinations of images as input, making it more robust. In contrast, if we simultaneously feed all these reference samples, the style input will always be the same during training. Thus, the model could easily fall into a local optimum, losing generalization ability. 2) For other writing systems (e.g., Chinese, ), feeding all these samples at the same time will dramatically increase the model size.
For stable convergence, we pre-train our model before implementing few-shot learning. After that, we can fine-tune our pre-trained model to any specific artistic style as we want. In this paper, we use the Chinese glyph image dataset created by us and the English glyph image dataset proposed in [Azadi et al., 2018] for pre-training.
As shown in Figure 2, our model consists of a Generator and three discriminators: Shape Discriminator , Texture Discriminator and Local Discriminator .
The two encoders separately extract the content and style features of the input. They have almost the same structure that consists of several convolution layers, except for the number of input channels. The two decoders behave differently: one for glyph shape and the other for texture. Inspired by Pix2Pix [Isola et al., 2017], we propose to use skip connections in the two encoder-decoder branches, as shown in Figure 3, so that these two encoder-decoder branches can work together. For the shape decoder, there are several up-convolution layers. The input of each layer is a concatenation of the features of the previous layer and the corresponding layer in two encoders. Then, a gray-scale shape image can be generated from the shape decoder. For the texture decoder, it is similar but the input of each layer is extra concatenated by the features of the corresponding shape decoder layer. At the end of the texture decoder, there is one more convolution layer fed with a concatenation of previous features and the gray-scale image , so that all information from the shape decoder can be shared with the texture decoder.
The purpose of using this special skip-connection architecture is that features at different scales are all important. For example, higher-level features contain more abstract style information while lower-level features contain more specific style information, we intend to exploit them as much as possible letting the model learn effective and sufficient information. We propose to use two separated decoders in our model mainly because, compared to color and texture, the shape is much more variable in glyph images. With such kind of network architecture, our model can pay more attention to the shape style.
For adversarial training, there are 3 discriminators in our model. The first two, and , are PatchGAN [Isola et al., 2017] like discriminators. The two decoders’ outputs and are fed into these two discriminators as fake samples, respectively. The third one, , is used for refining local textures. Details of all discriminators will be described in Section 3.2.
The objective function of our model consists of four terms: adversarial loss, loss, contextual loss and local texture refinement loss
Adversarial loss: Similar to most GANs, we impose a standard adversarial game to train the generator and discriminators , . As mentioned above, is trained for gray-scale image and for texture image
where and are machine-generated images, is a real glyph image with texture effects, denotes the gray-scale version of , and are weights for balancing these terms. If we train the model for a character which is in the few-shot reference set , will be the ground truth image of , if not it will be randomly selected from the style reference input set .
loss: To stabilize our training, we use an loss in our objective function. The loss function also has two terms, respectively for gray-scale images and texture images, which are defined as
where and are weights, and are ground truth images. It is worth explaining that during training we explore all glyphs in the font library, which means we synthesize glyph images for all characters. For example, if we train on the English dataset, we generate glyph images for all 26 capital letters, although most of them might not belong to the few-shot reference set. When implementing few-shot learning (i.e., fine-tuning on a specific stylized font), if the glyph is in the few-shot reference set, we have ground truth images and the weights will not be 0. Otherwise, if ground truth images are unavailable, these two weights should be set to 0. Although unseen characters (i.e., out of the few-shot reference set) cannot contribute to the loss, they are still useful for the adversarial training, helping us get more satisfactory results. The real samples for discriminators and are chosen in two different manners depending on whether the characters belong to or not. In the stage of pre-training, as all glyphs have corresponding ground truth images, the weights of will never be zero.
Contextual loss: The contextual loss was recently proposed in [Mechrez et al., 2018]. It is a new and effective way to measure the similarity between two images, requiring no spatial alignment. As the spatial alignment is required for the loss, if the synthesized image is not exactly spatially aligned to the ground truth image (e.g., a small displacement or rotation), the loss will be high but the synthesis result is often visually acceptable. The contextual loss leads the model to pay more attention to style features at a high level, not just differences in pixel values. Therefore, we regard the contextual loss as a complementary to the loss.
Here we briefly explain the contextual loss. The key idea of contextual loss is to treat an image as a collection of features, and measure the similarity between two images based on the similarity of the feature map collections, while ignoring the spatial alignment of features. Given an image and its target image , we gather the feature maps (e.g., VGG19 [Simonyan and Zisserman, 2014] features): and . For each in , we find the most similar to it, calculate the distance between them, and convert it to similarity. Then we calculate the average similarity over all similarity values and apply the negative logarithm to get the loss value. Formally, it is defined as
where is the similarity between and . To get , firstly, we calculate the cosine distance between and , and then normalize the distances, shift from distances to similarities by exponentiation and normalize the similarities, which are defined by the following equations
where and are hyper-parameters, and we fix them to and as the original paper [Mechrez et al., 2018].
In our model, we apply the contextual loss to both the gray-scale image and texture image by computing
where means extracted features from the th layer of VGG19, is the number of used layers, and are weights. Since we apply the contextual loss to the output and its ground truth, and will be zero for the adversarial training of unseen characters, same as and in the loss.
Local texture refinement loss: With such a small number of glyph images used in few-shot learning, the positive and negative samples are highly unbalanced. Therefore, we propose the local texture refinement loss to address this problem.
As shown in Figure 4, we randomly cut patches from images, and feed them to the Local Discriminator instead of using the whole image. In this manner, training samples will be relatively sufficient and balanced. Moreover, to get better texture details, we manually blur some positive samples with a Gaussian Filter and regard them as negative samples when training. Through this blurring operation, we can build a bridge between real and fuzzy samples, so that will force the Generator to synthesize more realistic images with less artifacts and noise. Obviously, the above-mentioned operations only involve a small amount of calculations, and is also computationally efficient since the size of patches it processes is small. Although the idea of using image patches is similar to PatchGAN [Isola et al., 2017], our motivation and implementing details are quite different: 1) We use patches to alleviate the problem of insufficient training samples, while PatchGAN aims to penalize structure at the scale of patches with a score indicating real/fake of a patch; 2) The way to build a bridge between real and fuzzy samples is novel and effective for this task.
The patches are used for training the third discriminator , whose negative samples consist of generated patches and blurred patches. So the loss function is defined as
where and represent the patches from style reference input images and generated images, respectively, means the blurred patches of , and denotes the balancing weight.
Finally, the proposed model can be trained by playing the following minimax game
As mentioned in Section 3, for a specific character, the input content reference image is identical for different artistic styles. As shown in Figure 5(a), we use the content input in the same font style (i.e., Code New Roman) as MC-GAN [Azadi et al., 2018] for the English dataset; for Chinese, as shown in Figure 5(b), we adopt the commonly used average font style [Jiang et al., 2019; Guo et al., 2018]. The principle of choosing the font style of content input is that the shape style it contains should be as common as possible.
We use the English glyph image dataset, proposed by [Azadi et al., 2018], which contains 32,046 synthetic artistic fonts, each with 26 glyphs as shown in Figure 6(a), to pre-train our model. The same test set as MC-GAN is used to fine-tune our model in few-shot learning, which contains 35 professional-designed English fonts with special text effects. Note that the English glyph image dataset only contains 26 capital letters.
Moreover, we also build a new publicly-available Chinese glyph image dataset for our experiments to verify our model’s extendibility. For the pre-training set, as shown in Figure 6(b), we first render glyph images for 639 representative Chinese characters in 246 normal Chinese font styles. Then, to convert them into glyph images with textures, we apply gradient colors and various stripe textures on the original binary images. Specifically, 10 different kinds of gradient colors or stripe textures are applied to each font style. Overall, the dataset contains 1,571,940 different artistic glyph images. For few-shot learning, we select 35 artist-designed fonts with textures as the test set and each font consists of 7,326 Chinese characters.
In our experiments, we have six convolution layers [Krizhevsky et al., 2012] and six up-convolution (transposed convolution) layers [Dumoulin and Visin, 2016] in the generator of the proposed AGIS-Net, each layer is equipped with Instance Normalization [Ulyanov et al., 2016]
and ReLU[Nair and Hinton, 2010]. We follow the structure of discriminators in Pix2Pix [Isola et al., 2017] to design our three discriminators, which output score maps instead of a single value. All images are with the size except for the patch fed into that is . Weights in the loss function are selected as , , , , , and , keeping unchanged during two training stages.
In pre-training, we train 20 epochs for the English glyph image dataset and 10 for the Chinese dataset. The batch size is 100 for both. When implementing few-shot learning (i.e., fine-tuning on a specific artistic font style), for English glyph images in each font style, we use the batch size 26 with 3,000 training epochs and validate the model every 50 epochs. We explore the 26 capital letters in the English dataset, most of these characters do not have ground truth images during few-shot learning. For Chinese glyph images in each font style, we use the batch size 100 with 500 training epochs, exploring 500 characters in the dataset, similarly, only a few of them have ground truth images.
We compare our method with four recently proposed image-to-image translation approaches. The first two are leading general-purpose image-to-image translation models, and the last two are state-of-the-art artistic font style transfer models. We directly use the source codes and default settings provided by the authors for these methods which are briefly described as follows.
BicycleGAN [Zhu et al., 2017b]: BicycleGAN learns a multi-modal mapping between two image domains, and can output diverse images from one input image.
MS-Pix2Pix [Mao et al., 2019]: Mode Seeking GAN (MSGAN) utilizes a novel mode seeking regularization term to address the mode collapse issue for cGANs. We compare our method with the Pix2Pix based MSGAN, here we call it MS-Pix2Pix.
MC-GAN [Azadi et al., 2018]: MC-GAN uses a stacked conditional GAN to transfer both the shape and texture styles of glyph images, solving this challenging task for the first time and getting impressive results. However, MC-GAN can only handle 26 English capital letters and is hard to scale up.
TET-GAN [Yang et al., 2019]: TET-GAN consists of a stylization subnetwork and a destylization subnetwork. It learns to disentangle and recombine the content and style features of text effects images, through processes of style transfer and removal. TET-GAN uses the shape outline as a guidance to transfer the texture style only. For fair comparison, we directly apply their method on our datasets, meaning that the model is required to transfer the glyph shape style as well.
For quantitative evaluation, we adopt four commonly-used metrics in many image generation tasks: Inception Score (IS) [Salimans et al., 2016], Fréchet Inception Distance (FID) [Heusel et al., 2017], structural similarity (SSIM) index and pixel-level accuracy (pix-acc). Specifically, IS is used to measure the realism and diversity of generated images. FID is employed to measure the distance between two distributions of synthesized glyph images and ground truth images, while SSIM aims to measure the structural similarity between them. Since FID and IS can not directly reflect the quality of synthesized character images, we also use the pixel-level accuracy (pix-acc) to evaluate performance. Higher values of IS, SSIM and pix-acc are better, whereas for FID, the lower the better.
As mentioned above, we pre-train all models on the datasets with glyph images in synthetic artistic font styles. Figure 7 shows some results generated by our method, BicycleGAN [Zhu et al., 2017b] and MS-Pix2Pix [Mao et al., 2019], respectively, during pre-training. Since we shuffle the input data after every epoch, the content and style may not be the same for all models in the same iteration. As we can see, although all explored characters’ ground truth images are available during training, BicycleGAN and MS-Pix2Pix still can hardly learn the style information and often synthesize poor-quality glyph images. For BicycleGAN [Zhu et al., 2017b], although it can capture the content information and output roughly correct glyph shapes for the English dataset, it can not process the style information well. The more recent work, MS-Pix2Pix [Mao et al., 2019], can not even handle the content information well, and sometimes outputs glyph images with incorrect contents. What is worse, both of these two methods are prone to mode collapse, and can not even converge on the Chinese dataset which is more challenging. It is clear that our model’s learning ability on this specific task is significantly better than the general-purpose image-to-image translation models (e.g., BicycleGAN and MS-Pix2Pix). Therefore, here we do not show results of these two models for the more challenging few-shot learning task, which can be found in the supplementary material.
Our goal is to generate novel glyph images for characters in an artistic font using just a few input samples. It is obvious that the few-shot size and style input size could affect the synthesizing performance. Therefore, we conduct extensive experiments to find an optimal setting. Here we examine the effects of different values of , the size of few-shot reference set , and , the size of style input set . All the few-shot reference sets for each artistic font style are randomly selected in this experiment.
In Figure 8 and Table 1, we compare the performance of our methods with 6 pairs of different settings on the English dataset. From both the qualitative and quantitative results, we can see that with more training samples available, the quality of results synthesized by our method improves, with clearer contours and smaller fuzzy regions. Although just given very few samples, synthesis results are already visually pleasing. We can see that, for the style input size, the difference is not so obvious. But if the size is too small (e.g., ), results become less satisfactory. There is an interesting phenomenon that the Inception Scores of our methods on all cases are higher than the ground truth, which means the realism and diversity of ground truth are worst. This is obviously meaningless. So, the conceptions of realism and diversity are not suitable for this glyph image synthesizing task. Therefore, among all the four metrics, IS does not have much reference value for this task. It is also necessary to point out why the SSIM and pix-acc’s relative values are small. For our task, we focus on generating glyph images in which there are lots of white pixels outside the glyph and thus a small change on those values may indicate dramatically large changes for the visual appearance (see Figure 8 and Table 1 where = 3, = 2 and = 5, = 4). Since parameters of the model are increasing with larger few-shot size and style input size, to balance the image quality and model size, we use the few-shot size and the style input size in the latter experiments for the English glyph image dataset.
Similar experiments are also conducted on the Chinese dataset. We compare the performance of our method with different few-shot sizes (i.e., , 30, 60 and 100) and different style input sizes (i.e., and 8). Due to page limit, here we just show some synthesis results with the style input size in Figure 9, which demonstrates the effectiveness of the proposed method in synthesizing Chinese glyph images. Same as English, we also provide quantitative results on the Chinese dataset. As shown in Table 2, we can see that the quality of synthesis results with are almost the same as that with . Although the quality of results can be improved with more training samples, glyph images synthesized by our method with few-shot size 30 are already good enough for practical uses. Similarly, in the following experiments conducted on the Chinese dataset, the size of few-shot reference set and the style input size are fixed as and , respectively.
In this section, we analyze the influences of different content inputs and different style few-shot reference sets.
As mentioned before, the content input is used to specify which character the model synthesizes. Along with the Code New Roman used by MC-GAN, we select three extra fonts: Courier, Noteworthy and Marker Felt. From the right part of Figure 10, we can see that the difference among glyph images of the same character synthesized using content inputs in different font styles are insignificant in general. This is mainly due to the fact that the content encoder is trained to extract content information while ignoring style information during both pre-training and fine-tuning procedures.
Apart from the same style few-shot reference set as MC-GAN, we randomly select two extra reference sets. As we can see, all results shown in Figure 11 have consistent color and texture. For some characters in the first font, such as ’A’, ’H’, and ’Z’, the shape styles of the results are inconsistent, which means different reference sets will lead to slightly different synthesized glyph shapes. However, for the second font, in which most characters share the same shape style, there is no marked difference between the results with different reference sets. It can be observed from this experiment that our style encoder can successfully extract the common feature for the given style few-shot reference set, especially for the font that all characters share the same shape style.
In this section, we perform experiments to verify the effectiveness of each key component in our model. In Figure 12, we demonstrate the effects of the contextual loss, the local texture refinement loss and skip connections. Stylized glyph images in the fifth row are generated by the model without skip connections. Comparing the images in the second row synthesized by our full AGIS-Net with those in the fifth row, we can see that the glyph shape style can be handled well without skip connections, but not the texture style such as colors for both two fonts and white lines for the second font. Thereby, we conclude that the skip connections play an important role in capturing and recovering texture style information. Comparing synthesis results in the second and third rows, we can see how the local refinement loss performs. Taking ’J’ and ’V’ for the first font and ’C’, ’D’, ’O’ and ’U’ for the second font for example, synthesis texture details become poor and much noise appears without adopting the local texture refinement loss. The effectiveness of the contextual loss in this specific task can be verified by comparing synthesis results in the third and fourth rows. We can see that the contextual loss helps to maintain more shape outline style information, such as, ’A’, ’D’, ’N’ and ’Z’ in the first font and ’H’, ’N’ and ’S’ in the second font.
Furthermore, we also provide quantitative results (see Table 3) evaluated on all 35 English artist-designed fonts, which intuitively reflect the influence of each component to the whole model. Values of these metrics clearly demonstrate that the novel skip connections in the Generator, which makes the two encoder-decoder branches work collaboratively, play the most important role in this task. Secondly, the local texture refinement loss also makes strong contribution in generating high-quality local texture details and noise-free synthesis results. Last but not the least, the contextual loss, which does not need spatial alignment, helps the model produce better shape outlines. From losses computed on the validation set during fine-tuning shown in Figure 13, we can also get the same conclusion about the effects of those key components.
In this section, we compare the performance of our model with other existing methods. Currently, the work that is most relevant to our method is MC-GAN [Azadi et al., 2018] which also provides the state-of-the-art performance for stylized glyph image synthesis. More recently, Yang et al.  proposed TET-GAN which shows promising results on the task of text effects transfer. But it requires the corresponding binary glyph image as input reference, unlike our task where the shape of the glyph is also synthesized. For fair comparison, we choose the same few-shot learning data as the default setting of the original MC-GAN [Azadi et al., 2018] and fix the font style of input content images to the standard Code New Roman in all three models. Then we generate the glyph images of all 26 characters in the above-mentioned 35 English artist-designed fonts using different methods and compare the performance of them both qualitatively and quantitatively.
As shown in Figure 14, our method clearly outperforms all other existing approaches, in both glyph shape and texture styles transfer. Moreover, our method could successfully decouple content and style, producing stylized glyph images based on the given contents and styles. As mentioned before, TET-GAN [Yang et al., 2019] requires binary stylized glyph images as input reference, and thus it performs poorly under our experimental settings. MC-GAN [Azadi et al., 2018] synthesizes reasonable results whose general shape and texture styles can be successfully transferred from input glyph images. But the details of images are not well synthesized, such as containing edge noise and incomplete shape outlines.
Although the visual appearance is much more intuitive to reflect the quality of synthesis results in the image generation task, quantitative evaluation metrics can give a higher-level indication of performance on the whole dataset. Table4 shows the quantitative comparison of our method and other two approaches evaluated on the above-mentioned 35 English artist-designed fonts. In addition to the four evaluation measures, we also conduct a user study. Specifically, for each artistic font style we randomly select 5 characters which are not contained in the style few-shot reference set. Then for each character, a participant is asked to choose the one that possesses the best quality and has the most similar style as the reference set among the 3 glyph images synthesized by these 3 methods. 60 participants have finished all questions of this user study. Statistical results are shown in the 6th column of Table 4. We also list the number of parameters for all models in Table 5. We can see that our model performs much better while requiring fewer parameters than other state-of-the-art methods. We also observe that some general-purpose models, although having smaller amounts of parameters, such as BicycleGAN [Zhu et al., 2017b] and MS-Pix2Pix [Mao et al., 2019], cannot perform well even on the pre-training datasets since they are basically unsuited for this specific task.
As mentioned before, our method is suitable to handle the artistic glyph image synthesis task for any writing systems. Both the style input and content input are flexibly controllable. In this section, we conduct experiments on the Chinese glyph image dataset to verify the extendibility of our AGIS-Net.
In Figure 15 and Table 6, we show some experimental results of different methods trained and tested on the Chinese dataset. Due to the fact that MC-GAN [Azadi et al., 2018] can only handle 26 Latin capital letters, here we just compare our method with TET-GAN [Yang et al., 2019]. We can see that glyph images synthesized by our method not only precisely inherit the corresponding font’s overall and detailed styles but also clearly represent the correct contents of characters. On the contrary, neither the shape style nor texture style can be satisfactorily transferred by TET-GAN, indicating that our method performs significantly better than TET-GAN in this specific task.
To verify the generalization ability of our model, we conduct the following two experiments.
We conduct this experiment to show that besides the glyphs explored in training, our model also has the ability of generating high-quality glyph images for unexplored characters, which are unseen during both pre-training and few-shot learning procedures. For the English dataset, all lowercase letters are unexplored. As we can see from Figure 16, for the lowercase English characters (the first part) which are similar to their uppercase counterparts, our model can generate satisfactory results. For most lowercase letters (the second part) which are quite different from their uppercase versions, our model can also generate reasonable results. For some characters (the third part), our model only works well for some styles. As shown in the last part of Figure 16, our method fails when handling two lowercase letters: ’a’ and ’m’. For ’a’, which is quite similar to ’o’ in this content input font style, our model tends to synthesize results like ’o’ for all styles. For ’m’, incorrect synthesis results are obtained due to its unique shape structure with 3 vertical lines that is quite different compared to other characters.
A similar experiment is also conducted on the Chinese dataset, as shown in Figure 17, the styles of glyphs in the first two rows come from the pre-training set, where glyph images of 500 characters in different styles are used to train our model, and the styles of glyphs in the last two rows come from the fine-tuning set, where only 30 stylized glyph images are available for training. Then, we input glyph images of some content reference characters that are unseen in both pre-training and fine-tuning procedures, and get corresponding synthesized glyph images. Surprisingly, the quality of synthesis results for these unexplored characters is comparable to that of explored ones (see Figure 17).
The idea of cross-language evaluation is similar to the former experiment. We pre-train our model on the Chinese glyph image dataset and fine-tune the model on specific Chinese artistic font styles. Then, we feed the learnt model with Japanese and Korean glyph images on the Omniglot dataset [Lake et al., 2015] as content input. As we can see from Figure 18, synthesis results of our model for Japanese/Korean characters are also impressive.
Experimental results demonstrate that our model has powerful generalization ability, and is potentially suitable to handle many other relevant tasks (e.g., painting synthesis). These results also verify that the proposed model is capable of disentangling the content and style for glyph images.
Our model sometimes fails in generating satisfactory synthesis results. As shown in Figure 19
, the quality of generated Chinese glyph images in the first three columns are poor due to the color inconsistency, probably because the model falls into a local minimum during training, this could be fixed by tuning the hyper-parameters. Our model also does not perform well for the other two English fonts, as their shape styles are so unique that there exists a huge style gap between them and the pre-training data.
The task of shape style transfer is much tougher than texture style transfer. To see how the shape decoder performs, we remove the texture decoder and from our model. Here we compare our model with a recently proposed method EMD [Zhang et al., 2018] which also aims at separating content and style. We select several fonts from the test set of the English pre-training dataset for the few-shot learning task. As shown in Figure 20, our model clearly outperforms EMD on all these fonts, indicating the effectiveness and superiority of our model for shape style transfer. However, there also exist some failure cases when using our model for glyph shape style transfer. For example, our synthesis results (e.g., ’A’, ’M’ and ’Z’) shown in the fifth row of Figure 20 look obviously different compared to their corresponding ground truth shapes. Actually, not only the proposed model but also all other existing approaches can not satisfactorily transfer the styles of some fonts, in which most characters have their own unique shape style or/and local details.
In this paper, we proposed a novel one-stage few-shot learning model for artistic glyph image synthesis. The proposed AGIS-Net only needs a small number of training samples as input, and then high-quality glyph images in the same artistic style as training data can be synthesized for any other characters. We also built a new large-scale Chinese glyph image dataset for performance evaluation. Experiments on two publicly available datasets demonstrate that our model is capable of generating high-quality bitmap images of characters while maintaining content information and style consistency. It should be pointed out that those bitmap images cannot be directly used to create a font consisting of vector images of characters which are perfectly scalable. How to automatically synthesize an artistic font that contains large numbers of glyph vector images using just a few samples is an interesting and challenging direction for future research.
Image-to-image translation with conditional adversarial networks. In CVPR, Cited by: §2.3, §3.1, §3.1, §3.2, §4.1.2.
Perceptual losses for real-time style transfer and super-resolution. In ECCV, Cited by: §2.2.