Unlike English characters, which consist of only 26 alphabets, the number of Chinese characters is quite large. According to the official statistics, the total number of Chinese characters is 91251 and the frequently-used Chinese characters is 3500. For such a large number of characters, manually designing a new font is always time-consuming and error prone. Therefore, it is very important and meaningful to design a model to automatically generate Chinese characters with specified font style and explicit semantic information.
Previous studies have shown significant progresses on character generation. A large amount of approaches [22, 9, 10, 29] focus on analyzing and extracting stroke features, while some methods [28, 18, 25, 1] treat font transfer as an image-to-image task. However, there are still some challenges, which are in three folds. Firstly, unlike typical image generation tasks, Chinese character synthesis is very sensitive to changes in content. Any slight change may lead to a change in the meaning of the character. Secondly, large topological variations make the translation between different fonts more challenging. Finally, due to the large number of font types, it is impractical to learn a mapping function between each of the two fonts. Therefore, how to implement multiple font translations with one framework is particularly important.
The aforementioned challenges motivate us to propose an effective method to automatically generate more accurate and stable characters. We formulate the process of transferring standard font (e.g. SimSun) to other fonts as character stylization. On the contrary, the character de-stylization is defined as mapping various fonts to standard font (e.g. SimSun). Unlike character recognition, character de-stylization provides a new idea to normalize various fonts to a standard font, which is based on image-to-image translation. We think it will facilitate text digitization and subsequent editing and sharing.
Specifically, as illustrated in Fig. 1, our model takes two images of different fonts and contents as input and decouples them into style representation and content representation, respectively. We then exchange these two sets of variables, and finally generate characters that match the specified style and content. In addition, we incorporate character stylization and de-stylization in a universal framework, which allows two tasks to be trained simultaneously. Moreover, we introduce font consistency module (FCM) to encode various styles to their respective Gaussian distributions. On one hand, it constrains the training dataset globally, which promotes the style variable of the same font belong to the same distribution, and different fonts conform to different distributions. On the other hand, in the testing phase, it not only allows to obtain the style variable by encoding the reference image using style encoder, but also via sampling. In addition, compared to standard font, the topology of some calligraphy or handwritten fonts changes significantly. Therefore, when performing character de-stylization, the generated characters often lack strokes or strokes are confusing. In order to alleviate this problem, we pre-trained a model called content prior module (CPM) to provide additional constraints for the content encoder. CPM can encourage the encoder to be optimized in the direction that is easier to synthesize standard font.
The main contributions can be summarized as follows.
We propose a novel unified framework called FontGAN for modeling Chinese character stylization and de-stylization together. Different from character recognition, character de-stylization provides a new way to normalize multiple fonts into one standard font, which will facilitate text digitization and subsequent editing and sharing.
We decouple the image into the style representation and content representation. A content prior module (CPM) is designed to improve the stability and accuracy of the character de-stylization. In addition, we propose font consistency module (FCM) to encode the same font into the same Gaussian distribution, which allows us to obtain the specified font style by sampling during testing.
Our model can handle one-to-many, many-to-one and many-to-many character generation tasks. Extensive experiments are performed to demonstrate the superior performance of our method over the state-of-the-art method.
2 Related work
2.1 Generative Adversarial Network
Generative adversarial network (GAN) 
generally consists of two parts: generator and discriminator. With adversarial training, the discriminator enforces the generator to capture the distribution of real data. In recent years, GAN has achieved impressive results in various applications such as image synthesis, image super-resolution, domain adaptation, etc. Based on GAN, conditional GAN (CGAN) was proposed to provide additional guidance information, which encourages GAN to generate samples in our desired direction. There are many excellent CGAN-based structures that conditioned on discrete labels [15, 8], images [16, 26, 27], texts [17, 24, 11] and so on. Pix2pix 
is a classical CGAN-based framework for image-to-image translation, which applies paired data, skip-connection and patchGAN to achieve high-quality results. Pix2pix has demonstrated competitive performance in a variety of image translation tasks including sketch-to-photo, semantic segmentation, colorization, etc. In our work, we design a CGAN-based structure and treat character stylization and de-stylization as image-to-image translation task.
2.2 Disentangled Representation Learning
One of the goals of disentangled representation learning is to disentangle the underlying factors of data variation. There exists a great deal of literature in this field. InfoGAN learns the meaningful representations by maximizing the mutual information between the latent codes and generated images. Tran et al. 
propose DR-GAN and explicitly disentangle the identity representation for pose-invariant face recognition and face synthesis. DRIT achieve unpaired diverse image-to-image translation by decomposing the input image into domain-invariant content space and domain-specific attribute space. Benefiting from the disentangled representation learning, character can be disentangled into content-related code and font-related distribution, which helps to learn different components better.
2.3 Chinese Character Synthesis
Many works on Chinese character synthesis [22, 9, 10, 29, 14] rely on stroke extraction. Xu et al.  synthesize handwriting style images by analyzing the strike shapes and character topology. Zong et al.  present StrokeBank, which is a dictionary that maps components in standard font to a particular handwriting. Lian et al. 
aim to provide an effective system to generate personal handwriting font library. In their work, stroke attributes are learned by Artificial Neural Networks (ANNs). Furthermore, this system only needs a small number of samples that are carefully written by a person. In addition, some methods[28, 18, 12, 25, 1] treat font synthesis as an image-to-image translation task. Zi2zi is an excellent GAN-based model for one-to-many font generation and achieves the state-of-the-art performance. It applies pix2pix structure and category embedding to generate the desired font conditioned on discrete label. A classification module is added to discriminator to calculate the category loss.
3.1 Model Overview
Notation: Let , denote the source font image and target font image, respectively. Moreover, we use and denote the reference image for pixel-wise loss. We define the proposed subnets as follows: content encoder (), style encoder (), stylization decoder (), de-stylization decoder (), target font discriminator () and source font discriminator ().
The proposed FontGAN follows a typical encoder-decoder architecture for character synthesis. In the framework, we introduce two different types of encoder, i.e., content encoder and style encoder, as shown in Fig. 1. The source font and target font share the content encoder and style encoder. This is because the content is style-irrelevant, and the style encoder has the ability to learn the representation of various fonts.
Specifically, during the training phase, the input images are first encoded into latent variables as follows:
where and denote the content variable and style variable in domain, and denote the content variable and style variable in domain, respectively.
The content variables and style variables are then exchanged and combined. The combined variables are finally fed into decoder to generate images as below:
where represents the generated image, which has the content of and the font of . Similarly, is the generated image, which has the content of and the font of .
After generation, the synthetic images are distinguished by discriminator. This mechanism mainly has two advantages: 1) It incorporates character stylization and de-stylization in one framework; 2) Content representation and style representation are explicitly disentangled.
During the testing phase, we have two ways to generate the desired image. One is that the style encoder encodes the input image into style variable, and then concatenated with the content variable to get the final result. The other is that we can sample from Gaussian distribution to obtain the style variable, which will be described in detail in the following subsection.
3.2 Network Architecture
Our model mainly consists of two parts: the generator and the discriminator (as shown in Fig. 1). Furthermore, generator contains four subnets: content encoder, style encoder, stylization decoder and de-stylization decoder. In addition, we also design two discriminators for target font and source font. Next, we will introduce these networks in detail.
3.2.1 Encoder Network
In order to separate the content variable and style variable, we propose two networks: content encoder and style encoder. These networks have similar structure except the last module. Specifically, the encoder has 5 down-sampling modules to down sample the input image with stride-2. Each down-sampling module is composed of a convolutional layer and a residual block, wherein the convolutional layer is used to implement the downsampling operation. Different from content encoder, style encoder has a fully connected layer after the final down-sampling module. This fully connected layer is used to obtain (128-dim) and (128-dim) for Gaussian sampling.
3.2.2 Decoder Network
Stylization decoder and de-stylization decoder map the combined variables to the final images. These decoders are structurally consistent. Specifically, the decoder network is composed of 4 up-sampling modules, which consists of a deconvolutional layer with stride-2 and a residual block. Moreover, skip-connection  is applied to refine the generated results.
3.2.3 Discriminator Network
In our approach, we feed the discriminator with two types of input pairs: for positive example, for negative example. The discriminator is similar to . In practice, the discriminator takes the advantages of the previous method  and consists of a series of down-sampling modules. It down samples the input image to
spatially, and then tries to classify if eachpatch is real or fake.
3.3 Loss functions
3.3.1 Adversarial Loss
In order to generate realistic and clear images, we apply adversarial loss to train our model. The generator tries to generate the desired images by minimizing the following objective, while the discriminator aims to distinguish the real image and the synthetic image by maximizing the objective. The objective of the adversarial framework can be written as:
where means concatenating at channel level. , where is obtained by sampling the Gaussian distribution corresponding to the target font.
3.3.2 Pixel-wise Loss
Different from typical image generation tasks, Chinese character synthesis has higher requirements for the integrity and accuracy of the generated content. Any change will cause the character to have no practical meaning or change its own semantics. Therefore, we suggest to impose pixel-wise loss to constrain the generated results.
We use L2 loss for image synthesis in our approach. Because L2 loss is more sensitive to pixel-level changes than L1 loss, which makes it easier to guarantee the integrity of the generated characters.
3.3.3 Content Consistent Loss
During the training phase, the content encoder will map the input image into a shared content space. And the content of the generated image should be consistent with the original image. Therefore, we formulate the content consistent loss as:
where , and are the outputs of the with , and as inputs.
3.3.4 Category-guided KL Loss
We introduce font consistency module (FCM) to decompose the style variables from character. Inspired by the original VAE , the optimization objective for our style encoder is to maximize the lower bound of , which can be written as:
where is the variational parameters, and represents the generative parameters. Moreover, denotes the style variables, and is the prior distributions for .
In practice, we assume the prior over the style variable is the centered isotropic multivariate Gaussian, which can be formulate as , where y
represents the vector filled by the font label. In our experiments, represents the source font SimSun, and for other target fonts, where is the number of fonts. In this way, on one hand, it ensures that the style variables of the same fonts can be encoded into the same Gaussian distribution. On the other hand, in the testing phase, in addition to directly encoding the reference image to get style variables, it can also be obtained by sampling.
Specifically, we can optimize Eq.(10) by minimizing the following KL-loss:
where and denote the source font KL-loss and the target font KL-loss, respectively. In addition, and are the outputs of the style encoder . The style variable is sampled from using , where . Moreover, is the dimension of .
3.3.5 Font Label Preserving Loss
In order to further disentangle the style variable of different fonts, we apply the label preserving loss to constrain the style encoder
to encode different fonts to their respective distributions more accurately. We compute this loss function as follows:
where is the output of , is the font label.
3.3.6 Latent Regression Loss
3.3.7 Content Prior Loss
With regard to the de-stylization process, the generated results are often unsatisfactory due to the large difference in topology between the target font and the standard font. Therefore, we propose a content prior module (CPM) (as shown in Fig. 2), which is used to provide content prior for our framework. Specifically, firstly, we pre-train a simple model that is similar to our final structure. SimSun and SimKai are treated as source font and target font, since they are the most common fonts, and the layout is very simple. Using the loss function mentioned above, we can get a trained content encoder . Secondly, when we train our final model, will guides to encode the content of the target font into the content prior space, where the content variables in the space are more likely to generate satisfactory SimSun characters using . Finally, the loss function can be written as:
To sum up, the full objective of our model can be expressed as:
where controls the relative importance of the above objectives.
In this section, we first introduce the implementation details, datasets, baseline methods and evaluation metrics. Then, extensive qualitative and quantitative experiments are conducted to validate the effectiveness of our approach. In addition, we conduct a series of experiments including the transfer between arbitrary fonts and the inference with novel fonts. Finally, we perform the ablation experiments to demonstrate the essentials of each components of our model.
4.1 Experiment Setup
4.1.1 Implementation Details
We have trained the network 100 epochs using Adam Optimizer with batch size 32 on TITAN XP GPU. The learning rate is initialized to 0.0002 and is reduced by half every 20 epochs. In our experiments, the parameters in Eq.(16) are fixed at , , , . Moreover, all images are resized to , and are normalized into .
As for content prior module, we collect SimSun and SimKai to train this module, and each font consisting of 6000 characters. For FontGAN, because there are no existing public datasets for Chinese character synthesis, we build a dataset that includes 50 fonts and each font contains 2000 Chineses characters. In addition, all characters in the test set have never appeared in the training set.
4.1.3 Baseline Methods
Four previous methods are adopted as baselines, including Rewrite, pix2pix, zi2zi and CGAN. Rewrite combines convolutional layer and maxpooling layer to implement translation between two fonts. Pix2pix is a commonly used model in the field of image translation. Zi2zi is the state-of-the-art method in Chinese font translation, which applies pix2pix structure and category embedding to generate the target font. Conditional GAN (CGAN) takes source font image as input and uses the one-hot vector as condition to generate character. In our experiments, CGAN uses FontGAN network as the main framework.
4.1.4 Evaluation Metrics
We use Multi-Scale Structural Similarity (MS-SSIM) , Local Distortion (LD) , L1 loss and OCR accuracy as the evaluation metrics. MS-SSIM is a widely used image evaluation metric to measure the similarity between two images. LD is adopted to evaluate local distortion via dense SIFT flow. L1 loss can calculate pixel-level deviations. OCR is used to quantify the de-stylization results. For MS-SSIM and OCR accuracy, higher value means better performance, while LD and L1 loss are opposite.
4.2 Qualitative Evaluation
4.2.1 Stylization in Simple Cases
Fig. 3 illustrates the stylization results in simple cases (e.g. the printed fonts). Since these fonts are standard and there are few joined-up characters, most of the methods can successfully transfer SimSun font into target fonts. It is worth noting that our method learns the style of the font more accurately.
4.2.2 Stylization with Large Topology Change
Fig. 4 shows the stylization results for some challenging fonts such as calligraphy and handwriten character. Since the topology of these characters differ greatly from standard font, the difficulty of font translation is greatly increased. Moreover, a large number of joined-up characters and non-standard writings lead to the model that needs to balance style retention and content consistency. Experimental results show that our method can effectively alleviate this challenge. For example, as shown in the first column of Fig. 4, Rewrite and pix2pix do work poorly, while CGAN, zi2zi and our method can generate accurate results, but our results are closer to the target character. We further details the generated images in Fig. 6 (a), these marked strokes indicate that our model can produce better results. Fig. 6 (b) demonstrates that our method can alleviate the challenging of generating complex characters.
With regard to character de-stylization, since rewrite does a very poorly work, we decide to remove this method in the next experiments. As shown in Fig. 5, pix2pix yields results that have SimSun style but the contents of the characters are completely confusing. The results obtained by CGAN and zi2zi can be slightly accepted, but there are still some problems of stroke missing or offset. Our method can accurately generate the desired characters.
In a word, our approach outperforms the state-of-the-art method in both character stylization and de-stylization, especially in more complicated situations, such as the fonts with large topology variations. This is because our model is able to accurately decouple the input image into content variables and style variables and further constrain these two types of variables through CPM and FCM.
4.2.4 Transfer of Arbitrary Fonts
We have achieved many-to-many character translation with one framework, which benefit from two reasons: (1) FontGAN decomposes the character into style representation and content representation, respectively; (2) It maps the content of all fonts to the same space. Specifically, we first obtain the content variable using , and then feed it along with the style variable to the generator to synthesize the character of the specified style. Fig. 7 shows that in addition to mapping standard font to other target fonts, our approach also enables mutual translation between target fonts.
4.3 Quantitative Evaluation
We select 10 fonts and each font contains 100 characters as test dataset. Table 1 shows the stylization results, compared with zi2zi, our model achieves improvement in terms of MS-SSIM, improvement of LD and improvement of L1 loss, which indicate that our method outperforms the baseline in structural similarity, local smoothness and pixel-level similarity. As shown in Table 2, our method also achieves the best performance which outperforms zi2zi by a large margin (, and improvements of MS-SSIM, LD and L1 loss, respectively). OCR accuracy also demonstrate the superiority of our model, especially in terms of content consistency.
4.4 Inference with New Fonts
We have further considered extending our model to novel fonts. For character stylization, it takes only a few minutes to fine-tune the trained model with limited data to achieve satisfactory results (see Fig. 8 (a)). With regard to character de-stylization, our model can directly map the input characters into SimSun font without training (see Fig. 8 (b)).
4.5 Ablation Experiments
We also perform a series of ablation experiments to validate the effectiveness of FCM and CPM. Moreover, style/content verification experiments are also introduced.
4.5.1 Effectiveness of FCM.
FCM can further improve the quality of the generated images, but the improvement is not obvious. As shown in Fig. 9(a), the model trained with FCM is more advantageous in terms of style maintenance of character details. Moreover, we introduce FCM to obtain the specified style variables by sampling the Gaussian distribution. The final result is then generated by combining style variables and content variables (as shown in Fig. 10).
4.5.2 Effectiveness of CPM.
CPM is mainly used to alleviate the problem of stroke deficiency or confusion during character de-stylization. As illustrated in Fig. 9(b), we first train our model without CPM, and then evaluate it on test dataset. Our model can generally yields satisfactory results, but in some cases the strokes are incomplete. On the contrary, by introducing CPM, the generation quality is significantly improved.
4.6 Validation Experiments
It is worth noting that we get some wrong characters or the characters that never existed by manual operation (e.g. the first column of Fig. 11 (a) or the first row of Fig. 11 (b)). Experimenting on these characters can strongly prove the effectiveness of style variables and content encoder.
4.6.1 Effectiveness of Style Variables
We select 2 characters to provide content variables, and then apply 10 style variables to these 2 characters respectively. The final generated results are shown in Fig. 11 (a). We can find that the style variables is able to work on new and different content and yield compelling results, which demonstrates that style variables are not affected by the content of the characters.
4.6.2 Effectiveness of Content Encoder
We adopt 2 fonts as the reference style variables. The wrong characters are first fed into content encoder to get the content variables, which are then combined with the reference style variables to generate the final images. As shown in Fig. 11(b), the generated contents are highly consistent with the original contents. The content encoder is able to accurately and completely map the contents of the original character to the latent space, which is not influenced by the stroke layout of the character.
4.7 Limitations and Discussion
Although our method is able to yield desirable results in most cases, there are also some failures. For example, when conducting de-stylization inference with new fonts, if the style of the new font is very complicated, the generated result is often unsatisfactory (Fig. 12). We think this is because these characters and standard characters vary greatly in topology, which affects the performance of the content encoder.
In this paper, we propose a unified framework for Chinese character stylization and de-stylization. We formulate the de-stylization process as a many-to-one image translation task. Unlike typical character recognition, our method start with the perspective of image translation, which can directly map various fonts into a standard font (e.g. SimSun). It will facilitate text digitization and subsequent editing and sharing. In addition, we decouple the character into style representation and content representation, which is more conducive to learning the corresponding feature representation in the deep space. Furthermore, font consistency module (FCM) and content prior module (CPM) are proposed. FCM helps our model to learn font style more accurately, and can obtain style representation through Gaussian sampling without reference image. CPM improves the stability and accuracy of the character de-stylization and alleviates the problem of stroke deficiency. Both qualitative and quantitative results demonstrate the effectiveness of our approach.
Generating handwritten chinese characters using cyclegan.
2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 199–207. Cited by: §1, §2.3.
-  (2016) Infogan: interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, pp. 2172–2180. Cited by: §2.2.
-  (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §2.1.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §3.2.1.
Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1125–1134. Cited by: §2.1, §3.2.3.
-  (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §3.3.4.
-  (2018) Diverse image-to-image translation via disentangled representations. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 35–51. Cited by: §2.2, §3.3.6.
-  (2019) Global and local consistent wavelet-domain age synthesis. IEEE Transactions on Information Forensics and Security. Cited by: §2.1.
-  (2012) Automatic shape morphing for chinese characters. In SIGGRAPH Asia 2012 Technical Briefs, pp. 2. Cited by: §1, §2.3.
-  (2016) Automatic generation of large-scale handwriting fonts via style learning. In SIGGRAPH ASIA 2016 Technical Briefs, pp. 12. Cited by: §1, §2.3.
-  (2018) Semantic image synthesis via conditional cycle-generative adversarial networks. In 2018 24th International Conference on Pattern Recognition (ICPR), pp. 988–993. Cited by: §2.1.
-  (2017) Auto-encoder guided gan for chinese calligraphy synthesis. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Vol. 1, pp. 1095–1100. Cited by: §2.3.
-  (2014) Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784. Cited by: §2.1.
-  (2017) Automatic generation of typographic font from a small font subset. arXiv preprint arXiv:1701.05703. Cited by: §2.3.
Conditional image synthesis with auxiliary classifier gans.
Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2642–2651. Cited by: §2.1.
-  (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434. Cited by: §2.1.
-  (2016) Generative adversarial text to image synthesis. arXiv preprint arXiv:1605.05396. Cited by: §2.1.
-  (2017) Https://github.com/kaonashi-tyc/rewrite. Cited by: §1, §2.3.
-  (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §3.2.2.
-  (2017) Disentangled representation learning gan for pose-invariant face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1415–1424. Cited by: §2.2.
-  (2003) Multiscale structural similarity for image quality assessment. In The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003, Vol. 2, pp. 1398–1402. Cited by: §4.1.4.
-  (2009) Automatic generation of chinese calligraphic writings with style imitation. IEEE Intelligent Systems 24 (2), pp. 44–53. Cited by: §1, §2.3.
-  (2018) Multiview rectification of folded documents. IEEE transactions on pattern analysis and machine intelligence 40 (2), pp. 505–511. Cited by: §4.1.4.
-  (2017) Stackgan: text to photo-realistic image synthesis with stacked generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5907–5915. Cited by: §2.1.
-  (2018) Separating style and content for generalized style transfer. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8447–8455. Cited by: §1, §2.3.
-  (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 2223–2232. Cited by: §2.1.
-  (2017) Toward multimodal image-to-image translation. In Advances in Neural Information Processing Systems, pp. 465–476. Cited by: §2.1, §3.3.6.
-  (2017) Https://github.com/kaonashi-tyc/zi2zi. Cited by: §1, §2.3.
-  (2014) Strokebank: automating personalized chinese handwriting generation. In Twenty-Sixth IAAI Conference, Cited by: §1, §2.3.