PyTorch Implementation of TCN
The mood of a text and the intention of the writer can be reflected in the typeface. However, in designing a typeface, it is difficult to keep the style of various characters consistent, especially for languages with lots of morphological variations such as Chinese. In this paper, we propose a Typeface Completion Network (TCN) which takes a subset of characters as an input, and automatically completes the entire set of characters in the same style as the input characters. Unlike existing models proposed for style transfer, TCN embeds a character image into two separate vectors representing typeface and content. Combined with a reconstruction loss from the latent space, and with other various losses, TCN overcomes the inherent difficulty in designing a typeface. Also, compared to previous style transfer models, TCN generates high quality characters of the same typeface with a much smaller number of model parameters. We validate our proposed model on the Chinese and English character datasets, and the CelebA dataset on which TCN outperforms recently proposed state-ofthe-art models for style transfer. The source code of our model is available at https://github.com/yongqyu/TCN.READ FULL TEXT VIEW PDF
Can we make a famous rap singer like Eminem sing whatever our favorite s...
Style transfer generates an image whose content comes from one image and...
Chinese calligraphy is the writing of Chinese characters as an art form
Font generation is a challenging problem especially for some writing sys...
In this paper, we explore automated typeface generation through image st...
Character customization system is an important component in Role-Playing...
With the rapid development of Role-Playing Games (RPGs), players are now...
PyTorch Implementation of TCN
Typeface is a set of one or more fonts, each consisting of glyphs that share common design features.111https://en.wikipedia.org/wiki/Typeface. Effective typeface not only allows writers to better express their opinions, but also helps convey the emotions and moods of their text. However, there is a small number of typefaces to choose from because there are several difficulties in designing typography. The typeface of all characters should be the same without compromising readability. As a result, it takes much effort to make a typeface for languages with a large character set such as Chinese which contains more than twenty thousand characters.
To deal with this difficulty, we aim to build a model that takes a minimal subset of character images as an input, and generates all the remaining characters in the same typeface of input characters, which is illustrated in Figure 1.
In the field of computer vision, the typeface completion task has not been much studied. Generating character images in the same typeface could be seen as a style transfer problem. Existing style transfer tasks often refer to extracting a style feature from a desired image, and combines the new style feature while keeping the content features of target images. In case of typeface completion task, after extracting a style feature from a desired image, we combine the same style feature while changing the content features of target images. For the typeface completion task, we use the terms style feature and typeface feature, interchangeably. However, as existing single style transfer models learn only a single style transfer[Gatys, Ecker, and Bethge2016, Huang and Belongie2017, Li et al.2017, Li et al.2018], we need to train style transfer models on a set of characters for the typeface completion task. While learning all the single models and keeping them for typeface completion is computationally infeasible, recent work of Choi et al. StarGAN:Yun has addressed this inefficiency but fails to produce high quality character images for typeface completion.
In typeface completion, a large number of classes such as Chinese characters should be considered. Style transfer models designed for small number of classes fail to generalize in the typeface completion task due to the large number of classes in character sets. In the existing model, the input channel becomes where refers to RGB channels for a general image, and
refers to the number of labels. In our task, however, using one-hot encoded labels makes a model inefficient whenincreases.
In order to overcome the weakness of existing style transfer models, and to deal with the large number of classes, we propose a Typeface Completion Network (TCN) that generates all characters in a character set from a single model. TCN represents the styles and contents of characters as latent vectors, and uses various losses to produce more accurate images of characters. We show that TCN outperforms on Chinese and English datasets in terms of task-specific quantitative metric and image qualities.
Generative Adversarial Networks (GAN)[Goodfellow et al.2014] has been highlighted as one of the hottest research topics in computer vision. GAN generates images using an adversarial loss with a deep convolution architecture [Radford, Metz, and Chintala2015, Goodfellow2017]. GAN has gained popularity and resulted in a variety of follow-up studies [Mirza and Osindero2014, Perarnau et al.2016, Arjovsky, Chintala, and Bottou2017, Ledig et al.2017, Chen and Koltun2017]
The style transfer task, one of the image-to-image translation tasks, involves changing the style of an image while retaining its content. Since most existing style transition models have a fixed pair of input and target style, they cannot receive or generate styles in various domains using a single model. However, the models of[Mirza and Osindero2014, Choi et al.2017], can take a target style label as an input and generate an image of the desired style using a single model. This reduces the number of parameters in a task which performs the transition into various style domains.
A task of completing a character set with some subsets can also be treated as a style transfer task. Our model changes the content of an input to the content of the target while maintaining the style of the input. This is the same as the existing style transfer model where the terms, style and content, are reversed. In addition, since our task requires various content domains, our model is based on a multi-domain transfer model.
. But now, with the development of deep learning, many models focus on character style transfer tasks[Upchurch, Snavely, and Bala2016, Baluja2016]. In particular, there have been researches on the style transfer task using GAN in the character image domain [Lyu et al.2017, Chang and Gu2017, Azadi et al.2018, Bhunia et al.2018].
However, the difference between the existing style transfer model with character images and our model is that our model changes the content of the input character, not the style. This changes the number of target labels. Due to languages with thousands of characters such as Chinese, the existing single-domain style transfer model and the multi-domain style transfer model which uses the one-hot vector as a domain label deal with the parameter inefficiency problem. To solve this problem, we express the domain label as a latent vector and propose new losses accordingly.
The typeface completion task involves completing the remaining characters of a character set of a single typeface, using one of the characters, . TCN receives a triplet () as input, where character label corresponds to . Using the triplet, our model generates a character image , corresponding to the character of with the typeface of . Since TCN generates one character at a time, above generating process is repeated times. The goal of this task is to obtain a model parameter that minimizes the difference between and while generating all the character sets. The overall formula of the task is as follows.
where is the distance between the images. We used the L1 metric to measure the distance between the images. From the above optimal , we can obtain the that is similar to .
shows the training process of TCN. TCN consists of typeface and content encoders, a discriminator, and a generator. The encoders extract desired feature vectors from an image. The two encoders each return a latent vector that combines different information. The generator, along with the two vectors above, receives character labels corresponding to the input and target. Through this process, the generator makes a target character image that has the same style as the input. The discriminator receives the generated image and determines if it is real. In this process, the discriminator returns the probability that the image corresponds to a certain typeface and character.
In our model, the encoder is divided into a content encoder and a typeface encoder. The content encoder extracts the symbolic representation of a character from an image, and the typeface encoder extracts the typographical representation from an image. Each encoder consists of a ResNet, and a 1-layer fully connected (FC) classifier, and returns the output of the ResNet and classifier, respectively.
The encoders receive the image from the main training and return the latent vectors containing the typeface and the content feature, respectively. Each expression is as follows:
where of and represent the ResNets of the encoders, respectively, so each is a latent vector.
Since the two encoders have the same structure, we don’t know what information each encoder extracts unless we guide them. Therefore, we pretrain the encoders so that each encoder extracts disjoint information.
For pretraining, we perform the classification task to distinguish the typeface and character of an image. The output of the classifier creates a cross entropy(CE) loss, encouraging the output of each ResNet to contain corresponding feature. As shown in Figure 2 (a), the losses from the typeface () and content () encoders are defined as
where of and represent the classifier of encoders, respectively, so each and is a classification result of the typeface and content of , respectively, and and are the typeface and content labels of , respectively.
Although the classification accuracy of each encoder is higher than 90%, redundancy may exist between the two features to some extent because both typeface and content features are generated from the same character image To solve this problem, we applied triplet loss[Schroff, Kalenichenko, and Philbin2015] as follows.
where and are ResNet results of the typeface and content, respectively. the typeface is the same, the content is different from . has same content as and has different typeface.
Last, we have obtained a reconstruction loss as in auto-encoder. The reconstruction loss ensures that no feature is lost during exclusive extraction. The Decoder used in the reconstruction task has the same structure as our generator.
In addition to the outputs of the encoder, the generator receives input and target character labels. The generator consists of two submodules: the feature combination submodule that combines the four inputs, and the image generation submodule that generates the image using the combined inputs.
Before generating an image, we combine four inputs. The input of the generator includes the typeface/content feature vectors of the input image (), the character label of the input (), and the target (). Ideally, the typeface encoder should extract only the typeface feature, but due to the structural nature of CNN, the typeface encoder also extracts the content feature. Accordingly, even if we extract the same typeface information from other content, we will obtain a different result. By the combination of the inputs, we want to make the feature vectors the same as the feature vectors obtained from the target image. We thus made the typeface transfer task into an auto-encoder task.
Input character labels are inserted for the multi-domain task. In the multi-domain style transfer task with various domains of input, it is helpful for the model to know the input label with the target label rather than just the target label.
We define the combination function by the following equation:
We define as a 1x1 convolutional network to establish a correlation between each channel. The 1x1 convolutional network better captures correlations than concatenating vectors and requires fewer parameters than a FC layer.
Next, the generator creates the image after receiving the result of the feature combination. The image generation model is composed of deconvolutional model. We define the image generation function by the following equation:
where can be seen as a vanilla GAN that takes a latent vector and generates an image.
We concatenate the two functions of the generator and define it as , and it can be expressed as follows:
We do not distinguish between and in the future, but we use .
The discriminator takes an image and determines whether the image is a real image or a fake image generated by the generator. This is used as a loss so that the image created by the generator will appear real enough to fool the discriminator.
The discriminator consists of ResNet, as in the encoders. The difference between the discriminator and the encoder is the classification part. For the encoder, there is a separate ResNet for each typeface/content classifier to distinguish the typeface and content. On the other hand, the discriminator uses one ResNet and three classifiers. One returns the T/F probability of whether the image is real, just like the discriminator of the basic GAN. The others determine which typeface and content the input has, as in [Mirza and Osindero2014, Odena2016, Odena, Olah, and Shlens2016, Perarnau et al.2016]. Our discriminator did not use a separate ResNet for each classifier and thus uses fewer parameters and normalizes losses for the three tasks. Another difference is that our discriminator does not return the output of ResNet because it is not necessary.
We define the discriminator as and express the results that correspond to the real image and fake image. The denotes to be associated with a fake image by the generator.
Identity loss is similar to the loss in the auto-encoder in that it helps an output to be equal to an input (Fig. 2 (b)). The generator uses the character label of an input image as an input label and a target label. This experiment prevents possible loss during feature compression.
Structural SIMilarity index (SSIM) is used to measure the structural similarity between two images. We use SSIM as an evaluation metric for the performance, and we also use it as a loss (Fig.2 (c) red arrow). Using SSIM as a loss was proposed in [Zhao et al.2017, Snell et al.2017]. In our experiments, we applied l1-loss along with the SSIM index.
Adversarial losses that help outputs to look real and deceive the discriminator are the ones that are normally expected from GAN. As mentioned above, there is also a typeface/content loss between the true typeface/content label and the output that the discriminator returns. We show the losses at the end of Figure 2 (c).
is 1 if the input image is real and 0 if otherwise. In Figure 2 (c), will be 0 because the discriminator receives a fake image generated by the generator.
Reconstruction loss, proposed by CycleGAN, is a loss between the original image and the reconstruction image that is translated back to the original image from the typeface-changed image (Fig. 2
(d)). To calculate the reconstruction loss, we use the following loss function:
where is an image that is transferred twice from and has the same typeface and content as .
Reconstruction loss was proposed for a pixel by pixel comparison between images, but we also apply perceptual loss to this concept. A perceptual loss was first proposed by [Johnson, Alahi, and Fei-Fei2016] in the style transfer field. This loss compares high-dimensional semantic information in the feature vector space. Since a character image is an image composed of strokes rather than pixel units, it is appropriate to apply the perceptual loss as follows:
where and is the output of the and , respectively, using .
Reconstruct perceptual loss is the difference between the outputs of the encoder. We compare the input image and the image translated twice, not once. The typeface of the input image and that of the image translated once are the same. However, applying perceptual loss to the two images is not effective because these two images have different content features. Hence, we compare the input image and the twice-translated image with the same typeface/content as the input image.
The final loss of the generator is as follows:
where , and are hyper-parameters that control the importance of each loss.
The learning method of the discriminator is similar to that of the existing GAN. The discriminator receives two types of input: one is a real image and the other is a fake image generated by the generator. For real images, the model computes the classification loss using T/F, typeface, and content output. For fake images, only the classification loss of the T/F output is calculated because the discriminator does not need to take a loss for poor images that the generator makes.
In training, as shown in Figure 2, we induced the losses through several steps, but in the test, we carry out one step, with only encoders and generator, not using discriminator. The typeface completion task in the test is expressed as follows:
By repeating this equation times according to , we can complete one typeface consisting of characters.
Since there are more than 50K characters in Chinese, we chose the top 1,000 most used characters222http://www.qqxiuzi.cn/zh/xiandaihanyu-changyongzi.php. Chinese images were collected from true-type format (TTF) and open-type format(OTF) files obtained from the Web333https://chinesefontdesign.com444http://www.sozi.cn. A total of 150 files were manually selected. Not all files contain all of the 1,000 characters so we are left with a dataset of 137,839 character images.
We also build an English dataset for comparison. We used a total of 907 typographies and 26 uppercase characters. As a result of using the same selection process as the Chinese dataset, we obtained 23,583 images in total. The detailed composition is shown in Table 1.
Unlike general images, a dataset of character images can be used to evaluate the output using an objective metric because character data has all the input-target pairs. We used the Structural Similarity (SSIM) index to objectively evaluate the performance on the character data. SSIM is a metric that measures the quality of images using structural information, and is defined by the following equation:
where is the average value of the which denotes the brightness of the image. is the distribution of the which denotes the contrast ratio of the image. is the covariance of and , which denotes the correlation of the two images. and are small constants that prevent the denominator from being zero. The closer the score is to 1, the more similar the image is to the original image.
We performed a style transition experiment on the CelebA[Liu et al.2015] dataset to measure the performance of TCN. We used 202,599 images and resized them all to 128x128, as was done with the other dataset. We used three features: black, blond, brown hair colors. The data composition and other settings are the same as those of the baseline [Choi et al.2017].
We selected the learning strategy and hyper-parameters of the models for the experiment. The source code, implemented with Pytorch[Paszke et al.2017], is also available at https://github.com/yongqyu/TCN. The triplet loss of the encoder is conditionally added to the total loss when the accuracy of each classifier exceeds 0.8.
The encoder, discriminator, and generator are all trained using the Adam optimizer, with a learning rate of 0.0002, beta1 = 0.5, beta2 = 0.999. The learning rate gradually decreases to zero as the number of epochs is increased. The image size is set to 128x128., , is 1, is 10.
|Multi- Domain Available||X||X||O||O|
|Rep. of Domain Index||None||Real-valued Distributed||One-hot||Real-valued Distributed|
In recent years, CycleGAN[Zhu et al.2017] has obtained outstanding performance in the image-to-image translation task. CycleGAN was the first to use cycle consistency which makes the image that was once converted back to the original domain equal to the original image, as in Equation 13. We also use vector-wise cycle consistency and pixel-wise cycle consistency at the image level.
MUNIT[Huang et al.2018] extracts the typeface and content features as a form of a latent vector using each encoder. By switching these vectors, an image with the desired features can be obtained. MUNIT uses a latent vector, and the reconstruction loss that uses the latent vector is similar to our reconstruct perceptual loss. However, there is a difference in the concept of reconstruction: we translate twice so that the reconstructed image is the same as the original, but MUNIT translates only once.
StarGAN[Choi et al.2017] passes an image with the desired domain label to the generator, like cGAN. To this end, the discriminator returns the true likelihood of the image, along with the domain to which the image corresponds. As a result, StarGAN can generate all characters in one model, like our model. However, unlike our model, StarGAN uses a one-hot vector to represent a content vector. The comparison of the above baselines and TCN is summarized in Table 2.
The single-domain transfer models cannot generate the entire character set using one model. For a fair comparison with the single-domain transfer model, we used two experimental conditions. First, we used sample pairs of characters. In English, Y-G and Q-G pairs were selected to represent the most different and similar pairs, respectively. In Chinese, index number 598-268, and 598-370 pairs were selected. After the sample pair experiments, we compared the performance of our model with that of a multi-domain model on translating all pairs. In these two experimental conditions, we performed the following subtasks: Typeface Completion and Character Reconstruction.
In the style transfer task, our model takes one character image and learns to complete the rest of the character set while maintaining its typeface. We used a single character image as an input for a fair comparison with the other models. We trained our model on Chinese and English character sets, which we mentioned above.
The style transfer experiment allows us to evaluate the performance of the two encoders in extracting the disjoint features. If the typeface encoder extracts the content feature and the typeface feature, the generated image will have the same content of the typeface input. This also applies to content encoders. In training, the model processes every content of a character set. And in the test, every content of the character set can be generated using the extracted typeface feature, even if the input typeface is new to the typeface encoder. Since the character image set has the target pair and the input, we can objectively evaluate the result based on its score.
Reconstruction is the process of regenerating an input image using the typeface and content features extracted from the input image. By the reconstruction, we can check if there is any feature missing when the encoder extracts features. It is also possible to check whether the decoder can effectively combine the two types of features. However, in this task, it is not possible to verify whether each of the features is disjointed or overlapped.
As our model is not limited to character images, we experimented with facial images used for existing style transfer models. In the facial image experiment, the only difference is that there are no content labels in the model , and the rest is the same unless otherwise stated. We take a face image and perform a style transfer experiment that changes the style feature label. We also conducted weighted style transfer experiments on weighted feature labels. We compare the performance of the models on converting an input image to a target image.
The reconstruction performance and style transfer performance of single-domain style transfer models (CycleGAN, MUNIT) vary (Table 3) due to insufficient information of the features. When extracting features from character images, style and content features are duplicated or lost, not being disjoint, which is demonstrated by the style transfer results of these models. The output of the single-domain models is dependent on the input image, so the models achieve high performance in the reconstruction task where the target is the input. On the other hand, in the style transition task, the result appears to be a simple combination of inputs rather than a style transfer.
StarGAN obtained good performance on the general images of the style transfer task, but not on our character dataset (Figure 3
(c)). In StarGAN, the content domain label is a sparse one-hot vector with channel size of the number of characters. It is difficult to simply concatenate this with a gray-scale one-channel image tensor. When calculating a loss in the one-hot vector, the cosine similarity of each vector is either one or zero. Therefore, we can only determine whether two values are matched. To address this issue, we use a latent vector as a domain label which has continuous values for similarity scores between vectors. From the values, the vector determines the similarity and difference of the two vectors, which can help the classifier to learn.
Another difference between TCN and the other models is the use of input labels. Adding input labels for the model results in output images more similar to the real images. Due to the differences described above, our model outperformed StarGAN by 14% on average, as shown in Table 4. And as shown in Figure 4, generated images have consistent typeface of input image. Even the results are unseen typefaces in the training process. Nonetheless, TCN generates an image similar to the target.
Our model can also be applied to general images. We experimented our model with the style encoders of unlabeled style information removed. As shown in Figure 5, we made a fairly plausible outcome. This suggests that our model can be used for various applications, and we will carry out various studies accordingly. Our model performs well in overall features of style, but it also lacks in areas such as serif and material.
In this paper, we proposed Typeface Completion Network (TCN) which generates an entire set of characters given only a subset of characters while maintaining the typeface of the input characters. TCN utilizes the typeface and content encoders to effectively leverage the information of numerous classes. As a result, TCN learns multi-domain style transfer using a single model, and produces more accurate outputs than existing baseline models. As illustrated in the qualitative analysis, we found that TCN successfully completes the character sets given a small number of characters, which could reduce the costs of designing a new typeface. We also tested TCN on the CelebA dataset to demonstrate its applicability. In future work, we plan to leverage recently proposed state-of-the-art GAN models which have shown to produce sharper images.
International Conference on Machine Learning.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, volume 11, 13.
Image style transfer using convolutional neural networks.In Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on.
Facenet: A unified embedding for face recognition and clustering.In Proceedings of the IEEE conference on computer vision and pattern recognition.