Fonts come in a wide variety of styles, and they can come with different weights, serifs, decorations, and more. Thus, choosing a font to use in a medium, such as book covers, advertisements, documents, and web pages, is a deliberate process by a designer. For example, when designing a book cover, the title design (i.e., the font style and color for printing the book title) plays an important role [14, 25]. An appropriate title design will depend on visual features (i.e., the appearance of the background design), as well as semantic features of the book (such as the book content, genre, and title texts). As shown in Fig. 2 (a), given a background image for a book cover, typographic experts determine an appropriate title design that fits the background image. In this example, yellow will be avoided for title design to keep the visual contrast from the background; a font style that gives a hard and solid impression may be avoided by considering the impression from the cute rabbit appearance. These decisions about fonts are determined by the image context.
This paper aims to generate an appropriate text image for a given context image to understand their correlation. Specifically, as shown in Fig. 2 (b), we attempt to generate a title image that fits the book cover image by using a neural network model trained by 104,295 of actual book cover samples. If the model can generate a similar title design to the original, the model catches the correlation inside it and shows the existence of the correlation.
Achieving this purpose is meaningful for two reasons. First, its achievement shows the existence of the correlation between design elements through objective analysis with a large amount of evidence. It is often difficult to catch the correlation because the visual and typographic design is performed subjectively with a huge diversity. Thus, the correlation is not only very nonlinear but also very weak. If we prove the correlation between the title design and the book cover image through our analysis, it will be an important step to understand the “theory” behind the typographic designs. Secondly, we can realize a design support system that can suggest an appropriate title design from a given background image. This system will be a great help to non-specialists in typographic design.
In order to learn how to generate a title from a book cover image and text information, we propose an end-to-end neural network that inputs the book cover image, a target location mask, and a desired book title and outputs stylized text. As shown in Fig. 2, the proposed network uses a combination of a Text Encoder, Context Encoder, Skeleton Generator, Text Generator, Perception Network, and Discriminator to generate the text. The Text Encoder and Context Encoders encode the desired text and given book cover context. The Text Generator use skeleton-guided text generation  to generate the text, and the Perception Network and adversarial Discriminator refine the results.
The main contributions of this paper are as follows:
We propose an end-to-end system of generating book title text based on the cover of a book. As far as we know, this paper presents the first attempt to generate text based on the context information of book covers.
A novel neural network is proposed, which includes a skeleton-based multi-task and multi-input encoder-decoder, a perception network, and an adversarial network.
Through qualitative and quantitative results, we demonstrate that the proposed method can effectively generate text suitable given the context information.
2 Related Work
Recently, font style transfer using neural networks  has become a growing field. In general, there are three approaches toward neural font style transfer, GAN-based methods, encoder-decoder methods, and NST-based methods. A GAN-based method for font style transfer uses a conditional GAN (cGAN). For example, Azadi et al.  used a stacked cGAN to generate isolated characters with a consistent style learned from a few examples. A similar approach was taken for Chinese characters in a radical extraction-based GAN with a multi-level discriminator  and with a multitask GAN . Few-shot font style transfer with encoder-decoder networks have also been performed . Wu et al.  used a multitask encoder-decoder to generate stylized text using text skeletons. Wang et al.  use an encoder-decoder network to identify text decorations for style transfer. In Lyu et al. 
an autoencoder guided GAN was used to generate isolated Chinese characters with a given style. There also have been a few attempts at using NST to perform font style transfer between text [1, 7, 21, 24].
An alternative to text-to-text neural font style transfer, there have been attempts to transfer styles from arbitrary images and patterns. For example, Atarsaikhan et al. 
proposed using a distance-based loss function to transfer patterns to regions localized to text regions. There are also a few works that use context information to generate text. Zhao et al. predicted fonts for web pages using a combination of attributes, HTML tags, and images of the web pages. This is similar to the proposed method in that the context of the text is used to generate the text. Yang et al.  stylized synthetic text to become realistic scene text. The difference between the proposed method and these methods is that we propose inferring the font style based on the contents of the desired medium and use it in an end-to-end model to generate text.
3 Automatic Title Image Generation
The purpose of this study is to generate an image of the title text with a suitable font and color for a book cover image. Fig. 2 shows the overall architecture of the proposed method. The network consists of 6 modules: Text Encoder, Context Encoder, Text Generator, Skeleton Generator, Perception Network, and Discriminator. The Text Encoder and Context Encoder extracts text and styles from the input text and style cover. The Generator generates the stylized text suitable for the book cover input, and the Skeleton Generator creates a text skeleton to guide the Generator. The Perception Network and Discriminator help refine the output of the Text Generator.
3.1 Text Encoder
The Text Encoder module extracts character shape features from an image of the input text. These features are used by the Text Generator and the Skeleton Generator to generate their respective tasks. As shown in Fig. 3, the Text Encoder input is an image of the target text rendered on a background with a fixed pixel value.
The encoder consists of 4 convolutional blocks with residual connections. The first block consists of two layers of 3
3 convolutions with stride 1. The subsequent blocks contain a 319]
are used as the activation function for each layer. The negative slope is set to 0.2. There are skip-connections between second and third convolutional blocks and the convolutional blocks of the Skeleton Generator, which is described later.
3.2 Context Encoder
The Context Encoder module extracts the style features from the cover image and the location mask image . Examples of the covers and the location mask are shown in Fig. 3. The cover image provides the information about the book cover, such as color, objects, layout, etc.) for the Context Encoder to predict the style of the font and the location mask provides target location information. It should be noted that the cover image has the text inpainted, i.e., removed. Thus, the style is inferred solely based on the cover and not on textual cues.
The Context Encoder input is constructed of and concatenated in the channel dimension. Also, the Context Encoder structure is the same as the Text Encoder, except that the input is only 2 channels (as opposed to 3 channels, RGB, for the Text Encoder). As the input of the generators, the output of the Context Encoder is concatenated in the channel dimension with the output of the Text Encoder.
3.3 Skeleton Generator
In order to improve the legibility of the generated text, a skeleton of the input text is generated and used to guide the generated text. This idea is called the Skeleton-guided Learning Mechanism . This module generates a skeleton map, which is the character skeleton information of the generated title image. As described later, the generated skeleton map is merged into the Text Generator. By doing so, the output of the Text Generator is guided by the skeleton to generate a more robust shape.
This module is an upsampling CNN that uses four blocks of convolutional layers. The first block contains two standard convolutional layers, and the three subsequent blocks have one transposed convolutional layer followed by two standard convolutional layers. The transposed convolutional layers have 33 transposed convolutions at stride 2, and the standard convolutional layers have 33 convolutions at stride 1. All the layers use Batch Norm and LeakyReLU. In addition, there are skip-connections between the Text Encoder and after the first layer of the second and third convolutional blocks.
To train the Skeleton Generator, the following skeleton loss is used to train the combined network:
where is the number of pixels, is true skeleton map, and is the output of Skeleton Generator. This skeleton loss is designed based on DiceLoss .
3.4 Text Generator
The Text Generator module takes the features extracted by Text Encoder and Style Encoder and outputs an image of the stylized title. This output is the desired result of the proposed method. It is an image of the desired text with the style inferred from the book cover image.
The Text Generator has the same structure as the Skeleton Generator, except with no skip-connections and with an additional convolutional block. The additional convolutional block combines the features from the Text Generator and the Skeleton Generator. As described previously, the Skeleton Generator generates a skeleton of the desired stylized text. To incorporate the generated skeleton into the generated text, the output of the Skeleton Generator is concatenated with the output of the fourth block of the Text Generator. The merged output is further processed through a
convolutional layer with a stride of 1, Batch Normalization, and Leaky ReLU activation. The outputof the Text Generator has a tanh activation function.
The Text Generator is trained using a reconstruction loss . The reconstruction loss is Mean Absolute Error between the generated output and the ground truth title text , or:
While loss guides the generated text to be similar to the original text, it is only one part of the total loss. Thus, the output of the Text Generator is not strictly the same as the original text.
3.5 Perception Network
To refine the results of the Text Generator, we use a Perception Network. Specifically, the Perception Network is used to increase the perception of the generated images [13, 29]. To do this, the output of the Text Generator is provided to a Perception Network, and two loss functions are added to the training. These loss functions are the Content Perceptual loss and the Style Perceptual loss . The Perception Network is a VGG19 
that was pre-trained on ImageNet.
Each loss function compares differences in the content and style of the features extracted from the Perception Network when provided the generated title and the ground truth title images. The Content Perceptual loss minimizes the distance between the extracted features of the generated title images and the original images, or:
where is the Perception Network, is the feature map of ’s -th layer given the generated input image or ground truth image , and is the set of layers used in these loss. In this case, is set to the relu1_1, relu2_1, relu3_1, relu4_1, and relu5_1 layers of VGG19. The Style Perceptual loss compares the texture and local features extracted by the Perception Network, or:
where is a Gram Matrix, which has elements. Given input , has a feature map of shape . The elements of Gram Matrix are given by:
where and are each element of . By minimizing the distance between the Gram Matrices, a style consistency is enforced. In other words, the local features of both images should be similar.
In addition to the Perception Network, we also propose the use of an adversarial loss to ensure that the generated results are realistic. To use the adversarial loss, we introduce a Discriminator to the network architecture. The Discriminator distinguishes between whether the input is a real title image or a fake image. In this case, we use the true tile image as the real image and the Text Generator’s output as the fake image.
The Discriminator in the proposed method follows the structure of the DCGAN . The Discriminator input goes through 5 down-sampling 5
5 convolutional layers with stride 2 and finally a fully-connected layer. The output is the probability that the input is a true title image. Except for the output layer, the LeakyReLU function is used. At the output layer, a sigmoid function is adopted. The following adversarial loss is used to optimize the entire generation model and the Discriminator:
where is the whole generation module, is the discriminator, is the style condition, is the true title image, and is the input title text.
3.7 Title image generation model
As we have explained, the proposed network can generate title text suitable for a style. As a whole, the Text Encoder receives an input text image and the Context Encoder receives a style image , and the Skeleton Generator outputs a skeleton map and the Text Generator outputs a title image . This process is shown in Fig. 2. The training is done with alternating adversarial training with the Discriminator and the rest of the modules in an end-to-end manner. The Text Generator, Skeleton Generator, Text Encoder, and Context Encoder are trained using a total loss through the Text Generator, where
Variables to are weights for each of the losses.
4 Experimental setup
In the experiment, as shown in Fig. 3, to train the network, we need a combination of the full book cover image without text, a target location mask, a target plain input text, the original font, and the skeleton of the original words. Thus, we use a combination of the following datasets and pre-processing to construct the ground truth.
We used the Book Cover Dataset . This dataset consists of 207,572 book cover images111https://github.com/uchidalab/book-dataset. The size of the book cover image varies depending on the book cover, but for this experiment, we resize the images to 256256 pixels in RGB.
To ensure that the generated style is only inferred by the book cover and not any existing text, we remove all of the text from the book covers before using them. To remove the text, we first use Character Region Awareness for Text Detection (CRAFT)  to detect the text, then cut out regions defined by the detected bounding-boxes with dilated with a 33 kernel. CRAFT is an effective text detection method that uses a U-Net-like structure to predict character proposals and uses the affinity between the detected characters to generate word-level bounding-box proposals. Then, Telea’s inpainting  is used to fill the removed text regions with the plausible background area. The result is images of book covers without the text.
For the title text, CRAFT is also used. The text regions found by CRAFT are recognized using the scene text recognition method that was proposed by Baek et al. . Using the text recognition method, we can extract and compare the text regions to the ground truth. Once the title text is found, we need to extract the text without the background, as shown in Fig. 7. We generate a text mask (Fig. (b)b
) using a character stroke separation network to perform this extraction. The character stroke separation network is based on pix2pix , and it is trained to generate the text mask based on the cropped detected text region of the title. By applying the text mask to the detected text, we can extract the title text without a background. The background is replaced with a background with pixel values (127.5, 127.5, 127.5). Since the inputs are normalized between [-1, 1], the background represents a “0” value. Moreover, a target location mask (Fig. 3b) is generated by masking the bounding-box of the title text. The plain text input is generated using the same text but in the font “Arial.” Finally, the skeleton of the text is obtained using the skeletonization method of Zhang et al. .
4.2 Implementation details
In this experiment, for training the proposed model on high-quality data pairs, only images where the character recognition results  of the region detected by CRAFT (Fig. (a)a) and the image created by the character stroke separation network (Fig. (c)c) match were used. As a result, our proposed model has been trained on 195,922 title text images and a corresponding 104,925 book cover images. Also, 3,702 title text images and 1,000 accompanying book cover images were used for the evaluation. The training was done end-to-end using batch size 8 for 300,000 iterations. We used Adam  as optimizer, and set the learning coefficient , , and . We also set , , , , and for the weighting factors of the losses. The weights are set so that the scale of each loss value is similar. For , we set a smaller value so that Discriminator does not have too much influence on the generation module’s training and can generate natural and clear images .
4.3 Evaluation metrics
For quantitative evaluation of the generate title images, we use standard metrics and non-standard metrics. For the standard metrics, Mean Absolute Error (MAE), Peak Signal-to-Noise Ratio (PSNR), and Structural Similarity (SSIM)
are used. We introduce two non-standard metrics for our title generation task, Font-vector Mean Square Error (Font MSE) and Color Mean Square Error (Color MSE). Font MSE evaluates the MSE between font style vectors of the original (i.e., ground-truth) and the generated title images. The font style vector is a six-dimensional vector of the likelihood of six font styles: serif, sans, hybrid, script, historical, and fancy. The font style vector is estimated by a CNN trained with text images generated by SynthText with 1811 different fonts. Color MSE evaluates the MSE between three-dimensional RGB color vectors of the original and the generated title images. The color vector is given as the average color of the stroke detected by the character stroke separation network. These evaluations are only used in the experiment where the target text is the same as the original text. It should be noted that the three standard evaluations MAE, PSNR, and SSIM, can only be used when the target text is the same as the original text. However, we can use Font MSE and Color MSE, even when the generated text is different because they measure qualities that are common to the generated text and the ground truth text.
5 Experimental results
5.1 Qualitative evaluation
This section discusses the relationship between the quality of the generated images and the book cover images by showing various samples generated by the network and the corresponding original book cover images. Fig. 8 shows the samples where the network generates a title image close to the ground truth successfully, that is, with smaller Font MSE and Color MSE. From the figure, we can see that the strokes, ornaments, and the size of the text are reproduced. Especially, the first example shows the serif font is also reproducible even if the input text is always given as a sans-serif image.
Among the results in Fig. 8, the “Windows” example clearly demonstrates that the proposed method can predict the font and style of the text given the book cover and the target location mask. This is due to the book cover being a recognizable template from the “For Dummies” series in which other books with similar templates exist in the training data. The figure demonstrates that the proposed method effectively infers the font style from the book cover image based on context clues alone.
Fig. 9 shows examples where the generator could not successfully generate a title image close to the ground truth. The first, second, and third images in Fig. 9 show examples of poor coloring in particular. The fourth and fifth images show examples where the proposed method could not predict the font shape. For the “SUGAR” and “BEN” images, there are no clues that the color should be red. In the “DEADLY” book cover image, one would expect light text on the dark background. However, the original book cover used dark text on a dark background. For the “Beetles” and “Furrow” examples, the fonts are highly stylized and difficult to predict.
Fig. 10 shows several additional results including successful and failure cases. Even in this difficult estimation task from a weak context, the proposed method gives a reasonable style for the title image. The serif for “Make” and the thick style for “GUIDEBOOK” are promising. We also observe that peculiar styles, such as very decorated fonts and vivid colors, are often difficult to recover from the weak context.
Finally, in Fig. 11, we show results of using the proposed method, but with text that is different from the ground truth. This figure demonstrates that we can produce any text in the predicted style.
5.2 Ablation study
To measure the importance of the components of the proposed method, quantitative and qualitative ablation studies are performed. The network of the proposed method consists of six modules and their associated loss functions. Therefore, we measure the effects of the components. All experiments are performed with the same settings and the same training data.
The following evaluations are performed:
Proposed: The evaluation with the entire proposed model.
Baseline: Training is performed only using the Text Encoder and Text Generator with the reconstruction loss .
w/o Context Encoder: The proposed method but without the Context Encoder. The results are expected to be poor because there is no information about the style to learn from.
w/o Skeleton Generator: The proposed method but with no guidance from the Skeleton Generator and without the use of the skeleton loss .
w/o Discriminator: The proposed method but without the Discriminator and the adversarial loss .
w/o Perception Network: The proposed method but without the Perception Network and the associated losses and .
|Method||MAE||PSNR||SSIM||Font MSE||Color MSE|
|w/o Context Encoder||0.036||20.79||0.872||0.126||0.093|
|w/o Perception Network||0.061||19.39||0.874||0.085||0.058|
|w/o Skeleton Generator||0.035||21.09||0.874||0.112||0.080|
The quantitative results of the ablation study are shown in Table 1. The results show that Proposed has the best results in all evaluation methods except one, Color MSE. For Color MSE, w/o Perception Network performed slightly better. This indicates that the color of text produced by the proposed method without the Perception Network was more similar to the ground truth. However, as shown in Fig. 12, the Perception Network is required to produce reasonable results. In the figure, the colors are brighter without the Perception Network, but there is also a significant amount of additional noise. This is reflected in the results for the other evaluation measures in Table 1.
Also from Fig. 12, it can be observed how important each module is to the proposed method. As seen in Table 1, the Font MSE and Color MSE are much larger for w/o Context Encoder than the proposed method. This is natural due to knowing the style information being provided to the network. There are no hints such as color, object, texture, etc. Thus, as shown in Fig. 12, w/o Context Encoder only generates a basic font with no color information. This also shows that the book cover image information is important in generating the title text. A similar trend can be seen with w/o Discriminator and w/o Skeleton Network. The results show that the Discriminator does improve the quality of the font and the Skeleton Network ensures the structure of the text is robust.
In this study, we proposed a method of generating the design of text based on context information, such as the location and surrounding image. Specifically, we generated automatic book titles for given book covers using a neural network. The generation of the title was achieved by extracting the features of the book cover and the input text with two encoders, respectively, and using a generator with skeletal information. In addition, an adversarial loss and perception network is trained simultaneously to refine the results. As a result, we succeeded in incorporating the implicit universality of the design of the book cover into the generation of the title text. We obtained excellent results quantitatively and qualitatively and the ablation study confirmed the effectiveness of the proposed method. The code can be found at https://github.com/Taylister/FontFits. In the future, we will pursue the incorporation of the text back onto the book cover.
This work was in part supported by MEXT-Japan (Grant No. J17H06100 and Grant No. J21K17808).
-  (2020-06) Guided neural style transfer for shape stylization. PLOS ONE 15 (6), pp. e0233489. Cited by: §2, §2.
-  (2018) Multi-content GAN for few-shot font style transfer. In CVPR, Cited by: §2.
-  (2019) What is wrong with scene text recognition model comparisons? dataset and model analysis. In ICCV, Cited by: §4.1, §4.2.
-  (2019) Character region awareness for text detection. In CVPR, Cited by: §4.1.
-  (2009) ImageNet: a large-scale hierarchical image database. In CVPR, Cited by: §3.5.
-  (2016) A neural algorithm of artistic style. J. Vis. 16 (12), pp. 326. Cited by: §2.
-  (2019) Selective style transfer for text. In ICDAR, Cited by: §2.
-  (2016) Synthetic data for text localisation in natural images. In CVPR, Cited by: §4.3.
-  (2015) Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385. Cited by: §3.1.
-  (2020) RD-GAN: few/zero-shot chinese character style transfer via radical decomposition and rendering. In ECCV, Cited by: §2.
-  (2017) Image-to-image translation with conditional adversarial networks. In CVPR, Cited by: §4.1.
-  (2016) Judging a book by its cover. arXiv preprint arXiv:1610.09204. Cited by: §4.1.
Perceptual losses for real-time style transfer and super-resolution. In ECCV, Cited by: §3.5.
-  (2007) Typography and graphic design: from antiquity to the present. Flammarion. Cited by: §1.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.2.
Photo-realistic single image super-resolution using a generative adversarial network. In CVPR, Cited by: §4.2.
-  (2018) Unsupervised representation learning of image-based plant disease with deep convolutional generative adversarial networks. In CCC, Cited by: §3.6.
-  (2017) Auto-encoder guided GAN for chinese calligraphy synthesis. In ICDAR, Cited by: §2.
-  (2013) Rectifier nonlinearities improve neural network acoustic models. In ICML, Cited by: §3.1.
V-net: fully convolutional neural networks for volumetric medical image segmentation. In 3DV, Cited by: §3.3.
-  (2019) Font style transfer using neural style transfer and unsupervised cross-domain transfer. In ACCV Workshops, Cited by: §2.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §3.5.
An image inpainting technique based on the fast marching method. J. Graphics Tools 9 (1), pp. 23–34. Cited by: §4.1.
-  (2020) Network of steel: neural font style transfer from heavy metal to corporate logos. In ICPRAM, Cited by: §2.
-  (1998) The new typography: a handbook for modern designers. Vol. 8, University of California Press. Cited by: §1.
-  (2019) Typography with decor: intelligent text style transfer. In CVPR, Cited by: §2.
-  (2004) Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13 (4), pp. 600–612. Cited by: §4.3.
-  (2020) Multitask adversarial learning for chinese font style transfer. In IJCNN, Cited by: §2.
-  (2019) Editing text in the wild. In ACM ICM, Cited by: §1, §2, §3.3, §3.5.
-  (2020) A learning-based text synthesis engine for scene text detection. In BMVC, Cited by: §2.
-  (1984) A fast parallel algorithm for thinning digital patterns. Commun. the ACM 27 (3), pp. 236–239. Cited by: §4.1.
-  (2018) Modeling fonts in context: font prediction on web designs. Computer Graphics Forum 37 (7), pp. 385–395. Cited by: §2.
Few-shot text style transfer via deep feature similarity. IEEE Trans. Image Process. 29, pp. 6932–6946. Cited by: §2.