Style transfer is the task of combining the style of one image with the content of another image. Although the content of an image can be defined by the objects and the general scenery, the style of an image is not well defined. The style can be understood as the brush stroke of a painting, the color distribution, certain dominant forms and shapes or even a combination of all the above. Previous style transfer works have focused on transferring paintings styles, where the features to be transferred encode the brush strokes, the cubist patterns or the color palette achieving fascinating results[7, 13, 6].
However, text characters are very particular objects for which the common understanding of content and style cannot be adopted. Instead we define the style of the text as the shape, color and background of the characters and the content as the transcription of the text. In this work, we devise two architectures that learn the features that encode a certain text style and are able to transfer them to other text instances while preserving their content. We thus recast the ideas of style transfer, previously applied mostly to paintings, to the text domain.
Specifically, our method is able to automatically change the style of text regions in natural scene images, generating realistic images with the same textual content but with different text styles. In machine printed text images, we are able to train models that stimulate a change of the text font. In handwritten text, we are able to transfer the writing style of a particular writer to another.
Possible applications of text style transfer include text stylization for augmented reality systems, or text font normalization in order to recast the visual appearance of any text instance to a canonical style in order to ease the different steps of a text recognition pipeline.
In particular, we will demonstrate that such an approach is useful as a data augmentation technique (see Figure 1), in order to deal with the problem of annotated data scarcity. One can transfer the styles found in a particular dataset to much bigger available annotated datasets, in order to train better text detectors or recognizers.
The contributions of this work are as follows:
We demonstrate that existing style transfer pipeline can be used for text style transfer.
We extend the prior works to selective style transfer for text, devising two architectures: two-stage and end-to-end, that allow to stylize only the text regions in an image.
We use text style transfer as a data augmentation technique for the scene text detection task, and provide results that show a boost in the detection performance.
The rest of the paper is organized as follows: in the related work section, we overview the style transfer and the text stylizing literature. In the methodology section, we first define the style transfer task and then present the proposed two-stage and end-to-end architectures for selective text style transfer. In the results section, we explain the models we have trained in different text domains, show qualitative results and discuss their performance. Finally, in the data augmentation section, we show how text style transfer can be used to boost a text detector performance.
Ii Related Work
Gatys et al.  propose a method using a pretrained VGG network  for style transfer by extrapolating a randomly generated image to a stylized image. The stylized image is obtained by computing the backward propagation on the resulting pixel values which is computationally demanding. Johnson et al.  recast the problem as an image transformation task, where a single, fixed learnt painting style is applied to an arbitrary image. A CNN is trained to alter a corpus of content images to match the style of a painting, eventually allowing to stylize images in real time. Simultaneously, Ulyanov at al. 
introduced the idea of Instance Normalization which is a modified version of Batch Normalization to have computationally less demanding models. A drawback of these works[13, 23] is that an independent model has to be trained for each source style. Dumolin et al.  overcomes it by proposing a single CNN that can learn to transfer different source styles (up to 32 in their experiments), allowing to generate images with combined styles.
An appealing direction in the style transfer literature is to apply the style according to semantic segmentation. Li et al. use a Markov Random Field over given semantic maps to decide which patch of the image to stylize. Luan et al. extend this idea to doodles with manual annotation. A final extension comes from Zhao et al., in which they use two models: one for generating soft masks and the other for style transfer. All these works, however, apply the style transfer to the whole image. Conceptually, the closest to our work is that of Gatys et al. where they propose a method to control perceptual factors such as color, luminance and spatial location and scale. In particular, spatial control refers to applying the style to only masked areas which can be sky areas, balls, houses, trees, etc. The proposed method is guided Gram matrices in which they use masking for the calculation of Gram matrices over CNN features at specific image regions.
Although style transfer has not been applied to text, other works have targeted the task of changing text style with different approaches. Liu et al.  proposed a pipeline to transform scene text into machine printed text within a scene text recognition model. Abe et al.  proposed a Generative Adversarial Network model to create new machine printed text fonts. Aksan et al.  propose a generative model to disentangle content and style of handwritten text represented as temporally ordered strokes, and apply it to handwriting synthesis and style transfer. More related to our work, Ankan et al.  focus on font to font translation in images of printed documents using a GAN architecture. Also, Azadi et al.  propose a conditional GAN to style machine printed text to more complex scene text fonts, learning each character style independently.
Our work, to the best of our knowledge, is the first one exploring the performance of style transfer models on the text style transfer task as well as having an end-to-end model that can perform spatial control without any external information.
We will first explain the details of the model on which we base our implementation. Afterwards, we will explain the details of our architectures, a two-stage and an end-to-end model that apply style transformations to only text areas in the image.
Iii-a Style Transfer
We use the model proposed by Dumoulin et al. as our baseline. The style transfer task is usually defined as finding an image
which is produced by an encoder-decoder image transformation network, whose content is similar to a source content imagebut whose style is similar to a source style image . A key point of style transfer is the definition of both content and style. In 
, two images are considered to have a similar content if their high-level features extracted by a trained classifier are close in Euclidean distance. On the other hand, two images are similar in style if their gram matrices of low-level features as extracted by a trained classifier are close under the Frobenius norm.
More formally, let be the transformer network and the output of the
layer of a CNN pretrained on ImageNet. In our case is the VGG-16. The training process is as follows: we initially forward the content image through the transformer network to obtain the stylized image . All images , , are then forwarded through , and features for them are extracted: correspondingly, for the content image from the layer, for the style image from layer and, , for the stylized image from both layers where . The content loss is defined as the mean squared error between and . The style loss is computed as the mean squared error between corresponding Gram matrices and of the features and . The final loss that directs the model training is a weighted average of the content and the style losses. In summary:
Iii-B Selective Style Transfer for Text
Selective style transfer refers to automatically detecting the relevant areas in the image (in our case, areas where text is present) and restricting the application of style to the detected areas only, leaving the rest of the image unchanged. Accordingly, we design and describe two models that can perform selective style transfer for text.
Iii-B1 Two-stage architecture
which is a text detector that infers the probability of each pixel belonging to a textual area. To transfer a text style to an input image, we first stylize the whole image. Then, we compute the pixel-level heatmap of the original image with TextFCN. To generate the final image where the style is transferred only to textual areas, we do a blending of the original image and the stylized image weighted with the TextFCN heatmap (seeFigure 2). This procedure allows to obtain realistic images, ensuring that non-textual areas are kept unchanged.
More formally, given a stylized image and a pre-trained TextFCN , we process the content image with to obtain the per-pixel text probability map . Then we get only textual areas of the stylized image by taking its Hadamard product with , and do the same with and to get the content of non-textual areas. We sum up the results to get the final image :
Iii-B2 End-to-end Architecture
In the style transfer literature, pre-computed masks of image regions have been extensively used to apply different styles in different regions[8, 20, 17, 26]. However, masks are used to constrain the image transformation network output as we do with the two-stage architecture, or for gradient masking, and none of the existing works is capable of learning the masks along with the style information. To this end, combined with the purpose of reducing the computational complexity, we create a novel end-to-end architecture that is capable of performing selective style transfer on text without needing any text detector. Our model is inspired by the distillation strategy from . The basic idea of distillation is to pass the learnt information of various networks, which are able to solve different tasks, into a single model. In our case, we combine the image style transformation network with the text detector. We take the pretrained image style transformation network and the ground truth annotations for the text to train a randomly initialized image transformation network with mean squared error loss (see Figure 3).
More formally, let be the pretrained image style transformation network, the masks for the text regions where and, the same network as but randomly initialized. We first obtain the stylized image forwarding the content image though . We then obtain as the output of after feeding it with the content image . To get the ground truth for content and for style, and respectively, we apply the Hadamard product:
We use mean squared error as our loss to train the selective style transfer net . There are two key points in our loss calculation. First of all, we need to apply the mask for text regions, and for content regions to to make sure our model learns to differentiate between text and background. Secondly, since text regions in the image are significantly smaller compared to background, we weight loss contributions of text and background pixels with two parameters , , with . The loss is defined as:
To explore the capabilities of text style transfer in various text domains, we train 3 models, namely, a model to transfer scene text styles, a model to transfer machine printed text fonts styles, and a model to transfer handwritten styles. Figure 4 shows some of the source styles used. In this section, we explain the implementation and training details for each model, and give qualitative results.
Iv-a Scene Text
We trained our baseline style transfer model for two-stage architecture with 34 scene text styles from COCO-Text dataset  using cropped word images as source styles and ImageNet  as the training dataset. Then, we trained our end-to-end model for selective text style transfer using the COCO-Text  dataset (only the images containing legible text). To train the end-to-end model in the distillation fashion, we set parameters and trained for epochs. After that, we got images where the style was correctly transferred to text, but were quite blurry and dark. Then we trained for epochs with equal weight for textual and not textual areas to balance the style and content loss.
Figure 5 shows results of the scene text model applied to COCO-Text scene images, using both the two-stage architecture and the end-to-end architecture to get the selective text style transfer results. The performance is appealing, transferring the source styles with high fidelity in both character shapes and colors to a wide diversity of scene texts. The text content is preserved quite well in most images, and only in some cases where the task is very complex due to the original text size or tangled style the result is illegible. Both the weight blending with TextFCN and the end-to-end architecture allow getting realistic images that are very useful as data augmentation (see Section V). This realistic results have also a huge potential in artistic applications, such as graphic design. Graphic designers could quickly visualize how a given text style fits in a existing scene text. Moreover, the model can perform style transfer averaging several styles, which allows to create new styles.
Iv-B Machine Printed Text
We train our baseline model with 8 machine printed source text fonts: Arial, Times, Lubster, Corsiva, Caveat, Pacifico, Consolas and Syncopate and Imagenet  training images. Figure 6 shows results of the model applied to machine text. It transfers successfully the main features of the source font style, such as line width, text orientation, and main font character style. However, it fails transferring the specific styles of some characters, and the final output is influenced by the initial image.
Iv-C Handwritten text
The baseline model is trained with 8 styles from different writers, using images from the IAM dataset  as source styles and the ImageNet  dataset as training images. Figure 6 shows results of the model applied to IAM dataset images. The model transfers correctly the main features of the text, as the tight characters and the thick stroke of the style in the first column, and the elongated and italic style of the writer in the second column. However, it fails on transferring more fine-grained characteristics of the source writer style, and some words of the resulting text are blurry.
Iv-D Cross Domain
In this section, we go one step further and test the capability of our text style transfer models to transfer style to images of other text domains.
Iv-D1 Machine Text to Scene Text
Transferring scene text styles to machine printed text has a huge potential in augmented reality scenarios and as a data augmentation technique, generating synthetic images with a given text style but different text content. Figure 7 shows results of styling machine printed text with the scene text model. The model successfully transfers scene text style to machine printed text with high fidelity.
Figure 9 shows some content augmentation results, where Arial text has been stylized with the destination scene text font, and the stylized text has been inserted in the image manually.
Iv-D2 Handwritten to Scene Text
Figure 7 shows results of styling handwritten text with the scene text model. This task is quite complex, since the scene text tends to be thick and detached while handwritten text is tangled formed by tangled thin strokes. However, the scene text style transfer model successfully transfers some style features to it keeping it legible, and results are a combination of the transferred scene text style and the original handwritten style.
Iv-D3 Scene Text to Machine Text
A model capable to convert any scene text to machine printed text would be a nice tool to improve scene text understanding pipelines. The machine printed text model allows us to do so, as shown in Figure 10. It correctly transfers the machine text source style if the scene text is simple, but it fails when scene text has a complex font, is too small or rotated. Note that the artifacts in those images are due to noisy high responses of the text detector.
Iv-D4 Handwritten to Machine Text
Converting handwritten text to machine printed text could be very useful in a handwritten text understanding pipeline. Our machine text model transfers style features from machine fonts to handwritten text, but it breaks the content, resulting in illegible images, as shown in Figure 10.
Iv-D5 Machine Text to Handwritten
Converting machine text to handwritten text can be a great tool to generate synthetic data to train handwritten text understanding models. However, our handwritten text model fails transferring the font styles to machine text, as shown in Figure 8. It only achieves to copy some general handwritten style features to some machine text fonts closer to handwritten styles, like Caveat.
V Data Augmentation
In this section we include experiments that demonstrate the usefulness of the proposed selective scene text style transfer as a data augmentation tool to improve text detectors’ performance. The reason for using style transfer as a data augmentation technique is twofold. According to the recent seminal work by Geirhos et. al. , CNNs are strongly biased towards recognising textures rather than shapes. To overcome the biases, they have made use of neural style transfer as an augmentation technique which resulted not only in overcoming the biases that a CNN has but also improving results on ImageNet classification
. Thus, data augmentation using selective style transfer is expected to help overcome similar biases of the text recognizer. A second rationale is that some datasets do not offer the necessary amount of images to efficiently train large neural networks. Style transfer in such cases can artificially increase the size of the datasets by combining different styles and contents. This is especially true for ICDAR 2013 and ICDAR 2015.
Moreover, the proposed data augmentation technique has a clear benefit compared with other methods to generate synthetic data [10, 25]. The generated images contain a text with a different visual style, but the text appears in the same place as in the original images, which makes the text position in the image realistic while preserving the content of the text, which makes the transcription match the scene semantics.
ICDAR 2013 : the dataset contains training images and test images that capture focused text on sign boards, posters, etc.
ICDAR 2015 : the dataset contains training images and testing images with incidental scene text, which means text that appears in the scene without the user focusing on it.
For these experiments, we trained an additional scene text style transfer model using different styles from the ICDAR 2015 training dataset, and used the two-stage architecture to perform selective text style transfer. We augment ICDAR 2013, ICDAR 2015 and COCO-Text datasets using the two-stage model with or random additional styles per image. Figure 1 shows some of the ICDAR 2015 resulting augmentations. We train the EAST text detector on the augmented and regular datasets. Stylizing COCO-Text with ICDAR 2015 training styles, allows to get a COCO-Text dataset closer to the target testing data, which is ICDAR 2015 testing set. To evaluate the trained models, we use the Robust Reading Competition framework (ICDAR 2015 Challenge 4: Incidental Scene Text Localization task and ICDAR 2013 Challenge 2: Focused Scene Text). Results in Table I
show that text style transfer is a useful data augmentation technique, achieving an improvement in F-Score performance between-% in all the setups just using or augmentations per image.
The boost in performance confirms that neural style transfer offers an advantageous and practical data augmentation technique that works both within the datasets as well as cross dataset, domain adaption scenarios.
|Training Dataset||Testing dataset||Augmentations||F|
|ICDAR 13||ICDAR 13||-||70.97|
|ICDAR 13||ICDAR 13||1 styles per image||74.55|
|ICDAR 13||ICDAR 13||4 styles per image||75.29|
|ICDAR 13+15||ICDAR 15||-||80.83*|
|ICDAR 15||ICDAR 15||-||78.74|
|ICDAR 15||ICDAR 15||1 style per image||80.60|
|ICDAR 15||ICDAR 15||4 styles per image||81.83|
|COCO-Text||ICDAR 15||1 style per image||69.66|
|COCO-Text||ICDAR 15||4 styles per image||70.71|
We have shown that a style transfer model is able to learn text styles as the characters shapes, line style, and colors, and to transfer it to an input text preserving the original characters. We have explored the performance of text style transfer in 3 text modalities: scene text, machine printed text and handwritten text and in cross-modal scenarios, proving the usefulness of text style transfer as a data augmentation technique to train scene text detectors. Cross-modal experiments show the potential of this pipeline in virtual reality scenarios to style an arbitrary text with a given scene style, and the realism of the generated images suggest that the pipeline could be useful in artistic applications, such as graphic design. We open the field for further research in different directions, such as data augmentation for scene text detection or recognition or handwritten writer identification.
The proposed selective style transfer two-stage and end-to-end architectures allow to automatically get realistic images where only text has been styled. Furthermore, the end-to-end selective style transfer pipeline can be applied in other style transfer tasks besides text. We provide PyTorch style transfer code, based on Google’s TensorFlow Magenta implementation, including the end-to-end selective style transfer implementation, and text style transfer trained models for both frameworks.
This work was supported by projects TIN2017-89779-P, Marie-Curie (712949 TECNIOspring PLUS), aBSINTHE (Fundacion BBVA 2017), Doctorats Industrials (AGAUR), the CERCA Programme / Generalitat de Catalunya, NVIDIA Corporation and a UAB PhD scholarship.
-  Kotaro Abe, Brian Kenji Iwana, Viktor Gösta Holmér, and Seiichi Uchida. Font Creation Using Class Discriminative Deep Convolutional Generative Adversarial Networks. ACPR, 2017.
-  Emre Aksan, Fabrizio Pece, and Otmar Hilliges. DeepWriting: Making Digital Ink Editable via Deep Generative Modeling. Conference on Human Factors in Computing Systems, 2018.
-  Samaneh Azadi, Matthew Fisher, Vladimir Kim, Zhaowen Wang, Eli Shechtman, and Trevor Darrell. Multi-Content GAN for Few-Shot Font Style Transfer. CVPR, 2018.
-  Dena Bazazian, Raúl Gómez, Anguelos Nicolaou, Lluís Gómez, Dimosthenis Karatzas, and Andrew D. Bagdanov. FAST: Facilitated and Accurate Scene Text Proposals through FCN Guided Pruning. Pattern Recognit. Lett., 2017.
-  Dena Bazazian, Raul Gomez, Anguelos Nicolaou, Lluis Gomez, Dimosthenis Karatzas, and Andrew D. Bagdanov. Improving Text Proposals for Scene Images with Fully Convolutional Networks. ICPR DLPR Work., 2017.
-  Vincent Dumoulin, Jonathon Shlens, and Manjunath Kudlur. A Learned Representation for Artistic Style. ICLR, 2016.
-  Leon A Gatys, Alexander S Ecker, and Matthias Bethge. A Neural Algorithm of Artistic Style. arXiv, 2015.
-  Leon A Gatys, Alexander S Ecker, Matthias Bethge, Aaron Hertzmann, and Eli Shechtman. Controlling perceptual factors in neural style transfer. In CVPR, 2017.
-  Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A Wichmann, and Wieland Brendel. Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. arXiv, 2018.
-  Ankush Gupta, Andrea Vedaldi, and Andrew Zisserman. Synthetic Data for Text Localisation in Natural Images. CVPR, 2016.
-  Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv, 2015.
-  Deng Jia, Dong Wei, Socher R, Li Li-Jia, Li Kai, and Fei-Fei Li. ImageNet: A large-scale hierarchical image database. In CVPR, 2009.
Justin Johnson, Alexandre Alahi, and Li Fei-Fei.
Perceptual Losses for Real-Time Style Transfer and Super-Resolution.ECCV, 2016.
-  Dimosthenis Karatzas, Lluis Gomez-Bigorda, Anguelos Nicolaou, Suman Ghosh, Andrew Bagdanov, Masakazu Iwamura, Jiri Matas, Lukas Neumann, Vijay Ramaseshan Chandrasekhar, Shijian Lu, Faisal Shafait, Seiichi Uchida, and Ernest Valveny. ICDAR 2015 Competition on Robust Reading. ICDAR, 2015.
-  Dimosthenis Karatzas, Faisal Shafait, Seiichi Uchida, Masakazu Iwamura, Lluis Gomez I Bigorda, Sergi Robles Mestre, Joan Mas, David Fernandez Mota, Jon Almazàn Almazàn, and Lluis Pere De Las Heras. ICDAR 2013 Robust Reading Competition. ICDAR, 2013.
-  Ankan Kumar Bhunia, Ayan Kumar Bhunia, Prithaj Banerjee, Aishik Konwer, Abir Bhowmick, Partha Pratim Roy, and Umapada Pal. Word Level Font-to-Font Image Translation using Convolutional Recurrent Generative Adversarial Networks. ICPR, 2018.
Chuan Li and Michael Wand.
Combining markov random fields and convolutional neural networks for image synthesis.In CVPR, 2016.
-  Tsung Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollar, and C. Lawrence Zitnick. Microsoft COCO: Common objects in context. Lect. Notes Comput. Sci., 2014.
-  Yang Liu, Zhaowen Wang, Hailin Jin, and Ian Wassell. Synthetically Supervised Feature Learning for Scene Text Recognition. ECCV, 2018.
-  Fujun Luan, Sylvain Paris, Eli Shechtman, and Kavita Bala. Deep photo style transfer. In CVPR, 2017.
-  U.-V. Marti and H. Bunke. The IAM-database: an English sentence database for offline handwriting recognition. Int. J. Doc. Anal. Recognit., 2002.
-  Karen Simonyan and Andrew Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. ICLR, 2015.
-  Dmitry Ulyanov, Vadim Lebedev, Andrea Vedaldi, and Victor Lempitsky. Texture Networks: Feed-forward Synthesis of Textures and Stylized Images. ICML, 2016.
-  Andreas Veit, Tomas Matera, Lukas Neumann, Jiri Matas, and Serge Belongie. COCO-Text: Dataset and Benchmark for Text Detection and Recognition in Natural Images. arXiv, 2016.
-  Fangneng Zhan, Shijian Lu, and Chuhui Xue. Verisimilar Image Synthesis for Accurate Detection and Recognition of Texts in Scenes. ECCV, 2018.
-  Huihuang Zhao, Paul L Rosin, and Yu-Kun Lai. Automatic semantic style transfer using deep convolutional neural networks and soft masks. arXiv, 2017.
-  Xinyu Zhou, Cong Yao, He Wen, Yuzhi Wang, Shuchang Zhou, Weiran He, and Jiajun Liang. EAST: An Efficient and Accurate Scene Text Detector. CVPR, 2017.