Text recognition plays an important role in the field of computer vision. With the advent of deep learning, text recognition methods have made great progresses[1, 2, 3, 4]. But they cannot achieve a satisfactory performance for insufficient training data which causes over-fitting problems. Owing that to collect and label real text images is a time-consuming work, methods to synthesize text images were proposed in order to alleviate the deficit in training data. The method put forward by Jaderberg et al.[1, 5] is based on a font catalogue. Coloring and projective distortion are applied on word images synthesized by font, border and shadow rendering, and then the processed images are added to background scene images with some noises. Gupta et al. proposed to apply the semantic segmentation on the scene image at first. Then the processed word images are pasted on a contiguous region of it, which guarantees that the word will not appear on objects of different distances. Based on , Zhan et al. presents a method which realizes semantic coherent synthesis. By leveraging the semantic annotations of objects and image regions created in the prior semantic segmentation research, semantic coherence between the text and the background has been reached while synthesizing text images. These methods are effective, but usually need complicated preceding or follow-up steps such as collecting background images, coloring the words and adding noises for improving the robustness, which requires more manual engineering.
In this paper, we propose a method based on Generative Adversarial Networks(GANs) which can generate infinite realistic text sequence images without any extra pre/post-process. The inspiration comes from the procedure of drawing pictures. While painting something in the real world, generally we will sketch the contours of it at first, and then use pigments to color the draft to finish the drawing. Following this procedure, we cope with the generation task from a new perspective as an image-to-image translation one, and utilize conditional adversarial networks to yield text sequence images on the basis of semantic ones. This work is based on a modified conditional Generative Adversarial Networks model named pix2pix which aims to translate semantic images to realistic ones. Some evaluation metrics will be utilized to assess our method and confirm the effectiveness. There are two main contributions in this work:
1.Unlike previous approaches, our method needs no extra preceding and follow-up step for generation. Besides, infinite images can be produced without any redundant operations.
2.The data generated by our method achieves a satisfactory performance on various evaluation metrics, and the code and dataset will be publicly available soon.
are models which learn a mapping from random noise vectorto output image , . By contrast, the subsequently proposed condition Generative Adversarial Networks(cGAN) are models that learn a mapping from an image and random noise vector to output image , . We will introduce the details of our methods in the following parts.
Ii-a Network Architectures
First let us recall the architecture of pix2pix network
. It uses modules of the form Convolution-BatchNorm-ReLU in both the generator and the discriminator. And inspired by U-Net, some skip connections are added to an encoder-decoder network as the generator. The architectures of the generator and the discriminator of the pix2pix model are shown respectively in Fig. 2(a) and Fig. 2(b). The objective can be expressed as
where represents the objective of condition GANs and represents the L1 distance. The parameter is set to 100 in the original work. We make some adaptations on the network in order to make it more appropriate for text images. Each component will be listed in the following part, and ablation experiments which can confirm that they are effective will be described in next chapter. The pipelines of pix2pix and our work are shown in Fig. 1.
Ii-A1 Cascaded generators
Enlightened by StackGAN who decomposes the text-to-image generative process into two stages, where the first GANs model aims to generate images with a small size and the second one tries to improve the resolution, we cascade two generators to make the generator to possess its own focus. The first generator is designed to generate the text area and its surroundings which are defined as the foreground area obtained through dilation operation on the masks of the text sequence images. The second generator aims to supplement the background area to produce realistic scene images. The architecture of the two generators are the same, while they are optimized respectively. As the generators do not need to generate too many areas, they can concentrate on their own work. Restrict by the hardware, only two generators are utilized though more generators are available. The procedure is depicted in Fig.2(c).
|Cascaded Network||Residual Blocks||PReLU||Inception score||FID score||ICDAR2013||ICDAR2015||IIIT 5K|
|Dataset||Image number||Inception score||FID score||ICDAR2013||ICDAR2015||IIIT 5K|
|Gupta et al.||8M||4.1150.202||66.16||78.3||46.7||75.5|
|Jaderberg et al.[1, 5]||8M||2.7470.170||51.68||78.1||52.1||77.6|
|Gupta et al.(colored)||8M||4.8480.323||72.98||78.7||48.5||74.9|
Ii-A2 Residual Blocks
Residual Networks(ResNets) solves the degradation problem in the process of training a deeper network through employing some residual blocks. The mapping of a residual block can be expressed as
which let the layers fit a residual mapping rather than a desired underlying mapping to mitigate the vanishing gradient problem. In order that our model can be optimized better, some residual blocks are added to the generator. As shown in Fig 3(c), after each convolutional layer except the last one of the encoder, a residual block with twoconvolutional layers which is depicted in Fig. 2(d) will be added.
Ii-A3 Activation Function
The parameter are set to 0.25 at first, and it will be updated automatically while training. In contrast, the parameter of leaky ReLU should be set up by ourselves, while to search for the best fitted parameter will take lots of time without satisfactory effects. Because there are only a few parameters added to the network, the computation and risks of over-fitting will not increase too much.
Ii-B Synthesizing Semantic Images
Gupta et al. proposed a method to synthesize text sequence images through morphology ways, which gives us inspiration about synthesizing semantic images. Taking this approach as basis, first we acquire suitable text samples from Newsgroup20 dataset in words, lines and paragraphs. Then the text sample is rendered with a randomly selected font and transformed randomly. Finally the text is blended into a black background image using Poisson image editing.
In the following part, we will describe the implementation details of our method. We also utilize some evaluation metrics and run a number of ablations to analyze the effectiveness of the proposed component.
In training stage, we collect some data from ICDAR 2013 training dataset which contains 229 images and KAIST scene text database which contains 1,498 images. It is worth noting that no testing dataset is involved into the training process. There are totally 6,715 word images while we discard those who only contain punctuations and those whose height is longer than the width, and we relabel them for better adaptability of our model. In the training stage, the optimizer of the network is Adam, the batch size is set to 64, and the learning rate is 0.0002. All images will be resized to22]. All experiments are carried out on a standard PC with Intel i7-8700 CPU and a single NVIDIA TITAN Xp GPU.
Iii-B Evaluation Metrics
Iii-B1 Inception score
To calculate Inception score of the generated dataset is a way to evaluate its quality. Images that contain meaningful objects should have a conditional label distribution with low entropy. And the marginal should have high entropy owing that we expect the model to generate varied images.
Iii-B2 FID score
Fréchet Inception Distance(FID) score is also an indicator of the performance of a model of GANs because it represents the similarity between two datasets. A lower FID score means two datasets are more similar with each other. As we expect the distribution of the generated images to be close to real ones, FID scores between the generated data and ICDAR 2013 testing dataset are utilized to evaluate our model.
Iii-B3 Recognition Task
Actually, the initial intention is to generate training data for recognition models. A higher accuracy of a trained model will prove that the data has a better quality. Therefore, an end-to-end recognition network named CRNN(Convolutional Recurrent Neural Network) is applied to test our model. The text sequence images and the contents of them will be fed into the training stage. For fair comparisons, we generate 8M images and transform them to gray ones to match the data from Jaderberg et al.[1, 5]. In addition, we also test the model through using colored images. In the training process, the batch size is set to 64 and the learning rate is 0.00005 with the SGD optimizer. The network is trained for 3 epochs which consumes about 6 hours carried on our hardware. The trained models will be evaluated on some public benchmarks such as ICDAR 2013, ICDAR 2015 and IIIT 5K.
Iii-C Ablation Experiments
First, we evaluate the effectiveness of the proposed components and compare them with our baseline models, which is extended straightly from the pix2pix framework. The comparison results are listed in Table I. From the table, we can observe that each component achieves a progress of the performance compared with the baseline model. We integrate them and get a further promotion of each evaluation metric, which demonstrate the proposed components are effective for the generation task.
Second, we make some comparisons with other methods including Jaderberg et al.[1, 5] and Gupta et al.. The numbers of images generated by each method are the same. Some samples of each dataset are shown in Fig. 3. Specially, we involve the training data of our GANs model into the comparisons. The results are shown in Table II. Naturally, the training images of our GANs model achieve the best FID score because they are sampled from the same distribution of ICDAR 2013 testing dataset. But they cannot achieve a good performance for in the recognition task cause there is a huge over-fitting problem within only 6k training images. In the inception scores, Jaderberg et al. reaches the best owing that the images contains less background information. The colored images are not as good as gray ones because a colored background contains more contents. In the FID scores, our method get the first place in the three producing methods. In recognition task, we achieve the highest accuracy on ICDAR 2013 and IIIT 5K. On ICDAR 2015, our data cannot achieve the best accuracy. We consider that it is because ICDAR 2015 contains plenty of images whose text is vague or distorted, and our generated images are too clear. In addition, the colored images are also evaluated but there is no obvious promotion. We argue that the CRNN network is not sensitive about the color mode of inputs. Finally, it is worth noting that each of the recognition model are only trained with 8M data, but our method is able to generate infinite images without any extra process.
Iv Conclusion and Future Work
We have proposed a method to generate realistic text sequence images for training recognition models. The method is able to produce infinite images with high quality, which exceeds general morphology methods. As more complicated networks can be used to synthesize high resolution images, in the future, our goal is to design an end-to-end system that can detect and recognize text in an image with high resolution while given a font catalogue and a lexicon finally.
-  M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman, “Synthetic data and artificial neural networks for natural scene text recognition,” arXiv preprint arXiv:1406.2227, 2014.
C.-Y. Lee and S. Osindero, “Recursive recurrent nets with attention modeling for ocr in the wild,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2231–2239.
-  B. Shi, X. Bai, and C. Yao, “An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition,” IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 11, pp. 2298–2304, 2017.
-  F. Bai, Z. Cheng, Y. Niu, S. Pu, and S. Zhou, “Edit probability for scene text recognition,” arXiv preprint arXiv:1805.03384, 2018.
-  M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman, “Reading text in the wild with convolutional neural networks,” arXiv preprint arXiv:1412.1842, 2014.
-  A. Gupta, A. Vedaldi, and A. Zisserman, “Synthetic data for text localisation in natural images,” in IEEE Conference on Computer Vision and Pattern Recognition, 2016.
-  F. Zhan, S. Lu, and C. Xue, “Verisimilar image synthesis for accurate detection and recognition of texts in scenes,” in European Conference on Computer Vision. Springer, 2018, pp. 257–273.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, 2014, pp. 2672–2680.
-  M. Mirza and S. Osindero, “Conditional generative adversarial nets,” arXiv preprint arXiv:1411.1784, 2014.
-  P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017, pp. 5967–5976.
S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in
International Conference on Machine Learning, 2015, pp. 448–456.
-  O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention. Springer, 2015, pp. 234–241.
-  G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” science, vol. 313, no. 5786, pp. 504–507, 2006.
-  H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N. Metaxas, “Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 5907–5915.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
——, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” inProceedings of the IEEE international conference on computer vision, 2015, pp. 1026–1034.
-  K. Lang and T. Mitchell, “Newsgroup 20 dataset,” 1999.
-  P. Pérez, M. Gangnet, and A. Blake, “Poisson image editing,” ACM Transactions on graphics (TOG), vol. 22, no. 3, pp. 313–318, 2003.
-  D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, L. G. i Bigorda, S. R. Mestre, J. Mas, D. F. Mota, J. A. Almazan, and L. P. De Las Heras, “Icdar 2013 robust reading competition,” in Document Analysis and Recognition (ICDAR), 2013 12th International Conference on. IEEE, 2013, pp. 1484–1493.
-  J. Jung, S. Lee, M. S. Cho, and J. H. Kim, “Touch tt: Scene text extractor using touchscreen interface,” ETRI Journal, vol. 33, no. 1, pp. 78–88, 2011.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
-  A. Paszke, S. Gross, S. Chintala, and G. Chanan, “Pytorch,” 2017.
-  T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen, “Improved techniques for training gans,” in Advances in Neural Information Processing Systems, 2016, pp. 2234–2242.
-  M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,” in Advances in Neural Information Processing Systems, 2017, pp. 6626–6637.
-  C. Yao, J. Wu, X. Zhou, C. Zhang, S. Zhou, Z. Cao, and Q. Yin, “Incidental scene text understanding: Recent progresses on icdar 2015 robust reading competition challenge 4,” arXiv preprint arXiv:1511.09207, 2015.
-  A. Mishra, K. Alahari, and C. V. Jawahar, “Scene text recognition using higher order language priors,” in BMVC, 2012.