Scene Text Synthesis for Efficient and Effective Deep Network Training

A large amount of annotated training images is critical for training accurate and robust deep network models but the collection of a large amount of annotated training images is often time-consuming and costly. Image synthesis alleviates this constraint by generating annotated training images automatically by machines which has attracted increasing interest in the recent deep learning research. We develop an innovative image synthesis technique that composes annotated training images by realistically embedding foreground objects of interest (OOI) into background images. The proposed technique consists of two key components that in principle boost the usefulness of the synthesized images in deep network training. The first is context-aware semantic coherence which ensures that the OOI are placed around semantically coherent regions within the background image. The second is harmonious appearance adaptation which ensures that the embedded OOI are agreeable to the surrounding background from both geometry alignment and appearance realism. The proposed technique has been evaluated over two related but very different computer vision challenges, namely, scene text detection and scene text recognition. Experiments over a number of public datasets demonstrate the effectiveness of our proposed image synthesis technique - the use of our synthesized images in deep network training is capable of achieving similar or even better scene text detection and scene text recognition performance as compared with using real images.


page 2

page 3

page 4

page 7


Verisimilar Image Synthesis for Accurate Detection and Recognition of Texts in Scenes

The requirement of large amounts of annotated images has become one gran...

A Scene-Text Synthesis Engine Achieved Through Learning from Decomposed Real-World Data

Scene-text image synthesis techniques aimed at naturally composing text ...

Spatial Fusion GAN for Image Synthesis

Recent advances in generative adversarial networks (GANs) have shown gre...

GA-DAN: Geometry-Aware Domain Adaptation Network for Scene Text Detection and Recognition

Recent adversarial learning research has achieved very impressive progre...

SynthText3D: Synthesizing Scene Text Images from 3D Virtual Worlds

With the development of deep neural networks, the demand for a significa...

Adaptive Composition GAN towards Realistic Image Synthesis

Despite the rapid progress of generative adversarial networks (GANs) in ...

FishEyeRecNet: A Multi-Context Collaborative Deep Network for Fisheye Image Rectification

Images captured by fisheye lenses violate the pinhole camera assumption ...

1 Introduction

Effective and efficient collection of a large amounts of annotated training images is critical to the research and development of deep neural networks (DNN) in different computer vision problems. Manual annotation takes time and usually has poor scalability and transferability while dealing with image data that are collected under different conditions, environments, or domains.

Fig. 1: The overview of the proposed image synthesis system: With background images and foreground texts as inputs, the proposed system identifies suitable text embedding regions and places the foreground texts into the background images with realistic geometry and appearance.

Different approaches have been attempted to relieve the image annotation constraint. The first resorts to the widely adopted image augmentation that creates new training images through photometric and geometric alternation of annotated training images. The second leverages semi-supervised and unsupervised learning that automatically collects or generates new training images through bootstrapping

[41], Generative Adversarial Networks (GAN) [38], etc. The third is through image composition that creates new images by placing foreground objects of interest (OOI) into certain background images. The challenge is to compose images as realistic as possible so as to make them useful in deep neural networks (DNNs) training.

We designed an image synthesizer that composes scene text images by embedding texts into natural images realistically [52]. The image synthesizer achieves promising performance when the synthesized images are employed to train scene text detection and recognition models. On the other hand, it involves complicated hand-crafted operations and requires box annotations within existing scene text detection and recognition datasets while determining the appearance of the embedded texts. In addition, its background images must be from training images of existing semantic segmentation datasets as it leverages region semantics to achieve semantic coherent synthesis. The proposed image synthesis technique in this paper improves our earlier work [52] from several aspects. First, it introduces semantic segmentation and defines a semantic score [5] with which any images can be used as the background image. Second, it introduces contour segmentation [1] and the detected image contours directly lead to better geometry alignment of the embedded texts. Third, it determines the text appearance by using GAN together with feature-level and semantic-level losses, hence removes both complicated hand-crafted operations and heavy reliance on existing scene text annotations. Experiments show that the newly synthesized images are more realistic and perform clearly better than [52] while applied to train scene text detection and recognition models.

The rest of this paper is organized as follows. Section 2 presents related works briefly. The proposed technique is then described in details in Section 3. Experimental results are further presented and discussed in Section 4. Some concluding remark is finally drawn in Section 5.

2 Related Work

2.1 Image Synthesis

Realistic image composition by inserting foreground objects into a background image has been studied extensively. The target is to achieve composition realism by adjusting object geometry and object appearance automatically and adaptively with respect to the surrounding background. With the advances of deep neural network research, image synthesis has been investigated as an image augmentation approach when only a limited amount of annotated images is available. For example, [16, 11, 52] study synthesis of scene text images for training better scene text detection and recognition models, [9] use synthetic chair images for training better optical flow networks. In recent years, a number of GAN-based techniques have been reported which synthesize new images by generation from random noises [30, 6, 54], appearance transfer [25, 59] or image composition [53].

Our proposed technique adopts the image composition approach for the synthesis of verisimilar scene text images. Beyond training GANs for realistic text appearance, it is capable of identifying suitable text embedding locations within the background image according to the semantic coherence. In addition, it exploits local contexts and is capable of aligning the foreground texts with the contextual structures within the background image realistically.

2.2 Scene Text Detection and Recognition

Automatic detection and recognition of various texts in scene images has attracted increasing research interests in recent years due to many relevant applications in practice [20, 35]. Different detection techniques have been proposed from those earlier using hand-crafted features [36, 29, 27] to the recent using DNNs [57, 18, 50, 43, 45]. Different detection approaches have also been explored including character based detection [15, 42, 19, 13], word-based detection [18, 24, 26, 12, 58] and the recent line-based detection [48, 55]. Meanwhile, different scene text recognition techniques have been developed from the earlier recognizing characters directly [16, 49, 32, 2, 10, 17]

to the recent recognizing words or text lines using recurrent neural network (RNN),

[33, 39, 40, 34, 51]

as well as various attention models

[23, 7].

Similar to other detection and recognition tasks, training accurate and robust scene text detection and recognition models requires a huge amount of annotated training images that cover the large variation and diversity of various texts in scene images. On the other hand, most existing datasets such as ICDAR2015 [20] and Total-Text [8] contain a few hundred or thousand training images only due to the challenge of image collection and annotation. The far from enough training images have become one major factor that impedes the advance of scene text detection and recognition research. The proposed image synthesis technique addresses this challenge by composing a theoretically unlimited number of realistic scene text images, more details to be presented in the ensuing subsections.

3 Scene Text Image Synthesis

Synthesizing realistic images that are useful in DNN training has been a very challenging research problem. The proposed scene text image synthesis technique addresses this challenge by two innovative designs, namely, a region detection module that searches for regions within a background image that are suitable for text placement and a text embedding module that determines the geometry and appearance of the foreground text for harmonious foreground-background composition, more details to be presented in the ensuing two subsections.

Fig. 2: Given a background image, two sets of candidate text embedding regions are first detected through contour-based segmentation and semantics-based segmentation, respectively. Semantic scores of each extracted regions are then determined and a list of background regions that are suitable for text embedding are further identified according to their semantic scores. Foreground texts (randomly picked from some existing text corpus) are finally placed into the identified background regions with realistic geometry and appearance. The two images in the last column show the synthesized images without using contour segmentation and semantic segmentation, respectively.

3.1 Text Embedding Region Detection

Given a background image, automatic scene text image synthesis needs to first locates suitable regions for the embedding of the foreground texts. We determine the suitable text embedding regions based on the observations of the location of many texts in scenes. In particular, scene texts often appear around homogeneous regions such as doors and signboards, largely for better legibility and attraction of human attention while placed in scenes. In addition, scene texts usually appear over semantically sensible objects or regions such as walls and vehicles instead of non-sensible objects such as trees and animals.

Based on the above observations, we determine the suitable text embedding regions by computing a semantic score map for each background image. The semantic score is estimated by fusing two types of image segmentation. The first type is based on the variation of image gradients which typically divides the background image into a number of regions

along the boundary of homogeneous regions. As a result, the generated usually do not have large variations and gradient changes as shown in “Contour Segmentation” in Fig. 2 (produced by [1]). The second type is based on object semantics which typically divides the background image into a number of semantically sensible regions along object boundaries. Since the segmentation targets complete and semantic sensible objects, the generated often contain very dynamic gradient changes such as the tower as shown in “Semantic Segmentation” in Fig. 2 (produced by [5]). In addition, each segmented has a specific semantics such as cars, humans, buildings, etc.

The semantic score map is estimated by fusing the two types of segmented image regions. For each gradient-based region , the percentages of its pixels of different semantics are first determined according to the semantically segmented regions and the corresponding semantics, e.g. 75% pixels are building (overlapping with the semantically segmented building region) and 25% are sky. The semantic score of (and its semantics) is then determined by the top percentage of semantic region (and the corresponding semantics) as shown in “Semantic Score” in Fig. 2. By pre-defining a set of object semantics that are suitable for text embedding such as signboard and cars, the proposed technique can thus determine a list of semantically sensible regions (within the background image) each of which comes with a semantic score telling how suitable it is for text embedding as shown in “Detected Regions” in Fig. 2. Note ultra-small regions or regions with very low semantic score are filtered out in the “Detected Regions”.

3.2 Scene Text Embedding

Once suitable text placement regions are determined, foreground texts need to be embedded into the background images as realistic as possible. We address this challenge from two aspects, namely, context-aware geometry alignment and GAN-based appearance adaptation, more details to be presented in the ensuing two subsections.

Fig. 3: Context-Aware geometry alignment: Given an identified scene text embedding region , a random homography ‘H’ is applied which produces the transformed region . is then obtained by embedding fronto-parallel texts on the . The restored region with texts in perspective view is finally derived by applying the inverse homography ‘InvH’ to the .
Fig. 4: GAN-based appearance adaptation: Given a foreground text sample ‘Pasadena’ for synthesis, a binary text mask is first determined with realistic geometry as described in Section 3.2.1. The is then concatenated with the identified background region and fed to the GAN generator to generate a composed image . A masked composed image is further determined and its appearance is adapted according to a set of real scene text image by using the GAN discriminator. The two images in the last column show the synthesis with and without using our proposed technique, respectively. For clarity, the feature level adaptation is not included.

3.2.1 Context-Aware Geometry Alignment

Natural spatial alignment of the foreground texts with the background image is critical to the realism of the synthesized scene text images as well as their usefulness in DNN training. We design a context-aware geometry alignment technique based on the observation that scene texts often align well with the borders of contextual objects, e.g. the border of signboards and license plates. Leveraging the (contour-based) segmented image regions as described in Section 3.1, we determine the contextual object borders by the boundaries of that are often perfectly aligned or even overlapped with the contextual object borders.

In addition, texts within scene images often have perspective instead of fronto-parallel views. To synthesize scene text images with realistic perspectives, we first transform an identified text embedding region by a homography which produces a transformed region as shown in Fig. 3. The foreground text in fronto-parallel view is then placed into the with perfect alignment with the boundary and this produces as illustrated in Fig. 3. Finally, is transformed back to the original dimension by using the inverse of the , i.e. the as shown in Fig. 3. Instead of deriving from complicated depth maps as in [11], we determine parameters randomly to ensure sufficient diversity of the generated perspectives. Our study shows that using random has little effect on the synthesized scene text images when they are applied in training scene text detection and recognition models, more to be discussed in the ensuing Experiments.

3.2.2 GAN-based Appearance Learning

Besides spatial alignment, texts in scenes usually appear with harmonious colors, brightness and styles with respect to the surrounding background. To make the synthesized images useful in deep network model training, the image synthesis needs to learn to determine the appearance of the foreground texts adaptively and automatically according to the contexts of the detected text embedding regions within the background image. We design a GAN-based appearance learning technique that combines feature-level adaptation and semantic consistency [14] for determining realistic appearance of the foreground texts as illustrated in Fig. 4.

Given the foreground texts to be synthesized, a binary text mask (text pixels being 1 and non-text pixels being 0) is first generated with a random text font as illustrated in Fig. 4. The text mask is then concatenated with an identified text embedding region

(as described in Section 3.1) and fed to the GAN generator to produce a translated image G(x) via image-to-image translation as shown in Fig. 4. With the binary

, a masked image can be further derived from which keeps the image background unchanged:


The use of the mask in Eq. 1 aims to minimize the change of the image background which greatly helps to synthesize more realistic scene text images.

Layers Out Size Configurations
TABLE I: The structure of the translation network.

A discriminator is trained to distinguish the generated image and a set of real scene text images . Beyond the pixel-level discrimination as employed in most GAN discriminators, we introduce feature-level adaptation and semantic consistency for more realistic image-to-image translation from to

. In particular, the feature-level adaptation aims to minimize the difference between the deep network features extracted from

and . The rationale is that the synthesized scene text images should have similar deep network features as real scene text images so as to be effective in deep model training. The feature-level loss is formulated as follows:


where denotes the feature extractor and denotes the discriminator that aims to distinguish the features of the real and the synthesized scene text images.

We also introduce a semantic consistency loss to avoid the flipping of text annotations. The idea is to pre-train a scene text recognizer and use it to recognize the translated scene text images . Since we have the semantics of the embedded foreground texts as denoted by , the semantic consistency loss can be formulated by comparing the recognition of and as follows:


where denotes the recognizer. It should be noted that the parameters of are fixed during the training process. The semantic consistency loss is used to update the parameters of the generator only.

In our implemented system, we adopt the Wasserstein GAN [4] for image appearance adaptation. The training of the discriminator is guided by losses where is defined in Eq. 2 and refers to the pixel-level loss as inherited from the original Wasserstein-GAN discriminator. Similarly, the training of the generator is guided by losses , where is defined in Eq. 3 and is the pixel-level losses as inherited from the original Wasserstein-GAN generator. In addition, the reference scene text images are derived by combining the training images of several existing scene text datasets including ICDAR2013 [21], ICDAR2015 [20], SVT [44], IIIT5K [28] and CUTE80 [31].

Methods ICDAR2013
Jaderberg et al. [18] [IC13+SVT] 68.0 86.7 76.2
Zhang et al. [56] [IC13+IC15+MT] 78.0 88.0 83.0
Tian et al. [43] [IC13+Private] 83.0 93.0 88.0
Yao et al. [48] [IC13+IC15+MT] 80.2 88.9 84.3
Gupta et al. [11] [Gupta 800K] 76.4 93.8 84.2
EAST (Ours 400K) 81.5 91.3 86.1

Scene text detection recall (R), precision (P) and f-score (F) on the ICDAR2013 dataset. “

EAST” denotes the adapted EAST model as described in Section 4.2.1. “IC13”, “IC15”, “SVT” and “MT” denote the original training images within the ICDAR2013, ICDAR2015, SVT and MSRA-TD500 datasets as described in Section 4.1.

4 Experiments

The proposed image synthesis technique has been evaluated on the scene text detection and scene text recognition tasks over several public datasets.

4.1 Datasets

The proposed image synthesis technique has been evaluated over the following public datasets:

ICDAR 2013 [21] dataset contains 229 images for training and 233 images for testing with word-level annotations. For recognition task, there are 848 word images for training recognition models and 1095 word images for evaluation.

ICDAR 2015 [20] is a dataset that consists of 1,670 images (17,548 annotated incidental scene text regions) acquired using the Google Glass. Incidental scene text refers to text that appears in the scene without the user taking any action to rectify the position and quality of the text regions.

MSRA-TD500 [47] dataset consists 300 natural images for training, 200 images for testing with diverse visual contents and resolutions, which are taken from cluttered indoor and outdoor scenes using a pocket camera.

IIIT5K [28]

dataset consists of 2000 training images and 3000 test images with cropped scene texts and born-digital. For each image, there are two lexicons: one with 50-word and the other with 1000-word.

SVT [44] dataset consists of 249 street view images from which 647 words images are cropped. Each word image has a 50-word lexicon.

CUTE [31] has 288 curved word images cropped from the CUTE dataset that are originally collected for scene text detection research.

Training Data R P F
1K Synth (Random) 69.13 64.82 66.91
1K Synth (RD) 71.64 66.46 68.95
1K Synth (TE) 69.76 65.71 67.67
1K Synth (RD+TE) 72.32 67.57 69.86
1K Gupta [11] 68.68 67.50 68.09
1K Zhan [52] 70.96 66.87 68.85
TABLE III: Scene text detection performance on the ICDAR2013 dataset by using the EAST model as described in Section 4.2.1, where “Synth”, “Gupta” and “Zhan” denote the training images that were synthesized by our proposed method, Gupta’s [11] and Zhan’s [52], respectively. “1K” denotes the number of synthesized images used, “Random” denotes embedding texts with random locations and colors, “RD” and “TE” denote the proposed region detection and text embedding techniques.
None None 50 1k None 50 None None
Yao [49] [-] - - 80.2 69.3 - 75.9 - -
Almaz´an [2] [-] - - 91.2 82.1 - 89.2 - -
Gordo [10] [-] - - 93.3 86.6 - 91.8 - -
Jaderberg [17] [Jaderberg 8M] 81.8 - 95.5 89.6 - 93.2 71.7 -
Jaderberg [18] [Jaderberg 8M] 90.8 - 97.1 92.7 - 95.4 80.7 -
Shi [33] [Jaderberg 8M] 89.6 - 97.8 95.0 81.2 97.5 82.7 -
Shi [34] [Jaderberg 8M] 88.6 - 96.2 93.8 81.9 95.5 81.9 59.2
Yang [46] [Private] - - 97.8 96.1 - 95.2 - 69.3
Lee [23] [Jaderberg 8M] 90.0 - 96.8 94.4 78.4 96.3 80.7 -
ASTR [Jaderberg 5M] [19] 86.6 64.1 96.8 93.2 81.0 96.1 81.5 55.8
ASTR [Gupta 5M] [11] 87.0 66.6 97.6 94.8 81.3 95.2 80.1 55.9
ASTR [Zhan 5M] [52] 87.7 67.4 97.9 95.4 82.1 96.9 82.2 56.8
ASTR [Ours 5M] 89.4 68.1 98.7 96.3 84.1 97.2 82.9 58.6
TABLE IV: Scene text recognition performance over the datasets ICDAR2013, ICDAR2015, IIIT5K, SVT and CUTE, where “50” and “1K” in the second row denote the lexicon size and “None” means no lexicon used. ASTR denotes the recognition model as described in Section 4.3.1.

4.2 Scene Text Detection

4.2.1 Implementation

For the scene text detection task, we adopt an adapted EAST model [58] to train all text detectors. EAST is a fully convolutional network (FCN) which can directly localize texts of arbitrary orientations at word or text-line level. It allows end-to-end training and optimization without unnecessary intermediate components and steps, and achieves superior detection accuracy and efficiency as compared with state-of-the-art methods. It uses the PVANET [22] as the backbone in its original implementation. We instead use the ResNet-152 in our implementation as PVANET is not publicly available.

The proposed image synthesis technique is evaluated by using the dataset ICDAR2013 that has been widely used in scene text detection study for years. Two experiments were performed to demonstrate the effectiveness of our synthesized images in training deep detection networks. In the first experiment, we employ 400K images synthesized by our proposed method to train an EAST model and compare the trained model with the state-of-the-art as shown in Table 2. The purpose is to show that our synthesized images can perform similarly or even better than real images while applied for training deep detection models. In the second experiment, we carry out an ablation study that uses 1K synthesized images to evaluate the performance of our proposed region detection and text embedding techniques. This experiment also compares our image synthesis technique with two state-of-the-art image synthesis techniques as shown in Table 3. Note we use 5000 images without containing any scene texts as the background images and select the foreground texts from publicly available corpses. The number of embedded words or text lines is limited to a maximum of 15 for each background image.

4.2.2 Result Analysis

Table 2 shows experimental results when 400K synthesized images are used to train the adapted EAST model. As Table 2 shows, the trained EAST model achieves similar performance as compared with state-of-the-art models that use either real images or synthesized images in training. Though a much larger amount of synthesized images is used as compared with those using real images, the proposed image synthesis technique generates new images by machines which removes the complicated image collection and selection process as required by real images. More importantly, the proposed image synthesis technique produces object annotations automatically which removes the time-consuming object annotation process. The use of a larger amount of synthesized images will introduce a longer training time but this is far more manageable as compared with manual collection, selection, and annotation of a large amount of real images. Note Gupta, et al also used their synthesized images in training, but they used 800K synthesized images in training and the achieved f-score is clearly lower than ours using 400K synthesized images.

Table 3 shows ablation study results as well as comparisons with other image synthesis techniques. For fair comparisons, the adapted EAST and 1K synthesized images (by different synthesis methods) are used in detection model training consistently. As Table 3 shows, random placement of source texts into background images (no control of embedding locations and text appearance) is capable of producing useful training images (with a detection f-score of 66.91%). The including of either our proposed region detection (RD) or text embedding (TE) clearly improves the detection f-score by up to 2%, and the including of both further improves the detection f-score by up to 3%. In addition, we can see that our synthesized images perform clearly better than the synthesized images in [11] and [52], with an up to 2% f-score improvement. The better performance is largely due to the semantic coherence, geometry alignment, and realistic appearance within our proposed image synthesis technique as illustrated in Fig. 5.

Fig. 5: A number of sample images by our proposed synthesis technique that show how the proposed semantic coherence, geometry alignment and realistic text appearance work together for automatic and verisimilar scene text synthesis.

4.3 Scene Text Recognition

4.3.1 implementation

For the scene text recognition task, we adopt an attention-based scene text recognition model (ASTR) which is a sequence-to-sequence learning method [37, 3]. The ASTR consists of an encoder and a decoder, where the encoder extracts a sequential representation from the input image and the decoder recurrently generates a sequence conditioned on the sequential representation. The text recognition model covers 68 characters including 10 digits, 26 lower-case letters and 32 ASCII punctuation marks. In evaluations, only digits and letters are counted and the rest is directly discarded. If a lexicon is provided, the lexicon word that has the minimum edit distance with the predicted word is selected. In addition, evaluations are based on the correctly recognized words (CRW) which can be determined based on the ground truth transcription.

The proposed image synthesis technique is evaluated over five public datasets including ICDAR2013, ICDAR2015, IIIT5K, SVT and CUTE as shown in Table 4. Two sets of benchmarking evaluations were performed. The first compares our proposed image synthesis technique with a list of state-of-the-art scene text recognition techniques that use different amounts of synthesized images (most used a much larger amount as shown in Table 4) as well as different recognition models as proposed in the respective research. The purpose is to demonstrate how our synthesized images perform as compared with state-of-the-art recognition methods. The second compares our proposed image synthesis technique with a number of state-of-the-art image synthesis techniques, where the same amount of synthesized image (5M) and the same recognition model ASTR is used for all model training consistently. It provides a more direct comparison by using the same recognition model and the same amount of synthesized images.

4.3.2 Result Analysis

Table 4 shows the scene text recognition results. As Table 4 shows, our ASTR model (ASTR [Ours 5M]) achieves state-of-the-art scene text recognition accuracy when 5M images synthesized by our proposed method are used in training. Though the accuracy by the ASTR [Ours 5M] is not always the highest, it performs better than other state-of-the-art techniques for most evaluated datasets under different scenarios with or without using lexicons. The slighter lower accuracy for some dataset such as ICDAR2013 and SVT (using 50 lexicon) is largely due to a larger amount of training images, e.g. 8M in [33] or a constrained-output recognizer [18]. In addition, the ASTR was common model for text recognition without specific design whereas the state-of-the-art recognition models usually used the latest networks with proposed tricks. It should be noted that the dataset CUTE contains a large amount of curved texts that cannot be recognized properly by most state-of-the-art methods (whereas the methods in [34, 46] were specially designed to recognized curved/irregular texts).

Table 4 also shows the comparison between our proposed image synthesis methods and three state-of-the-art image synthesis methods as reported in [19, 11, 52]. For fair comparison, the same amounts of synthesized images (5 million) were taken from our proposed method and the three state-of-the-art methods and the same ASTR model is used consistently. The trained recognition models are labelled by “ASTR [Ours 5M]”, “ASTR [Jaderberg 5M]”, “ASTR [Gupta 5M]” and “ASTR [Zhan 5M]”, respectively. As Table 4 shows, the ASTR trained by using our synthesized images outperforms the ASTR trained by using the “Jaderberg 5M”, “Gupta 5M” and “Zhan 5M” consistently across all 5 evaluated datasets. The outstanding recognition performance of our “ASTR [Ours 5M]” is largely due to the semantic coherence, geometry alignment and realistic text appearance in our proposed image synthesis method.

5 Conclusions

This paper presents an image synthesis technique that aims to train accurate and robust scene text detection and recognition models. The proposed technique achieves verisimilar scene text image synthesis by innovative designs in semantic coherence, geometry alignment and realistic text appearance. Experiments over 6 public benchmarking datasets show that our synthesized images are capable of training state-of-the-art scene text detection and recognition models.

A possible extension to this research is to further improve the appearance realism of the embedded texts. We currently employ GANs to learn the appearance of the embedded texts with a set of real images as references. For training more powerful detection/recognition models, it could help by including the detection/recognition models into the overall framework and employing the detection and recognition results as feedback to learn optimal GAN parameters. We will study this issue in our future work.

6 Acknowledgement

This work is funded by the Ministry of Education, Singapore, under the project “A semi-supervised learning approach for accurate and robust detection of texts in scenes” (RG128/17 (S)).


  • [1] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Süsstrunk. Slic superpixels compared to state-of-the-art superpixel methods. In TPAMI, 2012.
  • [2] J. Almaz´an, A. Gordo, A. Forn´es, and E. Valveny. Word spotting and recognition with embedded attributes. In TPAMI, (12):2552–2566, 2014.
  • [3] O. Alsharif and J. Pineau. End-to-end text recognition with hybrid hmm maxout models. In ICLR, 2014.
  • [4] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In PMLR, 2017.
  • [5] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. arXiv:1606.00915, 2016.
  • [6] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In NIPS, 2016.
  • [7] Z. Cheng, F. Bai, Y. Xu, G. Zheng, S. Pu, and S. Zhou. Focusing attention: Towards accurate text recognition in natural images. In ICCV, pages 5076–5084, 2017.
  • [8] C. K. Chng and C. S. Chan. Total-text: A comprehensive dataset for scene text detection and recognition. In ICDAR, pages 935–942, 2017.
  • [9] A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. Smagt, D. Cremers, and T. Brox. Flownet: Learning optical flow with convolutional networks. Proc. ICCV, 2015.
  • [10] A. Gordo. Supervised mid-level features for word image representation. In CVPR, 2015.
  • [11] A. Gupta., A. Vedaldi., and A. Zisserman. Synthetic data for text localisation in natural images.

    IEEE Conference on Computer Vision and Pattern Recognition

    , 2016.
  • [12] P. He, W. Huang, T. He, Q. Zhu, Y. Qiao, and X. Li. Single shot text detector with regional attention. arXiv:1709.00138, 2017.
  • [13] T. He, W. Huang, Y. Qiao, and J. Yao.

    Text-attentional convolutional neural network for scene text detection.

    In TIP, (6):2529–2541, 2016.
  • [14] J. Hoffman, E. Tzeng, T. Park, J.-Y. Zhu, P. Isola, A. A. Efros, and T. Darrell. Cycada: Cycle-consistent adversarial domain adaptation. In ICML, 2018.
  • [15] W. Huang, Y. Qiao, and X. Tang. Robust scene text detection with convolution neural network induced mser trees. In ECCV, pages 497–511, 2014.
  • [16] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman. Synthetic data and artificial neural networks for natural scene text recognition. arXiv preprint arXiv:1406.2227, 2014.
  • [17] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman. Deep structured output learning for unconstrained text recognition. In ICLR, 2015.
  • [18] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman. Reading text in the wild with convolutional neural networks. In IJCV, (1):1–20, 2016.
  • [19] M. Jaderberg, A. Vedaldi, and A. Zisserman. Deep features for text spotting. In ECCV, pages 512–528, 2014.
  • [20] D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh, A. Bagdanov, M. Iwamura, J. Matas, L. Neumann, V. R. Chandrasekhar, S. Lu, and F. Shafait. Icdar 2015 competition on robust reading. In ICDAR, pages 1156–1160, 2015.
  • [21] D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, S. R. Mestre, J. Mas, D. F. Mota, J. A. Almazan, L. P. de las Heras, and et al. Icdar 2013 robust reading competition. In Proc. ICDAR, pages 1484–1493, 2013.
  • [22] K. Kim, S. Hong, B. Roh, Y. Cheon, and M. Park. Pvanet: Deep but lightweight neural networks for real-time object detection. arXiv:1608.08021, 2016.
  • [23] C.-Y. Lee and S. Osindero. Recursive recurrent nets with attention modeling for ocr in the wild. In CVPR, pages 2231–2239, 2016.
  • [24] M. Liao, B. Shi, X. Bai, X. Wang, and W. Liu. Textboxes: A fast text detector with a single deep neural network. In AAAI, pages 4161–4167, 2017.
  • [25] M.-Y. Liu, T. Breuel, and J. Kautz. Unsupervised image-to-image translation networks. In NIPS, 2017.
  • [26] Y. Liu and L. Jin. Deep matching prior network: Toward tighter multi-oriented text detection. In CVPR, 2017.
  • [27] S. Lu, T. Chen, S. Tian, J.-H. Lim, and C.-L. Tan.

    Scene text extraction based on edges and support vector regression.

    In IJDAR, (2):125–135, 2015.
  • [28] A. Mishra, K. Alahari, and C. Jawahar. Scene text recognition using higher order language priors. In BMVC, 2012.
  • [29] L. Neumann and J. Matas. Real-time scene text localization and recognition. In CVPR, pages 3538–3545, 2012.
  • [30] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv:1511.06434, 2015.
  • [31] A. Risnumawan, P. Shivakumara, C. S. Chan, and C. L. Tan. A robust arbitrary text detection system for natural scene images. In Expert Syst. Appl., 41(18):8027–8048, 2014.
  • [32] J. A. Rodr´ıguez-Serrano, A. Gordo, and F. Perronnin. Label embedding: A frugal baseline for text recognition. In IJCV, 2015.
  • [33] B. Shi, X. Bai, and C. Yao. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. In TPAMI, 39(11):2298–2304, 2017.
  • [34] B. Shi, X. Wang, P. Lyu, C. Yao, and X. Bai. Robust scene text recognition with automatic rectification. In CVPR, 2016.
  • [35] B. Shi, C. Yao, M. Liao, M. Yang, P. Xu, L. Cui, S. Belongie, S. Lu, and X. Bai. Icdar2017 competition on reading chinese text in the wild (rctw-17). In ICDAR, 01:1429–1434, 2017.
  • [36] P. Shivakumara, R. P. Sreedhar, T. Q. Phan, S. Lu, and C. L. Tan. Multioriented video scene text detection through bayesian classification and boundary growing. IEEE Transactions on Circuits and systems for Video Technology, (8):1227–1235, 2012.
  • [37] K. Simonyan, A. Vedaldi, and A. Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv:1312.6034, 2013.
  • [38] L. Sixt, B. Wild, and T. Landgraf. Rendergan: Generating realistic labeled data. arXiv:1611.01331, 2017.
  • [39] B. Su and S. Lu. Accurate scene text recognition based on recurrent neural network. In ACCV, 2014.
  • [40] B. Su and S. Lu. Accurate recognition of words in scenes without character segmentation using recurrent neural network. In PR, 2017.
  • [41] S. Tian and S. Lu. Wetext: Scene text detection underweak supervision. In ICCV, pages 1492–1550, 2017.
  • [42] S. Tian, Y. Pan, C. Huang, S. Lu, K. Yu, and C. L. Tan. Text flow: A unified text detection system in natural scene images. In ICCV, pages 4651–4659, 2015.
  • [43] Z. Tian, W. Huang, P. H. T. He, and Y. Qiao. Detecting text in natural image with connectionist text proposal network. In ECCV, pages 56–72, 2016.
  • [44] K. Wang, B. Babenko, and S. Belongie. End-to-end scene text recognition. In ICCV, 2011.
  • [45] C. Xue, S. Lu, and F. Zhan. Accurate scene text detection through border semantics awareness and bootstrapping. In ECCV, pages 370–387, 2018.
  • [46] X. Yang, D. He, Z. Zhou, D. Kifer, and C. L. Giles. Learning to read irregular text with attention mechanisms. In IJCAI, pages 3280–3286, 2017.
  • [47] C. Yao, X. Bai, W. Liu, Y. Ma, and Z. Tu. Detecting texts of arbitrary orientations in natural images. In CVPR, pages 1083–1090, 2012.
  • [48] C. Yao, X. Bai, N. Sang, X. Zhou, S. Zhou, and Z. Cao. Scene text detection via holistic, multi-channel prediction. arXiv:1606.09002, 2016.
  • [49] C. Yao, X. Bai, B. Shi, and W. Liu. Strokelets: A learned multi-scale representation for scene text recognition. In CVPR, 2014.
  • [50] X. C. Yin, W. Y. Pei, J. Zhang, and H. W. Hao. Multiorientation scene text detection with adaptive clustering. In TPAMI, (9):1930–1937, 2015.
  • [51] F. Zhan and S. Lu. Esir: End-to-end scene text recognition via iterative image rectification. arXiv:1812.05824, 2018.
  • [52] F. Zhan, S. Lu, and C. Xue. Verisimilar image synthesis for accurate detection and recognition of texts in scenes. In ECCV, pages 257–273, 2018.
  • [53] F. Zhan, H. Zhu, and S. Lu. Spatial fusion gan for image synthesis. arXiv:1812.05840, 2018.
  • [54] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV, 2017.
  • [55] Z. Zhang, W. Shen, C. Yao, and X. Bai. Symmetry-based text line detection in natural scenes. In CVPR, pages 2558–2567, 2015.
  • [56] Z. Zhang, C. Zhang, W. Shen, C. Yao, W. Liu, and X. Bai. Multi-oriented text detection with fully convolutional networks. In CVPR, 2015.
  • [57] Z. Zhang, C. Zhang, W. Shen, C. Yao, W. Liu, and X. Bai. Multi-oriented text detection with fully convolutional networks. In CVPR, pages 4159–4167, 2016.
  • [58] X. Zhou, C. Yao, H. Wen, Y. Wang, S. Zhou, W. He, and J. Liang. East: An efficient and accurate scene text detector. In CVPR, 2017.
  • [59] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV, 2017.