UnrealText: Synthesizing Realistic Scene Text Images from the Unreal World

03/24/2020 ∙ by Shangbang Long, et al. ∙ Carnegie Mellon University 6

Synthetic data has been a critical tool for training scene text detection and recognition models. On the one hand, synthetic word images have proven to be a successful substitute for real images in training scene text recognizers. On the other hand, however, scene text detectors still heavily rely on a large amount of manually annotated real-world images, which are expensive. In this paper, we introduce UnrealText, an efficient image synthesis method that renders realistic images via a 3D graphics engine. 3D synthetic engine provides realistic appearance by rendering scene and text as a whole, and allows for better text region proposals with access to precise scene information, e.g. normal and even object meshes. The comprehensive experiments verify its effectiveness on both scene text detection and recognition. We also generate a multilingual version for future research into multilingual scene text detection and recognition. The code and the generated datasets are released at: https://github.com/Jyouhou/UnrealText/ .



There are no comments yet.


page 1

page 4

page 5

Code Repositories


Synthetic Scene Text from 3D Engines

view repo


A modified version of UnrealText

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With the resurgence of neural networks, the past few years have witnessed significant progress in the field of scene text detection and recognition. However, these models are data-thirsty, and it is expensive and sometimes difficult, if not impossible, to collect enough data. Moreover, the various applications, from traffic sign reading in autonomous vehicles to instant translation, require a large amount of data specifically for each domain, further escalating this issue. Therefore, synthetic data and synthesis algorithms are important for scene text tasks. Furthermore, synthetic data can provide detailed annotations, such as character-level or even pixel-level ground truths that are rare for real images due to high cost.

Currently, there exist several synthesis algorithms [45, 10, 6, 49] that have proven beneficial. Especially, in scene text recognition, training on synthetic data [10, 6] alone has become a widely accepted standard practice. Some researchers that attempt training on both synthetic and real data only report marginal improvements [15, 19] on most datasets. Mixing synthetic and real data is only improving performance on a few difficult cases that are not yet well covered by existing synthetic datasets, such as seriously blurred or curved text. This is reasonable, since cropped text images have much simpler background, and synthetic data enjoys advantages in larger vocabulary size and diversity of backgrounds, fonts, and lighting conditions, as well as thousands of times more data samples.

On the contrary, however, scene text detection is still heavily dependent on real-world data. Synthetic data [6, 49]

plays a less significant role, and only brings marginal improvements. Existing synthesizers for scene text detection follow the same paradigm. First, they analyze background images, e.g. by performing semantic segmentation and depth estimation using off-the-shelf models. Then, potential locations for text embedding are extracted from the segmented regions. Finally, text images (foregrounds) are blended into the background images, with perceptive transformation inferred from estimated depth. However, the analysis of background images with off-the-shelf models may be rough and imprecise. The errors further propagate to text proposal modules and result in text being embedded onto unsuitable locations. Moreover, the text embedding process is ignorant of the overall image conditions such as illumination and occlusions of the scene. These two factors make text instances outstanding from backgrounds, leading to a gap between synthetic and real images.

In this paper, we propose a synthetic engine that synthesizes scene text images from 3D virtual world. The proposed engine is based on the famous Unreal Engine 4 (UE4), and is therefore named as UnrealText. Specifically, text instances are regarded as planar polygon meshes with text foregrounds loaded as texture. These meshes are placed in suitable positions in 3D world, and rendered together with the scene as a whole.

As shown in Fig. 1, the proposed synthesis engine, by its very nature, enjoys the following advantages over previous methods: (1) Text and scenes are rendered together, achieving realistic visual effects, e.g. illumination, occlusion, and perspective transformation. (2) The method has access to precise scene information, e.g. normal, depth, and object meshes, and therefore can generate better text region proposals. These aspects are crucial in training detectors.

To further exploit the potential of UnrealText, we design three key components: (1) A view finding algorithm that explores the virtual scenes and generates camera viewpoints to obtain more diverse and natural backgrounds. (2) An environment randomization module that changes the lighting conditions regularly, to simulate real-world variations. (3) A mesh-based text region generation method that finds suitable positions for text by probing the 3D meshes.

The contributions of this paper are summarized as follows: (1) We propose a brand-new scene text image synthesis engine that renders images from 3D world, which is entirely different from previous approaches that embed text on 2D background images, termed as UnrealText. The proposed engine achieves realistic rendering effects and high scalability. (2) With the proposed techniques, the synthesis engine improves the performance of detectors and recognizers significantly. (3) We also generate a large scale multilingual scene text dataset that will aid further research.

2 Related Work

2.1 Synthetic Images

The synthesis of photo-realistic datasets has been a popular topic, since they provide detailed ground-truth annotations at multiple granularity, and cost less than manual annotations. In scene text detection and recognition, the use of synthetic datasets has become a standard practice. For scene text recognition, where images contain only one word, synthetic images are rendered through several steps [45, 10], including font rendering, coloring, homography transformation, and background blending. Later, GANs [5] are incorporated to maintain style consistency for implanted text [50], but it is only for single-word images. As a result of these progresses, synthetic data alone are enough to train state-of-the-art recognizers.

To train scene text detectors, SynthText [6] proposes to generate synthetic data by printing text on background images. It first analyzes images with off-the-shelf models, and search suitable text regions on semantically consistent regions. Text are implanted with perspective transformation based on estimated depth. To maintain semantic coherency, VISD [49] proposes to use semantic segmentation to filter out unreasonable surfaces such as human faces. They also adopt an adaptive coloring scheme to fit the text into the artistic style of backgrounds. However, without considering the scene as a whole, these methods fail to render text instances in a photo-realistic way, and text instances are too outstanding from backgrounds. So far, the training of detectors still relies heavily on real images.

Although GANs and other learning-based methods have also shown great potential in generating realistic images [47, 16, 12], the generation of scene text images still require a large amount of manually labeled data [50]. Furthermore, such data are sometimes not easy to collect, especially for cases such as low resource languages.

More recently, synthesizing images with 3D graphics engine has become popular in several fields, including human pose estimation 


, scene understanding/segmentation 

[27, 23, 32, 34, 36], and object detection [28, 41, 8]. However, these methods either consider simplistic cases, e.g. rendering 3D objects on top of static background images [28, 42] and randomly arranging scenes filled with objects [27, 23, 34, 8], or passively use off-the-shelf 3D scenes without further changing it [32]. In contrast to these researches, our proposed synthesis engine implements active and regular interaction with 3D scenes, to generate realistic and diverse scene text images.

2.2 Scene Text Detection and Recognition

Scene text detection and recognition, possibly as the most human-centric computer vision task, has been a popular research topic for many years 

[48, 20]. In scene text detection, there are mainly two branches of methodologies: Top-down methods that inherit the idea of region proposal networks from general object detectors that detect text instances as rotated rectangles and polygons [18, 52, 11, 51, 46]; Bottom-up approaches that predict local segments and local geometric attributes, and compose them into individual text instances [37, 21, 2, 39]. Despite significant improvements on individual datasets, those most widely used benchmark datasets are usually very small, with only around to images in test sets, and are therefore prone to over-fitting. The generalization ability across different domains remains an open question, and is not studied yet. The reason lies in the very limited real data and that synthetic data are not effective enough. Therefore, one important motivation of our synthesis engine is to serve as a stepping stone towards general scene text detection.

Most scene text recognition models consist of CNN-based image feature extractors and attentional LSTM [9] or transformer [43]-based encoder-decoder to predict the textual content [4, 38, 15, 22]. Since the encoder-decoder module is a language model in essence, scene text recognizers have a high demand for training data with a large vocabulary, which is extremely difficult for real-world data. Besides, scene text recognizers work on image crops that have simple backgrounds, which are easy to synthesize. Therefore, synthetic data are necessary for scene text recognizers, and synthetic data alone are usually enough to achieve state-of-the-art performance. Moreover, since the recognition modules require a large amount of data, synthetic data are also necessary in training end-to-end text spotting systems [17, 7, 29].

3 Scene Text in 3D Virtual World

3.1 Overview

In this section, we give a detailed introduction to our scene text image synthesis engine, UnrealText, which is developed upon UE4 and the UnrealCV plugin [30]. The synthesis engine: (1) produces photo-realistic images, (2) is efficient, taking about only - second to render and generate a new scene text image and, (3) is general and compatible to off-the-shelf 3D scene models. As shown in Fig.  2, the pipeline mainly consists of a Viewfinder module (section 3.2), an Environment Randomization module (section 3.3), a Text Region Generation module (section 3.4), and a Text Rendering module (section 3.5).

Firstly, the viewfinder module explores around the 3D scene with the camera, generating camera viewpoints. Then, the environment lighting is randomly adjusted. Next, the text regions are proposed based on 2D scene information and refined with 3D mesh information in the graphics engine. After that, text foregrounds are generated with randomly sampled fonts, colors, and text content, and are loaded as planar meshes. Finally, we retrieve the RGB image and corresponding text locations as well as text content to make the synthetic dataset.

Figure 2: The pipeline of the proposed synthesis method. The arrows indicate the order. For simplicity, we only show one text region. From left to right: scene overview, diverse viewpoints, various lighting conditions (light color, intensity, shadows, etc.), text region generation and text rendering.

3.2 Viewfinder

The aim of the viewfinder module is to automatically determine a set of camera locations and rotations from the whole space of 3D scenes that are reasonable and non-trivial, getting rid of unsuitable viewpoints such as from inside object meshes (e.g. Fig.  3 bottom right).

Learning-based methods such as navigation and exploration algorithms may require extra training data and are not guaranteed to generalize to different 3D scenes. Therefore, we turn to rule-based methods and design a physically-constrained 3D random walk (Fig.  3 first row) equipped with auxiliary camera anchors.

3.2.1 Physically-Constrained 3D Random Walk

Starting from a valid location, the physically-constrained 3D random walk aims to find the next valid and non-trivial location. In contrast to being valid, locations are invalid if they are inside object meshes or far away from the scene boundary, for example. A non-trivial location should be not too close to the current location. Otherwise, the new viewpoint will be similar to the current one. The proposed 3D random walk uses ray-casting [35], which is constrained by physically, to inspect the physical environment to determine valid and non-trivial locations.

In each step, we first randomly change the pitch and yaw values of the camera rotation, making the camera pointing to a new direction. Then, we cast a ray from the camera location towards the direction of the viewpoint. The ray stops when it hits any object meshes or reaches a fixed maximum length. By design, the path from the current location to the stopping position is free of any barrier, i.e. not inside of any object meshes. Therefore, points along this ray path are all valid. Finally, we randomly sample one point between the -th and -th of this path, and set it as the new location of the camera, which is non-trivial. The proposed random walk algorithm can generate diverse camera viewpoints.

3.2.2 Auxiliary Camera Anchors

The proposed random walk algorithm, however, is inefficient in terms of exploration. Therefore, we manually select a set of camera anchors across the 3D scenes as starting points. After every steps, we reset the location of the camera to a randomly sampled camera anchor. We set - and . Note that the selection of camera anchors requires only little carefulness. We only need to ensure coverage over the space. It takes around to seconds for each scene, which is trivial and not a bottleneck of scalability. The manual but efficient selection of camera is compatible with the proposed random walk algorithm that generates diverse viewpoints.

Figure 3: In the first row (1)-(4), we illustrate the physically-constrained 3D random walk. For better visualization, we use a camera object to represent the viewpoint (marked with green boxes and arrows). In the second row, we compare viewpoints from the proposed method with randomly sampled viewpoints.

3.3 Environment Randomization

To produce real-world variations such as lighting conditions, we randomly change the intensity, color, and direction of all light sources in the scene. In addition to illuminations, we also add fog conditions and randomly adjust its intensity. The environment randomization proves to increase the diversity of the generated images and results in stronger detector performance. The proposed randomization can also benefit sim-to-real domain adaptation [40].

3.4 Text Region Generation

In real-world, text instances are usually embedded on well-defined surfaces, e.g. traffic signs, to maintain good legibility. Previous works find suitable regions by using estimated scene information, such as gPb-UCM [1] in SynthText [6] or saliency map in VISD [49] for approximation. However, these methods are imprecise and often fail to find appropriate regions. Therefore, we propose to find text regions by probing around object meshes in 3D world. Since inspecting all object meshes is time-consuming, we propose a 2-staged pipeline: (1) We retrieve ground truth surface normal map to generate initial text region proposals; (2) Initial proposals are then projected to and refined in the 3D world using object meshes. Finally, we sample a subset from the refined proposals to render. To avoid occlusion among proposals, we project them back to screen space, and discard regions that overlap with each other one by one in a shuffled order until occlusion is eliminated.

3.4.1 Initial Proposals from Normal Maps

In computer graphics, normal values are unit vectors that are perpendicular to a surface. Therefore, when projected to 2D screen space, a region with similar normal values tends to be a well-defined region to embed text on. We find valid image regions by applying sliding windows of

pixels across the surface normal map, and retrieve those with smooth

surface normal: the minimum cosine similarity value between any two pixels is larger than a threshold

. We set to , which proves to produce reasonable results. We randomly sample at most non-overlapping valid image regions to make the initial proposals. Making proposals from normal maps is an efficient way to find potential and visible regions.

3.4.2 Refining Proposals in 3D Worlds

As shown in Fig. 4, rectangular initial proposals in 2D screen space will be distorted when projected into 3D world. Thus, we need to first rectify the proposals in 3D world. We project the center point of the initial proposals into 3D space, and re-initialize orthogonal squares on the corresponding mesh surfaces around the center points: the horizontal sides are orthogonal to the gravity direction. The side lengths are set to the shortest sides of the quadrilaterals created by projecting the four corners of initial proposals into the 3D space. Then we enlarge the widths and heights along the horizontal and vertical sides alternatively. The expansion of one direction stops when the sides of that direction get off the surface111when the distances from the rectangular proposals’ corners to the nearest point on the underlying surface mesh exceed certain threshold, hit other meshes, or reach the preset maximum expansion ratio. The proposed refining algorithm works in 3D world space, and is able to produce natural homography transformation in 2D screen space.

Figure 4: Illustration of the refinement of initial proposals. We draw green bounding boxes to represent proposals in 2D screen space, and use planar meshes to represent proposals in 3D space. (1) Initial proposals are made in 2D space. (2) When we project them into 3D world and inspect them from the front view, they are in distorted forms. (3) Based on the sizes of the distorted proposals and the positions of the center points, we re-initialize orthogonal squares on the same surfaces with horizontal sides orthogonal to the gravity direction. (5) Then we expand the squares. (6) Finally, we obtain text regions in 2D screen space with natural perspective distortion.

3.5 Text Rendering

Generating Text Images: Given text regions as proposed and refined in section  3.4

, the text generation module samples text content and renders text images with certain fonts and text colors. The numbers of lines and characters per line are determined by the font size and the size of refined proposals in 2D space to make sure the characters are not too small and ensure legibility. For a fairer comparison, we also use the same font set from Google Fonts 

222https://fonts.google.com/ as SynthText does. We also use the same text corpus, Newsgroup20. The generated text images have zero alpha values on non-stroke pixels, and non zero for others.

Rendering Text in 3D World: We first perform triangulation for the refined proposals to generate planar triangular meshes that are closely attached to the underlying surface. Then we load the text images as texture onto the generated meshes. We also randomly sample the texture attributes, such as the ratio of diffuse and specular reflection.

3.6 Implementation Details

The proposed synthesis engine is implemented based on UE4.22 and the UnrealCV plugin. On an ubuntu workstation with an 8-core Intel CPU, an NVIDIA GeForce RTX 2070 GPU, and 16G RAM, the synthesis speed is - seconds per image with a resolution of , depending on the complexity of the scene model.

We collect scene models from the official UE4 marketplace. The engine is used to generate scene text images with English words. With the same configuration, we also generate a multilingual version, making it the largest multilingual scene text dataset.

4 Experiments on Scene Text Detection

4.1 Settings

We first verify the effectiveness of the proposed engine by training detectors on the synthesized images and evaluating them on real image datasets. We use a previous yet time-tested state-of-the-art model, EAST [52], which is fast and accurate. EAST also forms the basis of several widely recognized end-to-end text spotting models [17, 7]. We adopt an opensource implementation333https://github.com/argman/EAST. In all experiments, models are trained on GPU with a batch size of . During the evaluation, the test images are resized to match a short side length of pixels.

Benchmark Datasets We use the following scene text detection datasets for evaluation: (1) ICDAR 2013 Focused Scene Text (IC13) [14] containing horizontal text with zoomed-in views. (2) ICDAR 2015 Incidental Scene Text (IC15) [13] consisting of images taken without carefulness with Google Glass. Images are blurred and text are small. (3) MLT 2017 [26] for multilingual scene text detection, which is composed of scene text images of languages.

4.2 Experiments Results

Pure Synthetic Data We first train the EAST models on different synthetic datasets alone, to compare our method with previous ones in a direct and quantitative way. Note that ours, SynthText, and VISD have different numbers of images, so we also need to control the number of images used in experiments. Results are summarized in Tab. 1.

Firstly, we control the total number of images to 10K, which is also the full size of the smallest synthetic dataset, VISD. We observe a considerable improvement on IC15 over previous state-of-the-art by in F1-score, and significant improvements on IC13 () and MLT 2017 (). Secondly, we also train models on the full set of SynthText and ours, since scalability is also an important factor for synthetic data, especially when considering the demand to train recognizers. Extra training images further improve F1 scores on IC15, IC13, and MLT by , , and . Models trained with our UnrealText data outperform all other synthetic datasets. Besides, the subset of images with our method even surpasses SynthText images significantly on all datasets. The experiment results demonstrate the effectiveness of our proposed synthetic engine and datasets.

Training Data
IC15 IC13 MLT 2017
SynthText 10K 46.3 60.8 38.9
VISD 10K (full) 64.3 74.8 51.4
Ours 10K 65.2 78.3 54.2
SynthText 800K (full) 58.0 67.7 44.8
Ours 600K (full) 67.8 80.6 56.3
Ours 5K + VISD 5K 66.9 80.4 55.7

Table 1: Detection results (F1-scores) of EAST models trained on different synthetic data.

Complementary Synthetic Data One unique characteristic of the proposed UnrealText is that, the images are generated from 3D scene models, instead of real background images, resulting in potential domain gap due to different artistic styles. We conduct experiments by training on both UnrealText data () and VISD (), as also shown in Tab. 1 (last row, marked with italics), which achieves better performance than other synthetic datasets. This result demonstrates that, our UnrealText is complementary to existing synthetic datasets that use real images as backgrounds. While UnrealText simulates photo-realistic effects, synthetic data with real background images can help adapt to real-world datasets.

Combining Synthetic and Real Data One important role of synthetic data is to serve as data for pretraining, and to further improve the performance on domain specific real datasets. We first pretrain the EAST models with different synthetic data, and then use domain data to finetune the models. The results are summarized in Tab. 2. On all domain-specific datasets, models pretrained with our synthetic dataset surpasses others by considerable margins, verifying the effectiveness of our synthesis method in the context of boosting performance on domain specific datasets.

Evaluation on ICDAR 2015
Training Data P R F1
IC15 84.6 78.5 81.4
IC15 + SynthText 10K 85.6 79.5 82.4
IC15 + VISD 10K 86.3 80.0 83.1
IC15 + Ours 10K 86.9 81.0 83.8
IC15 + Ours 600K (full) 88.5 80.8 84.5
Evaluation on ICDAR 2013
Training Data P R F1
IC13 82.6 70.0 75.8
IC13 + SynthText 10K 85.3 72.4 78.3
IC13 + VISD 10K 85.9 73.1 79.0
IC13 + Ours 10K 88.5 74.7 81.0
IC13 + Ours 600K (full) 92.3 73.4 81.8
Evaluation on MLT 2017
Training Data P R F1
MLT 2017 72.9 67.4 70.1
MLT 2017 + SynthText 10K 73.1 67.7 70.3
MLT 2017 + VISD 10K 73.3 67.9 70.5
MLT 2017 + Ours 10K 74.6 68.7 71.6
MLT 2017 + Ours 600K (full) 82.2 67.4 74.1

Table 2: Detection performances of EAST models pretrained on synthetic and then finetuned on real datasets.

Pretraining on Full Dataset As shown in the last rows of Tab. 2, when we pretrain the detector models with our full dataset, the performances are improved significantly, demonstrating the advantage of the scalability of our engine. Especially, The EAST model achieves an F1 score of on MLT17, which is even better than recent state-of-the-art results, including by CRAFT[2] and by LOMO [51]. Although the margin is not great, it suffices to claim that the EAST model revives and reclaims state-of-the-art performance with the help of our synthetic dataset.

4.3 Module Level Ablation Analysis

One reasonable concern about synthesizing from 3D virtual scenes lies in the scene diversity. In this section, we address the importance of the proposed view finding module and the environment randomization module in increasing the diversity of synthetic images.

Ablating Viewfinder Module We derive two baselines from the proposed viewfinder module: (1) Random Viewpoint + Manual Anchor that randomly samples camera locations and rotations from the norm-ball spaces centered around auxiliary camera anchors. (2) Random Viewpoint Only that randomly samples camera locations and rotations from the whole scene space, without checking their quality. For experiments, we fix the number of scenes to to control scene diversity and generate different numbers of images, and compare their performance curve. By fixing the number of scenes, we compare how well different view finding methods can exploit the scenes.

Ablating Environment Randomization We remove the environment randomization module, and keep the scene models unchanged during synthesis. For experiments, we fix the total number of images to and use different number of scenes. In this way, we can compare the diversity of images generated with different methods.

We train the EAST models with different numbers of images or scenes, evaluate them on the real datasets, and compute the arithmetic mean of the F1-scores. As shown in Fig. 5 (a), we observe that the proposed combination, i.e. Random Walk + Manual Anchor, achieves significantly higher F1-scores consistently for different numbers of images. Especially, larger sizes of training sets result in greater performance gaps. We also inspect the images generated with these methods respectively. When starting from the same anchor point, the proposed random walk can generate more diverse viewpoints and can traverse much larger area. In contrast, the Random Viewpoint + Manual Anchor method degenerates either into random rotation only when we set a small norm ball size for random location, or into Random Viewpoint Only when we set a large norm ball size. As a result, the Random Viewpoint + Manual Anchor method requires careful manual selection of anchors, and we also need to manually tune the norm ball sizes for different scenes, which restricts the scalability of the synthesis engine. Meanwhile, our proposed random walk based method is more flexible and robust to the selection of manual anchors. As for the Random Viewpoint Only method, a large proportion of generated viewpoints are invalid, e.g. inside other object meshes, which is out-of-distribution for real images. This explains why it results in the worst performances.

From Fig. 5 (b), the major observation is that environment randomization module improves performances over different scene numbers consistently. Besides, the improvement is more significant as we use fewer scenes. Therefore, we can draw a conclusion that, the environment randomization helps increase image diversity and at the same time, can reduce the number of scenes needed. Furthermore, the random lighting conditions realize different real-world variations, which we also attribute as a key factor.

Figure 5: Results of ablation tests: (a) ablating viewfinder module; (b) ablating environment randomization module.

5 Experiments on Scene Text Recognition

In addition to the superior performances in training scene text detection models, we also verify its effectiveness in the task of scene text recognition.

5.1 Recognizing Latin Scene Text

5.1.1 Settings

Model We select a widely accepted baseline method, ASTER [38], and adopt the implementation444https://github.com/Jyouhou/ICDAR2019-ArT-Recognition-Alchemy that ranks top-1 on the ICDAR 2019 ArT competition on curved scene text recognition (Latin) by [19]. The models are trained with a batch size of . A total of symbols are recognized, including an End-of-Sentence mark, case sensitive alphabets, digits, and printable punctuation symbols.

Training Datasets From the English synthetic images, we obtain a total number of word-level image regions to make our training dataset. Also note that, our synthetic dataset provide character level annotations, which will be useful in some recognition algorithms.

Evaluation Datasets We evaluate models trained on different synthetic datasets on several widely used real image datasets: IIIT [24], SVT [44], ICDAR 2015 (IC15) [13], SVTP [31], CUTE [33], and Total-Text[3].

Some of these datasets, however, have incomplete annotations, including IIIT, SVT, SVTP, CUTE. While these datasets contain punctuation symbols, digits, upper-case and lower-case characters, the aforementioned datasets only provide case-insensitive annotations and ignore all punctuation symbols. In order for more comprehensive evaluation of scene text recognition, we re-annotate these datasets in a case-sensitive way and also include punctuation symbols. We also publish the new annotations and we believe that they will become better benchmarks for scene text recognition in the future.

Training Data Latin Arabic Bangla Chinese Hindi Japanese Korean Symbols Mixed Overall
ST (1.2M) 34.6 50.5 17.7 43.9 15.7 21.2 55.7 44.7 9.8 34.9
Ours (1.2M) 42.2 50.3 16.5 44.8 30.3 21.7 54.6 16.7 25.0 36.5
Ours (full, 4.1M) 44.3 51.1 19.7 47.9 33.1 24.2 57.3 25.6 31.4 39.5
MLT19-train (90K) 64.3 47.2 46.9 11.9 46.9 23.3 39.1 35.9 3.6 45.7
MLT19-train (90K) + ST (1.2M) 63.8 62.0 48.9 50.7 47.7 33.9 64.5 45.5 10.3 54.7
MLT19-train (90K) + Ours (1.2M) 67.8 63.0 53.7 47.7 64.0 35.7 62.9 44.3 26.3 57.9
Table 3: Multilingual scene text recognition results (word level accuracy). Latin aggregates English, French, German, and Italian, as they are all marked as Latin in the MLT dataset.

5.1.2 Experiment Results

Experiment results are summarized in Tab. 6. First, we compare our method with previous synthetic datasets. We have to limit the size of training datasets to since VISD only publishes word images. Our synthetic data achieves consistent improvements on all datasets. Especially, it surpasses other synthetic datasets by a considerable margin on datasets with diverse text styles and complex backgrounds such as SVTP (). The experiments verify the effectiveness of our synthesis method in scene text recognition especially in the complex cases.

Since small scale experiments are not very helpful in how researchers should utilize these datasets, we further train models on combinations of Synth90K, SynthText, and ours. We first limit the total number of training images to . When we train on a combination of all synthetic datasets, with each, the model performs better than the model trained on datasets only. We further observe that training on synthetic datasets is comparable to training on the whole Synth90K and SynthText, while using much fewer training data. This result suggests that the best practice is to combine the proposed synthetic dataset with previous ones.

Training Data IIIT SVT IC15 SVTP CUTE Total
90K [10] (1M) 51.6 39.2 35.7 37.2 30.9 30.5
ST [6] (1M) 53.5 30.3 38.4 29.5 31.2 31.1
VISD [49] (1M) 53.9 37.1 37.1 36.3 30.5 30.9
Ours (1M) 54.8 40.3 39.1 39.6 31.6 32.1
ST+90K() 80.5 70.1 58.4 60.0 63.9 43.2
ST+90K+ours() 81.6 71.9 61.8 61.7 67.7 45.7
ST+90K() 81.2 71.2 62.0 62.3 65.1 44.7
Table 4: Results on English datasets (word level accuracy).

5.2 Recognizing Multilingual Scene Text

5.2.1 Settings

Although MLT 2017 has been widely used as a benchmark for detection, the task of recognizing multilingual scene text still remains largely untouched, mainly due to lack of a proper training dataset. To pave the way for future research, we also generate a multilingual version with images containing languages as included in MLT 2019 [25]: Arabic, Bangla, Chinese, English, French, German, Hindi, Italian, Japanese, and Korean. Text contents are sampled from corpus extracted from the Wikimedia dump555https://dumps.wikimedia.org.

Model We use the same model and implementation as Section 5.1, except that the symbols to recognize are expanded to all characters that appear in the generated dataset.

Training and Evaluation Data We crop from the proposed multilingual dataset. We discard images with widths shorter than pixels as they are too blurry, and obtain word images in total. We compare with the multilingual version of SynthText provided by MLT 2019 competition that contains a total number images. For evaluation, we randomly split images for each language (including symbols and mixed) from the training set of MLT 2019. The rest of the training set is used for training.

5.2.2 Experiment Results

Experiment results are shown in Tab. 3. When we only use synthetic data and control the number of images to , ours result in a considerable improvement of in overall accuracy, and significant improvements on some scripts, e.g. Latin () and Mixed (). Using the whole training set of images further improves overall accuracy to . When we train models on combinations of synthetic data and our training split of MLT19, as shown in the bottom of Tab. 3, we can still observe a considerable margin of our method over SynthText by in overall accuracy. The experiment results demonstrate that our method is also superior in multilingual scene text recognition, and we believe this result will become a stepping stone to further research.

6 Limitation and Future Work

There are several aspects that are worth diving deeper into: (1) Overall, the engine is based on rules and human-selected parameters. The automation of the selection and search for these parameters can save human efforts and help adapt to different scenarios. (2) While rendering small text can help training detectors, the low image quality of the small text makes recognizers harder to train and harms the performance. Designing a method to mark the illegible ones as difficult and excluding them from loss calculation may help mitigate this problem. (3) For multilingual scene text, scripts except Latin have much fewer available fonts that we have easy access to. To improve performance on more languages, researchers may consider learning-based methods to transfer Latin fonts to other scripts.

7 Conclusion

In this paper, we introduce a scene text image synthesis engine that renders images with 3D graphics engines, where text instances and scenes are rendered as a whole. In experiments, we verify the effectiveness of the proposed engine in both scene text detection and recognition models. We also study key components of the proposed engine. We believe our work will be a solid stepping stone towards better synthesis algorithms.


This research was supported by National Key R&D Program of China (No. 2017YFA0700800).


  • [1] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik (2011) Contour detection and hierarchical image segmentation. IEEE transactions on pattern analysis and machine intelligence 33 (5), pp. 898–916. Cited by: §3.4.
  • [2] Y. Baek, B. Lee, D. Han, S. Yun, and H. Lee (2019) Character region awareness for text detection. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    pp. 9365–9374. Cited by: §2.2, §4.2.
  • [3] C. K. Ch’ng and C. S. Chan (2017) Total-text: a comprehensive dataset for scene text detection and recognition. In Proc. ICDAR, Vol. 1, pp. 935–942. Cited by: §5.1.1.
  • [4] Z. Cheng, X. Liu, F. Bai, Y. Niu, S. Pu, and S. Zhou (2017) Arbitrarily-oriented text recognition. CVPR2018. Cited by: §2.2.
  • [5] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Proc. NIPS, pp. 2672–2680. Cited by: §2.1.
  • [6] A. Gupta, A. Vedaldi, and A. Zisserman (2016) Synthetic data for text localisation in natural images. In Proc. CVPR, pp. 2315–2324. Cited by: §1, §1, §2.1, §3.4, Table 4.
  • [7] T. He, Z. Tian, W. Huang, C. Shen, Y. Qiao, and C. Sun (2018) An end-to-end textspotter with explicit alignment and attention. In Proc. CVPR, pp. 5020–5029. Cited by: §2.2, §4.1.
  • [8] S. Hinterstoisser, O. Pauly, H. Heibel, M. Marek, and M. Bokeloh (2019) An annotation saved is an annotation earned: using fully synthetic training for object instance detection. CoRR abs/1902.09967. Cited by: §2.1.
  • [9] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §2.2.
  • [10] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman (2014) Synthetic data and artificial neural networks for natural scene text recognition. arXiv preprint arXiv:1406.2227. Cited by: §1, §2.1, Table 4.
  • [11] Y. Jiang, X. Zhu, X. Wang, S. Yang, W. Li, H. Wang, P. Fu, and Z. Luo (2017) R2CNN: rotational region cnn for orientation robust scene text detection. arXiv preprint arXiv:1706.09579. Cited by: §2.2.
  • [12] A. Kar, A. Prakash, M. Liu, E. Cameracci, J. Yuan, M. Rusiniak, D. Acuna, A. Torralba, and S. Fidler (2019) Meta-sim: learning to generate synthetic datasets. arXiv preprint arXiv:1904.11621. Cited by: §2.1.
  • [13] D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh, A. Bagdanov, M. Iwamura, J. Matas, L. Neumann, V. R. Chandrasekhar, S. Lu, et al. (2015) ICDAR 2015 competition on robust reading. In 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 1156–1160. Cited by: §4.1, §5.1.1.
  • [14] D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, L. G. i Bigorda, S. R. Mestre, J. Mas, D. F. Mota, J. A. Almazan, and L. P. de las Heras (2013) ICDAR 2013 robust reading competition. In 2013 12th International Conference on Document Analysis and Recognition (ICDAR), pp. 1484–1493. Cited by: §4.1.
  • [15] H. Li, P. Wang, C. Shen, and G. Zhang (2019) Show, attend and read: a simple and strong baseline for irregular text recognition. AAAI. Cited by: §1, §2.2.
  • [16] C. Lin, E. Yumer, O. Wang, E. Shechtman, and S. Lucey (2018) St-gan: spatial transformer generative adversarial networks for image compositing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9455–9464. Cited by: §2.1.
  • [17] X. Liu, D. Liang, S. Yan, D. Chen, Y. Qiao, and J. Yan (2018) FOTS: fast oriented text spotting with a unified network. Proc. CVPR. Cited by: §2.2, §4.1.
  • [18] Y. Liu and L. Jin (2017) Deep matching prior network: toward tighter multi-oriented text detection. In Proc. CVPR, Cited by: §2.2.
  • [19] S. Long, Y. Guan, B. Wang, K. Bian, and C. Yao (2019) Alchemy: techniques for rectification based irregular scene text recognition. arXiv preprint arXiv:1908.11834. Cited by: §1, §5.1.1.
  • [20] S. Long, X. He, and C. Yao (2018)

    Scene text detection and recognition: the deep learning era

    arXiv preprint arXiv:1811.04256. Cited by: §2.2.
  • [21] S. Long, J. Ruan, W. Zhang, X. He, W. Wu, and C. Yao (2018) TextSnake: a flexible representation for detecting text of arbitrary shapes. In Proc. ECCV, Cited by: §2.2.
  • [22] P. Lyu, Z. Yang, X. Leng, X. Wu, R. Li, and X. Shen (2019) 2D attentional irregular scene text recognizer. arXiv preprint arXiv:1906.05708. Cited by: §2.2.
  • [23] J. McCormac, A. Handa, S. Leutenegger, and A. J. Davison (2016) SceneNet RGB-D: 5m photorealistic images of synthetic indoor trajectories with ground truth. CoRR abs/1612.05079. Cited by: §2.1.
  • [24] A. Mishra, K. Alahari, and C. Jawahar (2012) Scene text recognition using higher order language priors. In BMVC-British Machine Vision Conference, Cited by: §5.1.1.
  • [25] N. Nayef, Y. Patel, M. Busta, P. N. Chowdhury, D. Karatzas, W. Khlif, J. Matas, U. Pal, J. Burie, C. Liu, et al. (2019) ICDAR2019 robust reading challenge on multi-lingual scene text detection and recognition–rrc-mlt-2019. arXiv preprint arXiv:1907.00945. Cited by: §5.2.1.
  • [26] N. Nayef, F. Yin, I. Bizid, H. Choi, Y. Feng, D. Karatzas, Z. Luo, U. Pal, C. Rigaud, J. Chazalon, et al. (2017) ICDAR2017 robust reading challenge on multi-lingual scene text detection and script identification-rrc-mlt. In Proc. ICDAR, Vol. 1, pp. 1454–1459. Cited by: §4.1.
  • [27] J. Papon and M. Schoeler (2015) Semantic pose using deep networks trained on synthetic rgb-d. In Proc. ICCV, pp. 774–782. Cited by: §2.1.
  • [28] X. Peng, B. Sun, K. Ali, and K. Saenko (2015) Learning deep object detectors from 3d models. In Proc. ICCV, pp. 1278–1286. Cited by: §2.1.
  • [29] S. Qin, A. Bissacco, M. Raptis, Y. Fujii, and Y. Xiao (2019) Towards unconstrained end-to-end text spotting. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4704–4714. Cited by: §2.2.
  • [30] W. Qiu and A. Yuille (2016) Unrealcv: connecting computer vision to unreal engine. In Proc. ECCV, pp. 909–916. Cited by: §3.1.
  • [31] T. Quy Phan, P. Shivakumara, S. Tian, and C. Lim Tan (2013) Recognizing text with perspective distortion in natural scenes. In Proc. ICCV, pp. 569–576. Cited by: §5.1.1.
  • [32] S. R. Richter, V. Vineet, S. Roth, and V. Koltun (2016) Playing for data: ground truth from computer games. In European conference on computer vision, pp. 102–118. Cited by: §2.1.
  • [33] A. Risnumawan, P. Shivakumara, C. S. Chan, and C. L. Tan (2014) A robust arbitrary text detection system for natural scene images. Expert Systems with Applications 41 (18), pp. 8027–8048. Cited by: §5.1.1.
  • [34] G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez (2016) The synthia dataset: a large collection of synthetic images for semantic segmentation of urban scenes. In Proc. CVPR, pp. 3234–3243. Cited by: §2.1.
  • [35] S. D. Roth (1982) Ray casting for modeling solids. Computer Graphics & Image Processing 18 (2), pp. 109–144. Cited by: §3.2.1.
  • [36] F. S. Saleh, M. S. Aliakbarian, M. Salzmann, L. Petersson, and J. M. Alvarez (2018) Effective use of synthetic data for urban scene semantic segmentation. In Proc. ECCV, pp. 86–103. Cited by: §2.1.
  • [37] B. Shi, X. Bai, and S. Belongie (2017) Detecting oriented text in natural images by linking segments. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.2.
  • [38] B. Shi, M. Yang, X. Wang, P. Lyu, X. Bai, and C. Yao (2018) ASTER: an attentional scene text recognizer with flexible rectification. IEEE transactions on pattern analysis and machine intelligence 31 (11), pp. 855–868. Cited by: §2.2, §5.1.1.
  • [39] Z. Tian, M. Shu, P. Lyu, R. Li, C. Zhou, X. Shen, and J. Jia (2019) Learning shape-aware embedding for scene text detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4234–4243. Cited by: §2.2.
  • [40] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel (2017) Domain randomization for transferring deep neural networks from simulation to the real world. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 23–30. Cited by: §3.3.
  • [41] J. Tremblay, T. To, and S. Birchfield (2018) Falling things: a synthetic dataset for 3d object detection and pose estimation. In Proc. CVPR Workshops, pp. 2038–2041. Cited by: §2.1.
  • [42] G. Varol, J. Romero, X. Martin, N. Mahmood, M. J. Black, I. Laptev, and C. Schmid (2017) Learning from synthetic humans. In Proc. CVPR, pp. 109–117. Cited by: §2.1.
  • [43] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Proc. NIPS, pp. 5998–6008. Cited by: §2.2.
  • [44] K. Wang, B. Babenko, and S. Belongie (2011) End-to-end scene text recognition. In 2011 IEEE International Conference on Computer Vision (ICCV),, pp. 1457–1464. Cited by: §5.1.1.
  • [45] T. Wang, D. J. Wu, A. Coates, and A. Y. Ng (2012)

    End-to-end text recognition with convolutional neural networks

    In 2012 21st International Conference on Pattern Recognition (ICPR), pp. 3304–3308. Cited by: §1, §2.1.
  • [46] X. Wang, Y. Jiang, Z. Luo, C. Liu, H. Choi, and S. Kim (2019) Arbitrary shape scene text detection with adaptive text region representation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6449–6458. Cited by: §2.2.
  • [47] X. Wang, Z. Man, M. You, and C. Shen (2017) Adversarial generation of training examples: applications to moving vehicle license plate recognition. arXiv preprint arXiv:1707.03124. Cited by: §2.1.
  • [48] Q. Ye and D. Doermann (2015) Text detection and recognition in imagery: a survey. IEEE transactions on pattern analysis and machine intelligence 37 (7), pp. 1480–1500. Cited by: §2.2.
  • [49] F. Zhan, S. Lu, and C. Xue (2018) Verisimilar image synthesis for accurate detection and recognition of texts in scenes. In Proc. ECCV, Cited by: §1, §1, §2.1, §3.4, Table 4.
  • [50] F. Zhan, H. Zhu, and S. Lu (2019) Spatial fusion gan for image synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3653–3662. Cited by: §2.1, §2.1.
  • [51] C. Zhang, B. Liang, Z. Huang, M. En, J. Han, E. Ding, and X. Ding (2019) Look more than once: an accurate detector for text of arbitrary shapes. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Cited by: §2.2, §4.2.
  • [52] X. Zhou, C. Yao, H. Wen, Y. Wang, S. Zhou, W. He, and J. Liang (2017) EAST: an efficient and accurate scene text detector. In Proc. CVPR, Cited by: §2.2, §4.1.

A. Scene Models

In this work, we use a total number of scene models which are all obtained from the Internet. However, most of these models are not free. Therefore, we are not allowed to share the models themselves. Instead, we list the models we use and their links in Tab. 5.

Scene Name Link
Urban City https://www.unrealengine.com/marketplace/en-US/product/urban-city
Medieval Village https://www.unrealengine.com/marketplace/en-US/product/medieval-village
Loft https://ue4arch.com/shop/complete-projects/archviz/loft/
Desert Town https://www.unrealengine.com/marketplace/en-US/product/desert-town
Archinterior 1 https://www.unrealengine.com/marketplace/en-US/product/archinteriors-vol-2-scene-01
Desert Gas Station https://www.unrealengine.com/marketplace/en-US/product/desert-gas-station
Modular School https://www.unrealengine.com/marketplace/en-US/product/modular-school-pack
Factory District https://www.unrealengine.com/marketplace/en-US/product/factory-district
Abandoned Factory https://www.unrealengine.com/marketplace/en-US/product/modular-abandoned-factory
Buddhist https://www.unrealengine.com/marketplace/en-US/product/buddhist-monastery-environment
Castle Fortress https://www.unrealengine.com/marketplace/en-US/product/castle-fortress
Desert Ruin https://www.unrealengine.com/marketplace/en-US/product/modular-desert-ruins
HALArchviz https://www.unrealengine.com/marketplace/en-US/product/hal-archviz-toolkit-v1
Hospital https://www.unrealengine.com/marketplace/en-US/product/modular-sci-fi-hospital
HQ House https://www.unrealengine.com/marketplace/en-US/product/hq-residential-house
Industrial City https://www.unrealengine.com/marketplace/en-US/product/industrial-city
Archinterior 2 https://www.unrealengine.com/marketplace/en-US/product/archinteriors-vol-4-scene-02
Office https://www.unrealengine.com/marketplace/en-US/product/retro-office-environment
Meeting Room https://drive.google.com/file/d/0B_mjKk7NOcnEUWZuRDVFQ09STE0/view
Old Village https://www.unrealengine.com/marketplace/en-US/product/old-village
Modular Building https://www.unrealengine.com/marketplace/en-US/product/modular-building-set
Modular Home https://www.unrealengine.com/marketplace/en-US/product/supergenius-modular-home
Dungeon https://www.unrealengine.com/marketplace/en-US/product/top-down-multistory-dungeons
Old Town https://www.unrealengine.com/marketplace/en-US/product/old-town
Root Cellar https://www.unrealengine.com/marketplace/en-US/product/root-cellar
Victorian https://www.unrealengine.com/marketplace/en-US/product/victorian-street
Spaceship https://www.unrealengine.com/marketplace/en-US/product/spaceship-interior-environment-set
Top-Down City https://www.unrealengine.com/marketplace/en-US/product/top-down-city
Scene Name https://www.unrealengine.com/marketplace/en-US/product/urban-city
Utopian City https://www.unrealengine.com/marketplace/en-US/product/utopian-city
Table 5: The list of 3D scene models used in this work.

B. New Annotations for Scene Text Recognition Datasets

During the experiments of scene text recognition for English scripts, we notice that among the most widely used benchmark datasets, several have incomplete annotations. They are IIIT5K, SVT, SVTP, and CUTE-80. The annotations of these datasets are case-insensitive, and ignore punctuation marks.

The common practice for recent scene text recognition research is to convert both prediction and ground-truth text strings to lower-case and then compare them. This means that the current evaluation is flawed. It ignores letter case and punctuation marks which are crucial to the understanding of the text contents. Besides, evaluating on a much smaller vocabulary set results in over-optimism of the performance of the recognition models.

To aid further research, we use the Amazon mechanical Turk (AMT) to re-annotate the aforementioned datasets, which amount to word images in total. Each word image is annotated by workers, and we manually check and correct images where the annotations differ. The annotated datasets are released via GitHub at https://github.com/Jyouhou/Case-Sensitive-Scene-Text-Recognition-Datasets.

B.1 Samples

We select some samples from the datasets to demonstrate the new annotations in Fig. 6.

B.2 Benchmark Performances

As we are encouraging case-sensitive (also with punctuation marks) evaluation for scene text recognition, we would like to provide benchmark performances on those widely used datasets. We evaluate two implementations of the ASTER models, by Long et al.666https://github.com/Jyouhou/ICDAR2019-ArT-Recognition-Alchemy and Baek et al777https://github.com/clovaai/deep-text-recognition-benchmark respectively. Results are summarized in Tab. 6.

The two benchmark implementations perform comparably, with Baek’s better on straight text and Long’s better at curved text. Compared with evaluation with lower case + digits, the performance drops considerably for both models when we evaluate with all symbols. These results indicate that it may still be a challenge to recognize a larger vocabulary, and is worth further research.

Figure 6: Examples of the new annotations.
Implementation Case IIIT SVT IC13 IC15 SVTP CUTE80 Total
Long et al. All 81.2 71.2 86.9 62.0 62.3 65.1 44.7
Baek et al All 81.5 71.7 88.9 62.1 62.6 64.9 41.5
Long et al. lower case + digits 89.5 84.1 89.9 68.8 73.5 76.3 58.2
Baek et al lower case + digits 86.5 83.5 93.0 70.3 75.1 68.4 46.0
Table 6: Results on English datasets (word level accuracy). All indicates that the evaluation considers lower case characters, upper case characters, numerical digits, and punctuation marks.