SynthTIGER: Synthetic Text Image GEneratoR Towards Better Text Recognition Models

07/20/2021 ∙ by Moonbin Yim, et al. ∙ NAVER Corp. 0

For successful scene text recognition (STR) models, synthetic text image generators have alleviated the lack of annotated text images from the real world. Specifically, they generate multiple text images with diverse backgrounds, font styles, and text shapes and enable STR models to learn visual patterns that might not be accessible from manually annotated data. In this paper, we introduce a new synthetic text image generator, SynthTIGER, by analyzing techniques used for text image synthesis and integrating effective ones under a single algorithm. Moreover, we propose two techniques that alleviate the long-tail problem in length and character distributions of training data. In our experiments, SynthTIGER achieves better STR performance than the combination of synthetic datasets, MJSynth (MJ) and SynthText (ST). Our ablation study demonstrates the benefits of using sub-components of SynthTIGER and the guideline on generating synthetic text images for STR models. Our implementation is publicly available at https://github.com/clovaai/synthtiger.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

MJ [5]
ST [2]
Ours
Figure 1: Word box images generated by synthesis engines. MJ provides diverse text styles but there is no noise from other texts. The examples of ST are cropped from a scene text image including multiple text boxes and they includes some part of other texts. Although our synthesis engine generates word box images as like MJ, its examples includes text noises observed in examples of ST.

Optical character recognition (OCR) is a technology extracting machine-encoded texts from text images. It is a fundamental function for visual understanding and has been used in diverse real-world applications such as automatic number plate recognition [16], business document recognition [15, 4, 3] and passport recognition [9]

. In the deep learning era 

[10, 11], OCR performance has been dramatically improved by learning from large-scale data consisting of image and text pairs. In general, OCR uses large-scale data consisting of synthetic text images because it is virtually impossible to manually gather and annotate real text images that cover the exponential combinations of diverse characteristics such as text length, fonts, and backgrounds.

OCR in the wild consists of two sub-tasks, scene text detection (STD) and scene text recognition (STR). They require similar but different training data. Since STD has to localize text areas from backgrounds, its training example is a raw scene or document snapshot containing multiple texts. In contrast, STR identifies a character sequence from a word box image patch that contains a single word or a line of words. It requires a number of synthetic examples to cover the diversity of styles and texts that might exist in the real world. This paper focuses on synthetic data generation for STR to address the diversity of textual appearance in a word box image.

There are two popular synthesis engines, MJ [5] and ST [2]. MJ is a text image synthesis engine that generates word box images by processing multiple rendering modules such as font rendering, border/shadow rendering, base coloring, projective distortion, natural data blending, and noise injection. By focusing on generating word box images rather than scene text images, MJ can control all text styles, such as font color and size, used in its rendering modules but the generated word box images cannot fully represent text regions cropped from a real scene image. In contrast, ST [2] generates scene text images that includes multiple word boxes on a single scene image. ST identifies text regions and writes texts upon the regions by processing font rendering, border/shadow rendering, base coloring, and poisson image editing. Since word boxes are cropped from a scene text image, the identified word box images can include text noises from other word boxes as like real STR examples. However, there are some constraints on choosing text styles because background regions identified for text rendering may not be compatible with some text rending functions (e.g., too small to use big font size).

To take advantage of both approaches, recent STR research [19, 1] simply integrates datasets generated by both MJ and ST. However, the simple data integration not only increases the total number of training data but also causes a bias on co-covered data distributions of both synthesis engines. Although the integration provides better STR performance than the individuals, there is still room for improvement by considering a better method combining the benefits of MJ and ST.

In this paper, we introduces a new synthesis engine, referred to as Synthetic Text Image GEneratoR (SynthTIGER), for better STR models. As like MJ, SynthTIGER generates word box images without the style constraints of ST. However, as like ST, it adopts additional noise from other text regions, which can occur during cropping a text region from a scene image. Fig. 1 shows synthesized examples with MJ, ST, and SynthTIGER. The examples of SynthTIGER may contain parts of other texts as like those of ST. In our apples-to-apples comparison, using SynthTIGER shows better performance than the cases of using MJ and ST respectively. Moreover, SynthTIGER provides comparable performance to the integrated dataset of MJ and ST even though only one synthesis engine is utilized.

Furthermore, we propose two methods to alleviate skewed data distribution for infrequent characters and short/lengthy words. Previous synthesis engines generate text images by randomly sampling target texts from a pre-defined lexicon. Due to the low sampling chance of infrequent characters and very short/long words, trained models often poorly perform on these kinds of words. SynthTIGER uses length augmentation and infrequent character augmentation methods to address this problem. As shown in experiment results, these methods improve STR performance over rare and short/long words.

Finally, this paper provides an open-source synthesis engine and a new synthetic dataset that shows better STR performance than the combined dataset of MJ and ST. Our experiments under fair comparisons to baseline engines prove the superiority of SynthTIGER. Furthermore, ablative studies on rendering functions describe how rendering processes in SynthTIGER contribute to improving STR performance. The experiments on data distributions of lengths and characters show the importance of synthetic text balancing. The official implementation of SynthTIGER is open-source and the synthesized dataset is publicly available.

2 Related Work

In STR, it has become a standard practice to use synthetic datasets. We introduce previous studies providing synthetic data generation algorithms for STR [5] and other tasks that can be exploited for STR [2, 8, 12].

MJ [5] is one of the most popular data generation algorithms (and the dataset generated with that approach) for STR. It produces an image patch containing a single word. In detail, the algorithm consists of six stages. Font rendering stage randomly selects a font and font properties such as size, weight and underline. Then, it samples a word from a pre-defined vocabulary and renders it on the foreground image layer following a horizontal line or a random curve. Border/shadow rendering step optionally adds inset border, outset border, or shadow image layers with a random width. The following base coloring stage fills these image layers with different colors. Projective distortion stage applies a random, full-projective transformation to simulate the 3D world. Natural data blending step blends these image layers with a randomly sampled crop of an image from the ICDAR 2003 and SVT datasets. Finally, Noise stage injects various noise such as blur and JPEG compression artifacts into the image. While MJ is known to generate text images useful enough to train STR models, it is not clear how much each stage contributes to its success.

Synthetic datasets for scene text detection (STD) can be used for STR by cropping text regions from synthesized images. The main difference of STD data generation algorithms from STR is that STD must consider the geometry of backgrounds to create realistic images. ST [2] is the most successful STD data generation algorithm. In detail, it first samples a text and a background image. Then, it segments the image based on color and texture to obtain well-defined regions (e.g., surfaces of objects). Next, it selects the region for the text and fills the text (and optionally outline) with a color based on the region’s color. Finally, the text is rendered using a randomly selected font and transformation according to the region’s orientation. However, the use of off-the-shelf segmentation techniques for text-background alignment can produce erroneous predictions and result in unrealistic text images. Recent studies like SynthText3D [8] and UnrealText [12] address this problem by synthesizing images with 3D graphic engines. Experiment results show that text detection performance can be notably improved by using synthesized text images without text alignment error. However, it is not clear whether these datasets generated from the virtual 3D world can benefit text recognition task.

3 SynthTIGER

SynthTIGER consists of two major components: text selection and text rendering modules. The text selection module is used to sample a target text, , from a pre-defined lexicon, . Then, the text rendering module generates a text image by using multiple fonts , backgrounds (textures) , and a color map . In this section, we first describe the text rendering process and then the target text selection process.

Figure 2: Overview of SynthTIGER rendering processes consisting of (a) text shape selection, (b) text style selection, (c) transformation, (d) blending, and (e) post-processing. Foreground text presents a ground-truth text of a generated image and mid-ground text is used as a high frequency visual noises with lines and dots from visual appearance of characters. Foreground and mid-ground texts are rendered independently in (a), (b), and (c) and the background and the texts are combined in (d). Finally, vision noises are added upon the combined image in (e).

3.1 Text Rendering Process

Synthesized text images should reflect realness of texts in both a “micro” perspective of a word box image and a “macro” perspective of a scene-level text image. The rendering process of SynthTIGER generates text-focused images for the micro-level perspective, but it additionally adapts noises of the macro-level perspective. Specifically, SynthTIGER renders a target text and a noisy text and combines them to reflect the realness of the text regions (in a wild, a part of a word appearance can be included in a region for another word). Fig. 2 overviews the modules of SynthTIGER engine. It consists of five procedures: (a) text shape selection, (b) text style selection, (c) transformation, (d) blending, and (e) post-processing. The first three processes, (a), (b), and (c), are separately applied to the foreground layer for a target text and the mid-ground layer for a noise text. In the (d) step, the two layers are combined with a background to represent a single synthesized image. Finally, the (e) adds realistic noises. The followings introduce each module in detail.

3.1.1 (a) Text Shape Selection

Text shape selection decides a 2-dimensional shape of a 1-dimensional character sequence. This process first identify individual character shapes of a target text and then renders them upon a certain line on 2D space in the left-to-right order.

To reveal visual appearances of the characters, a font is randomly selected from a pool of font styles and each character is rendered upon individual boards with randomly chosen font size and thickness. To add diversity of font styles, elastic distortion [20] is applied to the rendered characters.

Defining a spatial order of characters is essential to map characters upon 2-dimensional space. For straight texts, SynthTIGER basically aligns character boards in the left-to-right order with a certain margin between the boards. For curved texts, SynthTIGER places the character boards on a parabolic curve. The curvature of the curve is identified by the maximum height-directional gaps between the centers of the boards. The maximum gap is randomly chosen and the middle points of the target text are allocated on the centroid of the parabolic curve. The character boards upon the curve are rotated with a slope of the curve under a certain probability.

3.1.2 (b) Text Style Selection

This part chooses colors and textures of a text, and injects additional text effects such as bordering, shadowing and extruding texts.

A color map

is an estimation of a real distribution over colors of text images. It can be identified by clustering colors of real text images. It usually consists of 2, or 3 clusters with the mean gray-scale colors and their standard deviation (s.t.d). MJ and ST also utilize this color map identified from ICDAR03 dataset 

[13] and IIIT dataset [14], respectively. In our work, we adapt to the color map used in ST. The color selection from the color map is conducted sequentially in an order of a cluster and a color based on the mean and the s.t.d. Once a color is selected, SynthTIGER changes the color of the character appearances.

The colors of texts in the real world is not simply represented with a single color. SynthTIGER uses multiple texture sources, , to reflect the realness of text colors. Specifically, it picks up a random texture from , performs a random crop of the texture, and use it as a texture of the text appearance of the synthetic image. In this process, transparency of the texture is also randomly chosen to diversify the effect of textures.

In the real world, the characters’ boundary exhibits diverse patterns depending on text styles, text background, and environmental conditions. We can simulate the boundary styles by applying text border, shadow, and extruding effects. SynthTIGER randomly chooses one of these effects and applies it to the text. All required parameters such as effect size and color will be sampled randomly from a pre-defined range.

3.1.3 (c) Transformation

The visual appearance of the same scene text image can be significantly different depending on the view angle. Moreover, these text images detected by different OCR engines or labeled by multiple human annotators will have different patterns. SynthTIGER generates synthesized images reflecting these characteristics by utilizing multiple transformation functions.

In detail, SynthTIGER provides stretch, trapezoidate, skew and rotate transformations. Their functions are explained below.

  • Stretch adjusts the width or height of the text images.

  • Trapezoidate choose an edge of the text image and then adjust its length.

  • Skew tilts the text image to one of the four directions such as the right, left, top and bottom.

  • Rotate turns the text image clockwise or anticlockwise.

SynthTIGER applies one of these transformations to the text image with necessary parameter values randomly sampled.

Finally, SynthTIGER adds random margins to simulate the diverse results of text detectors. The margins are independently applied to the top, bottom, left, and right of the image.

3.1.4 (d) Blending

The blending process first creates a background image by randomly sampling color and texture from the color map, , and the texture database, . It randomly changes the transparency of the background texture to diversify the impact of the background. Secondly, it creates two text images, foreground and mid-ground, with the same rendering processes but different random parameters. The first one contains a target text and the second one carries a noise text. The next step is to combine the mid-ground and background images. The blending process first crops the background image to match the text image size. Then, it randomly shifts the noise text in the mid-ground and makes the non-textual area transparent. Finally, it merges two images by using one of multiple blending methods: normal, multiply, screen, overlay, hard-light, soft-light, dodge, divide, addition, difference, darken-only, and lighten-only. The last step is to overlay the foreground text image on the merged background. The target text area with a little margin is kept non-transparent to distinguish between the target text and the noise text. During this process, it also uses one of the blending methods aforementioned.

A synthesized image created through these steps from (a) to (d) might not be a good text-focused image for several reasons. For example, its text and background color happen to be indistinguishable because they are chosen independently. To address this problem, we adopted Flood-Fill algorithm111https://en.wikipedia.org/wiki/Flood_fill. We apply this algorithm starting from a pixel inside the target text, count the number of text boundary pixels visited, and calculate the ratio of the visited text boundary pixels to the number of all boundary pixels. This process is repeated until all target text pixels are used. If this ratio exceeds a certain threshold, we conclude that the target text and background are indistinguishable and discard the generated image.

3.1.5 (e) Post-processing

Post-processing is conducted to finalize the synthetic data generation. In this process, SynthTIGER injects general visual noises such as gaussian noise, gaussian blur, resize, median blur and JPEG compression.

3.2 Text Selection Strategy

The previous methods, MJ and ST, randomly sample target texts from a user-provided lexicon. In contrast, SynthTIGER provides two additional strategies to control the text length distribution and character distribution of a synthesized dataset. It alleviates the long-tail problem inherited from the use of a lexicon.

3.2.1 Text Length Distribution Control

The length distribution of texts randomly sampled from a lexicon does not represent the true distribution of a real world text data. To alleviate this problem, SynthTIGER performs text length distribution augmentation with the probability where stands for length distribution. The augmentation process first randomly chooses the target text length between 1 and the pre-defined maximum value. Then, it randomly samples a word from the lexicon. If the word matches the target length, SynthTIGER uses it as a target text. For longer words, it simply cuts off extra rightmost characters. For shorter words, it samples a new word and attaches it to the right of the previous one until the concatenated word matches or exceeds the target length. The rightmost extra characters will be cut off. Text length augmentation, however, should be used with caution because the generated texts are mostly nonsensical. In the experiment section, we show that text length augmentation with = increases STR accuracy more than 2%, while making little difference with =.

3.2.2 Character Distribution Control

Languages such as Chinese and Japanese use a large number of characters. A synthesized dataset for such a language often lacks enough amount of samples for rare characters. To deal with this problem, SynthTIGER conducts character distribution augmentation with the probability where stands for character distribution. When the augmentation is triggered, it randomly chooses a character from vocabulary and samples a word having that character. In the experiments, we show that character distribution augmentation with between 0.25 and 0.5 improves STR performance for both scene and document domains.

4 Experimental Results

This section consists of (S4.1) our experimental settings for both synthetic data generation and STR model development, (S4.2) comparison to popular synthetic datasets, (S4.3) apples-to-apples comparison of synthesis engines with the common resources, (S4.4) ablative studies on rendering functions, (S4.5) experiments on controlling text distributions.

4.1 Experimental Settings

To compare synthetic data generation engines, synthetic datasets are built with them and then STR performances are evaluated from the models trained with the generated datasets. Here, we describe the resources used in synthetic data generation engines and the training and evaluating settings of STR models.

4.1.1 Resources for Synthetic Data Generation

To build synthetic datasets, multiple resources, , , , and , are required. Table 1 describes the resources used in MJ and ST as well as in our experiments. As can be seen, MJ and ST are built with their own resources. MJ utilizes a lexicon combining Hunspell corpus222http://hunspell.github.io/ and ground-truths of real STR examples from ICDAR(IC), SVT, and IIIT datasets. MJ also uses textures and its color map of IC03 and SVT. In contrast, ST does not use the ground-truth information except for the color map from IIIT. SynthTIGER utilizes a lexicon consisting of texts of MJ and ST dataset and uses the same textures and color map with ST. They have different number of fonts that are available from google fonts333https://fonts.google.com/. Common* in the table uses an another lexicon from Wikipedia to evaluate all synthesis engines without ground-truth information of real STR test examples and test sets except for the color map. For our Japanese STR tasks, we utilize a Japanese lexicon (84M) from Wikipedia and Twitter, 382 fonts, , and .

Lexicon () Font () Texture () Color map ()
MJ [5] Hunspell + test-sets of 1,400 fonts IC03, SVT IC03
IC, SVT, IIIT(90K3) train-set (358) train-set
ST [2] Newsgroup20 1,200 fonts Crawling IIIT
(366K) (8,010) word dataset
SynthTIGER MJ + ST (197K3) 3,568 fonts
Common* Wikipedia (19M3)
Table 1: Resources used to build synthetic datasets of Latin texts. Common* indicates a setting for apples-to-apples comparison between synthesis engines. “3” indicates text augmentation for capitalized, upper-cased, and lower-cased words.

4.1.2 Experimental Settings for Training and Evaluating STR Models

In this paper, we evaluate synthetic dataset by training a STR model with them and evaluating the trained model on real STR examples. We choose BEST [1] as our base model since it is generally used as well as its implementation is publicly available. All synthetic datasets built for our experiments consists of 10M word box images. The public datasets, MJ and ST, contains 8.9M and 7M word box images respectively and they are also evaluated with the same process.

The BEST model are trained only with synthetic datasets. The training and evaluation is conducted with the STR test-bed444https://github.com/clovaai/deep-text-recognition-benchmark. Most of experimental settings follows the training protocol of Baek et al. [1] except for the input image size of 32 by 256.

The evaluation protocol is also the same with [1]. Specifically, we test two STR scenario depending on languages: one is Latin and the other is Japanese. For the Latin case, character vocabulary consists of 94 including both alphanumeric and special characters. STR models are evaluated on test-sets of STR benchmarks; 3,000 images of IIIT5k [14], 647 images of SVT [21], 1,110 images of IC03 [13], 1,095 images of IC13 [7], 2,077 images of IC15 [6], 645 images of SVTP [17], and 288 images of CUTE80 [18]. We also test performances on business documents with our in-house 38,493 images. We only evaluate on alphabets and digits due to in-consistent labels of the benchmark datasets. For the Japanese case, the vocabulary consists of 6,723 characters including alphanumeric, special, hiragana/katakana, and some Chinese characters. The evaluation is conducted on our in-house datasets; 40,938 images of scenes and 38,059 images of Japanese business documents.

4.2 Comparison on Synthetic Text Data

Dataset Regular Irregular Total
IIIT5k SVT IC03 IC13 IC15 SVTP CUTE80
MJ [5] 83.4 84.5 85.6 83.5 66.0 73.0 64.6 78.3
ST [2] 86.1 82.5 90.7 89.8 64.5 69.1 60.1 79.7
MJ [5] + ST [2] 90.9 87.2 92.1 91.2 72.9 77.8 73.6 85.1
SynthTIGER (Ours) 93.2 87.3 90.5 92.9 72.1 77.7 80.6 85.9
Table 2: Benchmark performances of BEST [1] trained from synthetic text images. The amount of MJ, ST, MJ+ST, our data are 8.9M, 7M, 15.9M, and 10M, respectively.

Table 2 compares STR performances of BEST [1] models trained with synthetic text images from MJ, ST, and SynthTIGER. As reported in previous works, the combination of MJ and ST shows better performances than their single usages. SynthTIGER always provides a better STR performance than the single usages of MJ and ST. Interestingly, ours achieves comparable or better performance than combined data (MJ+ST). It should be noted that the amount of combined training data is 1.5 times larger than ours.

Dataset Regular Irregular Total
IIIT5k SVT IC03 IC13 IC15 SVTP CUTE80
MJ* 87.1 81.9 83.6 86.1 62.9 69.8 57.3 78.3
ST* 79.0 80.4 76.8 79.5 59.6 66.2 59.4 72.8
MJ* + ST* 89.5 83.9 87.0 89.3 68.1 74.6 72.9 82.1
SynthTIGER* (Ours*) 89.8 84.5 84.2 87.9 69.5 73.8 74.0 82.1
Table 3: Benchmark performances of BEST [1] trained from synthetic generators with the same resources. The total amount of MJ*, ST*, MJ*+ST*, and Ours* are identical to 10M.

4.3 Comparison on Synthetic Text Image Generators with Same Resources

Since the outputs of the engines depend on the resources such as fonts, textures, color maps, and a lexicon, we provide fair comparisons, referred as to ‘*’, by setting the same resources in Table 3. To present a fair comparison, we set the total amount of comparison data as 10M. For example, the total amount of MJ*+ST* is 10M and other comparisons such as MJ*, ST*, and Ours* are also 10M. In Table 3, ours shows clear improvement from single usages of MJ* and ST*. Also, ours have comparable performance with combined datasets.

We collect some examples from the test benchmarks where ours provides correct predictions. In Fig. 3, ours has robust predictions when the text-images contain high-frequency noises such as lines, complex backgrounds, and parts of other characters. Although complexly combined functions could contribute to the correct predictions, we believe employing the proposed mid-ground also result in robust performance. We will provide more descriptions about effects of mid-ground in (S4.4) with the ablative study.

Dataset Latin Japanese
Scene Document Scene Document
MJ* + ST* 82.1 82.5 60.0 83.8
Ours* 82.1 85.6 60.1 86.8
Table 4: Latin and Japanese recognition performance on scene and document images.
Figure 3: Correctly predicted cases of ours. The predictions are positioned on the right side of the images. The texts of the first row are the predictions of MJ*+ST*; The texts of the second row are the predictions of Ours* (-) Mid-ground text that is described in ablative study; The texts of the third row are the predictions of Ours*.

We extend the experiments on Japanese to compare the language generalization performance between ours and previous generators. Moreover, we demonstrate the performance of document images to confirm the extensibility of another domain. In these experiments, all engines share the same resources because MJ and ST have not provided Japanese. Table 4 shows that ours achieves comparable or better performance than combined datasets. Specifically, ours achieves much better performance on document image with 3.0 pp improvements. Since document images usually contain high-frequency noises such as scan noise and part of characters that are included in other lines or paragraphs, we believe the use of mid-ground could cope with these noises.

4.4 Ablative Studies on Rendering Functions

We investigate the effects on the STR performance of rendering function by excluding each function. Although, we do not propose and optimize all the rendering functions, we believe these ablative studies are significant to STR fields. This is because detailed investigations of rendering functions have not been reported and also can be a guide to subsequent data generation researches. To help intuitive understanding of each function, we present some visual examples of rendering function in Fig. 4. As presented in Table 5, the texture blending, transformation, margin, and post-processing critically impact the performance. Moreover, the proposed mid-ground text enhances both regular and irregular benchmarks. Other rendering functions also contribute to the STR performances.

To show the effects of the proposed mid-ground, we present some examples in Fig. 3 where baseline (Ours*) can correctly predict the results, but “(-)Mid-ground text” cannot. These figures show that the proposed mid-ground can help to handle more diverse real-world scenes that could degrade the recognition performances.

Figure 4: Rendering effect visualization of SynthTIGER modules. The images on the top show the cases when the effects are applied. The bottom images represent the cases when the effects are off. All functions are essential to reveal the realness of the synthetic text images.
Regular Irregular Total
Baseline 87.8 70.9 82.1
(-) Curved text 87.2 (-0.6) 70.2 (-0.7) 81.4 (-0.7)
(-) Elastic distortion 87.7 (-0.1) 68.4 (-2.5) 81.1 (-1.0)
(-) Color map 87.6 (-0.2) 70.7 (-0.2) 81.9 (-0.2)
(-) Texture blending 84.5 (-3.3) 66.0 (-4.9) 78.2 (-3.9)
(-) Text effect 87.3 (-0.5) 67.7 (-3.2) 80.6 (-1.5)
(-) Transformation 87.3 (-0.5) 64.4 (-6.5) 79.5 (-2.6)
(-) Margin 87.1 (-0.7) 66.7 (-4.2) 80.2 (-1.9)
(-) Mid-ground text 87.4 (-0.4) 69.6 (-1.3) 81.4 (-0.7)
(-) Blending modes 88.0 (+0.2) 70.2 (-0.7) 81.9 (-0.2)
(-) Visibility check 87.1 (-0.7) 71.3 (+0.4) 81.8 (-0.3)
(-) Post-processing 86.1 (-1.7) 59.0 (-11.9) 76.9 (-5.2)
Table 5: The performance of rendering functions. (-) indicates exclusion from baseline. The color exclusion indicates using random color selection and the blending exclusion means using “normal” blending
(a) Length distributions
(b) STR performances
Figure 5: (a) Text length distribution of training data without augmentation (dashed), with augmentation (red) and evaluation data (blue). (b) Accuracy by the length of the model without length augmentation (dashed), model with 50% applied (red) and model with length augmentation optimized to length distribution of evaluation data (blue). The blue line drops sharply for long texts because texts longer than 15 characters in the evaluation dataset rarely exist.
Probability 0% 25% 50% 75% 100% Optimized
Accuracy 82.1 83.9 84.2 82.7 82.0 84.9
Table 6: STR accuracy according to the change of probability of length distribution augmentation.“Optimized” indicates the length augmentation is optimized to length distribution of evaluation data.

4.5 Experiments on Text Selection

4.5.1 Experiments on Text Length Distribution Control

As presented in Fig. 5(a), we found that short and long length texts in Latin training data are insufficient to cover real-world texts and length distribution between training and evaluation data are quite different. To alleviate these problems, we apply text length distribution augmentation to Latin, and thus, the augmented distribution can cover a wide range of length texts as a red graph. We also present the STR performance according to the change of in Table 6. We find that the best performance is achieved when sets 50% and it is comparable with “Optimized” that is regarded as upper bound. Specifically, “Optimized” makes the distribution of training data have a similar distribution of evaluation data by controlling target length. Fig. 5(b) shows that the proposed augmentation prevents critical performance degradation for very short and long length texts.

4.5.2 Experiments on Character Distribution Control

As presented in Fig. 6(a), Japanese characters, which is composed of thousands of characters, have an unbalanced long-tail problem. However, these low-presented characters in training data are frequently exhibited in evaluation data. For example, ¥, which is a currency sign with red-circled in the figure, contained words are rarely presented in training data, but, they are frequently exhibited in evaluation data. To relieve this problem, we apply character augmentation method guaranteeing the minimum numbers of examples including rare characters as red histogram in Fig. 6(a). We also present text recognition performances according to the change of in Table 7. It shows the character distribution augmentation greatly improves scene and document performances when the sets 50%. It can be seen in Fig. 6(b) that the proposed augmentation works for recognizing rare character included words.

(a) Character distributions
(b) STR performances
Figure 6: (a) The black dashed line shows the relationship between a specific Japanese character (X-axis) and the number of words involving that character (Y-axis) in the training data. Note that characters at the X-axis are sorted in descending order by the number of their occurrence. The red and blue histograms show the same relationship with the character distribution augmented training data and the evaluation data. (b) The red and gray histograms indicate the STR accuracy (Y-axis) for a subset of the text vocabulary (X-axis). Note that the values at X-axis (e.g., 0-1k) stands for a set of any words involving a character that occurs - times in the training data.
Probability 0% 25% 50% 75% 100%
Scene 60.1 61.9 62.6 59.4 54.5
Document 86.8 88.1 87.8 87.2 83.2
Table 7: Recognition accuracy according to the change of probability of character distribution augmentation.

5 Conclusion

Synthesizing text images is essential to learn a general STR model by simulating diversity of texts in the real-world. However, there has been no guideline for the synthesis process. This paper addresses the issues by introducing a new synthesis engine, SynthTIGER, for STR. SynthTIGER solely shows better or comparable performance when compared to existing synthetic datasets and its rendering functions are evaluated under a fair comparison. SynthTIGER also addresses biases on text distribution of synthetic datasets by providing two text selection methods over lengths and characters. Our experiments on rendering methods and text distributions show that controlling text styles and text distributions of synthetic dataset affects to learn more generalizable STR models. Finally, this paper contributes to OCR community by providing an open-sourced synthesis engine and a new synthetic dataset.

References

  • [1]

    Baek, J., Kim, G., Lee, J., Park, S., Han, D., Yun, S., Oh, S.J., Lee, H.: What is wrong with scene text recognition model comparisons? dataset and model analysis. In: International Conference on Computer Vision (ICCV) (2019)

  • [2]

    Gupta, A., Vedaldi, A., Zisserman, A.: Synthetic data for text localisation in natural images. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2315–2324 (2016)

  • [3] Hwang, W., Kim, S., Seo, M., Yim, J., Park, S., Park, S., Lee, J., Lee, B., Lee, H.: Post-ocr parsing: building simple and robust parser via bio tagging. In: Workshop on Document Intelligence at NeurIPS 2019 (2019)
  • [4] Hwang, W., Yim, J., Park, S., Yang, S., Seo, M.: Spatial dependency parsing for 2d document understanding. arXiv preprint arXiv:2005.00642 (2020)
  • [5]

    Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Synthetic data and artificial neural networks for natural scene text recognition. In: Workshop on Deep Learning, NIPS (2014)

  • [6]

    Karatzas, D., Gomez-Bigorda, L., Nicolaou, A., Ghosh, S., Bagdanov, A., Iwamura, M., Matas, J., Neumann, L., Chandrasekhar, V.R., Lu, S., et al.: Icdar 2015 competition on robust reading. In: ICDAR. pp. 1156–1160 (2015)

  • [7] Karatzas, D., Shafait, F., Uchida, S., Iwamura, M., i Bigorda, L.G., Mestre, S.R., Mas, J., Mota, D.F., Almazan, J.A., De Las Heras, L.P.: Icdar 2013 robust reading competition. In: ICDAR. pp. 1484–1493 (2013)
  • [8] Liao, M., Song, B., Long, S., He, M., Yao, C., Bai, X.: Synthtext3d: synthesizing scene text images from 3d virtual worlds. Science China Information Sciences 63(2), 1–14 (2020)
  • [9] Limonova, E., Bezmaternykh, P., Nikolaev, D., Arlazarov, V.: Slant rectification in russian passport ocr system using fast hough transform. In: Ninth International Conference on Machine Vision (ICMV 2016). vol. 10341, p. 103410P. International Society for Optics and Photonics (2017)
  • [10] Liu, X., Meng, G., Pan, C.: Scene text detection and recognition with advances in deep learning: a survey. International Journal on Document Analysis and Recognition (IJDAR) 22(2), 143–162 (2019)
  • [11] Long, S., He, X., Yao, C.: Scene text detection and recognition: The deep learning era. International Journal of Computer Vision 129(1), 161–184 (2021)
  • [12] Long, S., Yao, C.: Unrealtext: Synthesizing realistic scene text images from the unreal world. arXiv preprint arXiv:2003.10608 (2020)
  • [13] Lucas, S.M., Panaretos, A., Sosa, L., Tang, A., Wong, S., Young, R.: Icdar 2003 robust reading competitions. In: ICDAR. pp. 682–687 (2003)
  • [14] Mishra, A., Alahari, K., Jawahar, C.: Scene text recognition using higher order language priors. In: BMVC (2012)
  • [15] Motahari, H., Duffy, N., Bennett, P., Bedrax-Weiss, T.: A report on the first workshop on document intelligence (di) at neurips 2019. ACM SIGKDD Explorations Newsletter 22(2), 8–11 (2021)
  • [16] Patel, C., Shah, D., Patel, A.: Automatic number plate recognition system (anpr): A survey. International Journal of Computer Applications 69(9) (2013)
  • [17] Phan, T.Q., Shivakumara, P., Tian, S., Tan, C.L.: Recognizing text with perspective distortion in natural scenes. In: ICCV. pp. 569–576 (2013)
  • [18] Risnumawan, A., Shivakumara, P., Chan, C.S., Tan, C.L.: A robust arbitrary text detection system for natural scene images. In: ESWA. vol. 41, pp. 8027–8048 (2014)
  • [19] Shi, B., Yang, M., Wang, X., Lyu, P., Yao, C., Bai, X.: Aster: An attentional scene text recognizer with flexible rectification. IEEE transactions on pattern analysis and machine intelligence 41(9), 2035–2048 (2018)
  • [20]

    Simard, P.Y., Steinkraus, D., Platt, J.C., et al.: Best practices for convolutional neural networks applied to visual document analysis. In: Icdar. vol. 3. Citeseer (2003)

  • [21] Wang, K., Babenko, B., Belongie, S.: End-to-end scene text recognition. In: ICCV. pp. 1457–1464 (2011)