Advances of Scene Text Datasets

12/13/2018
by   Masakazu Iwamura, et al.
2

This article introduces publicly available datasets in scene text detection and recognition. The information is as of 2017.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 6

page 7

page 8

page 9

page 10

03/25/2019

ShopSign: a Diverse Scene Text Dataset of Chinese Shop Signs in Street Views

In this paper, we introduce the ShopSign dataset, which is a newly devel...
11/19/2019

KISS: Keeping It Simple for Scene Text Recognition

Over the past few years, several new methods for scene text recognition ...
11/19/2020

Scene text removal via cascaded text stroke detection and erasing

Recent learning-based approaches show promising performance improvement ...
11/22/2016

Smart Library: Identifying Books in a Library using Richly Supervised Deep Scene Text Reading

Physical library collections are valuable and long standing resources fo...
11/27/2020

Efficient Scene Compression for Visual-based Localization

Estimating the pose of a camera with respect to a 3D reconstruction or s...
09/09/2018

TextContourNet: a Flexible and Effective Framework for Improving Scene Text Detection Architecture with a Multi-task Cascade

We study the problem of extracting text instance contour information fro...
06/24/2021

A Simple and Strong Baseline: Progressively Region-based Scene Text Removal Networks

Existing scene text removal methods mainly train an elaborate network wi...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Advances in pattern recognition and computer vision researches are often brought by advances in both techniques and datasets; a new technique requires a new dataset to prove its effectiveness, and a new dataset motivates researches to develop new techniques. In the research field of scene text detection and recognition, it is also true. Particularly in the field, representative datasets have been provided through competitions held in conjunction with the series of International Conference on Document Analysis and Recognition (ICDAR). But, not limited to them, various datasets have been released. This article focuses on these publicly available datasets in scene text detection and recognition and gives an overview.

1.1 Roles of datasets

The most important role of datasets is to well represent the recognition targets as they are (which is often referred as “in the wild”). Due to a variety of looks of recognition targets, large datasets are generally desired. In the era of deep learning, demand for larger training datasets is stronger. However, constructing a large dataset is not an easy task due to large cost in labor and money. Hence, there is a gap between ideal and real. As a workaround, data synthesis has been considered a very useful and important technique. Effectiveness of data synthesis in scene text detection and recognition is shown in

[1, 2]. However, use of datasets containing synthesized data for evaluation is arguable because synthesized data are considered not to completely represent the nature of real recognition targets.

Another important role of datasets is to provide an opportunity to fairly and easily compare techniques. In the research field, datasets provided for the series of ICDAR Robust Reading Competition (RRC) and some other datasets are often used. Only with an experiment of the proposed method following the protocol and evaluation criterion determined for the selected dataset and task, a proposed method can be fairly compared with the state-of-the-art methods. Hence, publicly available datasets contribute to encourage development of new methods.

1.2 Tasks and Evaluation

Figure 1: Tasks of scene text detection and recognition.

Four tasks are generally considered in the research field of scene text detection and recognition. See Fig. 1 for illustration of the tasks. Typical evaluation criteria of the tasks can be found in [3, 4].

  1. Text Localization/Detection
    This task requires to output text regions of a given image in the form of bounding boxes

    . Usually the bounding boxes are expected to be as tight to the detected text as possible. For evaluation of static images, a standard precision and recall metric 

    [5, 6, 7], DetEval [8]111ICDAR Robust Reading Competition “Born Digital Images” and “Focused Scene Text” use a slightly different implementation from the original (http://liris.cnrs.fr/christian.wolf/software/deteval/). See more detail at http://rrc.cvc.uab.es/?com=faq. and intersection-over-union (IoU) overlap method [9] are used. For evaluation of videos, CLEAR-MOT [10] and VACE [11] are used in ICDAR RRC “Text in Videos” [3, 4]. In addition, “video precision and recall” is proposed in [12].

  2. Text Segmentation
    This task requires to output text regions of a given image by pixels. For evaluation, a standard pixel-level precision and recall metric is used in [13, 14] and an atom-based metric [15] is used in ICDAR RRC “Born Digital Images” [13, 3] and “Focused Scene Text” [3].

  3. (Cropped) Word Recognition
    This task requires to output the transcription of a given cropped word image. For evaluation, recognition accuracy and a standard edit distance metric are often used [16, 3]. Sometimes case is ignored.

  4. End-to-end Recognition
    This task requires to output the transcriptions of text regions of a given image. The result is evaluated by the same way as the localization task first and then wrongly recognized words are excluded [4].

Name #Image #Word Languages Tasks Note

ICDAR Robust Reading Competition / Challenge

2003 [5, 6], 2005 [7] 529 2,434 Eng. LR
Born 2011 [13] 522 4,501 LSR
Digital 2013 [3] 561 5,003 Eng.
Images 2015 [4] E
Focused 2011 [17] 484 2,037 LR
Scene 2013 [3] 462 2,524 Eng. LSR
Text 2015 [4] E
Text in 2013 [3] 15,277 93,598 Eng., Fre., Spa. L Video (#WS=1,962).
Videos 2015 [4] 27,824 125,141 LE Video (#WS=3,562).
2015 Incidental 1,670 17,548 Eng. LRE
Scene Text [4]
2017 63,686 173,589 Eng., Ger., Fre., LRE
COCO-Text [18, 19] Spa., etc. Text annotation of MS COCO Dataset [20].
2017 FSNS [21] 1,081,422 - Fre. E Each image contains up to 4 views of a street name sign.
2017 DOST [22, 23] 32,147 797,919 Jap., etc. LRE Video (#WS=22,398). 5 views in most frames.
Ara., Ban., Chi., Tasks also include script identification.
2017 MLT [24] 18,000 107,547 Eng., Fre., Ger., LR #Word counts training and validation sets.
Ita., Jap., Kor.

General

Chars74k [25] 74,107 74,107 Eng., R Character image DB (natural, hand drawn and synthesised).
Kannada #Word represents the number of English characters.
SVT [26, 16] 349 904 Eng. LRE
NEOCR [27] 659 5,238 Eng., Ger. LR Text with various degradation (blur, perspective distortion+).
KAIST [14] 3,000 3,000 Eng., Kor. LS
SVHN [28] 248,823 630,420 Digit LR Digit image DB. #Word represents the number of digits.
MSRA-TD500 [29] 500 500 Eng., Chi. L Text bounding boxes are in various angles.
IIIT5K [30] 5,000 5,000 Eng. R Cropped word image DB.
YouTube Video Text [12] 11,791 16,620 Eng. LR Videos from YouTube (#WS=245).
ICDAR2015 TRW [31] 1,271 6,291 Eng., Chi. LR
ICDAR2017 RCTW [32] 12,263 64,248 Chi. LE #Word counts training data.

Synth

MJSynth [1] 8,919,273 8,919,273 Eng. - Synthesized cropped word image DB.
SynthText [2] 800,000 800,000 Eng. - Synthesized scene text image DB.
Table 1: Summary of publicly available datasets. #Image represents the total number of images, mostly of detection tasks (for a video dataset, the total number of frames). #Word represents the number of word regions ground truthed. Tasks indicate Text Localization/Detection (L), Text Segmentation (S), Word Recognition (R) and End-to-end Recognition (E). #WS represents the number of word sequences in a video dataset.
(a) ICDAR Robust Reading Competitions (RRC) Dataset in 2003 [5, 6] and 2005 [7]
(b) ICDAR RRC “Born Digital Images” (Challenge 1) Dataset in 2011 [13], 2013 [3] and 2015 [4]
(c) ICDAR RRC “Focused Scene Text” (Challenge 2) Dataset in 2011 [17], 2013 [3] and 2015 [4]
(d) ICDAR RRC “Text in Videos” (Challenge 3) Dataset in 2013 [3] and 2015 [4]
(e) ICDAR RRC “Incidental Scene Text” (Challenge 4) Dataset in 2015 [4]
(f) COCO-Text Dataset [18] / ICDAR2017 Robust Reading Challenge (RRC) on COCO-Text [19]
(g) French Street Name Signs (FSNS) Dataset [21] / ICDAR2017 Robust Reading Challenge (RRC) on End-to-End Recognition on the Google FSNS Dataset
(h) Downtown Osaka Scene Text (DOST) Dataset [22] / ICDAR 2017 Robust Reading Challenge (RRC) on Omnidirectional Video (DOST) [23]
Figure 2: Sample images of databases #1.
(a) ICDAR2017 Competition on Multi-lingual Scene Text Detection and Script Identification (MLT) dataset [24]
(b) Chars74k Dataset [25]
(c) Street View Text (SVT) Dataset [26, 16]
(d) Natural Environment OCR (NEOCR) Dataset [27]
(e) KAIST Scene Text Database [14]
(f) Street View House Numbers (SVHN) Dataset [28]
(g) MSRA Text Detection 500 (MSRA-TD500) Database [29]
(h) IIIT 5K-Word Dataset [30]
(i) YouTube Video Text (YVT) Dataset [12]
Figure 3: Sample images of databases #2.
(a) ICDAR2015 Competition on Text Reading in the Wild (TRW) Dataset [31]
(b) ICDAR2017 Competition on Reading Chinese Text in the Wild (RCTW) Dataset [32]
(c) MJSynth Dataset [1]
(d) SynthText in the Wild Dataset (SynthText) [2]
Figure 4: Sample images of databases #3.

2 Overview of Publicly Available Datasets

Publicly available datasets are summarized in Table 1. Their sample images are shown in Figs. 24. They consist of 21 datasets222The datasets of ICDAR Robust Reading Competitions “Born Digital Images” (2011-2015), “Focused Scene Text” (2011-2015) and “Text in Videos” (2013-2015) are counted as a single dataset for each.. Nine of them are related to ICDAR Robust Reading Competitions (2003-2005 and 2011-2015) / Challenges (2017), ten are other general datasets (out of ten, three focus on character, digit and cropped word images for each), and two are fully synthesized.

The first fully ground-truthed dataset for scene text detection and recognition tasks was provided in 2003 for the first ICDAR RRC [5, 6]. The dataset for the scene text detection task contained about 500 images captured with a variety of digital cameras intentionally focusing on words in the images. Keeping its concept, the dataset was updated in 2011 [17] and 2013 [3] which are later referred as ICDAR RRC “Focused Scene Text.” Though these datasets were used long time as the de facto standard for benchmarking, they are almost at the end of their lives. Primal reasons include their quality and size; word images of high quality are less challenging to detect and recognize, and 500 images are too small.

To meet such demands, more challenging datasets have been created. Street View Text (SVT) dataset [26, 16], released in 2010, harvests word images from Google Street View. The word images have variability in appearance and are often low resolution. Natural Environment OCR (NEOCR) dataset [27], released in 2011, provides more challenging text images including blurred, partially occluded, rotated and circularly laid out text. MSRA Text Detection 500 (MSRA-TD500) database [29], released in 2012, contains text images in various angles. Though the datasets mentioned above contain text images intentionally focused in capturing, ICDAR RRC “Incidental Scene Text” dataset, released in 2015, provides those captured without intentionally focused. As a result of not focused, images contained in the dataset are of low quality; they are often out of focus, blurred and low resolution. The creation of the dataset is encouraged by improvement of imaging technology. That is, while in the past, word images were assumed to be captured with a digital camera, capturing images with a wearable device become realistic. COCO-Text dataset [18, 19], released in 2016, is text annotation of MS COCO dataset [20] constructed for object recognition. Hence, text in the dataset is not intentionally focused. Downtown Osaka Scene Text (DOST) dataset [22, 23], released in 2017, contains sequential images captured with an omni-directional camera. Use of the omni-directional camera ensures text images are completely free from human intention. Regarding the dataset size, generally speaking, datasets released more recently contain more data.

Another direction to enhance datasets was to handle scene text in videos (as sequential images). Compared to static images, videos contain more information. For example, even if text in a single frame image of a video is hard to read due to blur, we may be able to read it by watching it for a while. This implies that we can expect more robust detection and recognition of scene text in videos by employing slightly different approaches to those in static images. ICDAR RRC “Text in Videos” dataset [3, 4], released in 2013 and extended in 2015, is the first dataset for scene text detection and recognition in videos. YouTube Video Text (YVT) dataset [12], released in 2014, harvests image sequences from YouTube videos. DOST dataset [22, 23] mentioned above is also a video dataset.

While a video is one constructed by aligning static images toward time, aligning static images toward space yields multiple view images. French Street Name Signs (FSNS) dataset [21], released in 2016, provides French street name signs of up to four views. In this challenge, similar to video, it is expected to increase recognition performance by using the information contained in the multi-view images. DOST dataset [22, 23] is also considered as a dataset containing multi-view images.

A recent trend of datasets is to treat scene text of non-English, non-Latin and multiple languages. Back in 2011, KAIST [14] and NEOCR [27] datasets containing Korean and German text in addition to English, respectively, are released. ICDAR RRC “Text in Videos” dataset [3, 4] contains French and Spanish text in addition to English. MSRA-TD500 [29] and ICDAR2015 TRW [31] datasets contain Chinese and English. ICDAR2017 RCTW dataset [32] contains Chinese only. FSNS dataset [21] contains French. DOST dataset [22, 23] contains Japanese and English. ICDAR2017 Competition on Multi-lingual Scene Text Detection and Script Identification (MLT) dataset [24] contains text of nine languages: Arabic, Bangla, Chinese, English, French, German, Italian, Japanese and Korean. The tasks include “joint text detection and script identification” in addition to text detection and cropped word recognition.

Three datasets focus on character, digit and cropped word images for each. Chars74k Dataset [25] focusing on character images collects 74k English character images as well as Kannada characters. Street View House Numbers (SVHN) dataset [28] focusing on digit images collects 630k digits of house numbers from Google Street View. IIIT5K dataset [30] collects 5,000 cropped word images. In addition, while not treating scene text, ICDAR RRC “Born Digital Images” dataset [13, 3, 4], released in 2011, contains text images collected from Web and email images has substantial relationship.

Last but not least, synthesized datasets are expected to play very important roles. MJSynth dataset [1], released in 2014, contains 8M cropped word images rendered by a synthetic data engine using 1,400 fonts and variety of combinations of shadow, distortion, coloring and noise. SynthText in the Wild dataset (SynthText) [2], released in 2016, contains 800k scene text images naturally rendered. Using these datasets, it is shown that even without real datasets in training, scene text can be detected and recognized very well.

3 Conclusion and information sources

This article gave an overview of publicly available datasets in scene text detection and recognition. Some useful information sources are as follows.

Acknowledgements

This work is partially supported by JSPS KAKENHI #17H01803.

References

  • [1] Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.:

    Synthetic data and artificial neural networks for natural scene text recognition.

    In: Proc. NIPS Deep Learning Workshop. (2014)
  • [2] Gupta, A., Vedaldi, A., Zisserman, A.: Synthetic data for text localisation in natural images. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition. (2016)
  • [3] Karatzas, D., Shafait, F., Uchida, S., Iwamura, M., Gomez i Bigorda, L., Mestre, S.R., Mas, J., Mota, D.F., Almazan, J.A., de las Heras, L.P.: ICDAR 2013 robust reading competition. In: Proc. International Conference on Document Analysis and Recognition. (2013) 1115–1124
  • [4] Karatzas, D., Gomez-Bigorda, L., Nicolaou, A., Ghosh, S., Bagdanov, A., Iwamura, M., Matas, J., Neumann, L., Chandrasekhar, V.R., Lu, S., Shafait, F., Uchida, S., Valveny, E.: ICDAR 2015 robust reading competition. In: Proc. International Conference on Document Analysis and Recognition. (2015) 1156–1160
  • [5] Lucas, S.M., Panaretos, A., Sosa, L., Tang, A., Wong, S., Young, R.: ICDAR 2003 robust reading competitions. In: Proc. International Conference on Document Analysis and Recognition. Volume 2. (2003) 682–687
  • [6] Lucas, S.M., Panaretos, A., Sosa, L., Tang, A., Wong, S., Young, R., Ashida, K., Nagai, H., Okamoto, M., Yamamoto, H., Miyao, H., Zhu, J., Ou, W., Wolf, C., Jolion, J.M., Todoran, L., Worring, M., Lin, X.: ICDAR 2003 robust reading competitions: Entries, results and future directions. International Journal on Document Analysis and Recognition 7(2-3) (2005) 105–122
  • [7] Lucas, S.M.: ICDAR 2005 text locating competition results. In: Proc. International Conference on Document Analysis and Recognition. Volume 1. (2005) 80–84
  • [8] Wolf, C., Jolion, J.M.: Object count/area graphs for the evaluation of object detection and segmentation algorithms. International Journal of Document Analysis and Recognition 8(4) (September 2006) 280–296
  • [9] Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J., Zisserman, A.: The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision 111(1) (June 2014) 98–136
  • [10] Bernardin, K., Stiefelhagen, R.: Evaluating multiple object tracking performance: The clear mot metrics. EURASIP Journal on Image and Video Processing 2008 (May 2008)
  • [11] Kasturi, R., Goldgof, D., Soundararajan, P., Manohar, V., Garofolo, J., Bowers, R., Boonstra, M., Korzhova, V., Zhang, J.: Framework for performance evaluation of face, text, and vehicle detection and tracking in video: Data, metrics, and protocol. IEEE Transactions on Pattern Analysis and Machine Intelligence 31(2) (2009) 319–336
  • [12] Nguyen, P.X., Wang, K., Belongie, S.: Video text detection and recognition: Dataset and benchmark. In: Proc. IEEE Winter Conference on Applications of Computer Vision. (2014)
  • [13] Karatzas, D., Mestre, S.R., Mas, J., Nourbakhsh, F., Roy, P.P.: ICDAR 2011 robust reading competition challenge 1: Reading text in born-digital images (web and email). In: Proc. International Conference on Document Analysis and Recognition. (2011) 1485–1490
  • [14] Jung, J., Lee, S., Cho, M.S., Kim, J.H.: Touch TT: Scene text extractor using touchscreen interface. ETRI Journal 33(1) (2011) 78–88
  • [15] Clavelli, A., Karatzas, D., Lladós, J.: A framework for the assessment of text extraction algorithms on complex colour images. In: Proc. International Workshop on Document Analysis Systems. (2010)
  • [16] Wang, K., Babenko, B., Belongie, S.: End-to-end scene text recognition. In: Proc. International Conference on Computer Vision. (2011) 1457–1464
  • [17] Shahab, A., Shafait, F., Dengel, A.: ICDAR 2011 robust reading competition challenge 2: Reading text in scene images. In: Proc. International Conference on Document Analysis and Recognition. (2011) 1491–1496
  • [18] Veit, A., Matera, T., Neumann, L., Matas, J., Belongie, S.: COCO-Text: Dataset and benchmark for text detection and recognition in natural images. arXiv:1601.07140 [cs.CV] (2016)
  • [19] Gomez, R., Shi, B., Gomez, L., Neumann, L., Veit, A., Matas, J., Belongie, S., Karatzas, D.: ICDAR2017 robust reading challenge on COCO-Text. In: Proc. International Conference on Document Analysis and Recognition. (2017)
  • [20] Lin, T.Y., Maire, M., Belongie, S., Bourdev, L., Girshick, R., Hays, J., Perona, P., Ramanan, D., Zitnick, C.L., Dollár, P.: Microsoft coco: Common objects in context. arXiv:1405.0312 [cs.CV] (2014)
  • [21] Smith, R., Gu, C., Lee, D.S., Hu, H., Unnikrishnan, R., Ibarz, J., Arnoud, S., Lin, S.: End-to-end interpretation of the french street name signs dataset. In: Proc. International Workshop on Robust Reading. (2016) 411–426
  • [22] Iwamura, M., Matsuda, T., Morimoto, N., Sato, H., Ikeda, Y., Kise, K.: Downtown osaka scene text dataset. In: Proc. International Workshop on Robust Reading. (2016) 440–455
  • [23] Iwamura, M., Morimoto, N., Tainaka, K., Bazazian, D., Gomez, L., Karatzas, D.: ICDAR2017 robust reading challenge on omnidirectional video. In: Proc. International Conference on Document Analysis and Recognition. (2017)
  • [24] Nayef, N., Yin, F., Bizid, I., Choi, H., Feng, Y., Karatzas, D., Luo, Z., Pal, U., Rigaud, C., Chazalon, J., Khlif, W., Luqman, M.M., Burie, J.C., lin Liu, C., Ogier, J.M.: ICDAR2017 robust reading challenge onmulti-lingual scene text detection and scriptidentification – RRC-MLT. In: Proc. International Conference on Document Analysis and Recognition. (2017)
  • [25] de Campos, T.E., Babu, B.R., Varma, M.: Character recognition in natural images. In: Proc. International Conference on Computer Vision Theory and Applications. (2009)
  • [26] Wang, K., Belongie, S.: Word spotting in the wild. In: Proc. European Conference on Computer Vision: Part I. (2010) 591–604
  • [27] Nagy, R., Dicker, A., Meyer-Wegener, K.: NEOCR: A configurable dataset for natural image text recognition. In: Camera-Based Document Analysis and Recognition. Volume 7139 of Lecture Notes in Computer Science. (2012) 150–163
  • [28] Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., Ng, A.Y.: Reading digits in natural images with unsupervised feature learning. In: Proc. NIPS Workshop on Deep Learning and Unsupervised Feature Learning. (2011)
  • [29] Yao, C., Bai, X., Liu, W., Ma, Y., Tu, Z.: Detecting texts of arbitrary orientations in natural images. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition. (2012) 1083–1090
  • [30] Mishra, A., Alahari, K., Jawahar, C.V.: Scene text recognition using higher order language priors. In: Proc. British Machine Vision Conference. (2012)
  • [31] Zhou, X., Zhou, S., Yao, C., Cao, Z., Yin, Q.: ICDAR 2015 text reading in the wild competition. arXiv preprint (2015)
  • [32] Shi, B., Yao, C., Liao, M., Yang, M., Xu, P., Cui, L., Belongie, S., Lu, S., Bai, X.: ICDAR2017 competition on reading chinesetext in the wild (RCTW-17). In: Proc. International Conference on Document Analysis and Recognition. (2017)