Advances in pattern recognition and computer vision researches are often brought by advances in both techniques and datasets; a new technique requires a new dataset to prove its effectiveness, and a new dataset motivates researches to develop new techniques. In the research field of scene text detection and recognition, it is also true. Particularly in the field, representative datasets have been provided through competitions held in conjunction with the series of International Conference on Document Analysis and Recognition (ICDAR). But, not limited to them, various datasets have been released. This article focuses on these publicly available datasets in scene text detection and recognition and gives an overview.
1.1 Roles of datasets
The most important role of datasets is to well represent the recognition targets as they are (which is often referred as “in the wild”). Due to a variety of looks of recognition targets, large datasets are generally desired. In the era of deep learning, demand for larger training datasets is stronger. However, constructing a large dataset is not an easy task due to large cost in labor and money. Hence, there is a gap between ideal and real. As a workaround, data synthesis has been considered a very useful and important technique. Effectiveness of data synthesis in scene text detection and recognition is shown in[1, 2]. However, use of datasets containing synthesized data for evaluation is arguable because synthesized data are considered not to completely represent the nature of real recognition targets.
Another important role of datasets is to provide an opportunity to fairly and easily compare techniques. In the research field, datasets provided for the series of ICDAR Robust Reading Competition (RRC) and some other datasets are often used. Only with an experiment of the proposed method following the protocol and evaluation criterion determined for the selected dataset and task, a proposed method can be fairly compared with the state-of-the-art methods. Hence, publicly available datasets contribute to encourage development of new methods.
1.2 Tasks and Evaluation
Four tasks are generally considered in the research field of scene text detection and recognition. See Fig. 1 for illustration of the tasks. Typical evaluation criteria of the tasks can be found in [3, 4].
This task requires to output text regions of a given image in the form of bounding boxes
. Usually the bounding boxes are expected to be as tight to the detected text as possible. For evaluation of static images, a standard precision and recall metric[5, 6, 7], DetEval 111ICDAR Robust Reading Competition “Born Digital Images” and “Focused Scene Text” use a slightly different implementation from the original (http://liris.cnrs.fr/christian.wolf/software/deteval/). See more detail at http://rrc.cvc.uab.es/?com=faq. and intersection-over-union (IoU) overlap method  are used. For evaluation of videos, CLEAR-MOT  and VACE  are used in ICDAR RRC “Text in Videos” [3, 4]. In addition, “video precision and recall” is proposed in .
This task requires to output the transcriptions of text regions of a given image. The result is evaluated by the same way as the localization task first and then wrongly recognized words are excluded .
ICDAR Robust Reading Competition / Challenge
|2003 [5, 6], 2005 ||529||2,434||Eng.||LR|
|Text in||2013 ||15,277||93,598||Eng., Fre., Spa.||L||Video (#WS=1,962).|
|Videos||2015 ||27,824||125,141||LE||Video (#WS=3,562).|
|Scene Text |
|2017||63,686||173,589||Eng., Ger., Fre.,||LRE|
|COCO-Text [18, 19]||Spa., etc.||Text annotation of MS COCO Dataset .|
|2017 FSNS ||1,081,422||-||Fre.||E||Each image contains up to 4 views of a street name sign.|
|2017 DOST [22, 23]||32,147||797,919||Jap., etc.||LRE||Video (#WS=22,398). 5 views in most frames.|
|Ara., Ban., Chi.,||Tasks also include script identification.|
|2017 MLT ||18,000||107,547||Eng., Fre., Ger.,||LR||#Word counts training and validation sets.|
|Ita., Jap., Kor.|
|Chars74k ||74,107||74,107||Eng.,||R||Character image DB (natural, hand drawn and synthesised).|
|Kannada||#Word represents the number of English characters.|
|SVT [26, 16]||349||904||Eng.||LRE|
|NEOCR ||659||5,238||Eng., Ger.||LR||Text with various degradation (blur, perspective distortion+).|
|KAIST ||3,000||3,000||Eng., Kor.||LS|
|SVHN ||248,823||630,420||Digit||LR||Digit image DB. #Word represents the number of digits.|
|MSRA-TD500 ||500||500||Eng., Chi.||L||Text bounding boxes are in various angles.|
|IIIT5K ||5,000||5,000||Eng.||R||Cropped word image DB.|
|YouTube Video Text ||11,791||16,620||Eng.||LR||Videos from YouTube (#WS=245).|
|ICDAR2015 TRW ||1,271||6,291||Eng., Chi.||LR|
|ICDAR2017 RCTW ||12,263||64,248||Chi.||LE||#Word counts training data.|
|MJSynth ||8,919,273||8,919,273||Eng.||-||Synthesized cropped word image DB.|
|SynthText ||800,000||800,000||Eng.||-||Synthesized scene text image DB.|
2 Overview of Publicly Available Datasets
Publicly available datasets are summarized in Table 1. Their sample images are shown in Figs. 2–4. They consist of 21 datasets222The datasets of ICDAR Robust Reading Competitions “Born Digital Images” (2011-2015), “Focused Scene Text” (2011-2015) and “Text in Videos” (2013-2015) are counted as a single dataset for each.. Nine of them are related to ICDAR Robust Reading Competitions (2003-2005 and 2011-2015) / Challenges (2017), ten are other general datasets (out of ten, three focus on character, digit and cropped word images for each), and two are fully synthesized.
The first fully ground-truthed dataset for scene text detection and recognition tasks was provided in 2003 for the first ICDAR RRC [5, 6]. The dataset for the scene text detection task contained about 500 images captured with a variety of digital cameras intentionally focusing on words in the images. Keeping its concept, the dataset was updated in 2011  and 2013  which are later referred as ICDAR RRC “Focused Scene Text.” Though these datasets were used long time as the de facto standard for benchmarking, they are almost at the end of their lives. Primal reasons include their quality and size; word images of high quality are less challenging to detect and recognize, and 500 images are too small.
To meet such demands, more challenging datasets have been created. Street View Text (SVT) dataset [26, 16], released in 2010, harvests word images from Google Street View. The word images have variability in appearance and are often low resolution. Natural Environment OCR (NEOCR) dataset , released in 2011, provides more challenging text images including blurred, partially occluded, rotated and circularly laid out text. MSRA Text Detection 500 (MSRA-TD500) database , released in 2012, contains text images in various angles. Though the datasets mentioned above contain text images intentionally focused in capturing, ICDAR RRC “Incidental Scene Text” dataset, released in 2015, provides those captured without intentionally focused. As a result of not focused, images contained in the dataset are of low quality; they are often out of focus, blurred and low resolution. The creation of the dataset is encouraged by improvement of imaging technology. That is, while in the past, word images were assumed to be captured with a digital camera, capturing images with a wearable device become realistic. COCO-Text dataset [18, 19], released in 2016, is text annotation of MS COCO dataset  constructed for object recognition. Hence, text in the dataset is not intentionally focused. Downtown Osaka Scene Text (DOST) dataset [22, 23], released in 2017, contains sequential images captured with an omni-directional camera. Use of the omni-directional camera ensures text images are completely free from human intention. Regarding the dataset size, generally speaking, datasets released more recently contain more data.
Another direction to enhance datasets was to handle scene text in videos (as sequential images). Compared to static images, videos contain more information. For example, even if text in a single frame image of a video is hard to read due to blur, we may be able to read it by watching it for a while. This implies that we can expect more robust detection and recognition of scene text in videos by employing slightly different approaches to those in static images. ICDAR RRC “Text in Videos” dataset [3, 4], released in 2013 and extended in 2015, is the first dataset for scene text detection and recognition in videos. YouTube Video Text (YVT) dataset , released in 2014, harvests image sequences from YouTube videos. DOST dataset [22, 23] mentioned above is also a video dataset.
While a video is one constructed by aligning static images toward time, aligning static images toward space yields multiple view images. French Street Name Signs (FSNS) dataset , released in 2016, provides French street name signs of up to four views. In this challenge, similar to video, it is expected to increase recognition performance by using the information contained in the multi-view images. DOST dataset [22, 23] is also considered as a dataset containing multi-view images.
A recent trend of datasets is to treat scene text of non-English, non-Latin and multiple languages. Back in 2011, KAIST  and NEOCR  datasets containing Korean and German text in addition to English, respectively, are released. ICDAR RRC “Text in Videos” dataset [3, 4] contains French and Spanish text in addition to English. MSRA-TD500  and ICDAR2015 TRW  datasets contain Chinese and English. ICDAR2017 RCTW dataset  contains Chinese only. FSNS dataset  contains French. DOST dataset [22, 23] contains Japanese and English. ICDAR2017 Competition on Multi-lingual Scene Text Detection and Script Identification (MLT) dataset  contains text of nine languages: Arabic, Bangla, Chinese, English, French, German, Italian, Japanese and Korean. The tasks include “joint text detection and script identification” in addition to text detection and cropped word recognition.
Three datasets focus on character, digit and cropped word images for each. Chars74k Dataset  focusing on character images collects 74k English character images as well as Kannada characters. Street View House Numbers (SVHN) dataset  focusing on digit images collects 630k digits of house numbers from Google Street View. IIIT5K dataset  collects 5,000 cropped word images. In addition, while not treating scene text, ICDAR RRC “Born Digital Images” dataset [13, 3, 4], released in 2011, contains text images collected from Web and email images has substantial relationship.
Last but not least, synthesized datasets are expected to play very important roles. MJSynth dataset , released in 2014, contains 8M cropped word images rendered by a synthetic data engine using 1,400 fonts and variety of combinations of shadow, distortion, coloring and noise. SynthText in the Wild dataset (SynthText) , released in 2016, contains 800k scene text images naturally rendered. Using these datasets, it is shown that even without real datasets in training, scene text can be detected and recognized very well.
3 Conclusion and information sources
This article gave an overview of publicly available datasets in scene text detection and recognition. Some useful information sources are as follows.
ICDAR Robust Reading Competition Portal:
The IAPR TC11 Dataset Repository:
This work is partially supported by JSPS KAKENHI #17H01803.
Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.:
Synthetic data and artificial neural networks for natural scene text recognition.In: Proc. NIPS Deep Learning Workshop. (2014)
-  Gupta, A., Vedaldi, A., Zisserman, A.: Synthetic data for text localisation in natural images. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition. (2016)
-  Karatzas, D., Shafait, F., Uchida, S., Iwamura, M., Gomez i Bigorda, L., Mestre, S.R., Mas, J., Mota, D.F., Almazan, J.A., de las Heras, L.P.: ICDAR 2013 robust reading competition. In: Proc. International Conference on Document Analysis and Recognition. (2013) 1115–1124
-  Karatzas, D., Gomez-Bigorda, L., Nicolaou, A., Ghosh, S., Bagdanov, A., Iwamura, M., Matas, J., Neumann, L., Chandrasekhar, V.R., Lu, S., Shafait, F., Uchida, S., Valveny, E.: ICDAR 2015 robust reading competition. In: Proc. International Conference on Document Analysis and Recognition. (2015) 1156–1160
-  Lucas, S.M., Panaretos, A., Sosa, L., Tang, A., Wong, S., Young, R.: ICDAR 2003 robust reading competitions. In: Proc. International Conference on Document Analysis and Recognition. Volume 2. (2003) 682–687
-  Lucas, S.M., Panaretos, A., Sosa, L., Tang, A., Wong, S., Young, R., Ashida, K., Nagai, H., Okamoto, M., Yamamoto, H., Miyao, H., Zhu, J., Ou, W., Wolf, C., Jolion, J.M., Todoran, L., Worring, M., Lin, X.: ICDAR 2003 robust reading competitions: Entries, results and future directions. International Journal on Document Analysis and Recognition 7(2-3) (2005) 105–122
-  Lucas, S.M.: ICDAR 2005 text locating competition results. In: Proc. International Conference on Document Analysis and Recognition. Volume 1. (2005) 80–84
-  Wolf, C., Jolion, J.M.: Object count/area graphs for the evaluation of object detection and segmentation algorithms. International Journal of Document Analysis and Recognition 8(4) (September 2006) 280–296
-  Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C.K.I., Winn, J., Zisserman, A.: The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision 111(1) (June 2014) 98–136
-  Bernardin, K., Stiefelhagen, R.: Evaluating multiple object tracking performance: The clear mot metrics. EURASIP Journal on Image and Video Processing 2008 (May 2008)
-  Kasturi, R., Goldgof, D., Soundararajan, P., Manohar, V., Garofolo, J., Bowers, R., Boonstra, M., Korzhova, V., Zhang, J.: Framework for performance evaluation of face, text, and vehicle detection and tracking in video: Data, metrics, and protocol. IEEE Transactions on Pattern Analysis and Machine Intelligence 31(2) (2009) 319–336
-  Nguyen, P.X., Wang, K., Belongie, S.: Video text detection and recognition: Dataset and benchmark. In: Proc. IEEE Winter Conference on Applications of Computer Vision. (2014)
-  Karatzas, D., Mestre, S.R., Mas, J., Nourbakhsh, F., Roy, P.P.: ICDAR 2011 robust reading competition challenge 1: Reading text in born-digital images (web and email). In: Proc. International Conference on Document Analysis and Recognition. (2011) 1485–1490
-  Jung, J., Lee, S., Cho, M.S., Kim, J.H.: Touch TT: Scene text extractor using touchscreen interface. ETRI Journal 33(1) (2011) 78–88
-  Clavelli, A., Karatzas, D., Lladós, J.: A framework for the assessment of text extraction algorithms on complex colour images. In: Proc. International Workshop on Document Analysis Systems. (2010)
-  Wang, K., Babenko, B., Belongie, S.: End-to-end scene text recognition. In: Proc. International Conference on Computer Vision. (2011) 1457–1464
-  Shahab, A., Shafait, F., Dengel, A.: ICDAR 2011 robust reading competition challenge 2: Reading text in scene images. In: Proc. International Conference on Document Analysis and Recognition. (2011) 1491–1496
-  Veit, A., Matera, T., Neumann, L., Matas, J., Belongie, S.: COCO-Text: Dataset and benchmark for text detection and recognition in natural images. arXiv:1601.07140 [cs.CV] (2016)
-  Gomez, R., Shi, B., Gomez, L., Neumann, L., Veit, A., Matas, J., Belongie, S., Karatzas, D.: ICDAR2017 robust reading challenge on COCO-Text. In: Proc. International Conference on Document Analysis and Recognition. (2017)
-  Lin, T.Y., Maire, M., Belongie, S., Bourdev, L., Girshick, R., Hays, J., Perona, P., Ramanan, D., Zitnick, C.L., DollÃ¡r, P.: Microsoft coco: Common objects in context. arXiv:1405.0312 [cs.CV] (2014)
-  Smith, R., Gu, C., Lee, D.S., Hu, H., Unnikrishnan, R., Ibarz, J., Arnoud, S., Lin, S.: End-to-end interpretation of the french street name signs dataset. In: Proc. International Workshop on Robust Reading. (2016) 411–426
-  Iwamura, M., Matsuda, T., Morimoto, N., Sato, H., Ikeda, Y., Kise, K.: Downtown osaka scene text dataset. In: Proc. International Workshop on Robust Reading. (2016) 440–455
-  Iwamura, M., Morimoto, N., Tainaka, K., Bazazian, D., Gomez, L., Karatzas, D.: ICDAR2017 robust reading challenge on omnidirectional video. In: Proc. International Conference on Document Analysis and Recognition. (2017)
-  Nayef, N., Yin, F., Bizid, I., Choi, H., Feng, Y., Karatzas, D., Luo, Z., Pal, U., Rigaud, C., Chazalon, J., Khlif, W., Luqman, M.M., Burie, J.C., lin Liu, C., Ogier, J.M.: ICDAR2017 robust reading challenge onmulti-lingual scene text detection and scriptidentification â RRC-MLT. In: Proc. International Conference on Document Analysis and Recognition. (2017)
-  de Campos, T.E., Babu, B.R., Varma, M.: Character recognition in natural images. In: Proc. International Conference on Computer Vision Theory and Applications. (2009)
-  Wang, K., Belongie, S.: Word spotting in the wild. In: Proc. European Conference on Computer Vision: Part I. (2010) 591–604
-  Nagy, R., Dicker, A., Meyer-Wegener, K.: NEOCR: A configurable dataset for natural image text recognition. In: Camera-Based Document Analysis and Recognition. Volume 7139 of Lecture Notes in Computer Science. (2012) 150–163
-  Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., Ng, A.Y.: Reading digits in natural images with unsupervised feature learning. In: Proc. NIPS Workshop on Deep Learning and Unsupervised Feature Learning. (2011)
-  Yao, C., Bai, X., Liu, W., Ma, Y., Tu, Z.: Detecting texts of arbitrary orientations in natural images. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition. (2012) 1083–1090
-  Mishra, A., Alahari, K., Jawahar, C.V.: Scene text recognition using higher order language priors. In: Proc. British Machine Vision Conference. (2012)
-  Zhou, X., Zhou, S., Yao, C., Cao, Z., Yin, Q.: ICDAR 2015 text reading in the wild competition. arXiv preprint (2015)
-  Shi, B., Yao, C., Liao, M., Yang, M., Xu, P., Cui, L., Belongie, S., Lu, S., Bai, X.: ICDAR2017 competition on reading chinesetext in the wild (RCTW-17). In: Proc. International Conference on Document Analysis and Recognition. (2017)