ICDAR2019 Robust Reading Challenge on Multi-lingual Scene Text Detection and Recognition -- RRC-MLT-2019

07/01/2019 ∙ by Nibal Nayef, et al. ∙ 0

With the growing cosmopolitan culture of modern cities, the need of robust Multi-Lingual scene Text (MLT) detection and recognition systems has never been more immense. With the goal to systematically benchmark and push the state-of-the-art forward, the proposed competition builds on top of the RRC-MLT-2017 with an additional end-to-end task, an additional language in the real images dataset, a large scale multi-lingual synthetic dataset to assist the training, and a baseline End-to-End recognition method. The real dataset consists of 20,000 images containing text from 10 languages. The challenge has 4 tasks covering various aspects of multi-lingual scene text: (a) text detection, (b) cropped word script classification, (c) joint text detection and script classification and (d) end-to-end detection and recognition. In total, the competition received 60 submissions from the research and industrial communities. This paper presents the dataset, the tasks and the findings of the presented RRC-MLT-2019 challenge.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction and Related Work

Reading scene text in natural scene images is a key component in a diverse set of applications ranging from helping the visually impaired, to data mining of street-view-like images for information used in map services and geographic information systems. Scene text detection and recognition also finds its use in larger integrated systems such as those for autonomous driving, indoor navigation and visual search engines.

This proposed competition is an extension of the RRC-MLT proposed in ICDAR-2017 [1], which was the first competition to offer the challenge of detecting multi-lingual scene text, and identifying the different scripts of such text. This proposed competition named “RRC-MLT-2019” offers the following novel aspects: 1) a new challenging task: End-to-End multi-lingual text detection and recognition 2) a synthetic dataset that matches and complements the real one in order to provide more training data 3) an additional language in the real dataset (Devanagari) 4) re-opening the 3 tasks of RRC-MLT-2017 on the new version of the real dataset and 5) an End-to-End baseline method for the new recognition task.

Research on scene text detection and recognition has primarily focused on English text, which has wide range of available datasets and well defined benchmarks [2, 3, 4, 5]. Some other uni-lingual datasets focus on Arabic [6] or French [7]. The available datasets which could be considered multi-lingual have been built either for Indian languages only as in [8], or they contain only 2 scripts as in DOST [9], ICPR-MTWI [10], MSRA-TD500111http://www.iapr-tc11.org/mediawiki/index.php?title=MSRA_Text_Detection_500_Database_(MSRA-TD500) and KAIST222http://www.iapr-tc11.org/mediawiki/index.php?title=KAIST_Scene_Text_Database datasets. There exists many script identification datasets [11, 12, 13, 14] containing cropped scene text words from multiple languages, however, these have a relatively small number of images and their goal is limited to script classification in cropped images.

Unlike the above-mentioned datasets, we present a set of natural scene images containing text instances from Arabic, Bangla, Chinese, Devanagari, English, French, German, Italian, Japanese and Korean. We also provide a set of synthetically generated images with the same set of languages to assist the training. The objective of this competition is to promote the development of new methods for multi-lingual scene text understanding. The competition not only provides a large scale dataset, but also sets the evaluation protocols and standard benchmarks to promote future research.

This paper is organized as follows. Firstly, the organization of the MLT2019 challenge is outlined in Section II. The datasets used for the 4 tasks are described in Section III. Each task is then detailed in a separate section which contains the task’s description, its evaluation protocol, the list of participant methods and their obtained results (Sections IV to VII). We conclude the paper and discuss future work in Section VIII.

Due to space limitations, participant methods are listed only by their names in the tables of results, where each name is a clickable link to the details of the method (authors, affiliations, description and results). Only the winning methods for each task are described in more detail.

Ii MLT-2019 Challenge Organization

This challenge is comprised of four tasks related to text detection (Section IV), script classification (Section V), joint text detection and script identification (Section VI) and End-to-End text recognition (Section VII). The first three tasks have been re-opened from MLT-2017 [1] on the extended MLT-2019 dataset, while the forth is a newly introduced task. The datasets created for this challenge can be relied upon to train participant methods. However, we have allowed the participants to use any other dataset to improve the training of their methods.

The web portal of the RRC platform (http://rrc.cvc.uab.es/) [3] was used for interacting with participants regarding the challenge information, schedule, downloads, online submissions and results viewing. Overall, we had 60 different submissions distributed as follows: 25 in Task-1, 15 in Task-2, 10 in Task-3 and 10 submissions in Task-4. Some participants submitted results for more than one task, and some participants have submitted multiple similar methods for the same task. In the cases where the submitted methods are not demonstrably different (i.e. reflecting only parameter tuning), the participants have been asked to choose one method – without knowing the results – as a final submission.

Iii The “RRC-MLT-2019” Datasets

We have created two datasets: 1) The real MLT-2019 dataset that contains 20,000 real natural scene images with embedded text in 10 languages, 2) The synthetic MLT-2019 dataset that is prepared as an assistive training set only for Task-4. The synthetic dataset matches the scripts of real one.

Iii-a The MLT-2019 Dataset of Real Images

Iii-A1 Type/source of images

The images of the dataset are natural scene images with embedded text, such as street signs, street advertisement boards, shops names, passing vehicles and users photos in microblogs. The images were captured using different mobile phone cameras or were collected from freely available images from the Internet. The images mainly contain intentional – i.e. focused – scene text, however, some unintentional text may appear in some images. Such text – usually very small, blurry and/or occluded – is marked to be ignored in the evaluation. We have imposed conditions on the collection of our dataset related to the type (example: natural scenes), content (example: mostly focused text) and capture conditions of the images (example: no dark image). This is to ensure – to some extent – the homogeneity of the collected images as they have been collected by different people and in different countries.

Iii-A2 Number of Images, Languages and Scripts

The dataset is comprised of 20,000 images containing text of 10 different languages (2,000 images per language). Most images contain text of more than one language, but each language is represented in at least 2,000 images. The ten languages are: Arabic, Bangla, Chinese, Devanagari, English, French, German, Italian, Japanese and Korean. Those languages belong to one of the following seven scripts: Arabic, Bangla, Chinese, Hindi, Japanese, Korean and Latin. An eighth script class named “Symbols” was added for characters such as + / > :) ’ . " - when they appear alone in a word (without any other alphabet characters of the languages). We also have a rare script class named “Mixed” used when characters of two or more scripts appear in the same word (without spaces). The images are divided as follows: 50% for training (a total of 10,000 images, 1,000 per language), and 50% for testing.

Iii-A3 Ground Truth (GT)

The text in the scene images of the dataset is annotated at word level. A GT-word is defined as a consecutive set of characters without spaces, i.e. words are separated by spaces, except in Chinese and Japanese where the text is labeled at line level. Each GT-word is labeled by a 4-corner bounding box, and is associated with a script class and a Unicode transcription of that GT-word. Some text regions in the images are not readable to the annotators due to low resolution and/or other distortions. Such regions are marked as “don’t care” and ignored in the evaluation process.

Iii-B Synthetic Multi-Language in Natural Scene Dataset

State-of-the-art scene text systems employ deep learning techniques which require a tremendous amount of labelled data. Hence, we have provided an additional synthetic dataset

[15] to complement the real one for training purposes. We adapt the framework proposed by Gupta et al.[16] to a multi-language setup. The framework generates realistic images by overlaying synthetic text over existing natural background images and it accounts for D scene geometry.

Gupta et al. [16] proposed the following approach for scene-text image synthesis:

  • Text in real-world usually appears in well-defined regions, which can be characterized by uniform color and texture. This is achieved by thresholding gPb-UCM contour hierarchies [17] using efficient graph-cut implementation [18]. This gives us prospective segmented regions for rendering text.

  • Dense depth map of segmented regions is then obtained using [19] and then planer facet are fitted to them using RANSAC [20]

    . This way, normals to prospective regions for text rendering are estimated.

  • Finally, the text is aligned to a prospective image region for rendering. This is achieved by warping the image region to frontal-parallel view using the estimated region normals. Then, a rectangle is fitted to this region and the text is then aligned to the larger side of this rectangle.

Note that the pipeline presented in [16] renders text character by character, which breaks down the ligature of Arabic, Bangla and Devanagari words. We have made appropriate changes to handle this issue.

The generated dataset contains the same set of script classes as the real dataset: Arabic, Bangla, Chinese, Devanagari, Japanese, Korean, Latin. The Synthetic Multi-Language in Natural Scene Dataset contains text rendered over natural scene images selected from the set of background images collected by [16]. Annotations include word level and character level text bounding boxes along with the corresponding transcription and language class. The dataset has images with thousands of images for each language.

Iv Task-1: Multi-Lingual Text Detection

Iv-a Task-1 Description

The objective of this task is the correct localization of multi-lingual text at word level in an image. The training set consists of 10,000 scene images, where each image has a corresponding GT file that contains a list of bounding boxes coordinates for each text word in the image. Bounding boxes are represented by four corner points ordered clock-wise. The test set has 10,000 images. For each image in the test set, participants are expected to produce a list comprised of four-corner bounding boxes for each word detected in the image. This task was introduced in RRC-MLT-2017 [1], and it has been reopened in RRC-MLT-2019 on the 10-languages dataset.

Iv-B Evaluation Protocol for Task-1

The f-measure (Hmean) is used as the metric for ranking the participants methods. The standard f-measure is based on both the recall and precision of the detected word bounding boxes as compared to the ground truth in all the test images (the boxes are matched/processed image by image). A detection is counted as correct (true positive) if the detected bounding box has more than 50% overlap (intersection over union) with the GT box. At the image level, the evaluation procedure works as follows: let be the set of bounding boxes of the “don’t care” regions, be the set of bounding boxes in the ground truth, and be the set of bounding boxes in the results under evaluation.

First, the result bounding boxes from are matched against the “don’t care” regions set to eliminate noise. Each quadrilateral is compared against each quadrilateral and is discarded if the following condition is true:

(1)

where is the area of a quadrilateral . Such approach leads to some minor issues with ground truth regions overlapping with “don’t care” regions. However, only few cases in the dataset were observed, and there was no impact on the global evaluation of the methods. This highlights possible improvement of the RRC evaluation methods [4] in the future.

Once the set of the detected bounding boxes is filtered, the resulting filtered set is matched against the set of ground truth quadrilaterals . A positive match is counted each time a couple of elements verifies the following condition:

(2)

with and . An extra test ensures that each element and each element can only be matched once.

At the whole test set level, the evaluation metrics are computed cumulatively from all the test images (detection results of all the images are pooled together). Extending the set of positive (relevant) matches

, the set of expected words and the set of filtered results to include all the test images, we can compute the precision, recall and f-measure as follows:

(3)

Iv-C Participant Methods and Results for Task-1

We report here the results obtained by the participants for this task. The ranking of the participants according to Hmean is summarized in Table I. The name of each participant method in the table is a link to its online description and results.

Rank Method Hmean Precision Recall
1 Tencent-DPPR Team 83.61% 87.52% 80.05%
1 Multi-stage_Text_Detector 83.59% 87.75% 79.80%
2 NJU-ImagineLab 83.07% 87.85% 78.79%
3 PMTD [21] 82.53% 87.47% 78.12%
4 MaskRCNN 80.35% 82.64% 78.19%
5 IC_RL 80.11% 82.97% 77.44%
6 4Paradigm-Data-Intelligence 79.84% 83.44% 76.54%
7 Two-stage Text Detector 78.38% 82.26% 74.85%
—based on Cascade-RCNN
8 MM-MaskRCNN 76.79% 84.73% 70.21%
9 TH-DL 76.64% 84.55% 70.09%
10 SOT 74.24% 79.96% 69.28%
11 DISTILLED CRAFT 72.94% 81.22% 66.19%
12 Text-Mountain 71.95% 72.12% 71.77%
13 CRAFTS [22] 70.86% 81.42% 62.73%
14 Unicamp-SRBR- 70.81% 81.58% 62.54%
—MLT2019-PELEEText
15 RRPN 69.56% 77.71% 62.95%
16 Unicamp-SRBR-MLT2019 68.56% 77.00% 61.79%
—Fusion-PSENet-PELEEText
17 Lomin OCR 67.65% 71.62% 64.09%
18 NXB OCR 65.96% 70.59% 61.90%
19 PSENet 65.83% 73.52% 59.59%
20 MLT2019 ETD 64.36% 78.71% 54.44%
21 CLTDR 63.53% 77.20% 53.97%
22 TP 58.01% 77.59% 46.32%
23 based on mask rcnn 49.45% 64.69% 40.02%
24 Cyberspace 47.09% 69.48% 35.61%
TABLE I: Results of the RRC-MLT-2019 Challenge for Task-1: Multi-Lingual Text Detection

Most of the participant methods – including winner methods – are based on R-CNN (masked, cascaded, with refinement stage etc.). This shows that the R-CNN method can be improved to achieve highly accurate detection. Other methods have used one or more deep nets used previously for text detection and recognition such as ResNet, EAST (based on FCN), RRPN and FPN among others.

Iv-C1 Winner Methods of Task-1


We have two winner methods for this task (both ranked 1). The difference in Hmean of the two methods is not significant given the possibility of the presence of some errors in the GT. The first winner method is called “Tencent-DPPR Team”.
Authors: Longhuang Wu, Shangxuan Tian, Chang Liu, Wenjie Cai, Jiachen Li, Sicong Liu, Haoxi Li, Chunchao Guo, Hongfa Wang, Hongkai Chen, Qinglin Lu, Chun Yang, Xucheng Yin, Lei Xiao.
Affiliation: Tencent-DPPR (Data Platform Precision Recommendation) team.
Method description: The text detector follows the framework of Mask R-CNN that employs a mask to detect multi-oriented scene texts. The text detector is trained using the RRC-MLT-2019 training set and the MSRA-TD500 dataset. A multi-scale training approach is used during training. To obtain the final ensemble results, two different backbones and different multi-scale testing approaches are combined.

The other winner method is “Multi-stage Text Detector”.
Authors: Pengfei Wang, Mengyi En, Xiaoqiang Zhang, Chengqaun Zhang.
Affiliation: VIS-VAR Team at Baidu Inc. and Xidian University. Pengfei Wang did carried out this work while interning at Baidu Inc.
Method description: The method relies on two stages. The first stage is a modified Mask-R-CNN, where a rotated proposal module is introduced to make Mask-RCNN more suitable for detecting multi-oriented scene text. The second stage is a refinement to get the final detection results.

V Task-2: Cropped Word Script identification

V-a Task Description

The objective of this task is to identify the script of a cropped word image. The training and test sets of this task consist of cropped word images that have been extracted from the full scene images of Task-1 based on the bounding boxes of the GT words. In total, there are 89,177 training images and 102,462 test images. The text in our dataset images appears in 10 different languages, some of them share the same script. Additionally, punctuation and some math symbols sometimes appear as separate words, those words are annotated as a special script class called “Symbols”. Hence, we have a total of 8 different scripts. We have excluded the words that have “Mixed” script for this task due to the very small number of samples. We have also excluded all the “don’t care” words whether they have a recognizable script or not.

Given the test images of cropped words, participants are asked to identify the script ID of each word image file. A single script name (ID) per image is requested. The valid scripts for this task are: “Arabic”, “Bangla”, “Chinese”, “Hindi”, “Japanese”, “Korean”, “Latin” and “Symbols”. This task was introduced in RRC-MLT-2017 [1], and it has been reopened in RRC-MLT-2019 on the new dataset of 10 languages.

V-B Evaluation Protocol for Task-2

The evaluation and the ranking of results is based on classification accuracy. Participants provide a script ID for each word image, and if the result is correct, then the count of correct results is incremented. The overall accuracy of a given method is is defined as follows. let be the set of correct script classes in the ground truth, and be the set of script classes returned by a given method, where and refer to the same original image. Then, the performance of a given method is expressed by:

(4)

V-C Participant Methods and Results for Task-2

We report here the results obtained by the participants for this task. The ranking of the participants – according to script classification accuracy – is summarized in Table II.

Rank Method Accuracy
1 Tencent-DPPR Team 94.03%
2 SOT: CNN-based Classifier 91.66%
3 GSPA_HUST 91.02%
3 SCUT-DLVC-Lab 90.97%
4 TPS-ResNet [23] 90.90%
4 Conv-Transformer 90.88%
5 TH-DL 90.70%
6 TH-ML 88.85%
7 MultiScale_HUST 88.64%
8 USTC & IFLYTEK 88.54%
9 Conv_Attention 88.41%
10 Cold 87.98%
11 NXB OCR 84.88%
12 ELE-MLT based method 82.86%
13 Res_MUL_SPP_BUPT 71.31%
TABLE II: Results of the RRC-MLT-2019 Challenge for Task-2: Cropped Word Script identification

Most of the participant methods base their methods on famous deep nets for text recognition such as ResNet, VGG16, Seq2Seq with CTC, CNN with self-attention, RNN, CNN-LSTM etc., with adding some improvements such as multi-scale techniques, attention, voting strategy for combining results from multiple nets, training statistics of the scripts etc.

V-C1 Winner Method of Task-2


The winner method is called “Tencent-DPPR Team”.
Authors: Sicong Liu, Haoxi Li, Haibo Qin, Ben Xu, Chunchao Guo, Longhuang Wu, Shangxuan Tian, Hongfa Wang, Hongkai Chen, Qinglin Lu, Chun Yang, Xucheng Yin, Lei Xiao.
Affiliation: Tencent-DPPR (Data Platform Precision Recommendation) team.
Method description: In the first stage, the method recognizes text-lines and their character-level language types using the ensemble results of several recognition models which are based on Seq2Seq with CTC and CNN with self-attention & RNN. In the second stage, the language types of the recognized results are identified based on the statistics of the MLT-2019 training set and the Wikipedia corpus.

Vi Task-3: Joint Text Detection and Script Identification

Vi-a Task Description

The objective of this task is the correct localization of all the words in a full scene image and jointly identifying the script ID of each localized (detected) word. The training and test sets are comprised of 10,000 images each, the same scene images described in Task-1. The ground truth file corresponding to an image contains the coordinates of the bounding boxes of all the words inside the image (including “don’t care” words), the transcription and the script ID for each word box.

Participants are required to output the list of the detected bounding boxes for each image and the script ID for each detected bounding box (word) in the list. This task was introduced in RRC-MLT-2017 [1], and it has been reopened in RRC-MLT-2019 on the new dataset of 10 languages.

Vi-B Evaluation Protocol for Task-3

The evaluation of this task is a cascade of correct localization of a text box and its correct script classification. This only requires injecting the control of the correct identification of the script for a given text region into Equation 2:

(5)

Except for the definition of a correct detection, Task-3 has the same ranking and evaluation protocol as Task-1. A correct detection here is counted when the box is both detected correctly and its correct script ID is identified.

Vi-C Participant Methods and Results for Task-3

We report here the results obtained by the participants for this joint detection and classification task. The ranking of the participants’ methods – according to Hmean – is summarized in Table III.

Rank Method Hmean Precision Recall
1 Tencent-DPPR Team 80.84% 87.68% 74.99%
2 Mask_RCNN-transformer 75.12% 77.26% 73.10%
3 icdar2019_mlt_task3_test_lqj 72.13% 74.21% 70.16%
4 TH-DL 71.01% 78.34% 64.94%
5 DISTILLED CRAFT 68.69% 74.97% 63.39%
6 Cold 68.58% 77.79% 61.32%
7 CRAFTS [22] 68.34% 78.52% 60.50%
8 SOT + Classifier 65.66% 66.20% 65.13%
9 USTC & IFLYTEK: det+cls 63.14% 63.30% 62.98%
10 NXB OCR 57.74% 61.79% 54.18%
TABLE III: Results of the RRC-MLT-2019 Challenge for Task-3: Joint Text Detection and Script Identification

Almost all the methods that participated in Task-1, have also participated in Tasks 3 and 4 as the core task of detecting text words has been accomplished in Task-1. This shows that the research has moved towards end-to-end approaches with the same underlying deep learning-based methods.

Vi-C1 Winner Method of Task-3


The winner method is called “Tencent-DPPR Team” presented by the same team which won Tasks 1 & 2 as well. This can be expected as Task-3 is a joint between the first two tasks. Indeed, the winner method here is a cascade of the two methods presented by the Tencent-DPPR for Tasks 1 & 2.
Authors: Longhuang Wu, Shangxuan Tian, Haoxi Li, Sicong Liu, Jiachen Li, Chunchao Guo, Haibo Qin, Chang Liu, Hongfa Wang, Hongkai Chen, Qinglin Lu, Chun Yang, Xucheng Yin, Lei Xiao.
Affiliation: Tencent-DPPR (Data Platform Precision Recommendation) team.
Method description: A cascade of the team’s two descriptions of their methods for the first two tasks (see Subsections IV-C1 and V-C1).

Vii Task-4: End-to-End Text Detection and Recognition

Vii-a Task-4 Description

This newly introduced task is very challenging: a unified OCR for multiple-languages. We present the task of end-to-end scene text detection and recognition in multi-lingual setting that is coherent with its English-only counterparts. Given an input scene image, the objective is to localize a set of bounding boxes and their corresponding transcriptions.

The training and test sets are comprised of 10,000 images each, they are the same scene images described in Task-1 and with the same GT as in Task-3. The training data is unbalanced for the different languages in the real dataset (see Subsection III-A). Hence, to help with the training for this task, we provide to participants the synthetic dataset described in Subsection III-B in addition to the real dataset.

Vii-B Evaluation Protocol for Task-4

The evaluation of this task is a cascade of correct localization (detection) of a text box and its correct transcription. This only requires injecting the control of matching the transcription for a given text region into Equation 2:

(6)

where the transcription is matched with case insensitive setting, and a given transcription result must exactly match the GT transcription (i.e the edit distance between the two transcriptions is zero). Except for the definition of a correct detection, Task-4 has the same ranking and evaluation protocol as Task-1. A correct detection here is counted when the box is both detected correctly and its text is recognized correctly.

Note that the test set words which contain characters that did not appear in the train set are set as “don’t care” for both the detection and recognition. This means whether a method detects them correctly or not, or recognize them correctly or not, they won’t be counted in the evaluation.

Vii-C Participant Methods and Results for Task-4

We report here the results obtained by the participants for this task. The ranking of the participants according to Hmean is summarized in Table IV. Note that the online results show additional evaluation metrics including the edit-distance accuracy for recognition part.

Rank Method Hmean Precision Recall
1 Tencent-DPPR Team 59.15% 71.26% 50.55%
& USTB-PRIR
2 end2end 52.50% 55.34% 49.93%
3 CRAFTS [22] 51.74% 65.68% 42.68%
4 Mask_RCNN-transformer 51.04% 52.51% 49.64%
5 Three-stage method 40.19% 44.37% 36.73%
6 USTC & IFLYTEK 39.55% 39.71% 39.39%
7 icdar2019_mlt_test_lqj 38.75% 39.88% 37.67%
8 TH-DL 37.32% 41.22% 34.10%
9 RRPN+CLTDR 33.82% 38.62% 30.08%
10 NXB OCR 32.07% 34.37% 30.06%
E2E-MLT “Baseline” [15] 26.46% 37.44% 20.47%
TABLE IV: Results of the RRC-MLT-2019 Challenge for Task-4: End-to-End Text Detection and Recognition

For this new task, we present a baseline method named in Table IV as “E2E-MLT” [15] with available source code333https://github.com/MichalBusta/E2E-MLT/blob/master/README.md. Since it has been submitted by the organizers as a baseline, we do not rank it.

Once again we note that most of the methods have also participated in Tasks 1 & 3 and incorporated a recognition part into their networks (The recognition parts have been used – in some cases – to help with Task-2). Participant methods mostly rely on the detection and recognition deep nets mentioned in Subsection IV-C, and adding to this list: attention-based decoders, MORAN-v2, CRNN adopting CTC, CRNN and convolutional transformers with a ResNet50 backbone. We have also noted that combining multiple nets can be effective in end-to-end tasks.

Vii-C1 Winner Method of Task-4


The winner method is called “Tencent-DPPR Team & USTB-PRIR” as a collaboration between two teams. The Tencent-DPPR Team has indeed won the 4 tasks of our MLT-2019 challenge. The team shared the same rank with another team in Task 1, and collaborated with another team in Task 4.
Authors: Sicong Liu, Longhuang Wu, Shangxuan Tian, Haoxi Li, Chunchao Guo, Haibo Qin, Chang Liu, Hongfa Wang, Hongkai Chen, Qinglin Lu, Chun Yang, Xucheng Yin, Lei Xiao.
Affiliations: Tencent-DPPR & USTB-PRIR.
Method description: The detection part of the framework is the same as described in Subsection IV-C1, and the recognition part is the same as described in Subsection V-C1 as it was used for the script classification task of detected words.

Viii Conclusions and Future Directions

This report has summarized the organization and the findings of the multi-lingual scene text (MLT) challenge of the RRC competition. There has been a total of 60 different submissions distributed over the four proposed tasks. This shows a big interest of the community in the problem of multi-lingual scene text detection and recognition. This interest has vastly grown since the 2017-edition of MLT.

Our work has extended the previous RRC-MLT-2017 edition in the following aspects: adding a new language (of a new script), introducing a new end-to-end task for text recognition and building a new synthetic dataset that matches the real one in terms of scripts for training purposes. All the details about the RRC-MLT-2019 challenge and its datasets are available on the RRC competition website: http://rrc.cvc.uab.es/?ch=15.

Future versions of this challenge could focus on increasing the number of languages in the dataset (of similar and also of different scripts) leading to very large-scale problems in multi-lingual scene text detection and recognition. Moreover, there is a need to design more robust evaluation protocols that can handle special appearances of text such as unfocused scene text, and also deal with sub-task evaluation for “don’t care” words in joint or end-to-end tasks. This work provides the base on which such future work could be built.

Acknowledgments

This work is partially funded by Agence Nationale de la Recherche (ANR) in France, National Natural Science Foundation of China (NSFC 61411136002) in China under the AUDINM project, and by the Visual Computing Competence Center TE01020415 of the Technology Agency of the Czech Republic.

References

  • [1] N. Nayef, F. Yin, I. Bizid, H. Choi, Y. Feng, D. Karatzas, Z. Luo, U. Pal, C. Rigaud, J. Chazalon, W. Khlif, M. M. Luqman, J.-C. Burie, C.-L. Liu, and J.-M. Ogier, “Icdar2017 robust reading challenge on multi-lingual scene text detection and script identification - rrc-mlt,” in ICDAR, 2017.
  • [2] A. Veit, T. Matera, L. Neumann, J. Matas, and S. Belongie, “Coco-text: Dataset and benchmark for text detection and recognition in natural images,” arXiv preprint arXiv:1601.07140, 2016.
  • [3] D. Karatzas, L. G. i Bigorda, A. Nicolaou, S. Ghosh, A. Bagdanov, M. Iwamura, J. Matas, L. Neumann, V. R. Chandrasekhar, S. Lu, F. Shafait, S. Uchida, and E. Valveny, “ICDAR 2015 competition on robust reading,” in ICDAR, 2015.
  • [4] D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, L. G. i. Bigorda, S. R. Mestre, J. Mas, D. F. Mota, J. A. Almazàn, and L. P. de las Heras, “ICDAR 2013 robust reading competition,” in ICDAR, 2013.
  • [5] L. G. i Bigorda, A. Nicolaou, and D. Karatzas, “Improving patch-based scene text script identification with ensembles of conjoined networks,” Pattern Recognition, 2017.
  • [6] M. Jain, M. Mathew, and C. Jawahar, “Unconstrained scene text and video text recognition for arabic script,” in ASAR, 2017.
  • [7] R. Smith, C. Gu, D.-S. Lee, H. Hu, R. Unnikrishnan, J. Ibarz, S. Arnoud, and S. Lin, “End-to-end interpretation of the french street name signs dataset,” in Computer Vision – ECCV 2016 Workshops, 2016, pp. 411–426.
  • [8] M. Mathew, M. Jain, and C. V. Jawahar, “Benchmarking scene text recognition in devanagari, telugu and malayalam,” in ICDAR-MOCR Workshop, 2017.
  • [9] M. T. M. N. S. H. I. Y. K. K. Iwamura, M., “Downtown osaka scene text dataset,” in ECCV IWRR Workshop, 2016.
  • [10] H. Mengchao and Y. Zhibo. (2018) Icpr mtwi multi-type web images. [Online]. Available: https://tianchi.aliyun.com/competition/entrance/231651/introduction
  • [11]

    B. Shi, X. Bai, and C. Yao, “Script identification in the wild via discriminative convolutional neural network,”

    Pattern Recognition, 2016.
  • [12] N. Sharma, R. Mandal, R. Sharma, U. Pal, and M. Blumenstein, “ICDAR2015 competition on video script identification (cvsi 2015),” in ICDAR, 2015.
  • [13] A. K. Singh, A. Mishra, P. Dabral, and C. V. Jawahar, “A simple and effective solution for script identification in the wild,” in DAS, 2016.
  • [14] L. G. i Bigorda and D. Karatzas, “A fine-grained approach to scene text script identification,” in DAS, 2016.
  • [15] M. Bušta, Y. Patel, and J. Matas, “E2E-MLT – an unconstrained end-to-end method for multi-language scene text,” ACCV IWRR Workshop, 2018.
  • [16] A. Gupta, A. Vedaldi, and A. Zisserman, “Synthetic data for text localisation in natural images,” in CVPR, 2016.
  • [17] P. Arbeláez, M. Maire, C. Fowlkes, and J. Malik, “Contour detection and hierarchical image segmentation,” PAMI, 2010.
  • [18] P. Arbeláez, J. Pont-Tuset, J. T. Barron, F. Marques, and J. Malik, “Multiscale combinatorial grouping,” in CVPR, 2014.
  • [19] Z. Liu, X. Li, P. Luo, C.-C. Loy, and X. Tang, “Semantic image segmentation via deep parsing network,” in ICCV, 2015.
  • [20] M. A. Fischler and R. C. Bolles, “Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography,” Communications of the ACM, 1981.
  • [21] J. Liu, X. Liu, J. Sheng, D. Liang, X. Li, and Q. Liu, “Pyramid mask text detector,” arXiv preprint arXiv:1903.11800, 2019.
  • [22] Y. Baek, B. Lee, D. Han, S. Yun, and H. Lee, “Character region awareness for text detection,” in CVPR, 2019.
  • [23] J. Baek, G. Kim, J. Lee, S. Park, D. Han, S. Yun, S. J. Oh, and H. Lee, “What is wrong with scene text recognition model comparisons? dataset and model analysis,” arXiv preprint arXiv:1904.01906, 2019.