Log In Sign Up

1st Place Solution to ECCV 2022 Challenge on Out of Vocabulary Scene Text Understanding: End-to-End Recognition of Out of Vocabulary Words

by   Zhangzi Zhu, et al.

Scene text recognition has attracted increasing interest in recent years due to its wide range of applications in multilingual translation, autonomous driving, etc. In this report, we describe our solution to the Out of Vocabulary Scene Text Understanding (OOV-ST) Challenge, which aims to extract out-of-vocabulary (OOV) words from natural scene images. Our oCLIP-based model achieves 28.59% in h-mean which ranks 1st in end-to-end OOV word recognition track of OOV Challenge in ECCV2022 TiE Workshop.


page 1

page 2

page 3


1st Place Solution to ECCV 2022 Challenge on Out of Vocabulary Scene Text Understanding: Cropped Word Recognition

This report presents our winner solution to ECCV 2022 challenge on Out-o...

On Vocabulary Reliance in Scene Text Recognition

The pursuit of high performance on public benchmarks has been the drivin...

ICDAR2019 Robust Reading Challenge on Multi-lingual Scene Text Detection and Recognition -- RRC-MLT-2019

With the growing cosmopolitan culture of modern cities, the need of robu...

A pooling based scene text proposal technique for scene text reading in the wild

Automatic reading texts in scenes has attracted increasing interest in r...

RFBTD: RFB Text Detector

Text detection plays a critical role in the whole procedure of textual i...

Vision-Language Adaptive Mutual Decoder for OOV-STR

Recent works have shown huge success of deep learning models for common ...

Using Human Psychophysics to Evaluate Generalization in Scene Text Recognition Models

Scene text recognition models have advanced greatly in recent years. Ins...

1 Introduction

Recently, many scene text recognition techniques have been proposed which learns the language knowledge for better recognition the in-vocabulary (IV) words (i.e., the words have been appeared in the training set). However, the language information of out-of-vocabulary (OOV) words are usually difficult to learn if they have never been seen during training, which makes model difficult to recognize the OOV words accurately.

ECCV 2022 Challenge on Out of Vocabulary Scene Text Understanding 111, held together with ECCV 2022 workshop on Text in Everything (TiE) 222, aims to evaluate the model performances on recognizing OOV words. In this challenge, the training, validation and test sets are composed of several commonly used datasets, including ICDAR13 [karatzas2013icdar], ICDAR15 [karatzas2015icdar], MLT19 [nayef2019icdar2019]

, COCO-Text

[veit2016coco], TextOCR [singh2021textocr], HierText [long2022towards], and OpenImagesText [krylov2021open]

. Two evaluation metrics are provided that focuses on: (1) OOV words only which aims to evaluate the model performances on recognizing the OOV words; and (2) both IV and OOV words by averaging the IV and OOV scores which aims to consider both IV and OOV words in evaluation.

In this report, we present our solution to the end-to-end OOV word recognition task. We first pre-train different commonly-used network backbones by using oCLIP [xue2022language]. We then fine-tune PAN [wang2019efficient], Mask TextSpotter-v3 (MTS-v3) [liao2020mask] and TESTR [Zhang_2022_CVPR] on the composed datasets for word detection. Finally, we recognize the detected words by using a SCATTER-based [litman2020scatter] recognizer. Our method achieves 28.59% h-mean, which ranks 1st on end-to-end recognition.

2 Methods and Experimental Results

2.1 Text Detection

In our solution, we first pre-train different backbones including VAN-large [guo2022visual] and Deformable ResNet-101 [dai2017deformable] by using oCLIP [xue2022language]. Next, we fine-tune PAN [wang2019efficient], MTS-v3 [liao2020mask] and TESTR [Zhang_2022_CVPR] on the composed datasets by using the pre-trained models. Finally, we combine the detection results from different models together and recognize the words following [zhu2022oovrec]. All models are evaluated on the validation set of the composed dataset.

2.1.1 Model Pre-train

We first pre-train VAN-large [guo2022visual] and Deformable ResNet-101 [dai2017deformable] by using oCLIP [xue2022language] on SynthText [Gupta16] dataset as well as the provided composed dataset. Next, we fine-tune PAN [wang2019efficient] (with VAN-large as backbone), MTS-v3 [liao2020mask] (with Deformable ResNet-101 as backbone), and TESTR [Zhang_2022_CVPR] (with Deformable ResNet-101 as backbone) on the composed dataset by using the pre-trained backbone weights. By pre-training using oCLIP, the performances of different models have been improved by 1% to 3% in Fscore as shown in Table 1.

Method OOV All
Precision Recall Fscore Precision Recall Fscore
PAN 65.36 68.71 67.00 83.36 56.18 67.21
PAN+oCLIP 64.03 73.11 68.27 (+1.27) 83.37 61.64 70.88 (+3.67)
MTS-v3 77.13 48.31 59.41 87.61 42.16 56.93
MTS-v3+oCLIP 77.55 48.83 59.93 (+0.52) 87.73 43.09 57.87 (+0.94)
TESTR 69.55 55.12 61.50 84.75 52.34 64.71
TESTR+oCLIP 71.47 56.22 62.93 (+1.43) 85.93 55.83 65.73 (+1.02)
Table 1: Text detection results by adopting oCLIP[xue2022language] for backbone pre-training.

2.1.2 Model Ensemble

Next, we collect the detection results from different models with different scales of images (i.e., 512, 960, 1280, 1600) as input which are hence combined together. We further apply soft-nms [bodla2017soft] on the combined results and filter the detected boxes by a threshold of 0.92. Table 2 shows the model ensemble results.

Method OOV
Precision Recall Fscore
PAN 64.03 73.11 68.27
MTS-v3 77.55 48.83 59.93
TESTR 71.47 56.22 62.93
Ensemble 69.85 76.20 72.89
Table 2: Text detection results by model ensemble.
Precision Recall Fscore
Validation 41.08 41.73 41.40
Test 20.28 48.42 28.59
Table 3: End-to-end recognition results on validation and test set, respectively.

2.2 End-to-End Word Recognition

We pass the detected texts to our recognition model [zhu2022oovrec] and filter out the words that are recognized to be ‘ignore’ texts to obtain the text recognition results. Table 3 shows the end-to-end recognition results from our models.

3 Conclusion

This report presents our solutions to the end-to-end OOV word recognition task of ECCV 2022 Challenge on OOV-ST. We adopt oCLIP for model pre-train and model ensemble for better detection of texts in scenes. The presented solution ranks first in the end-to-end recognition of out of vocabulary words in the ECCV 2022 Challenges on OOV-ST.