1 Introduction
Recently, many scene text recognition techniques have been proposed which learns the language knowledge for better recognition the in-vocabulary (IV) words (i.e., the words have been appeared in the training set). However, the language information of out-of-vocabulary (OOV) words are usually difficult to learn if they have never been seen during training, which makes model difficult to recognize the OOV words accurately.
ECCV 2022 Challenge on Out of Vocabulary Scene Text Understanding 111https://rrc.cvc.uab.es/?ch=19, held together with ECCV 2022 workshop on Text in Everything (TiE) 222https://sites.google.com/view/tie-eccv2022/challenge, aims to evaluate the model performances on recognizing OOV words. In this challenge, the training, validation and test sets are composed of several commonly used datasets, including ICDAR13 [karatzas2013icdar], ICDAR15 [karatzas2015icdar], MLT19 [nayef2019icdar2019]
, COCO-Text
[veit2016coco], TextOCR [singh2021textocr], HierText [long2022towards], and OpenImagesText [krylov2021open]. Two evaluation metrics are provided that focuses on: (1) OOV words only which aims to evaluate the model performances on recognizing the OOV words; and (2) both IV and OOV words by averaging the IV and OOV scores which aims to consider both IV and OOV words in evaluation.
In this report, we present our solution to the end-to-end OOV word recognition task. We first pre-train different commonly-used network backbones by using oCLIP [xue2022language]. We then fine-tune PAN [wang2019efficient], Mask TextSpotter-v3 (MTS-v3) [liao2020mask] and TESTR [Zhang_2022_CVPR] on the composed datasets for word detection. Finally, we recognize the detected words by using a SCATTER-based [litman2020scatter] recognizer. Our method achieves 28.59% h-mean, which ranks 1st on end-to-end recognition.
2 Methods and Experimental Results
2.1 Text Detection
In our solution, we first pre-train different backbones including VAN-large [guo2022visual] and Deformable ResNet-101 [dai2017deformable] by using oCLIP [xue2022language]. Next, we fine-tune PAN [wang2019efficient], MTS-v3 [liao2020mask] and TESTR [Zhang_2022_CVPR] on the composed datasets by using the pre-trained models. Finally, we combine the detection results from different models together and recognize the words following [zhu2022oovrec]. All models are evaluated on the validation set of the composed dataset.
2.1.1 Model Pre-train
We first pre-train VAN-large [guo2022visual] and Deformable ResNet-101 [dai2017deformable] by using oCLIP [xue2022language] on SynthText [Gupta16] dataset as well as the provided composed dataset. Next, we fine-tune PAN [wang2019efficient] (with VAN-large as backbone), MTS-v3 [liao2020mask] (with Deformable ResNet-101 as backbone), and TESTR [Zhang_2022_CVPR] (with Deformable ResNet-101 as backbone) on the composed dataset by using the pre-trained backbone weights. By pre-training using oCLIP, the performances of different models have been improved by 1% to 3% in Fscore as shown in Table 1.
Method | OOV | All | ||||
Precision | Recall | Fscore | Precision | Recall | Fscore | |
PAN | 65.36 | 68.71 | 67.00 | 83.36 | 56.18 | 67.21 |
PAN+oCLIP | 64.03 | 73.11 | 68.27 (+1.27) | 83.37 | 61.64 | 70.88 (+3.67) |
MTS-v3 | 77.13 | 48.31 | 59.41 | 87.61 | 42.16 | 56.93 |
MTS-v3+oCLIP | 77.55 | 48.83 | 59.93 (+0.52) | 87.73 | 43.09 | 57.87 (+0.94) |
TESTR | 69.55 | 55.12 | 61.50 | 84.75 | 52.34 | 64.71 |
TESTR+oCLIP | 71.47 | 56.22 | 62.93 (+1.43) | 85.93 | 55.83 | 65.73 (+1.02) |
2.1.2 Model Ensemble
Next, we collect the detection results from different models with different scales of images (i.e., 512, 960, 1280, 1600) as input which are hence combined together. We further apply soft-nms [bodla2017soft] on the combined results and filter the detected boxes by a threshold of 0.92. Table 2 shows the model ensemble results.
Method | OOV | ||
Precision | Recall | Fscore | |
PAN | 64.03 | 73.11 | 68.27 |
MTS-v3 | 77.55 | 48.83 | 59.93 |
TESTR | 71.47 | 56.22 | 62.93 |
Ensemble | 69.85 | 76.20 | 72.89 |
Set | OOV | ||
Precision | Recall | Fscore | |
Validation | 41.08 | 41.73 | 41.40 |
Test | 20.28 | 48.42 | 28.59 |
2.2 End-to-End Word Recognition
We pass the detected texts to our recognition model [zhu2022oovrec] and filter out the words that are recognized to be ‘ignore’ texts to obtain the text recognition results. Table 3 shows the end-to-end recognition results from our models.
3 Conclusion
This report presents our solutions to the end-to-end OOV word recognition task of ECCV 2022 Challenge on OOV-ST. We adopt oCLIP for model pre-train and model ensemble for better detection of texts in scenes. The presented solution ranks first in the end-to-end recognition of out of vocabulary words in the ECCV 2022 Challenges on OOV-ST.