Scene text spotting has witnessed remarkable progress and achieved promising results 2018cvpr_liu_fots; 2018eccv_lyu_masktextspotterv1; 2020cvpr_liu_abcnet; 2020eccv_liao_masktextspotterv3 in recent years. However, there is still room for improvement in scene text spotting, due to the inconsistency between text detection and recognition, which involves the following two aspects.
The first aspect is the inconsistency of text recognition features during training and testing. Most existing methods 2018cvpr_liu_fots; 2018eccv_lyu_masktextspotterv1; 2020cvpr_liu_abcnet; 2020eccv_liao_masktextspotterv3 extract recognition features based on ground-truth annotations in the training phase and predicted bounding boxes in the testing phase, which often leads to inconsistent text recognition feature distributions (see Figure 1(a)).
Second, there is inconsistency between optimization targets of text detection and recognition. Detection branches in the existing methods 2018cvpr_liu_fots; 2018eccv_lyu_masktextspotterv1; 2020cvpr_liu_abcnet; 2020eccv_liao_masktextspotterv3 are typically optimized to learn high-IoU text detection results. However, the detection result with high IoU is not always suitable for the recognition task. As shown in Figure 1(c), the bounding box with results in the false recognition result, while the bounding box with lower IoU yields the correct result.
Due to the introduced inconsistency between text detection and recognition, previous methods (e.g., ABCNet) suffer a significant performance drop (see Figure 1(b)) when the IoU of detection results are lower than 0.8, which indicates that although some detection results are considered as “correct” under the detection evaluation protocol (e.g., ), these detection results may not be suitable for text recognition (see Figure 1(c)).
To address the aforementioned problems, we propose a new arbitrarily-shaped text spotting framework, termed Auto-Rectification Text Spotter (ARTS), which bridges the inconsistency between text detection and recognition. We carefully design three modules for ARTS, which include: (1) a rectification control points detection (RCPD) branch to detect arbitrarily-shaped text lines; (2) a differentiable feature extractor termed auto-rectification module (ARM) for back-propagating text recognition loss to optimize the detection branch; and (3) a lightweight text recognition branch to decode text contents. All the modules above complement each other, enabling the proposed ARTS to learn text detection results from both detection loss and recognition loss, which largely alleviates the inconsistency problem between text detection and recognition. As the red columns shown in Figure 1(b) and the example shown in Figure 1(c), our method achieves much better performance especially when the detection results are with lower-quality (IoU 0.8).
We conduct extensive experiments to further examine the effectiveness of ARTS on three challenging benchmark datasets, including Total-Text 2017icdar_totaltext, CTW1500 2017arxiv_liu_ctw1500 and ICDAR2015 2015icdar_ic15. As shown in Figure 2, our method surpasses prior arts in terms of both accuracy and efficiency. For example, our ARTS-S (ResNet50) achieves an end-to-end text spotting F-measure of 77.1% on Total-Text, surpassing ABCNet-MS 2020cvpr_liu_abcnet by 7.6 points, while keeping a faster inference speed (10.5 FPS vs. 6.9 FPS). Moreover, the real-time version ARTS-RT yields an F-measure of 65.9% at 28.0 FPS, which is 10 FPS faster and 1.7% better than the previous fastest ABCNet.
Our main contributions are listed as follows:
(1) We systematically analyze the inconsistency between text detection and recognition, and propose a new text spotting framework, termed ARTS, to address this problem. To our knowledge, our method is the first work to study and tackle the inconsistency problem in text spotting.
(2) We design a differentiable module named ARM to bridge the gap between text detection and recognition branches, so that recognition loss can be back-propagated to optimize the detection results, helping detection branch to predict more accurate and more suitable detection results for text recognition.
(3) The proposed ARTS achieves state-of-the-art performance in terms of both accuracy and efficiency. Extensive experiments demonstrate the superiority of our models. Notably, ARTS-S (ResNet50) yields 77.1% end-to-end text spotting F-measure at 10.5 FPS on Total-Text, which is significantly better and faster than previous state-of-the-art methods.
Existing text spotting methods can be roughly summarized into the following two categories:
Regular Text Spotters
are usually designed to process horizontal or multi-oriented scene text. DeepTextSpotter 2017iccv_busta_deeptextspotter used RPN to generate rotated proposals and extracted text features for its recognizer with bilinear sampling. FOTS 2018cvpr_liu_fots adopted a one-stage text detector to produce rotated rectangular bounding boxes and used RoIRotate to extract text features for the following recognizer. He et al. 2018cvpr_he_e2etextspotter also developed a similar framework whose recognition head was implemented by an attention-based decoder. Though these methods have achieved promising results on standard benchmarks (e.g.2015icdar_ic15), they failed to spot texts with arbitrary shapes.
Arbitrarily-Shaped Text Spotters
are designed for spotting texts with irregular layouts. TextDragon 2019iccv_feng_textdragonv1
developed a bottom-up framework to combine features extracted from multiple text segments by RoISlide. Mask TextSpotter v1/v22018eccv_lyu_masktextspotterv1; 2019pami_liao_masktextspotterv2 and Qin et al. 2019iccv_qin_unconstrained were based on Mask R-CNN 2017iccv_he_maskrcnn and extracted recognition features through RoIAlign or RoIMasking. Wang et al. 2020aaai_wang_boundary utilized a multi-stage anchor-based method to first generate axis-aligned rectangular proposals, then regress their angles to produce rotated rectangular proposals and finally regress boundary points on top of the rotated rectangular proposals. ABCNet 2020cvpr_liu_abcnet proposed to use parametric bezier control points as the representation for arbitrary-shaped text instances to extract smooth text feature.
Comparison with Similar Works.
Boundary 2020aaai_wang_boundary adopted a three-stage anchor-based detector as its detection branch and cannot back-propagate recognition losses to the first two detection stages. Differently, our method adopt a much simpler one-stage anchor-free pipeline as our detection branch and our detection branch can be jointly optimized by detection and recognition targets as shown in Figure 3(b), largely alleviating the inconsistency between text detection and recognition.
ABCNet 2020cvpr_liu_abcnet adopted BezierAlign for feature extraction. But we argue that BezierAlign also cannot propagate recognition loss back to the detection branch, causing the inconsistency between text detection and recognition. In our work, we use a different training strategy, i.e., using predicted instead of ground-truth polygons to extract text features for recognition task during training, making it possible for our further improvement, i.e., we use our ARM to extract text features and further enable loss back-propagation from recognition towards detection branch, which is hard to achieve by ABCNet’s BezierAlign.
ARTS is an efficient and accurate end-to-end framework for detecting and recognizing text lines with arbitrary shapes. The overall architecture of ARTS is presented in Figure 4, which consists of three components: (1) a Rectification Control Points Detection head (RCPD) to detect and predict control points for each text line, (2) a differentiable Auto-Rectification Module (ARM) to rectify curved text features into aligned ones and allow loss back-propagation from recognition to detection branch, and (3) a text recognition branch to decode text contents from extracted features.
In the forward phase, we first feed the input image to the backbone network and output the shared feature maps. Secondly, on top of the feature maps, RCPD predicts the text location and the rectification control points. Thirdly, these predicted rectification control points will be sent to ARM for rectifying and extracting text features. Finally, the aligned features are fed into the text recognition head to obtain the final text contents.
During training, we use the joint loss of detection loss and recognition loss to optimize our model. Different from previous methods 2018cvpr_liu_fots; 2018eccv_lyu_masktextspotterv1; 2019pami_liao_masktextspotterv2; 2020cvpr_liu_abcnet
whose detection branches are only supervised by loss function, our detection branch is jointly optimized by detection and recognition targets with loss functions and . Besides, unlike previous methods who tended to directly use ground-truth annotations for feature extraction during training, our method adopts a new training strategy to use predicted detection results instead. Concretely, we define the central region of a text instance as positive pixels, and evenly sample pixels from all positive pixels. Then, we use the groups of predicted control points of these sampled pixels and send them to our ARM to get text recognition features, which will be fed into recognition branch to train our text recognition branch. Here, is set to 64 by default.
Rectification Control Points Detection
As presented in Figure 5, we adopt a one-stage anchor-free framework as our detection branch to densely regress rectification control points for all text lines. For each text line, we sample the central region as positive pixels and regress offsets from the pixel towards the control points of this text line. The size of the regression result is , where means the number of control points for each side, denotes the downsampling scale to the input image, while and are the height and width of the feature map, respectively.
Ground-Truth Generation of RCPD.
We do not directly use the annotations as our ground-truth targets because the annotations provided by the dataset are not accurate enough for extracting high-quality text features. As depicted in Figure 6, we recalculate the control points targets by first fitting cubic bezier curves, and then uniformly sample points according to the following equation:
where indicates the k-th sampled control points, indicates the i-th bezier control points and is a hyper-parameter which determines how many rectification control points do we sample on each side of text. represents the Bernstein basis polynomials and is formulated as follows:
where is the binomial coefficient.
The sampled points are defined as the rectification control points for this text instance, and are used for generating the training target. Concretely, for a positive pixel at position , we generate the offset target as follows:
where and mean coordinates of the k-th control point, while and denote the target offset towards the k-th control point.
Previous methods tended to use RoIAlign operator or its variants for feature extraction. However, these operators can only back-propagate recognition loss into the shared backbone but not into the detection branch. Thus their detection branches are supervised by detection targets only and are highly independent of recognition information. These detection branches cannot learn from recognition targets and thus cannot produce detection results that are suitable for text recognition, leading to inconsistency between text detection and recognition.
We propose to design a new feature extractor named Auto-Rectification Module (ARM) to eliminate the inconsistency. ARM receives groups of predicted rectification control points for text instances, and outputs
aligned text features for all text instances. Our ARM is implemented mainly based on a differentiable Spatial Transform Network (STN)2015nips_stn. Note that we further upgrade the original version so that it can handle the situation where there are multiple text instances in the same image. Due to page limit, detailed mathematical formulation will be provided in supplementary materials, and we refer readers to 2015nips_stn for more detailed information about STN. Compared with previous methods 2020cvpr_liu_abcnet; 2018eccv_lyu_masktextspotterv1; 2019pami_liao_masktextspotterv2; 2018cvpr_liu_fots, our proposed module has the following differentiability advantage.
Differentiability from Recognition to Detection.
Previous end-to-end methods like 2020cvpr_liu_abcnet; 2018cvpr_liu_fots only share backbone features but often lack the ability of back-propagating recognition loss into detection branch. We argue that it is of vital importance for our RCPD head to learn from recognition losses for producing better detection results. So in our framework, we propose to use ARM, which is completely differentiable to enable loss back-propagation from recognition to our RCPD head. As a result, our RCPD head, which will be jointly optimized by detection and recognition targets, can predict more suitable results for the subsequent recognition task. Extensive results also verify our argument that learning from recognition losses can help the entire network achieve global-optimal and obtain better performance in end-to-end text spotting metric.
To validate the effectiveness and robustness of our spotting framework, we adopt two different recognizers, i.e., Parallel Recognizer and Serial Recognizer. Both the recognizers have the same feature extractor, but differ in their sequence modeling modules and decoders. The detailed structure of our recognition branch can be seen in Table 1.
|Conv layers 2||3, 1, 1||(n, 256, h, w)|
|Conv layers 1||3, (2,1), 1||(n, 256, h/2, w)|
|Conv layers 2||3, 1, 1||(n, 256, h/2, w)|
|Conv layers 1||3, (2,1), 1||(n, 256, h/4, w)|
|Avg & Permute||-||(w, n, 256)|
|BiLSTM||Self-Attn||-||(w, n, 256)|
|Serial||Parallel||-||(n, len, n_class)|
The overall loss function of our model consists of two parts: (1) detection loss and (2) recognition loss . It is defined as follows:
The detection loss function is a multi-task loss function which can be defined as Eqn 5.
where and are for classification and centerness prediction, respectively, which is similar to loss function used in 2019iccv_tian_fcos. is the loss function of our RCPD head, which is implemented by Smooth L1 loss 2015iccv_girshick_fastrcnn and is formulated as follows:
where and are the predicted offsets and target offsets of rectification control points defined in Eqn 3, respectively. Here is used to balance the importance and is set to 0.2 by default in our experiments. The recognition loss function is for optimizing the recognition branch and follows a similar loss function used in 2018pami_shi_aster; 2019iccv_baek_whatiswrong.
Our training process is divided into two phases, that is, pretraining and finetuning. During pretraining, we use a mixed dataset consisting of SynthText150k 2020cvpr_liu_abcnet, Total-Text 2017icdar_totaltext and MLT 2017icdar_mlt17. As for finetuning, we finetune our network on target datasets, i.e., Total-Text, CTW1500 and ICDAR2015, respectively.
The backbone of our network follows a common setting as most of the previous papers 2020cvpr_liu_abcnet; 2020aaai_wang_boundary; 2018eccv_lyu_masktextspotterv1; 2019pami_liao_masktextspotterv2, i.e., ResNet-50 2016cvpr_he_resnet together with a Feature Pyramid Network (FPN) 2017cvpr_lin_fpn. Following the settings of previous papers, for detection branch, we conduct dense prediction on 5 feature maps with 1/8, 1/16, 1/32, 1/64, 1/128 resolution of the input image while for ARM and the subsequent recognition, we use 3 feature maps with 1/4, 1/8, 1/16 resolution.
We train our model with a batchsize of 8, using Stochastic Gradient Descent (SGD) with momentum of 0.9. The maximum iteration of pretraining is 260K and the initial learning rate is set to 0.02, which decays to a tenth atand iteration. As for finetuning, the maximum iteration is 10K for Total-Text and IC15, which decays to a tenth at and iteration and 130K for CTW1500, which decays to a tenth at iteration. Following prior arts, we adopt widely-used data augmentation strategies: (1) instance aware random cropping, (2) random scaling with shorter side randomly chosen from 640 to 896, and (3) random rotation with angle randomly chosen from [, ].
We resize the shorter side of the input image to 1000 for Total-Text, 800 for CTW1500 and 1000 for ICDAR2015. We use NMS to filter out overlapped predictions and the threshold is set to 0.5. All the results are tested with batchsize of 1 using one Tesla V100 GPU. For the best detection metric, we use a confidence threshold of 0.4 to filter out texts with low detection scores. And for the best end-to-end metric on Total-Text, we use a recognition threshold of 0.9 (for None) or 0.7 (for Full) to filter out texts with low recognition scores, which can be simply calculated by averaging scores for all characters.
Comparisons with State-of-the-Art methods
Arbitrarily-Shaped Text Spotting.
|ABCNet MS 2020cvpr_liu_abcnet||CVPR’20||ResNet50||-||-||-||69.5||78.4||6.9|
. ”None” and ”Full” indicate results with no lexicon and full lexicon, respectively. ARTS-P and ARTS-S indicate using parallel and serial decoder, respectively. ”RT” means a real-time R18 version which shrink the size of input image to 640 for short side.indicates using private data for training.
Our network mainly focuses on arbitrarily-shaped text spotting. To verify its effectiveness, we conduct experiments on the challenging Total-Text dataset. We follow the official evaluation protocol in 2020cvpr_liu_abcnet to make a fair comparison.
The results on Total-Text can be seen in Table 2. Our method outperforms previous state-of-the-art methods by a large margin both in terms of accuracy and efficiency. Concretely, our ARTS-S achieves an outstanding E2E F-measure of 77.1% without lexicons which surpasses existing methods by +5.9% (77.1% vs. 71.2%) with a light-weight serial recognizer and in the meanwhile keeps a competitive running speed (10.5FPS).
Moreover, our ARTS-P also outperforms previous methods by a large margin, achieving an E2E F-measure of 75.8% at 13.0 FPS. For a faster ARTS-P R18 version, we adopt ResNet18 as backbone but can still achieve much better E2E performance compared with ABCNet (73.5% vs. 64.2%) while keeping a comparable running speed (17.0 FPS vs. 17.9 FPS). For our real-time version, we achieve the fastest running speed of 28FPS with a competitive E2E F-measure of 65.9%.
|MaskTextspotter v1 2018eccv_lyu_masktextspotterv1||ECCV’18||ResNet50||91.6||81.0||86.0||79.3||73.0||62.4||4.8|
|He et al. 2018cvpr_he_e2etextspotter||CVPR’18||PVA||87.0||86.0||87.0||82.0||77.0||63.0||-|
|CharNet R-50 2019iccv_xing_charnet||ICCV’19||ResNet50||91.2||88.3||89.7||80.1||74.5||62.2||-|
Long Arbitrarily-Shaped Text Spotting.
To verify the robustness of our method on long curved text, we also conduct experiments on a representative benchmark dataset called CTW1500. As can be seen in Table 4, our method can achieve highly competitive results both in end-to-end text spotting metric and detection metric. Specifically, our proposed network, as a regression-based method, can achieve better results in E2E metric even compared with those state-of-the-art segmentation-based methods 2020aaai_qiao_textperceptron (60.6% vs. 57.0%). When compared with previous regression-based methods (e.g., ABCNet 2020cvpr_liu_abcnet), our method achieves a even larger advantage (60.6% vs. 45.2%). The results demonstrate that our network, even as a regression-based method, is still robust to those extremely long curved text instances which could be very difficult for previous regression-based methods due to the extreme aspect ratios.
Multi-Oriented Text Spotting.
Though our method mainly focuses on arbitrarily-shaped text spotting, we can still achieve state-of-the-art performance on multi-oriented dataset ICDAR2015. As can be seen in Table 3, our method can surpass most of the previous state-of-the-art methods while keeping the fastest running speed. Specifically, our ARTS-S achieves the highest E2E F-measure of 68.7% with generic lexicon and runs at a fast running speed (10.0FPS). And our ARTS-P can still achieves a competitive E2E F-measure of 66.6 with generic lexicon at 12.0 FPS, which is 50% faster than previous methods.
Comparisons with BezierAlign.
We have theoretically emphasized the advantages of our ARM in the above section. Here we conduct experiments on Total-Text to show the performance differences between our ARM and a representative SOTA extracting method BezierAlign 2020cvpr_liu_abcnet. In our experiments, we directly use the BezierAlign operator to replace our ARM as feature extraction module in our pipeline and follow the training strategy provided by the official code repository (#1). We fix all the other settings including decoder architectures and data augmentations.
As shown in Table 5, using BezierAlign (#1) suffers a big performance drop (74.9% vs. 77.1%) compared with using our differentiable ARM (#2). Though model with BezierAlign can produce smoother text features, its detection branch loses the ability to learn from recognition information and eventually leads to performance drop, indicating that using our proposed ARM to deal with the “inconsistency problem” are quite essential for robust text spotting.
Note that BezierAlign with our architecture achieves better results than the original ABCNet due to the using of attention-based recognizer and different data augmentations.
Effectiveness of Back-Propagating Recognition Loss to Detection Branch.
To validate the effectiveness of back-propagating recognition loss, we design three groups of ablation experiments on Total-Text. For the first group (#1), we train recognition branch with features extracted by the ground-truth polygons and thus cut off the loss back-propagation. For the second group (#2), we use the predicted control points to rectify text features and enable recognition loss back-propagation, but in the meanwhile we set to 0 so that RCPD head will only be optimized by recognition targets. As for the last group (#3), we use predicted control points, enable recognition loss back-propagation and set to 0.2 so that our RCPD head will be jointly optimized by detection and recognition targets. Results can be seen in Table 6. With loss back-propagation, our method (#3) outperforms the method without the ability (#1) by +2.0% in E2E F-measure and +1.4% in detection F-measure, demonstrating the superiority of back-propagating recognition loss to detection branch. We also find a surprising result (#2) that even without the supervision of control points targets, our network can still reach convergence and achieve good performance under the only supervision, i.e., recognition targets.
Visualization and Time Analysis
Qualitative results are illustrated in Figure 7. Our proposed network can handle arbitrarily-shaped texts and seamlessly rectify them into straight texts for better recognition.
Time Cost Analysis.
We analyze the time consumption of different components on Total-Text. All the experiments follow the same training protocol. As can be seen in Table 7, using parallel instead of serial recognizer can reduce time cost for recognition to (6.9ms vs. 24.0ms) with only limited performance drop, making recognition branch a time-saving component and removing the barrier towards real-time scene text spotting.
|Method||Time Cost (ms)||FPS|
In this paper, we systematically analyze the inconsistency between text detection and recognition. To tackle this problem, we design a differentiable auto-rectification module (ARM) together with a new training strategy to allow loss back-propagation from recognition branch to detection branch so that our detection branch can be jointly optimized by detection and recognition targets, thus largely alleviating the inconsistency problem. Based on these, we propose a new arbitrarily-shaped text spotter, termed ARTS, to fast detect and recognize scene texts. Extensive experiments on both arbitrarily-shaped (Total-Text and CTW1500) and multi-oriented (ICDAR2015) benchmark datasets demonstrate that our proposed ARTS can achieve state-of-the-art performance in terms of both accuracy and efficiency.