A PyTorch implementation of "ABCNet: Real-time Scene Text Spotting with Adaptive Bezier-Curve Network" (CVPR 2020 oral)
Scene text detection and recognition has received increasing research attention. Existing methods can be roughly categorized into two groups: character-based and segmentation-based. These methods either are costly for character annotation or need to maintain a complex pipeline, which is often not suitable for real-time applications. Here we address the problem by proposing the Adaptive Bezier-Curve Network (ABCNet). Our contributions are three-fold: 1) For the first time, we adaptively fit arbitrarily-shaped text by a parameterized Bezier curve. 2) We design a novel BezierAlign layer for extracting accurate convolution features of a text instance with arbitrary shapes, significantly improving the precision compared with previous methods. 3) Compared with standard bounding box detection, our Bezier curve detection introduces negligible computation overhead, resulting in superiority of our method in both efficiency and accuracy. Experiments on arbitrarily-shaped benchmark datasets, namely Total-Text and CTW1500, demonstrate that ABCNet achieves state-of-the-art accuracy, meanwhile significantly improving the speed. In particular, on Total-Text, our realtime version is over 10 times faster than recent state-of-the-art methods with a competitive recognition accuracy. Code is available at https://tinyurl.com/AdelaiDetREAD FULL TEXT VIEW PDF
End-to-end text-spotting, which aims to integrate detection and recognit...
The reading of arbitrarily-shaped text has received increasing research
Segmentation-based methods are widely used for scene text detection due ...
Omnidirectional scene text detection has received increasing research
Recently, segmentation-based methods are quite popular in scene text
Arbitrary-shaped text detection is a challenging task due to the complex...
Scene text detection has been made great progress in recent years. The
A PyTorch implementation of "ABCNet: Real-time Scene Text Spotting with Adaptive Bezier-Curve Network" (CVPR 2020 oral)
Scene text detection and recognition has received increasing attention due to its numerous applications in computer vision. Despite tremendous progress has been made recently[10, 41, 27, 35, 26, 42]
, detecting and recognizing text in the wild remains largely unsolved due to its diversity patterns in sizes, aspect ratios, font styles, perspective distortion, and shapes. Although the emergence of deep learning has significantly improved the performance of the task of scene text spotting, current methods still exist a considerable gap for real-world applications, especially in terms of efficiency.
Recently, many end-to-end methods [30, 36, 33, 23, 43, 20] have significantly improved the performance of arbitrarily-shaped scene text spotting. However, these methods either use segmentation-based approaches that maintain a complex pipeline or require a large amount of expensive character-level annotations. In addition, almost all of these methods are slow in inference, hampering the deployment to real-time applications. Thus, our motivation is to design a simple yet effective end-to-end framework for spotting oriented or curved scene text in images [5, 26], which ensures fast inference time while achieving an on par or even better performance compared with state-of-the-art methods.
To achieve this goal, we propose the Adaptive Bezier Curve Network (ABCNet), an end-to-end trainable framework, for arbitrarily-shaped scene text spotting. ABCNet enables arbitrarily-shaped scene text detection with simple yet effective Bezier curve adaptation, which introduces negligible computation overhead compared with standard rectangle bounding box detection. In addition, we design a novel feature alignment layer—BezierAlign—to precisely calculate convolutional features of text instances in curved shapes, and thus high recognition accuracy can be achieved with almost negligible computation overhead. For the first time, we represent the oriented or curved text with parameterized Bezier curves, and the results show the effectiveness of our method. Examples of the our spotting results are shown in Figure 1.
Note that previous methods such as TextAlign  and FOTS  can be viewed as a special case of ABCNet because a quadrilateral bounding box can be seen as the simplest arbitrarily-shaped bounding box with 4 straight boundaries. In addition, ABCNet can avoid complicated transformation such as 2D attention , making the design of the recognition branch considerably simpler.
We summarize our main contributions as follows.
In order to accurately localize oriented and curved scene text in images, for the first time, we introduce a new concise parametric representation of curved scene text using Bezier curves. It introduces negligible computation overhead compared with the standard bounding box representation.
We propose a sampling method, a.k.a. BezierAlign, for accurate feature alignment, and thus the recognition branch can be naturally connected to the overall structure. By sharing backbone features, the recognition branch can be designed with a light-weight structure.
The simplicity of our method allows it to perform inference in real time. ABCNet achieves state-of-the-art performance on two challenging datasets, Total-Text and CTW1500, demonstrating advantages in both effectiveness and efficiency.
Scene text spotting requires detecting and recognizing text simultaneously instead of concerning only one task. Recently, the emergence of deep-learning-based methods have significantly advanced the performance of text spotting. Both the detection and recognition have been dramatically improved in performance. We summarized several representative deep-learning-based scene text spotting methods into the following two categories. Figure 2 shows an overview of typical works.
Regular End-to-end Scene Text Spotting Li et al.  propose the first deep-learning based end-to-end trainable scene text spotting method. The method successfully uses a RoI Pooling  to joint detection and recognition features via a two-stage framework, but it can only spot horizontal and focused text. Its improved version  significantly improves the performance, but the speed is limited. He et al.  and Liu et al.  adopt an anchor-free mechanism to improve both the training and inference speed. They use a similar sampling strategy, i.e., Text-Align-Sampling and RoI-Rotate, respectively, to enable extracting feature from quadrilateral detection results. Note that both of these two methods are not capable of spotting arbitrarily-shaped scene text.
Arbitrarily-shaped End-to-end Scene Text Spotting To detect arbitrarily-shaped scene text, Liao et al.  propose a Mask TextSpotter which subtly refines Mask R-CNN and uses character-level supervision to simultaneously detect and recognize characters and instance masks. The method significantly improves the performance of spotting arbitrarily-shaped scene text. However, the character-level ground truths are expensive, and using free synthesized data is hard to produce character-level ground truth for real data in practice. Its improved version  significantly alleviated the reliance for the character-level ground truth. The method relies on a region proposal network, which restricts the speed to some extent. Sun et al.  propose the TextNet which produces quadrilateral detection bounding boxes in advance, and then use a region proposal network to feed the detection features for recognition. Although the method can directly recognize the arbitrarily-shaped text from a quadrilateral detection, the performance is still limited.
Recently, Qin et al. 
propose to use a RoI Masking to focus on the arbitrarily-shaped text region. However, the results may easily be affected by outlier pixels. In addition, the segmentation branch increases the computation burden; the fitting polygon process also introduces extra time consumption; and the grouping result is usually jagged and not smooth. The work in is the first one-stage arbitrarily-shaped scene text spotting method, requiring character-level ground truth data for training. Authors of  propose a novel sampling method, RoISlide, which uses fused features from the predicting segments of the text instances, and thus it is robust to long arbitrarily-shaped text.
, we adopt a single-shot, anchor-free convolutional neural network as the detection framework. Removal of anchor boxes significantly simplifies the detection for our task. Here the detection is densely predicted on the output feature maps of the detection head, which is constructed by 4 stacked convolution layers with stride of 1, padding of 1, and 33 kernels. Next, we present the key components of the proposed ABCNet in two parts: 1) Bezier curve detection; and 2) BezierAlign and recognition branch.
Compared to segmentation-based methods [40, 44, 1, 38, 45, 28], regression-based methods are more direct solutions to arbitrarily-shaped text detection, e.g., [26, 42]. However, previous regression-based methods require complicated parameterized prediction to fit the text boundary, which is not very efficient and robust for the various text shapes in practice.
To simplify the arbitrarily-shaped scene text detection, following the regression method, we believe that the Bezier curve is an ideal concept for parameterization of curved text. The Bezier curve represents a parametric curve that uses the Bernstein Polynomials  as its basis. The definition is shown in Equation (1).
where, represents the degree, represents the - control points, and represents the Bernstein basis polynomials, as shown in Equation (2):
where is a binomial coefficient. To fit arbitrary shapes of the text with Bezier curves, we comprehensively observe arbitrarily-shaped scene text from the existing datasets and the real world, and we empirically show that a cubic Bezier curve (i.e., is 3) is sufficient to fit different kinds of the arbitrarily-shaped scene text in practice. An illustration of cubic Bezier curve is shown in Figure 4.
Based on the cubic Bezier curve, we can simplify the arbitrarily-shaped scene text detection to a bounding box regression with eight control points in total. Note that a straight text that has four control points (four vertexes) is a typical case of arbitrarily-shaped scene text. For consistency, we interpolate additional two control points in the tripartite points of each long side.
To learn the coordinates of the control points, we first generate the Bezier curve ground truths described in 2.1.1 and follow a similar regression method as in  to regress the targets. For each text instance, we use
where and represent the minimum and values of the 4 vertexes, respectively. The advantage of predicting the relative distance is that it is irrelevant to whether the Bezier curve control points are beyond the image boundary. Inside the detection head, we only need one convolution layer with 16 outputted channels to learn the and , which is nearly cost-free while the results can still be accurate, which will be discussed in Section 3.
In this section, we briefly introduce how to generate Bezier curve ground truth based on the original annotations. The arbitrarily-shaped datasets, e.g., Total-text  and CTW1500 , use polygonal annotations for the text regions. Given the annotated points from the curved boundary, where represents the - annotating point, the main goal is to obtain the optimal parameters for cubic Bezier curves in Equation (1). To achieve this, we can simply apply standard least square method, as shown in Equation (4):
Here represents the number of annotated points for a curved boundary. For Total-Text and CTW1500, is 5 and 7, respectively. is calculated by using the ratio of the cumulative length to the perimeter of the polyline. According to Equation (1) and Equation (4), we convert the original polyline annotation to a parameterized Bezier curve. Note that we directly use the first and the last annotating points as the first () and the last () control points, respectively. An visualization comparison is shown in the Figure 5, which shows that the generating results can be even visually better than the original ground truth. In addition, based on the structured Bezier curve bounding box, we can easily using our BezierAlign described in Section 2.2 to warp the curved text into a horizontal format without dramatic deformation. More examples of the Bezier curve generation results are shown in Figure 6. The simplicity of our method allows it generalize to different kinds of text in practice.
For the end-to-end scene text spotting methods, a massive amount of free synthesized data is always necessary, as shown in Table 2. However, the existing 800k SynText dataset  only provides quadrilateral bounding box for a majority of straight text. To diversify and enrich the arbitrarily-shaped scene text, we make some effort to synthesize 150k synthesized dataset (94,723 images contain a majority of straight text, and 54,327 images contain mostly curved text) with the VGG synthetic method . Specially, we filter out 40k text-free background images from COCO-Text  and then prepare the segmentation mask and scene depth of each background image with  and  for the following text rendering. To enlarge the shape diversity of synthetic texts, we modify the VGG synthetic method by synthesizing scene text with various art fonts and corpus and generate the polygonal annotation for all the text instances. The annotations are then used for producing Bezier curve ground truth by the generating method described in Section 2.1.1. Examples of our synthesized data are shown in Figure 8.
To enable end-to-end training, most of the previous methods adopt various sampling (feature alignment) methods to connect the recognition branch. Typically a sampling method represents an in-network region cropping procedure. In other words, given a feature map and Region-of-Interest (RoI), using the sampling method to select the features of RoI and efficiently output a feature map of a fixed size. However, sampling methods of previous non-segmentation based methods, e.g., RoI Pooling , RoI-Rotate , Text-Align-Sampling , or RoI Transform  cannot properly align features of arbitrarily-shaped text (RoISlide  numerous predicting segments). By exploiting the parameterization nature of a compact Bezier curve bounding box, we propose BezierAlign for feature sampling. BezierAlign is extended from RoIAlign . Unlike RoIAlign, the shape of sampling grid of BezierAlign is not rectangular. Instead, each column of the arbitrarily-shaped grid is orthogonal to the Bezier curve boundary of the text. The sampling points have equidistant interval in width and height, respectively, which are bilinear interpolated with respect to the coordinates.
Formally given an input feature map and Bezier curve control points, we concurrently process all the output pixels of the rectangular output feature map with size . Taking pixel with position (from output feature map) as an example, we calculate by Equation (5):
With the position of , we can easily apply bilinear interpolation to calculate the result. Comparisons among previous sampling methods and BezierAlign are shown in Figure 7.
Benefiting from the shared backbone feature and BezierAlign, we design a light-weight recognition branch as shown in Table 1, for faster execution. It consists of 6 convolutional layers, 1 bidirectional LSTM  layer, and 1 fully connected layer. Based on the output classification scores, we use a classic CTC Loss  for text string (GT) alignment. Note that during training, we directly use the generated Bezier curve GT to extract the RoI features. Therefore the detection branch does not affect the recognition branch. In the inference phase, the RoI region is replaced by the detecting Bezier curve described in Section 2.1. Ablation studies in Experimental Section 3 demonstrate that the proposed BezierAlign can significantly improve the recognition performance.
|conv layers 4||(3, 1)||(, 256, , )|
|conv layers 2||(3, (2,1))||(, 256, , )|
|average pool for||-||(, 256, 1, )|
|Channels-Permute||-||(, , 256)|
|BLSTM||-||(, , 512)|
|FC||-||(, , )|
We evaluate our method on two recently introduced arbitrarily-shaped scene text benchmarks, Total-Text  and CTW1500 , which also contain a large amount of straight text. We also conduct ablation studies on Total-Text to verify the effectiveness of our proposed method.
The backbone of this paper follows a common setting as most of the previous papers, i.e., ResNet-50  together with a Feature Pyramid Network (FPN) . For detection branch, we utilize RoIAlign on 5 feature maps with 1/8, 1/16, 1/32, 1/64, and 1/128 resolution of the input image while for recognition branch, BezierAlign is conducted on three feature maps with 1/4, 1/8, and 1/16 sizes. The pretrained data is collected from publicly available English word-level-based datasets, including 150k synthesized data described in Section 2.1.2, 15k images filtered from the original COCO-Text , and 7k ICDAR-MLT data . The pretrained model is then finetuned on the training set of the target datasets. In addition, we also adopt data augmentation strategies, e.g., random scale training, with the short size randomly being chosen from 560 to 800 and the long size being less than 1333; and random crop, which we make sure that the crop size is larger than half of the original size and without any text being cut (for some special cases that hard to meet the condition, we do not apply random crop).
We train our model using 4 Tesla V100 GPUs with the image batch size of 32. The maximum iteration is 150K; and the initialized learning rate is 0.01, which reduces to 0.001 at the 70K iteration and 0.0001 at 120K iteration. The whole training process takes about 3 days.
|TextBoxes ||SynText800k, IC13, IC15, TT||ResNet-50-FPN||36.3||48.9||1.4|
|Mask TextSpotter’18 ||SynText800k, IC13, IC15, TT||ResNet-50-FPN||52.9||71.8||4.8|
|Two-stage ||SynText800k, IC13, IC15, TT||ResNet-50-SAM||45.0||-||-|
|TextNet ||SynText800k, IC13, IC15, TT||ResNet-50-SAM||54.0||-||2.7|
|Li et al. ||SynText840k, IC13, IC15, TT, MLT, AddF2k||ResNet-101-FPN||57.80||-||1.4|
|Mask TextSpotter’19 ||SynText800k, IC13, IC15, TT, AddF2k||ResNet-50-FPN||65.3||77.4||2.0|
|Qin et al. ||
|CharNet ||SynText800k, IC15, MLT, TT||
|TextDragon ||SynText800k, IC15, TT||
|ABCNet-F||SynText150k, COCO-Text, TT, MLT||ResNet-50-FPN||61.9||74.1||22.8|
Total-text dataset  is one of the most important arbitrarily-shaped scene text benchmark proposed in 2017, which was collected form various scenes, including text-like scene complexity and low-contrast background. It contains 1,555 images, with 1,255 for training and 300 for testing. To resemble the real-world scenarios, most of the images of this dataset contain a large amount of regular text while guarantee that each image has at least one curved text. The text instance is annotated with polygon based on word-level. Its extended version  improves its annotation of training set by annotating each text instance with a fixed ten points following text recognition sequence. The dataset contains English text only. To evaluate the end-to-end results, we follow the same metric as previous methods, which use F-measure to measure the word-accuracy.
To evaluate the effectiveness of the proposed components, we conduct ablation studies on this dataset. We first conduct sensitivity analysis of how the number of the sampling points may affect the end-to-end results, which is shown in Table 4. From the results we can see that the number of sampling points can significantly affect the final performance and efficiency. We find (7,32) achieves the best trade-off between F-measure and FPS, which is used as the final setting in the following experiments. We further evaluate BezierAlign by comparing it with previous sampling method shown in Figure 7. The results shown in Table 3 demonstrate that the BezierAlign can dramatically improve the end-to-end results. Qualitative examples are shown in Figure 9.
Another important component is Bezier curve detection, which enables arbitrarily-shaped scene text detection. Therefore, we also conduct experiments to evaluate the time consumption of Bezier curve detection. The result in Table 5 shows that the Bezier curve detection does not introduce extra computation compared with standard bounding box detection.
|Methods||Sampling method||F-measure (%)|
|ABCNet||with Horizontal Sampling||38.4|
|ABCNet||with Quadrilateral Sampling||44.7|
|ABCNet||+ (6, 32)||59.6||23.2|
|+ (7, 32)||61.9||22.8|
|+ (14, 64)||58.1||19.9|
|+ (21, 96)||54.8||18.0|
|+ (28, 128)||53.4||15.1|
|+ (30, 30)||59.9||21.4|
|without Bezier curve detection||22.8 fps|
|with Bezier curve detection||22.5 fps|
We further compare our method to previous methods. From the Table 2, we can see that our single scale result (short size being 800) can achieve a competitive performance meanwhile achieving a real time inference speed, resulting in a better trade-off between speed and word-accuracy. With multi-scale inference, ABCNet achieves state-of-the-art performance, significantly outperforming all previous methods especially in the running time. It is worth mentioning that our faster version can be more than 11 times faster than previous best method  with on par accuracy.
Some qualitative results of ABCNet are shown in Figure 10. The results show that our method can accurately detect and recognize most of the arbitrarily-shaped text. In addition, our method can also well handle straight text, with nearly quadrilateral compact bounding box and correct recognize results. Some errors are also visualized in the figure, which are mainly caused by mistakenly recognizing one of the characters.
CTW1500  is another important arbitrarily-shaped scene text benchmark proposed in 2017. Compared to Total-Text, this dataset contains both English and Chinese text. In addition, the annotation is based on text-line level, and it also includes some document-like text, i.e., numerous small text may stack together. CTW1500 contains 1k training images, and 500 testing images.
Because the occupation of Chinese text in this dataset is very small, we directly regard all the Chinese text as “unseen” class during training, i.e., the 96- class. Note that the last class, i.e., the 97
class is “EOF” in our implementation. We follow the same evaluation metric as. The experimental results are shown in Table 6, which demonstrate that in terms of end-to-end scene text spotting, the ABCNet can significantly surpass previous state-of-the-art methods. Examples results of this dataset are showed in Figure 11. From the figure, we can see that some long text-line instances contain many words, which make a full-match word-accuracy extremely difficult. In other words mistakenly recognizing one character will result in zero score for the whole text.
|FOTS ||SynText800k, CTW1500||21.1||39.7|
|Two-Stage* ||SynText800k, CTW1500||37.2||69.9|
|RoIRotate* ||SynText800k, CTW1500||38.6||70.9|
|LSTM* ||SynText800k, CTW1500||39.2||71.5|
|TextDragon ||SynText800k, CTW1500||39.7||72.4|
. “None” represents lexicon-free. “Strong Full” represents that we use all the words appeared in the test set.
We have proposed ABCNet—a real-time end-to-end method that uses Bezier curves for arbitrarily-shaped scene text spotting. By reformulating arbitrarily-shaped scene text using parameterized Bezier curves, ABCNet can detect arbitrarily-shaped scene text with Bezier curves which introduces negligible computation cost compared with standard bounding box detection. With such regular Bezier curve bounding boxes, we can naturally connect a light-weight recognition branch via a new BezierAlign layer.
In addition, by using our Bezier curve synthesized dataset and publicly available data, experiments on two arbitrarily-shaped scene text benchmarks (Total-Text and CTW1500) demonstrate that our ABCNet can achieve state-of-the-art performance, which is also significantly faster than previous methods.
The authors would like to thank Huawei Technologies for the donation of GPU cloud computing resources.
Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks.In Proc. Int. Conf. Mach. Learn., pages 369–376. ACM, 2006.