. Benefited from the development of deep learning, scene text detection has made great progress[19, 7, 63, 51, 6, 54, 5]. However, due to unconstrained text variations in font, size, color, and orientation, arbitrary-shaped scene text detection remains a challenge.
Current scene text detection methods based on deep learning can be divided into two categories: regression based approaches[57, 61, 62] and segmentation based approaches[47, 40, 20, 38, 53]. Due to the prediction for each pixel, segmentation based approaches do not need to explicitly process complex curved texts. However, such approaches are sensitive to noises, so they usually depend on the pre-training on a large dataset. Besides, pixel-level processing significantly increases the computational cost and the post-processing steps are typically very complicated. In contrast, regression based methods are often more concise and are easier to train. However, there are still two main problems unresolved for regression based methods.
On one hand, designing a compact text mask representation that can fit diverse geometry variances of arbitrary-shaped text instances is challenging. Because of the high complexity of directly regressing arbitrary-shaped text masks, most of the existing regression based methods regress contour point sequences of texts. However, point sequences are not sufficient to capture the details of highly curved texts, in which the represented text contour is usually unsmooth, as shown in Fig.1(a).
On the other hand, state-of-the-art regression based methods rely heavily on the divide-and-conquer strategy in feature pyramid networks (FPN) to regress multi-scale texts. However, all training samples are subject to the same supervision, leading to an imbalanced supervision issue among different pyramid layers, especially for single-stage detectors. Specifically, the number of training samples in P3 layer is times of that in P7 layer of FPN. Thus, connecting multi-level heads together will cause extremely imbalanced learning among samples on different layers, as the shallow layers like P3 receive much more supervision than deeper layers.
To tackle these two problems, we propose a novel arbitrary-shaped scene text detection framework, namely TextDCT. First, inspired by the recent instance segmentation work 
, we model the high-resolution text instance mask in the frequency domain instead of the spatial domain via discrete cosine transform (DCT), and keep its low-frequency components as the mask representation, which has low training complexity. Due to the energy concentration characteristics of DCT, in which most of the natural signal energy is concentrated in the low-frequency components, so the mask representation by this transformation has a high quality.
Moreover, although recent research efforts[52, 3] can mitigate the imbalanced supervision problem by independent supervision on different pyramid layers, they do not go beyond the divide-and-conquer, which makes its structure more complex in single-stage detectors. Intuitively, using only a single-level head can solve this problem and the model is much simpler. However, directly replacing multi-level heads with a single-level head can lead to a dramatic model performance drop. The first reason is that the single-level head is not scale-aware and spatial-aware, in which the receptive field of single-level feature cannot match large range of text scales. Since arbitrary-shaped texts usually appear in vastly different fonts, rotations, and shapes, the detection performance for multi-scale text instances using a single-level head is limited. The second one is that a positive sampling strategy suitable for single-level prediction is required. If using the popular “center sampling” positive sampling strategy, the masks of different text instances whose center points are in the same region are easily confused.
Considering the above issues, we design a feature awareness module (FAM) to achieve spatial-awareness and scale-awareness by fusing rich contextual information, capturing larger receptive field and focusing on more significant features. We also introduce a text kernel sampling (TKS) strategy for the single-level prediction by treating the shrunk text kernel region as the positive samples, which can adaptively adjust the number of positive samples to balance text regression at different scales. Besides, based on the separation of text kernels from each other, we propose segmented non-maximum suppression (S-NMS) to effectively suppress false positives, especially for long text instances.
Our TextDCT is a single-shot, anchor-free, and light-weighted framework that performs joint optimization of text/non-text kernel classification, text bounding box regression, and text mask regression. The main contributions of this work are summarized as follows:
We propose a novel scene text detection framework by employing DCT to represent text masks, which can accurately approximate arbitrary-shaped text instances while having low training complexity.
We design a single-level head for top-down text prediction, with a feature awareness module, a text kernel sampling strategy, and a segmented non-maximum suppression method. These strategies are beneficial for avoiding imbalanced supervision, processing multi-scale text variations, and suppressing false positives.
Extensive experiments demonstrate that our TextDCT achieves competitive performance on both accuracy and efficiency. Particularly, TextDCT obtains F-measure of at frames per second (FPS) and F-measure of at FPS for CTW1500 and Total-Text datasets, respectively.
Ii Related Work
Current scene text detection methods can be divided into two categories: segmented based scene text detection and regression based scene text detection. Besides, we discuss some current false positive suppressing methods.
Ii-a Segmentation Based Scene Text Detection
Inspired by semantic segmentation methods[27, 36], some works regard the arbitrary-shaped text detection as a segmentation problem, which represents complex text instances with pixel-level prediction and rebuilds text instances through specific post-processing. Pixellink first predicted the linkage relationship between pixels, then extracted the text bounding boxes by separating the links belonging to different text instances. TextSnake treated text instances as a sequence of overlapping disks, and predicted text area, text center line, and some geometric attributes of disks to rebuild the text instances. To effectively split the close text instances, PSENet detected different scale kernels in each text instance, and adopted a progressive scaling method to gradually expand the text kernels to obtain the final detections. Besides, TextField first learnt a direction field containing both text mask and relative position information away from text boundary, then linked neighbor pixels to generate candidate text instances. Tian et al. assumed each text instance as a cluster and performed a two-step clustering strategy to segment dense text instances, where pixels of the same text usually lie in the same cluster. CRAFT detected character-level text regions by using two-dimensional Gaussian segmentation labels and exploring affinity between characters. DBNet
introduced a differentiable binarization processing module that gives a high threshold for text boundaries to distinguish adjacent texts. Moreover, TextBPN first obtained boundary proposals based on the distance field map and classification map, then deformed the boundary proposals into more accurate text boundaries by the adaptive boundary deformation module.
Most of these methods can adapt to curved texts, but are sensitive to text-like background noises and holes, resulting in false positive cases. In contrast, our TextDCT can deal with false positives well by the S-NMS strategy.
Ii-B Regression Based Scene Text Detection
Regression based methods typically rely on the object detection pipeline with bounding box regression, which is usually easier to train than segmentation based methods. CTPN first used a modified pipeline of Faster R-CNN to detect a set of partial text components with a fixed-size width, then connected them within different instances. RRPN also used a modified pipeline of Faster R-CNN, which adopted rotated proposals to detect multi-oriented texts. TextBoxes modified the shapes of convolutional filters and increased the proportion of default boxes of SSD to adapt to the aspect ratio of texts. TextBoxes++ extended TextBoxes by applying quadrilateral regression to effectively detect multi-oriented text. EAST adopted an anchor-free single-stage detection framework to directly regress the offsets between points within text instances and the corresponding bounding boxes or four corner points. MOST proposed a text feature alignment module (TFAM) and a position-aware non-maximum suppression (PA-NMS) module to achieve accurate detection of long texts. However, the above regression based text representations are horizontal or oriented rectangles, which have limited capacity to model irregular texts.
regressed the offsets between the top-left point of the bounding box and the key points on text contours, then smoothed the offsets with a recurrent neural network (RNN). LOMO introduced a shape representation module that uses center lines, text regions, and border offsets to represent texts, then proposed an iterative refinement module to regress long texts. DRRG treated text instances as a series of combinations of small rectangular components and introduced a graph convolutional network to learn the linkage relationships of the text components. TextRay formulated the text contours in the polar system, and regressed the distance from the polar coordinate to the point where the emitted N-rays intersect with the text boundary. ABCNet adopted an anchor-free network to predict a series of points, then used Bernstein polynomial to transform these points into Bezier curves that fit the text contours. PCR proposed the contour location mechanism (CLM) to regress text contours progressively. FCENet
treated text contours as periodic functions, and used the discrete Fourier transform (DFT) to convert text contours to Fourier eigenvectors. Unlike FCENet using DFT to encode text contours, our method adopts DCT to encode text masks as compact vectors.
Ii-C False Positive Suppression
To suppress false positives, SPCNet designed a text context module (TCM) and a re-score mechanism to improve the accuracy of the scores of tilted texts. ContourNet designed a local orthogonal texture-aware module (LOTM), which considers the local texture information in two orthogonal directions simultaneously. TextRay designed a central-weighted training strategy to give beneficial gradients for long text instances. ABCNet used the center-ness branch to get better text bounding boxes. TextFuseNet fused global-level, char-level and word-level features of texts to strengthen the capability of distinguishing texts and non-texts. PCR designed a contour localization mechanism to re-score the localized contours. However, these methods increase the training complexity or the post-processing time, which brings additional computational burden. In this paper, we propose the S-NMS to suppress false positives in long texts with low overhead.
Iii Arbitrary-Shaped Text Detection via Discrete Cosine Transform Mask
Our TextDCT is a single-shot, anchor-free detection framework that can fit arbitrary-shaped texts. As shown in Fig. 2, this framework mainly contains three parts: feature extraction, feature fusion, and three-branch joint optimization based on the single-level head. In the feature extraction module, we utilize ResNet50 as the backbone to generate shared feature maps with different receptive fields. We first adopt FPN
to generate fused features with strong representation, and then add an efficient feature awareness module (FAM) between P4 and P3 of FPN to achieve spatial-awareness and scale-awareness for the single-level head. In the three-branch joint optimization module, we share the first four convolutions to improve the correlation between the classification and regression tasks, and then use three convolutions to predict the probability of each position corresponding to the text kernel, the regression value of each position corresponding to the text bounding box, and the text mask, respectively.
Iii-B DCT Mask Representation
Most of the existing contour points modeling methods have a limited capacity to represent highly curved or long texts, while the directly regressed high-resolution masks contain redundant information since the discriminative pixels are mainly distributed along the text boundaries. To obtain low-complexity and high-quality text mask representations, we encode the text masks into compact vectors by DCT. As shown in Fig. 3, the output of the mask regression branch in Fig. 2 is first reshaped to be the size of , where and are the height and width of , and denotes the vector dimension at each point in . Then, we regress only the mask vectors corresponding to the positive samples in , where is the number of positive sample points.
During training, the generation process of the ground-truth mask vector corresponding to each point consists of three steps. Firstly, we reshape the ground-truth mask of the -th text instance to be , where is the uniform mask size, , and is the number of text instances in the input image.
Secondly, we employ two-dimensional DCT to encode into the frequency domain:
where for , and otherwise. Due to the energy concentration characteristics of DCT, in which most of the natural signal energy is concentrated in the low-frequency components, the compact vector can be sampled from the first -dimensional vector of the in a “zigzag”.
Thirdly, we assign to by , where . is a label assignment strategy that matches the frequency-domain labels to corresponding positive sample points.
During inference, there are three steps for the prediction of text instance masks. Firstly, we employ a segmented non-maximum suppression (S-NMS) strategy to obtain from the prediction , which will be elaborated in Sec. III-D. Secondly, we expand to -dimensions by backward complementing 0, and then transform to by reshaping the size of to . Thirdly, can be generated by two-dimensional inverse discrete cosine transform (IDCT):
Note that the computational cost of DCT and IDCT are negligible, since the complexity of computation through fast cosine transform (FCT) is only .
Iii-C Single-Level Prediction
State-of-the-art regression based text detectors often adopt the divide-and-conquer strategy, which introduces an imbalanced supervision issue and causes a more complex structure in single-stage detectors. To suppress this issue, our TextDCT framework uses a single-level prediction structure. Two keys to the single-level prediction lie in how to design the input feature of the head so that the head is scale-aware and spatial-aware, and how to design a positive sample assignment strategy.
Iii-C1 Feature Awareness Module
As shown in Fig. 2, we use the pyramid feature P3 as the input of the single-level head. Due to the diversity of text scale variations, the receptive field of P3 can only cover a limited scale range. As shown in Fig. 4(a), extreme long texts cannot be accurately regressed using P3 as the head’s input, so we design a feature awareness module (FAM).
The structure of FAM is shown in Fig. 2, which consists of a skip connection and two deformable convolutions. Specifically, we first add a skip connection from P5 to enrich the contextual information of P3. Since direct element-wise sum of P3 and upsampled P5 leads to misaligned contexts of fused features, which may harm the prediction around text boundaries, we add a deformable convolution in the skip connection to learn offsets from spatial differences between P3 and unsampled P5, so as to obtain the aligned feature with contextual information. Besides, considering deformable convolution has the ability to focus on salient regions from various text appearances, we add a deformable convolution after the fused feature to capture more significant features. By obtaining features with rich contextual information and adaptively adjusting the receptive fields to achieve spatial-awareness and scale-awareness for the single-level head, our proposed FAM can suppress the limitations of the divide-and-conquer strategy.
Iii-C2 Positive Sampling Strategy
The definition of positive samples is crucial for text detection, and balanced positive and negative samples are beneficial for achieving accurate classification results. Most existing anchor-free single-stage detectors adopt the ”center sampling” strategy to define positive samples. However, this strategy is not suitable for the case of single-level head, as potentially numerous ambiguous points harm the accuracy of regressing text shapes, as shown in Fig. 4(c). Besides, it is unreasonable to give the same number of positive samples to texts of different sizes. Therefore, we propose a text kernel sampling (TKS) strategy by treating the text kernel as the positive sample region, in which the text kernel can be obtained by Vatti clipping algorithm. Since each point in the text kernel is located inside the text instance, there are few ambiguous samples, and the number of positive samples changes adaptively with the text scales. In this way, we can effectively alleviate the bias between texts with different scales during training. As shown in Fig. 4(d), our proposed TKS can generate more high-quality predictions.
Iii-D Segmented Non-Maximum Suppression
As shown in Fig. 2, a given image goes through our TextDCT, in which the classification scores predicted by the classification branch and the text boxes regressed by the box regression branch can be combined to obtain text box predictions. Typically, non-maximum suppression (NMS) is used to remove duplicated box predictions. However, the points near the text kernel edges of long text instances are generally far from the centers of the texts, making it hard to perceive the comprehensive text shape information accurately. Therefore, the regressed boxes corresponding to these points may be inaccurate and may have a low intersection over union (IoU) with the boxes corresponding to the points near the centers of the texts, which makes it difficult to be filtered by the NMS and thus leads to false positives, as illustrated in Fig. 5.
TextRay and ABCNet address this problem by using a center-weighted training strategy and adding a center-ness branch, respectively, in which the training complexity is increased. Instead, we adopt a segmented non-maximum suppression (S-NMS) strategy. In particular, benefited from the strategy of using text kernels as positive samples, each text kernel can be easily distinguished. For each text kernel, we firstly select only the box corresponding to the point with the highest classification score, which allows us to filter out other boxes that may be of low quality directly without considering the IOU threshold. Then, we use NMS to filter the remaining potentially duplicate boxes.
Note that, as shown in Fig. 6, there are still duplicate boxes after the first step of the S-NMS because the widespread holes in the text may cause the text kernel to be split into multiple parts. Due to the regression task prefers to focus on the edge parts, as shown in Fig. 6, where most of the points belonging to the same kernel but split into different parts can still regress to the same text instance.
Iii-E Full Loss Function
In our TextDCT framework, the full loss function is formulated as
where and are the trade-off factors of the loss function. , and denote classification loss, bounding box regression loss and mask vector regression loss of the three branches, respectively.
where we choose Dice Loss to optimize , and denote the -th pixel value in the ground-truth and the prediction of text kernels, respectively. Besides, we use GIoU loss for following ABCNet, where A, B are ground truth and predicted bounding boxes, respectively. C is the smallest convex box enclosing both A and B and .
Our mask vector regression loss is defined as
where represent the -th element in ground-truth and prediction mask vectors, respectively. is the indicator function for text kernels. is the vector loss, in which we choose smooth-L1 loss for its effectiveness and stability in training.
The post-processing of our TextDCT model consists of four steps. Firstly, the locations of the positive samples are obtained according to the classification threshold , and then their corresponding predicted text boxes and text mask vectors are obtained. Secondly, the S-NMS is utilized to filter the overlapped text boxes. Thirdly, the text mask vectors corresponding to the remaining text boxes are transformed to 2D text masks by IDCT, and then the spatial resolutions of these text masks are resized to the same sizes of the spatial resolutions of their corresponding text boxes. Finally, these resized text masks are binarized by a threshold , and then are projected into the complete map with the same spatial resolution as the input image based on the locations of their corresponding text boxes.
Iv-a Datasets and Settings
We evaluate our TextDCT on four popular text detection datasets: CTW1500, Total-Text, ICADAR2015, and MLT.
CTW1500  is a challenging dataset that contains irregular-shaped and multi-oriented texts. It consists of training images and test images. Texts in this dataset predominantly suffer from blurring, low resolution and perspective distortion, and the text regions are all annotated by key points.
Total-Text  consists of images ( for training and for testing). It contains horizontal, multi-oriented, and curved texts. Unlike CTW1500, all texts in Total-Text are annotated by world-level polygons with the adaptive number of vertices.
ICDAR2015  consists of natural images for training and images for testing, including many multi-orientated and street-viewed text instances. The ground-truth of each text is annotated with eight coordinates to enclose the text in a clockwise way.
MLT  is proposed on ICDAR 2017 Competition, which involves multi-lingual, multi-script and multi-oriented scene texts. It contains images for training ( training images and validation images) and images for testing.
Iv-A2 Implementation Details
The backbone of our TextDCT framework is the pre-trained ResNet50 on ImageNet
. During training, stochastic gradient descent (SGD) is adopted as an optimizer with a batch size offor the CTW1500 and Total-Text datasets and for the MLT and ICDAR2015 datasets. The model is trained up to iterations with the initial learning rate of , in which the learning rate is decreased to at the -th iteration. In addition, we employ a pre-trained model trained with iterations on the SynthText dataset. Following other methods [24, 20, 48], the text regions labeled as “DO NOT CARE” are ignored during training. Our TextDCT is trained on an NVIDIA Tesla V100 GPU. It takes about training hours on CTW1500 and Total-Text datasets, and takes about training hours on ICDAR2015 and MLT datasets.
We set to , to , to , to , and the shrinking rate of the text kernels to for all datasets. Besides, we use the multi-scale training strategy, in which the short side of the input images is set to , , , , , , , , and , while the long side is maintained to on CTW1500 and Total-Text datasets, and is maintained to on ICDAR2015 and MLT datasets. In Eq. (3), , and . For data augmentation, we perform random horizontal flip, random crop, color jitter and contrast jitter for input images. In random cropping, we only crop the non-text regions, in which the crop area is larger than half of the original image area. During inference, the input images are resized as , , , and for CTW1500, Total-Text, ICDAR2015, and MLT, respectively. In the following sections, we omit for simplicity in the recall (R), precision (P), and F-measure (F) results.
Iv-B Ablation Study
In this section, we conduct ablation studies on both Total-Text and CTW1500 datasets to validate main modules in our TextDCT.
Iv-B1 Input of the head
Table I shows the results of a baseline model using different pyramid feature layers on CTW1500 dataset. The baseline model is TextDCT without FAM and S-NMS, in which the first four convolutions of the head are not shared, and the positive sampling strategy is “center sampling”. Compared to multi-level head inputs, using only a single-level head has a significant performance drop, especially for the recall rate. This is mainly because the single-level head without FAM is not scale-aware and spatial-aware, and the “center sampling” strategy is not suitable for single-level prediction. Although the performance of using multi-level heads is higher than that of single-level head, the imbalanced supervision problem limits its performance. Besides, the F-measure of P3 is and higher than that of P4 and P5, respectively. It demonstrates that P3 makes a good trade-off between semantic information and boundary information compared to P4 and P5.
Iv-B2 FAM and TKS
Table II shows the results of using FAM and TKS over the baseline model on CTW1500 and Total-Text datasets. We can see that the baseline model with FAM achieves the gain of on CTW1500 in terms of F-measure. In Fig. 7, the baseline model with FAM has high responses on all the text regions, while the baseline model highlights on a few non-text regions and misses some text regions. This demonstrates that FAM is beneficial for accurately perceiving the overall layout of the texts with different sizes and shapes. Besides, using TKS also significantly improves the model performance, with recall and F-measure improved by and on CTW1500, respectively. When using both FAM and TKS, the performance is further improved on both CTW1500 and Total-Text datasets.
Iv-B3 Components of FAM
Here we investigate the effectiveness of each component in FAM. FAM consists of a skip connection from P5 to P3 and two deformable convolutions. As shown in Table III, compared to the model of TextDCT without FAM, further adding a skip connection from P5 to P3 almost has no effect on the performance. However, the combination of skip connection and DCN1 improves recall and F-measure by 3.8 and 2.7, respectively, which shows that the features with rich contextual information aligned by deformable convolution are helpful for detecting multi-scale texts. Besides, if only adding DCN2 after P3, the F-measure can reach 82.2, which shows that adaptively adjusting the receptive field to focus on important regions can significantly improve the performance of the single-level prediction model. We also implement a variant by using both DCN2 and DCN3, in which DCN3 denotes replacing the convolution in FAM with a deformable convolution. We can observe that the use of DCN3 almost brings no impact on the performance. Therefore, the performance gain by FAM is mainly due to the rich aligned contextual information and the adaptively adjusted receptive field, instead of the deformable convolution itself. When SC, DCN1, and DCN2 are combined together, our TextDCT achieves the best performance.
Iv-B4 DCT mask representation
To investigate the effectiveness of the DCT mask representation, we implement different variants on CTW1500 dataset, as presented in Table IV. IOU refers to the intersection over union between reconstructed mask and ground-truth , which is used to evaluate the quality of the mask representation. Without the DCT mask representation, a high dimension makes the model difficult to optimize, leading to the performance degradation. On the other hand, a low resolution leads to significant reconstruction errors, which makes the model difficult to fit complex text shapes accurately.
Firstly, by comparing the -dimensional mask vector without DCT and the -dimensional DCT vector under the same resolution of , we can see that the DCT mask representation using lower training complexity can obtain higher recall, precision, and F-measure results. Secondly, we compare the effect of different resolutions for the -dimensional DCT mask vector. As the resolution increases from to , the IOU increases from to , and the F-measure gains improvement. When the resolution is further increased to , the results are almost unchanged. Therefore, we set the resolution to in our TextDCT. Finally, we compare different dimensions of the DCT vectors. When the dimension is increased from to , the IOU only grows and the F-measure remains unchanged at . The is due to the energy concentration characteristics of DCT, in which the former dimension is more important than the latter dimension for reconstructed masks.
Iv-B5 Shared head convolutions
Here we investigate the effect of sharing head convolutions. The results on CTW1500 dataset are shown in Table V, in which “gap” refers to the absolute value of the difference between the classification score and the location score. The location score means the intersection over union between the predicted box and the ground-truth box. In Table V, we can see that only sharing the first four convolutions in the heads of the box regression and the mask regression branches has little effect on the performance. If further sharing the first four convolutions in the heads of both the classification and regression branches, the F-measure is improved from to . This is mainly due to the reduction of “gap”, resulting in a higher-quality regression to the positive sample with the highest classification score.
|Dai et al.||TMM’||-||-|
, COCO-Text, and ICDAR2019-MLT, respectively. indicates evaluating the performance of Total-Text with the IOU@0.5, and the default evaluation matrix of Total-Text is DetEval.
|Lyu et al.||CVPR’||Syn||Syn|
In Table VI, we filter the redundant boxes using NMS, S-NMS, and kernel-level NMS (K-NMS) in our TextDCT, respectively. K-NMS is a two-stage NMS, which first uses NMS for each text kernel corresponding to the regression box separately, and then uses NMS for the remaining boxes. Compared with NMS, K-NMS has a slightly different filtering order, resulting in slightly different final results ( vs. in terms of F-measure). Compared with NMS and KNMS, S-NMS can effectively improve the performance. Note that there may be adhesions between the predicted adjacent text kernels. Since S-NMS only selects the text box corresponding to the highest classification score for the adhesion region, the recall is decreased slightly. Although with recall rate drops on the CTW1500, S-NMS exceeds NMS by
in terms of F-measure, in which F-measure gives a convincing measurement result by balancing precision and recall.
Iv-C Comparison with State-of-the-Art Methods
For a fair comparison, we only evaluate our model on a single scale for all datasets. Furthermore, for the text spotting model[24, 23, 29], we only show the detection results without recognition branch.
Iv-C1 Evaluation on Long Curved Text Benchmark
The comparison results on long curved text dataset CTW1500 are given in Table VII, from which we can see that TextDCT achieves competitive performance and speed. Without pre-training, TextDCT can still achieve competitive results with , , and in recall, precision, and F-measure, respectively.
On one hand, TextDCT outperforms at least than segmentation based methods[56, 20, 48, 47, 64] in F-measure, which are more sensitive to noise and require pre-training on additional datasets to achieve good results. For example, the F-measure of PSENet after pre-training on the Synthtext dataset is higher than that without pre-training. Although the precision of DBNet is higher than TextDCT, some small texts may be filtered out in the post-processing of DBNet, so its recall rate is much lower than our TextDCT ( vs. ).
On the other hand, for regression based methods, EAST only regresses quadrilaterals so it cannot adapt to curved texts ( in F-measure). Although LOMO strives to achieve accurate text region representation by iterative optimization module, TextDCT obtains more higher performance ( vs. ). Moreover, compared with FCENet, TextDCT is more capable of modeling extremely long texts, which will be explored in detail in Sec. IV-D.
Iv-C2 Evaluation on Curved Text Benchmark
We can observe that TextDCT in a single testing scale achieves competitive performance and speed ( in F-measure and 15.1 in FPS). As a single-stage regression based method, TextDCT significantly outperforms existing regression based methods[29, 61, 46]. As a two-stage method, TextSpotter performs end-to-end text detection and recognition by modifying the mask branch of Mask-RCNN, which has much lower detection capability than the single-stage TextDCT ( vs. in F-measure). Compared with SPCNet, which is a two-stage regression based model with a text context module (TCM) and a re-score mechanism to improve model performance, TextDCT outperforms it by and in recall and F-measure respectively. Besides, as a single-stage model with a single-level head, TextDCT has a much simpler pipeline than these two-stage models. CRAFT additionally uses character-level annotations to supervise the learning process, which greatly increases the difficulty of annotating datasets. Compared to CRAFT, our TextDCT trains with only word-level supervision and shows advantages in F-measure.
Note that TextRay models text contours in polar coordinates and represents text contours by a finite number of contour points, which may have limited ability to model texts with highly curved shapes (see Fig. 1). However, TextDCT can model extremely curved texts well with DCT mask representation and achieve better results. Compared with  improving the discrimination of text feature representations through the multi-scale context aware feature aggregation module, TextDCT achieves much better performance ( vs. in F-measure) by our single-level framework and DCT mask representation. Besides, compared with the segmentation based methods[28, 56] that include complex post-processing and are susceptible to false positives, our TextDCT uses a simple pipeline and suppresses false positives through S-NMS, outperforming these methods by a large margin.
Some qualitative results on Total-Text dataset are depicted in Fig. 8(b). As a light-weighted single-shot regression-based detector, our TextDCT achieves satisfactory results in various degrees of curvature, aspect ratios and complex backgrounds.
Iv-C3 Evaluation on Oriented Text Benchmark
Test results on ICDAR2015 following the standard evaluation metric are shown in TableVIII. We can see that our TextDCT achieves , , and for recall, precision, and F-measure respectively without any extra datasets, and achieves satisfactory results in terms of F-measure ( in F-measure) when pre-trained on SynthText.
As shown in Table VIII, TextDCT surpasses most existing regression based methods[65, 7, 6, 51]. Specifically, compared with EAST, although both are single-stage models, our model can achieve spatial-awareness and scale-awareness through FAM, and has better detection capabilities for long texts in ICDAR2015 ( vs. ). R-Net achieves spatial-awareness and scale-awareness by a spatial relationship module (SPM) and a scale relationship module (SRM). However, TextDCT can achieve better performance since more accurate contours can be obtained by mask regression branch and S-NMS eliminates some false positives. Compared with the two-stage anchor-based methods[6, 7], in which the latter contains a more complex network structure and sophisticated post-processing, and is sensitive to the anchor setting with more hyper-parameters, TextDCT achieves much better performance. Some example results on ICDAR2015 are shown in Fig. 8(c).
Iv-C4 Evaluation on Multi-Lingual Benchmark
The comparison results on MLT dataset for evaluating the ability of multi-lingual text detection are presented in Table VIII. Our TextDCT surpasses most existing methods such as LOMO, R-Net, and TextMoutain. Besides, the example results on MLT are shown in Fig. 8(d), which shows that our TextDCT can effectively detect multi-lingual texts.
Iv-D DCT Mask vs. DFT Contour
Recently, there is a work FCENet which adopts DFT to encode text contours as compact vectors. To fairly compare our DCT mask representation with the DFT contour representation, we implement FCENet* by applying the idea of FCENet to our TextDCT framework. The only difference between FCENet* and TextDCT is that FCENet* uses a classification branch and a contour regression branch in the single-level head, in which the box regression branch is discarded.
Considering extremely long texts and extremely curved texts are the main challenges in scene text detection, we build a challenging subset with totally samples of CTW1500, in which the text whose text mask area is less than half of the text box area or the longest edge of the text box is longer than three-quarters of the longest edge of the image are selected. Qualitative and quantitative comparisons were shown in Fig. 9 and Table. IX, respectively. Compared to our TextDCT, FCENet* tends to miss some corner pixels of long texts. Although the F-measure of FCENet* is higher than TextDCT under the evaluation protocol IOU@0.5, the F-measure of FCENet* drops more when the IOU threshold is increased from to . The F-measure of TextDCT is higher than that of FCENet* under the evaluation protocol IOU@0.8, mainly because TextDCT fits text contours more accurately than FCENet*.
Iv-E Running Time Analysis
In the inference stage, the running time of our TextDCT includes network inference time and post-processing time, where post-processing mainly lies in S-NMS and IDCT. We set different input sizes for different datasets to achieve optimal model performance. The size of input images and the number of text instances has a great impact on the model running time, as shown in Table X. We compute the average running time of each image in each dataset on an NVIDIA Tesla V100 GPU. The results reported in Table X show that our TextDCT is able to detect arbitrary-shaped scene texts quickly.
|Dataset||Input size||Network Inference||Post-processing|
According to the above experimental results, our TextDCT can perform well in most challenging scenarios, accurately detecting highly curved texts. However, there are a few failure cases, as shown in Fig. 10. If the predicted text kernels stick to each other, it will cause omission because S-NMS only selects the point with the highest classification score for each kernel. Besides, some very text-like contexts are not filtered by our model, which should be alleviated by hard example mining.
In this paper, we have proposed a novel arbitrary-shaped scene text detector named TextDCT. The geometric encodings can be effectively learned through FAM and TKS, and false positives can be effectively removed by S-NMS. To obtain high-quality and low-complexity text instances, we have applied DCT to encode text instances as compact vectors. Extensive experiments have shown that the proposed single-level prediction framework can effectively detect arbitrary-shaped scene texts.
In the future work, the idea of mitigating the visual-semantic gap in  can be explored to suppress the false positives like the failure cases in Fig. 10. For example, we can extend TextDCT to an end-to-end scene text spotting framework, in which the text recognition module is trained to recognize texts as well as distinguish texts and text-like backgrounds. In this way, the detected text-like backgrounds can be removed based on the outputs of the text recognition module. Besides, our proposed idea of employing DCT to encode text masks as compact vectors is also promising to be applied for other object detection tasks.
Character region awareness for text detection.
Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 9365–9374. Cited by: §II-A, §IV-C2, TABLE VII, TABLE VIII.
-  (2017) Total-text: a comprehensive dataset for scene text detection and recognition. In Proc. Int. Conf. Document Anal. Recognit., Vol. 1, pp. 935–942. Cited by: 2nd item, §IV-C2, TABLE VII.
-  (2021) Disentangle your dense object detector. In Proc. ACM Int. Conf. Multimedia, pp. 4939–4948. Cited by: §I.
-  (2019) Icdar2019 robust reading challenge on arbitrary-shaped text-rrc-art. In Proc. Int. Conf. Document Anal. Recognit., pp. 1571–1576. Cited by: TABLE VII.
-  (2021) Comprehensive studies for arbitrary-shape scene text detection. arXiv preprint arXiv:2107.11800. Cited by: §I.
-  (2021) Accurate scene text detection via scale-aware data augmentation and shape similarity constraint. IEEE Trans. Multimedia, pp. 1883 – 1895. Cited by: §I, §IV-C3, TABLE VII.
-  (2019) Deep multi-scale context aware feature aggregation for curved scene text detection. IEEE Trans. Multimedia 22 (8), pp. 1969–1984. Cited by: §I, §IV-C2, §IV-C3, TABLE VIII.
-  (2021) Progressive contour regression for arbitrary-shape scene text detection. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 7393–7402. Cited by: §II-B, §II-C, TABLE VII.
-  (2018) Pixellink: detecting scene text via instance segmentation. In Proc. AAAI Conf. Artif. Intell., pp. 6773–6780. Cited by: §II-A.
-  (2016) Synthetic data for text localisation in natural images. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 2315–2324. Cited by: §IV-A2, §IV-C1, TABLE VII.
-  (1985) A two-dimensional fast cosine transform. IEEE Trans. Acoust., Speech, Signal Process. 33 (6), pp. 1532–1539. Cited by: §III-B.
-  (2017) Mask r-cnn. In Proc. Int. Conf. Comput. Vis., pp. 2961–2969. Cited by: §IV-C2.
-  (2016) Deep residual learning for image recognition. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 770–778. Cited by: §III-A.
-  (2021) MOST: a multi-oriented scene text detector with localization refinement. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 8813–8822. Cited by: §II-B, TABLE VIII.
-  (2019) Textplace: visual place recognition and topological localization through reading scene texts. In Proc. Int. Conf. Comput. Vis., pp. 2861–2870. Cited by: §I.
-  (2015) ICDAR 2015 competition on robust reading. In Proc. 13th Int. Conf. Document Anal. Recognit., pp. 1156–1160. Cited by: 3rd item.
-  (2012) Imagenet classification with deep convolutional neural networks. In Proc. Adv. Neural Inf. Process. Syst., pp. 1097–1105. Cited by: §IV-A2.
Textboxes: a fast text detector with a single deep neural network. In Proc. AAAI Conf. Artif. Intell., pp. 4161–4167. Cited by: §II-B.
-  (2018) Textboxes++: a single-shot oriented scene text detector. IEEE Trans. Image Process. 27 (8), pp. 3676–3690. Cited by: §I, §II-B.
-  (2020) Real-time scene text detection with differentiable binarization. In Proc. AAAI Conf. Artif. Intell., pp. 11474–11481. Cited by: §I, §II-A, §IV-A2, §IV-C1, TABLE VII, TABLE VIII.
-  (2017) Feature pyramid networks for object detection. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 2117–2125. Cited by: §I, §III-A.
-  (2016) Ssd: single shot multibox detector. In Proc. Eur. Conf. Comput. Vis., pp. 21–37. Cited by: §II-B.
-  (2018) Fots: fast oriented text spotting with a unified network. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 5676–5685. Cited by: §IV-C.
-  (2020) Abcnet: real-time scene text spotting with adaptive bezier-curve network. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 9809–9818. Cited by: §II-B, §II-C, §III-D, §III-E, §IV-A2, §IV-C, TABLE VII.
-  (2019) Arbitrarily shaped scene text detection with a mask tightness text detector. IEEE Trans. Image Process. 29, pp. 2918–2930. Cited by: TABLE VII.
-  (2019) Curved scene text detection via transverse and longitudinal sequence connection. Pattern Recognit. 90, pp. 337–345. Cited by: §II-B, 1st item, TABLE VII.
-  (2015) Fully convolutional networks for semantic segmentation. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 3431–3440. Cited by: §II-A.
-  (2018) Textsnake: a flexible representation for detecting text of arbitrary shapes. In Proc. Eur. Conf. Comput. Vis., pp. 20–36. Cited by: §II-A, §IV-C2, TABLE VII.
-  (2018) Mask textspotter: an end-to-end trainable neural network for spotting text with arbitrary shapes. In Proc. Eur. Conf. Comput. Vis., pp. 67–83. Cited by: §IV-C2, §IV-C.
-  (2018) Multi-oriented scene text detection via corner localization and region segmentation. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 7553–7563. Cited by: TABLE VIII.
-  (2021) ReLaText: exploiting visual relationships for arbitrary-shaped scene text detection with graph convolutional networks. Pattern Recognit. 111, pp. 107684. Cited by: TABLE VII.
-  (2018) Arbitrary-oriented scene text detection via rotation proposals. IEEE Trans. Multimedia 20 (11), pp. 3111–3122. Cited by: §II-B, TABLE VIII.
-  (2019) ICDAR2019 robust reading challenge on multi-lingual scene text detection and recognition—rrc-mlt-2019. In Proc. Int. Conf. Document Anal. Recognit., pp. 1582–1587. Cited by: TABLE VII.
-  (2017) Icdar2017 robust reading challenge on multi-lingual scene text detection and script identification-rrc-mlt. In Proc. 14th IAPR Int. Conf. Document Anal. Recognit., Vol. 1, pp. 1454–1459. Cited by: 4th item, TABLE VII.
-  (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Proc. Adv. Neural Inf. Process. Syst., pp. 91–99. Cited by: §II-B.
-  (2015) U-net: convolutional networks for biomedical image segmentation. In Proc. Int. Conf. Med. Image Comput. Comput.-Assist. Intervent., pp. 234–241. Cited by: §II-A.
-  (2021) Dct-mask: discrete cosine transform mask representation for instance segmentation. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 8720–8729. Cited by: §I.
-  (2019) Detecting dense and arbitrary-shaped scene text by instance-aware component grouping. Pattern Recognit. 96, pp. 106954. Cited by: §I.
-  (2019) Seglink++: detecting dense and arbitrary-shaped scene text by instance-aware component grouping. Pattern Recognit. 96, pp. 106954. Cited by: TABLE VII, TABLE VIII.
-  (2019) Learning shape-aware embedding for scene text detection. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 4234–4243. Cited by: §I.
-  (2016) Detecting text in natural image with connectionist text proposal network. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 56–72. Cited by: §II-B.
-  (2020) Fcos: a simple and strong anchor-free object detector. IEEE Trans. Pattern Anal. Mach. Intell., pp. 1922 – 1933. Cited by: §I, §III-C2, §IV-B1.
-  (2019) Learning shape-aware embedding for scene text detection. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 4234–4243. Cited by: §II-A.
-  (1992) A generic solution to polygon clipping. Commun. ACM 35 (7), pp. 56–63. Cited by: §III-C2.
-  (2016) Coco-text: dataset and benchmark for text detection and recognition in natural images. arXiv preprint arXiv:1601.07140. Cited by: TABLE VII.
-  (2020) Textray: contour-based geometric modeling for arbitrary-shaped scene text detection. In Proc. ACM Int. Conf. Multimedia, pp. 111–119. Cited by: Fig. 1, 1(a), §II-B, §II-C, §III-D, §IV-C2, §IV-C2, TABLE VII.
-  (2019) Shape robust text detection with progressive scale expansion network. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 9336–9345. Cited by: §I, §II-A, §IV-C1, TABLE VII.
-  (2019) Efficient and accurate arbitrary-shaped text detection with pixel aggregation network. In Proc. Int. Conf. Comput. Vis., pp. 8440–8449. Cited by: §IV-A2, §IV-C1, TABLE VII.
-  (2019) Arbitrary shape scene text detection with adaptive text region representation. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 6449–6458. Cited by: TABLE VII, TABLE VIII.
-  (2020) Contournet: taking a further step toward accurate arbitrary-shaped scene text detection. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 11753–11762. Cited by: §II-C, TABLE VII, TABLE VIII.
-  (2020) R-net: a relationship network for efficient and accurate scene text detection. IEEE Trans. Multimedia 23, pp. 1316–1329. Cited by: §I, §IV-C3, §IV-C4, TABLE VIII.
-  (2021) Progressive hard-case mining across pyramid levels in object detection. arXiv preprint arXiv:2109.07217. Cited by: §I.
-  (2019) Scene text detection with supervised pyramid context network. In Proc. AAAI Conf. Artif. Intell., pp. 9038–9045. Cited by: §I, §II-C, §IV-C2, TABLE VIII.
-  (2021) Boundary-aware arbitrary-shaped scene text detector with learnable embedding network. IEEE Trans. Multimedia 24, pp. 3129 – 3143. Cited by: §I, TABLE VIII.
-  (2016) Text detection in stores using a repetition prior. In Proc. IEEE Winter Conf. Appl. Comput. Vis., pp. 1–9. Cited by: §I.
-  (2019) Textfield: learning a deep direction field for irregular scene text detection. IEEE Trans. Image Process. 28 (11), pp. 5566–5579. Cited by: §II-A, §IV-C1, §IV-C2, TABLE VII.
-  (2019) MSR: multi-scale shape regression for scene text detection. In Proc. Int. Joint Conf. Artif. Intell., pp. 989–995. Cited by: §I, TABLE VII.
-  (2020) Semantics-preserving graph propagation for zero-shot object detection. IEEE Trans. Image Process. 29, pp. 8163–8176. Cited by: §V.
-  (2021) Scene text detection with richer fused features. In Proc. Int. Joint Conf. Artif. Intell., pp. 7–15. Cited by: §II-C.
-  (2014) Scene text recognition in mobile applications by character descriptor and structure configuration. IEEE Trans. Image Process. 23 (7), pp. 2972–2982. Cited by: §I.
-  (2019) Look more than once: an accurate detector for text of arbitrary shapes. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 10552–10561. Cited by: §I, §II-B, §IV-C1, §IV-C2, §IV-C4, TABLE VII, TABLE VIII.
-  (2020) Deep relational reasoning graph network for arbitrary shape text detection. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 9696–9705. Cited by: §I, §II-B, TABLE VII, TABLE VIII.
-  (2020) OPMP: an omnidirectional pyramid mask proposal network for arbitrary-shape scene text detection. IEEE Trans. Multimedia 23, pp. 454–467. Cited by: §I, TABLE VII.
-  (2021) Adaptive boundary proposal network for arbitrary shape text detection. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 1305–1314. Cited by: §II-A, §IV-C1, TABLE VII.
-  (2017) East: an efficient and accurate scene text detector. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 5551–5560. Cited by: §II-B, §IV-C1, §IV-C3.
-  (2021) Fourier contour embedding for arbitrary-shaped text detection. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 3123–3131. Cited by: §II-B, §IV-C1, §IV-D, TABLE VII, TABLE VIII.
-  (2021) Textmountain: accurate scene text detection via instance segmentation. Pattern Recognit. 110, pp. 107336. Cited by: §IV-C4, TABLE VIII.