Omnidirectional Scene Text Detection with Sequential-free Box Discretization (IJCAI 2019). Including competition model, online demo, etc.
Omnidirectional scene text detection has received increasing research attention. Previous methods directly predict words or text lines of quadrilateral shapes. However, most methods neglect the significance of consistent labeling, which is important to maintain a stable training process, especially when a large amount of data are included. For the first time, we solve the problem in this paper by proposing a novel method termed Sequential-free Box Discretization (SBD). The proposed SBD first discretizes the quadrilateral box into several key edges, which contains all potential horizontal and vertical positions. In order to decode accurate vertex positions, a simple yet effective matching procedure is proposed to reconstruct the quadrilateral bounding boxes. It departs from the learning ambiguity which has a significant influence during the learning process. Exhaustive ablation studies have been conducted to quantitatively validate the effectiveness of our proposed method. More importantly, built upon SBD, we provide a detailed analysis of the impact of a collection of refinements, in the hope to inspire others to build state-of-the-art networks. Combining both SBD and these useful refinements, we achieve state-of-the-art performance on various benchmarks, including ICDAR 2015, and MLT. Our method also wins the first place in text detection task of the recent ICDAR2019 Robust Reading Challenge on Reading Chinese Text on Signboard, further demonstrating its powerful generalization ability. Code is available at https://tinyurl.com/sbdnet.READ FULL TEXT VIEW PDF
Scene text in the wild is commonly presented with high variant
Scene text detection and recognition has received increasing research
Scene text detection has been made great progress in recent years. The
This paper reports the ICDAR2019 Robust Reading Challenge on Arbitrary-S...
A novel framework named Markov Clustering Network (MCN) is proposed for ...
Detecting incidental scene text is a challenging task because of
The challenges of shape robust text detection lie in two aspects: 1) mos...
Omnidirectional Scene Text Detection with Sequential-free Box Discretization (IJCAI 2019). Including competition model, online demo, etc.
Scene text detection in arbitrary orientation has received increasing attention in computer vision for its numerous potential applications such as augmented reality, and robot navigation. Moreover, scene text detection is also the foundation and prerequisite for text recognition, which provides a reliable and straightforward approach to scene understanding. However, it still remains an open problem as text instances in natural images are often presented in multi-oriented, low quality, perspective distortions, and various sizes and scales.
In the literature, a number of methods Jaderberg et al. (2016); Neumann and Matas (2012, 2015b, 2015a); Tian et al. (2015, 2016) have been developed for solving horizontal scene text detection. However, scene text in the wild is usually presented with the omnidirectional form, attracting numerous recent works Zhong et al. (2016); Liu and Jin (2017); Shi et al. (2017a); Xue et al. (2018, 2018); Xie et al. (2019a); Liu et al. (2019e); Liao et al. (2017, 2018a, 2018b); Liu et al. (2018); He et al. (2017b, c) which can be roughly categorized into two groups: segmentation based, and regression based. The segmentation based methods usually try to group pixels into text instances by utilizing either Fully Convolution Network (FCN) Long et al. (2015) or Mask R-CNN He et al. (2017a)
architectures. These approaches commonly suffer from a number of limitations. First, segmented text instances often come along with a bunch of complicated post-processing steps. For example, the segmentation results from Mask R-CNN need to be fitted into rotated quadrilateral bounding boxes, which can be easily affected by outliers or irregular segmentation boundaries. Second, these approaches often explores a number of heuristic settings and geometric assumption. These drawbacks limit the applications when dealing with complicated scenarios.
Compared with segmentation based methods Zhang et al. (2016); Long et al. (2015); He et al. (2017a); Deng et al. (2018); Lyu et al. (2018b); Wu and Natarajan (2017); Wang et al. (2019b); He et al. (2016), regression based methods Zhu and Du (2018); Liu and Jin (2017); Xue et al. (2019); Liao et al. (2018b); Ma et al. (2018); Liao et al. (2018a); He et al. (2018b); Zhou et al. (2017); He et al. (2017c) are simpler and more straightforward. These methods directly predict vertex positions and have achieved promising results. However, the significance of the consistent regression is ignored. Take East Zhou et al. (2017) as an example. Each feature within a text instance is responsible for regressing the corresponding quadrilateral bounding box by predicting four distances to the boundaries and a rotate angle from the view of the point. A pre-processing step to assign the regression targets is required. As shown in Figure 1, even with a minor rotation, the regression targets can be altered dramatically. Such ambiguities can lead to an unstable training process, which greatly degrades the performance. Our experiments show that the accuracy of EAST Zhou et al. (2017) dropped sharply (by more than 10%) when equipped with an random rotation technique for data augmentation, which is supposed to boost the performance.
To address the problem, we propose a novel method termed Sequential-free Box Discretization (SBD), by disentangling the objective into two separate tasks: Key Edges Detection and Matching-Type Learning. The fundamental idea is to utilize the invariant representations (e.g., minimum , minimum , maximum , maximum , mean center point, and intersecting point of the diagonals) that are irrelevant to the label sequence to inversely deduce the bounding box coordinates. To simplify the parameterization, SBD first locates all the discretized horizontal and vertical edges that contain vertex. Then a sequence labeling matching type is learned to find out the best fit quadrilateral. Departing from the ambiguity of the training targets, our approach successfully improves the performance when numerous rotated data are introduced.
In addition, we also complement our method with a few key technical innovations that further enhance the performance. We have conducted extensive experiments and ablation studies based on our method to explore the influence of six aspects, including data arrangement, pre-processing, backbone, proposal generating, prediction head, and post-processing to determine the significance of the various components. Our hope is to alleviate the efforts for others and provide some useful tips when designing state-of-the-art models. Building upon SBD and these useful refinements, we achieve the first place in the task of Text Line Detection in ICDAR2019 Robust Reading Challenge on Reading Chinese Text on Signboard.
To summarize, our main contributions are listed as follows.
To our knowledge, we are the first to solve the text detection ambiguity in terms of the sequential order of the quadrilateral bounding box, which is of great importance for achieving good detection accuracy.
The flexibility of our proposed method allows it to make use of several key refinements that are critical to further boost the accuracy. Our method achieves state-of-the-art performance on various scene text benchmarks, including ICDAR 2015 Karatzas et al. (2015) and MLT Nayef et al. (2017). In addition, our method won the championship in Text Detection task of the recent ICDAR2019 Robust Reading Challenge on Reading Chinese Text on Signboard.
Our method, with effective refinements, can also be generalized to the ship detection in aerial images. The significant improvement of TIoU-Hmean further demonstrates the robustness of our approach.
Recent researches suggest that, quadrilateral bounding box is one of the most important representations for multi-oriented scene text detection. However, introducing quadrilateral bounding box may bring in some issues for both of the current segmentation-based methods and non-segmentation-based methods.
For segmentation-based methods Zhang et al. (2016); Long et al. (2015); He et al. (2017a); Deng et al. (2018); Lyu et al. (2018b); Wu and Natarajan (2017); Wang et al. (2019b); He et al. (2016), they usually require additional steps to group pixels into polygon, which requires time-consuming post-processing and is easily to be affected by outliers.
For non-segmentation-based methods Zhu and Du (2018); Xue et al. (2019); Liao et al. (2018b); Ma et al. (2018); Liao et al. (2018a); He et al. (2018b); Liu and Jin (2017); Zhou et al. (2017); He et al. (2017c), they can directly learn the exact bounding box to localize the text instances, but they are easily be affected by the label sequence. Usually, they use a heuristic sequential protocol to alleviate the issue, but the solutions are not robust because the whole sequence may totally change even under tiny interference. To make this clear, we discuss some of the previous solutions as follows:
Given an annotation with the coordinates of 4 points, a common sequential protocol is to choose the point with minimum as the first point, and then deciding the rest of the points clockwise. However, this protocol is not robust. Take horizontal rectangle as an example, using this protocol we can decide the first point is the left top point, the fourth point is the left bottom point; if the left bottom point leftward a pixel (which is possible because of the inconsistent labeling), the original fourth point becomes the first point, and thus the whole sequence changes, resulting in an unstable learning relationship.
As shown in Figure 2(b), given four points, Textboxes++ Liao et al. (2018a) uses the distances between the annotation points and the vertexes of the circumscribed horizontal rectangle to decide the sequence. However, if and have the same distance to the , one pixel rotation may also completely change the whole sequence.
first find the mean center point of the given 4 points and then construct the cartesian coordinate system. Using the positive axis of x, QRN ranks the intersection angles of the four points, and choose the point with the minimum angle as the first point. But if the first point is in the positive axis of x, one pixel upward or downward will result in totally different sequence.
Although these methods Liu and Jin (2017); Liao et al. (2018a); He et al. (2018b) can alleviate the confusion to some extent, the results can be significantly undermined when using pseudo samples with large rotated degrees.
Different from these methods, our method is the first that can directly produce compact quadrilateral bounding box without complex post-processing, meanwhile, it can still completely avoid label confusion issue without any heuristic process.
Our proposed scene text detection system consists of three core components: Sequential-free Box Discretization (SBD) block, Match-Type Learning (MTL) block, and Re-scoring and Post Processing (RPP) block. Figure 3 illustrates the entire pipeline of the proposed detection framework, and more details are described in the following sections.
The purpose of omnidirectional scene text detection is to accurately localize the textual content shown in the wild by generating the outputs in the form of rectangular or quadrilateral bounding boxes. Compared to rectangle annotations, it is admitted that quadrilateral labels demonstrate stronger capability in covering effective text regions, especially for the rotated texts. However, as introduced in Section 2, simply replacing rectangular bounding boxes with quadrilateral annotations can introduce inconsistency due to the sensitivity of the non-segmentation-based methods to label sequence. As shown in Figure 1, the detection model might fail to obtain the accurate features for the corresponding points when facing small disturbance, because the order of the points may completely change after rotating the target with a tiny angle. Therefore, instead of predicting sequence-sensitive distances or coordinates, Sequential-free Box Discretization (SBD) block is proposed to discretize the quadrilateral box into 8 Key Edges (KEs) that comprised of order-irrelevant points, i.e., minimum and ; the second smallest and ; the second largest and ; and maximum and (see Figure 1). We use x-KEs and y-KEs in the following sections to represent [, , , ] and [, , , ] respectively.
More specifically, the proposed approach is built upon the widely used generic object detection framework—Mask R-CNN He et al. (2017a). As shown in Figure 4, the proposals processed by RoIAlign are fed into the SBD block where the feature maps are further forwarded through a series of convolution layers; then these features are upsampled by 2 bilinear upscaling layers, and the output feature maps from deconvolution is restricted to . Furthermore, two convolution kernels shaped in and with 4 channels are employed to shrink the horizontal and vertical features for x-KEs and y-KEs respectively. Finally, the SBD model is trained by minimizing the cross-entropy loss over an M-way SoftMax output, where the corresponding positions of the ground-truth KEs are assigned to each output channel.
In practice, it is worth mentioning that SBD does not directly learn the x-KEs and y-KEs due to the restriction of RoI. Specifically, the original Mask R-CNN framework only learns to predict the target objects inside of the RoI areas, and it would be less likely to restore the missing pixels of the object parts that outside of the RoIs. Therefore, to solve this problem, the x-KEs and y-KEs are encoded into the form of ‘half lines’ at training time. Suppose the x-KEs and y-KEs , then the ‘half lines’ are defined as follows:
where and represent the value of mean central point of ground-truth bounding box for x-axis and y-axis respectively. By employing such training strategy, the proposed SBD block can break the RoI restriction (see Figure 5), i.e., the integration of the text instance can be guaranteed because the and fall into the area of RoIs in most cases even if the border of the text instance locates outside of the RoIs,
Similar to Mask R-CNN, the overall detector is trained under a multi-task manner. Thus the loss function is composed of 4 parts:
where the first three terms , and follow the same settings as presented in He et al. (2017a), and is the cross-entropy loss that used for learning the Key Edges prediction task. Authors there made an interesting observation that the additional keypoint branch can harm the bounding box detection performance He et al. (2017a). However, based on our experiments (see ablation study in Table 3 and 4), the proposed SBD block is the key component that significantly boosts the detection accuracy. The reasons may be explained in two-fold: a) Different from the keypoint detection task which has to learn classes against each other, the competitive pixels in SBD block is only ; b) Compared to the training target in keypoint detection task, which is relatively less posed due to the vague definition of ground-truth (neither one-hot point nor small area cannot well be used to describe the target keypoint absolutely), the KEs produced by SBD are more exclusive and thus can provide better supervision for training the network.
It is noteworthy that the SBD block only learns to predict the numerical values of 8 KEs, but ignores the connection between x-KEs and y-KEs. Therefore, it is beneficial to design a proper matching procedure to reconstruct the quadrilateral bounding box from the key edges, otherwise the incorrect matching type may lead to completely unreasonable results (see Figure 6).
As described in Section 3.1, there are 4 x-KEs and 4 y-KEs outputted by the SBD block. Each x-KE should match to one of the y-KEs to construct a corner point, such as , , and etc. Then, all of the 4 constructed corner points are assembled to the final prediction, i.e., quadrilateral bounding box. It is important to note that different orders of the corners would produce different results, hence the total number of match-types between the x-KEs and y-KEs can be simply calculated by . For example, the predicted match-type in Figure 6(a) is . Based on this, a simple yet effective module termed Match-Type Learning (MTL) block is proposed to learn the connections between x-KEs and y-KEs. Specifically, as shown in Figure 4
, the feature maps that are used for predicting the x-KEs and y-KEs are concatenated for classifying the match-types. In other words, the matching procedure is formed as a 24 categories classification task. In our method, the MTL head is trained by minimizing the cross-entropy loss, and the experiments demonstrate that the convergence speed is quite fast.
The fact that the detectors can sometimes output high confidence scores for false positive samples has been a long standing issue among the detection community for both generic object and text. One possible reason maybe that the scoring head used in most of the current literature is supervised by softmax loss which is designed for classification but not localization. Moreover, the classification score only considers whether the instance is foreground or background but shows less sensitive to the compactness of the bounding box, which could also have a negative impact on the final performance.
Therefore, a confidence Re-scoring and Post Processing block termed RPP is proposed to suppress the unreasonable false positives. Specifically, RPP adopts a policy that similar to multiple experts system to reduce the risk of outputting high scores for negative samples. In RPP, an SBD score is firstly calculated based on 8 KEs (4 x-KEs and 4 y-KEs):
where is the number of the KEs. As shown in Figure 7(a), the distribution of the demonstrates an one-peak pattern in most cases, nonetheless, the peak value is still significantly lower than 1. Hence, we sum up 4 adjacent scores that are near to the peak value for each key edge score to avoid too low confidence. Supposing
is the outputted score vector of thekey edge, then the function is defined to sum up the peak value and its neighbors:
It should be noted that the number of adjacent values would be less than 4 if the peak value locates at the head or tail of the vector, thus only the existed neighbors would be counted in this case. Finally, the refined confidence can be obtained by:
where is the weighting coefficient, and is the original SoftMax confidence for the bounding box. By counting the into the final score, it enables the proposed detector to draw lessons from multiple agents (8 KEs scores), and enjoys the benefits of tightness-aware confidence that supervised by the KEs prediction task.
It has been proved that detection performance could be boosted under the multi-task learning framework. For example, as shown in He et al. (2017a), simultaneously training a detection head with an instance segmentation head can significantly improve the detection accuracy. Similarly, a segmentation head is also employed in the proposed SBD network to predict the area inside of the bounding box, which enforces the model to regularize pixel-level features to enhance both performance and robustness. However, some issues of the segmentation head are shown in Figure 8, a) the segmentation mask can sometimes produce false positive pixels while SBD prediction is correct; b) the segmentation head fails to maintain some positive samples which have been successfully detected by the SBD block. Therefore, compared to some segmentation based approaches that directly reconstruct the bounding box by exploiting the segmentation mask, MTL block can learn geometric constrains to avoid the false positives caused by inaccurate segmentation output, which also reduces the heavy reliance on the segmentation task. To be specific, as shown in Figure 6(b), the blue dashed line matches to an invalid shape that violates the definition of quadrilateral (sides should only have two intersections on the head and tail). By simply removing these abnormal results, MTL block can further eliminate some false positives that might cheat segmentation branch.
Another interesting observation is that the RPP block shows strong capability in suppressing false positives and makes the predictions more reliable. To provide analysis, we visualize the term that used in the RPP block (see Equation (5)), and found that there are mainly two typical patterns of the KE scores outputted by the SBD block, as shown in Figure 7, a) one-peak pattern; and b) multi-peak pattern. In normal cases, the KEs scores demonstrate a regular pattern, in which there is only one peak value present in the output vector (see Figure 7(a)), while in hard negative samples, two or more peak values appear (see Figure 7(b), 7(c), 7(d)). These multiple peaks would share the confidence together, since the total score is normalized to one. Therefore, based on Equation (3) and (5), the final score would be lowered down, which protects the proposed model from outputting high confidence for those false positive instances.
In this section, we conduct ablation studies on the ICDAR 2015 Karatzas et al. (2015) dataset to validate the effectiveness of each component of our method. First, we evaluate how the proposed modules influence the performance. The results are shown in Table 3 and Figure 9. From Table 3, we can see that SBD and RPP can lead to 2.4% and 0.6% improvement respectively in terms of Hmean. In addition, Figure 9 shows that our method can substantially outperform the baseline Mask R-CNN under different confidence thresholds of the detections, which further demonstrates its effectiveness. In addition, we also conduct experiments by comparing the mask branch and KE branch (including SBD and RPP) on the same network, i.e., remaining both but testing only on one of the branches. To this end, we simply use provided training samples of IC15 without any data augmentation. The results are shown in Table 4, which demonstrates that the proposed modules can effectively improve the scene text detection performance.
More importantly, we also conduct experiments to demonstrate that introducing ambiguity in the training is harmful to achieve promising results. To be specific, we first trained the Textboxes++ Liao et al. (2018a), East Zhou et al. (2017), CTD Liu et al. (2019d), and proposed method with original 1k training images of ICDAR 2015 dataset. Then, we randomly rotated the training images among and randomly picked up additional 2k images from the rotated dataset to finetune these models. The results are presented in Table 1. All the previous methods fail to recognize the significance of the consistent labeling and dramatically degrade the accuracy when more data are included in the training process. Furthermore, as shown in Table 2, our proposed method demonstrates higher robustness under different rotation degrees.
|ICDAR2015||Mask R-CNN baseline||83.5%|
|Baseline + SBD||85.9% ( 2.4%)|
|Baseline + SBD + RPP||86.5% ( 3.0%)|
|KE branch without RPP||80.4% ( 1.0%)|
|KE branch with RPP||81.0% ( 1.6%)|
In this section, we provide a detailed analysis of the impact of refinements based on the proposed methods, in order to evaluate the limits of our method and whether it can be mutual promoted by existing modules. By accumulating the effective refinements, our method achieves the first place in the detection task of ICDAR 2019 Robust Reading Challenge on Reading Chinese Text on Signboard.
In the following sections, we present an extensive set of experiments that compare our baseline model, i.e., Mask R-CNN + KE branch, with alternative architectures and different strategies in six parts, including data arrangement, pre-processing, backbone, proposal generating, prediction head, and post-processing. The objective is to show that the proposed model corresponds to a local optimum in the space of architectures and parameters, and to evaluate the sensitivity of the final performance to each design choice. The following discussions follow the structure of Table 5. Note the significant breadth and exhaustivity of the following experiments represent more than 3,000 GPU-hours of training time.
The competition dataset, Reading Chinese Text on Signboard (ReCTS), is a practical and challenging omnidirectional natural scene text dataset with 25,000 signboard images. 20,000 images are used for training set, with total 166,952 text instances. The rest of the 5,000 images are used for the test set. Examples of this dataset are shown in Figure 10. The layout and arrangement of Chinese characters in this dataset are clearly different from that in other benchmarks. As the function of signboard is to attract customer base, it is very common to see the aesthetics appearance and thus the Chinese characters could be arranged in any kind of layout with various fonts. In addition, characters from one word can be in diverse orientations, diverse fonts, or diverse shapes, which aggravates the challenge. This dataset provides both text lines and characters annotations in order to inspire new algorithms that can take advantages of the arrangement of characters. To evaluate the function of each component, we split the original training set into 18,000 training images and 2,000 validation images.
Our model is implemented in Pytorch. Each experiment in this section uses a single network that is a variation of our baseline model (first row of Table5). Each network is trained on the official training set of ReCTS unless specified. In addition, as the test scale may significantly influence the final detection result, the testing max size is fixed to 2000 pixels, and the scale is fixed to 1400 pixels for strictly fair Ablation experiments. The ratio of the flip is also fixed to 0.5. Results are reported on the validation set of ReCTS based on the widely used main performance metric, Hmean. We also report the best confidence threshold that leads to the best performance, which can also reveal some important information.
For ablation studies, the number of iterations for training one network is fixed to 80k iterations, with a batch size of 4 images per GPU on 4 1080ti GPUs. The final cumulative model is trained with 160 epochs on 4 V100s, which takes approximate 6 days. The baseline model uses ResNet-101-FPN as the backbone which is initialized by a model pretrained on the MLTNayef et al. (2017) data. We now discuss the results of each Ablation experiment from Table 5.
|Methods||Best threshold||Recall (%)||Precision (%)||Hmean (%)||Hmean|
|With data cleaning||0.93||77.7||80.3||79.0||0.1|
|With only mlt pretrained data (100k iters)||0.97||53.4||56.1||54.7||24.4|
|With only 60k pretrained data (200k iters)||0.81||50.8||61.0||55.5||23.6|
|With defect data||0.91||75.8||72.5||74.1||5.0|
|Without MLT data pretrain||0.85||75.5||81.9||78.6||0.5|
|With 60k pretrained model||0.91||78.8||81.9||80.3||1.2|
|With random crop (best ratio)||0.91||78.4||83.7||81.0||1.9|
|With random rotate (best ratio)||0.91||77.6||81.8||79.7||0.6|
|With color jittering||0.91||76.4||82.5||79.3||0.2|
|With ASPP in KE head||0.91||76.1||80.1||78.0||1.1|
|With ASPP in (backbone 1/16)||0.89||73.1||81.3||77.0||2.1|
|With deformable convolution (C4-1)||0.87||79.5||83.9||81.7||2.6|
|With deformable convolution (C4-2)||0.89||79.1||84.3||81.6||2.5|
|With deformable convolution (C3-)||0.83||81.2||81.9||81.6||2.5|
|With panoptic segmentation (dice loss)||0.67||77.7||80.3||79.0||0.1|
|With pyramid attention network (PAN)||0.85||77.6||83.1||80.3||1.2|
|With multi-scale network (MSN)||0.91||79.0||81.6||80.3||1.2|
|With deformable PSROI pooling||0.91||80.7||79.4||80.0||0.9|
|With character head||0.93||77.7||82.0||79.8||0.7|
|With mask scoring||0.93||75.7||81.8||78.6||0.5|
|With pyramid mask||0.91||78.3||80.0||79.1||0.0|
|With cascade r-cnn (ensemble)||-||77.7||80.3||79.0||0.1|
|With polygonal non-maximum suppression||0.91||77.2||82.8||79.9||0.8|
|With Key Edge RPP||0.91||78.5||79.9||79.2||0.1|
Considering the image diversity and the consistency and quality of the annotation, we collect a 60k dataset for pretraining, which consists of 30,000 images from LSVT Sun et al. (2019) training set, 10,000 images from MLT 2019 Nayef et al. (2019) training set, 5,603 images from ArT Chng et al. (2019) which contains all the images of SCUT-CTW1500 Liu et al. (2019d) and Total-text Chng and Chan (2017); Ch’ng et al. (2019), and the rest of 14,859 images are selected from RCTW-17 Shi et al. (2017b), ICDAR 2015 Karatzas et al. (2015), ICDAR 2013 Karatzas et al. (2013), MSRA-TD500 Yao et al. (2012), COCO-Text Veit et al. (2016), and USTB-SV1K Yin et al. (2015). Note that we transfer polygonal annotations to minimum area rectangle for the training.
The ablation results are shown in Table 5
. If we only use the pretrained data without the split training data from ReCTS, the result in the ReCTS validation set is significantly worse than the baseline (even if the pretrained model is trained with more iterations). This is mainly because the diversity and annotation granularity of the selected pretrained dataset is still very different from the ReCTS dataset. However, using the model trained with pretrained data is better than directly using ImageNet model. For example, directly using ImageNet ResNet-101 model instead of MLT pretrained model from the baseline method, the Hmean is reduced by 0.5%. If using the model with 60k pretrained data, then finetuning the model on the split ReCTS training data, the result can be improved by 1.2% in terms of Hmean. In addition, to evaluate the importance of the data quality, we mimic the manual annotation error by removing 5% the training annotation instances and without correcting some samples with annotation ambiguity from the original training data of ReCTS. The results show that using a defect training data would significantly degrade the performance.
Our baseline model uses pretrained model with only flip strategy for data augmentation. We compare the baseline to various different data augmentation methods.
Without introducing extra parameters or training/testing time, results shown in Table 5 demonstrate that both rotate and crop data augmentation strategies can improve the detection results. We further conduct sensitivity analysis of how the ratios of using these two strategies influence the performance, which are shown in Figure 11. From Figure 11(a), there are some useful findings as below.
Under appropriate ratios, three rotated degree (30, 15, and ) can all outperform the baseline method in most of the ratios, with 0.5%, 0.6%, and 0.4%, respectively.
Under 0.1 rotated ratio, the performances of three rotated degree are all worse than the baseline. This maybe because the pseudo samples change the distribution of the original dataset while the very few number of pseudo sample is insufficient to improve the generalization ability. On the other hand, the ratios to achieve best results for different rotated degree always lie between 0.3 and 0.8, which empirically suggest that using a medium ratio for rotated data augmentation strategy maybe a better choice.
We can also see that the performance with rotated angle of 15 is consistently better than that of 30 and 5.
Compared to the rotated data augmentation strategy, the random cropping strategy can significantly improve the detection performance. The best performance showed in Table 5 can achieve 1.9% improvement in terms of Hmean comparing to the baseline method. Sensitivity analysis is also conducted in Figure 11(b), from which we can find that as the crop ratio improves, the performance is also tend to improve, which suggests that always using crop strategy is conducive to the detection results. Note that the crop ratio of 0.1 only improves Hmean by 0.5% whereas other ratios can improve the Hmean more than 1%, which is similar to the phenomenon of the results of 0.1 of rotated ratio.
We also conduct a simple ablation study to evaluate the performance of the color jittering. Based on the same setting of the baseline method, we empirically set the ratio of brightness, contrast, saturation, and hue to 0.5, 0.5, 0.5, and 0.1, respectively. The ratio here represents the disturbing degree of each specific transformation. The result in Table 5 shows that using color jittering data augmentation can slightly improve the result by 0.2% in terms of Hmean.
Training scale is important especially for the scene text detection task. To evaluate how the training scale influences the results of our method, we use two parameters, and , to control the training scale. The first item resizes the minimum side of the image to the specific parameter (in our implementation, there is a set of values for random scaling), and the second item restricts the maximum size of the sides of the image. The value of must less than , and the whole scaling processing strictly remains the original aspect ratio. We mainly compare three different settings: (a) default training scale (: 560 to 920 with interval of 40, is 1300); (b) medium training scale (: 680 to 1120 with interval of 40, is 1800); and (c) large training training scale (: 800 to 1400 with interval of 40, is 2560). The results are shown in Table 6. The results show that: (1) Larger training scale requires larger testing scale for the best performance. (2) The Larger training scale the better performance can be achieved. Note that although a larger training scale can improve the performance, it is training costly and may need significantly more GPU memory.
|(, )||Hmean (%)||Hmean (%)||Hmean|
A well known hypothesis is that a deeper and wider network architecture has a better performance than a shallower and thinner network. However, naively increasing the network depth will significantly increase the computational cost while has a limited improvement. Therefore, we investigate different styles of architectures of the backbone. The results are shown in the Table 5, which are summarized as follows:
By changing the backbone ResNet-101-FPN of the baseline model into ResNeXt-152-32x8d-FPN-IN5k, the Hmean can be increased by 2.5%. Note that the pretrained model of ResNeXt-152-32x8d-FPN-IN5k is pretrained on ImageNet by the Facebook Detectron framework, which is transformed into a pytorch model.
Atrous Spatial Pyramid Pooling (ASPP) Chen et al. (2017) shows effectiveness in the task of semantic segmentation, which is known for its function of increasing the receptive field. However, in this scene text detection task, using ASPP in KE head or backbone reduces the performance by 1.1% and 2.1%, respectively. One possible reason is that the change of network architecture usually needs more iterations; however, the best confidence thresholds for the best performance of using ASPP are 0.91 and 0.89 that are similar to the best threshold of the baseline model, which suggests that the network has already conversed.
Deformable convolution Dai et al. (2017) is known as an effective module used for many tasks. It adds 2D offsets to the regular sampling grid of the standard convolution, allowing free form deformation of convolutional operation. This is suitable for scene text detection as the mutable characteristics of the text. We experiment with three ways of deformable convolution: Starting adding deformable convolution from the C4-1, C4-2, and C3 of the backbone, and the results show that the performance can all be significantly improved, with 2.6%, 2.5%, and 2.5% in terms of Hmean, respectively.
Motivated by the Panoptic Feature Pyramid Networks Kirillov et al. (2019), we also test whether a panoptic segmentation loss is useful for the scene text detection. To this end, we use a dice loss in the output of the FPN for the panoptic segmentation, which has two classes: background and text. The result in the Table 5 shows that the Hmean reduces 0.1%. However, the best threshold is 0.67, which represents that the background noise may somehow reduce the confidence of the training procedure.
Pyramid Attention Network (PAN) Huang et al. (2019a) is a novel structure that combines the attention mechanism and spatial pyramid to extract precise dense features for semantic segmentation tasks. Because it can effectively suppress false alarms caused by text-like backgrounds, we integrated it into the backbone and test its function. The result shows that using PAN can lead to 1.2% improvement in terms of Hmean, but it also increases computational cost, with the increase of 2.4G video memory.
Multi-Scale Network (MSN) Xue et al. (2019) is robust for scene text detection by employing multiple network channels to extract and fuse features at different scales concurrently. In our experiment, integrating MSN into the backbone can also increase 1.2% in terms of Hmean. Note that compared to PAN, the recall of MSN is much higher based on a higher best threshold, which suggests that different architectures may have different function related to the performance of the detector.
The proposed model is based on a two-stage framework, and the Region Proposal Network (RPN) Ren et al. (2015) is used as default proposal generating mechanism.
Usually, previous researches modify the anchor generating mechanism to improve the result, such as DMPNet Liu and Jin (2017), DeRPN Xie et al. (2019b), Kmeans anchor Redmon and Farhadi (2017), scale-adaptive anchor Li et al. (2019), and guided anchor Wang et al. (2019a). For simplicity, we remain the default RPN structure with statistical setting of the anchor box based on the training set.
The other important part in this proposal generating stage is the sampling process, such as RoI Pooling Ren et al. (2015), RoI Align He et al. (2017a) (our default setting) and PSRoI Pooling Dai et al. (2016). We choose to evaluate Deformable PSROI Pooling Dai et al. (2017) for our method, because it is proved effective for the scene text detection task Yang et al. (2018), and the flexible process maybe beneficial to the proposed SBD. The result is shown in Table 5: using Deformable PSRoI Pooling can improve the baseline method by 0.9% in terms of Hmean.
|Method||cf||R (%)||P (%)||H (%)||H|
The final part for a two-stage detection framework is the prediction head. To clearly evaluate effectiveness of the components, the Ablation experiments are separately conducted on different heads.
Empirically, online hard negative examples mining (OHEM) Shrivastava et al. (2016) is not always effective in different benchmarks, e.g., using the same framework except the training data, it can significantly improve the result in ICDAR 2015 benchmark Karatzas et al. (2015) while reduce the result on MLT benchmark Nayef et al. (2017). The result may be related to the data distribution which is hard to trace. We thus test two version of OHEM in the validation set. The first version, OHEMv1, is as same as the original implementation, while the second version OHEMv2 simply ignores the top 5 hard examples to avoid outliers. Both two versions have the same value of the ratio, which is 0.25. The results in the Table 5 show that both two versions will reduce the Hmean, with 0.7 and 0.8, respectively. Note that using OHEM will also result in the reduction of the best confidence, which means forcing learning the hard examples may reduce the confidence of the normal examples. On the other hand, we also evaluate the performance of the cascade r-cnn, and the results are shown in Table 7. However, the results show that using cascade does not result in further improvement.
To improve the mask head, we evaluate two methods, mask scoring Huang et al. (2019a) and pyramid mask Liu et al. (2019a), as shown in Table 5. The results show that the modifications on mask head cannot contribute to the detection performance. However, the mask prediction results can be visually compacter and more accurate comparing to the baseline.
|Baseline + character head||79.8||0.7|
A common sense is that a stronger supervision can result in a better performance. Because the competition also provides character ground truth, we build and evaluate the performance of the auxiliary character head. The implementation of character head is exactly the same as box head except ground truth. Unlike box, mask, and KE head, the proposed character head is built on a different RPN, i.e., the character head does not share the same proposal with the other heads. The KE head directly produces quadrilateral bounding box (word box) which is directly served for the final detection, and we test if the auxiliary head can indirectly (shared backbone) improve the word box detection performance. The Ablation results in Table 8 demonstrate the idea, which shows that using a character head can improve the Hmean by 0.7%. In addition, if we add a mask prediction head in the character head, namely mask character in Table 8, the result remains the same. Moreover, we use a triplet loss to learn the connection between the characters (the ground truth includes that whether the characters belong to the same text instances), but the improvement reduces to 0.5%. This may be because the instance connection may somehow introduce the inconsistent labeling issue. We further test the performance of only using the character head (with instance connection) without KE head. The Hmean are significantly reduced by 3.9% compared to the baseline method, which suggests that using character as an auxiliary head instead of final prediction head is a suitable choice.
The last but not least step is to apply post-processing methods for final improvement. To this end, we compare baseline to a series of standard and more effective post-processing methods.
Traditional NMS between horizontal rectangular bounding boxes may cause unnecessary suppression, and thus we conduct Ablation experiments to evaluate the performance of the PNMS. We use grid search to find the best threshold for both NMS and PNMS for fair combination, which is 0.3 and 0.15, respectively. The result in Table 5 shows that using PNMS is better than NMS by 0.8% in terms of Hmean. In addition, PNMS is much effective when using test ensemble in practice.
The proposed key edge RPP is proved effective on ICDAR 2015 benchmark, and thus we also test if it could be conducive to this competition dataset. The Ablation result in Table 5 shows that it can slightly improve the Hmean by 0.1% comparing to the baseline. It is worth noticing that although the best confidence threshold is 0.91 that is the same as baseline, the recall can be increased by 0.4% while only reducing the precision by 0.2%.
We also conduct experiments to evaluate how the testing scale influences the performance. The results are shown in Figure 12, which demonstrates that a proper setting of the and can significantly improve the detection performance. In addition, the results also reveal there is a limitation of the : if the value of is higher than a certain value, the performance would be gradually reduced.
To evaluate the performance of test ensemble, we conduct Ablation experiments in four different aspects: a) different backbone ensemble; b) multiple intermediate models ensemble; c) multi-scale ensemble; and d) independent models ensemble. Note that in order to achieve best performance, implementing ensemble or multi-scale testing require some tricks otherwise the results can be worse. We summarize as follows.
Using a high confidence threshold. One weakness of multi-scale ensembling is that if a true negative detection exist in one of the testing scale, it cannot be avoided unless we set a high confidence threshold to exclude it in the ensemble phase. Therefore, for each scale, we first test its best confidence threshold (cf) on the validation set, and then using a higher confidence for model ensemble.
Variant Scale of multi-scale testing. The performance of small scale (600 (), 1200 ()), e.g., in the ReCTs competition, is much worse than that of the large scale (1600, 1600). However, small scale is conducive to detect large instances compared to large scales, and they can always be mutually promoted in practice.
Using a strict PNMS threshold. A normal case for the ensemble result is that the recall can be significantly improved whereas the prediction is dramatically reduced. When observing the final integrated detection boxes, it is easy to find that the reduction is caused by box-in-box and many stacked redundant boxes. Using a strict PNMS can effectively solve this issue.
|Method||Backbone ensemble||Intermediate model ensemble||Multi-scale ensemble (, )||Model ensemble|
|Ensemble||def-C4-1 & def-C4-2||def-C4-1 & defC3||def-C4-1 & def-C4-2 & def-C3||x152-60k & x152-70k & x152-80k||(600, 1600) & (1200, 1600) & (1600, 1600)||M1 & M2|
|Tian et al. Tian et al. (2016)||52.0||74.0||61.0|
|Shi et al. Shi et al. (2017a)||76.8||73.1||75.0|
|Liu et al. Liu and Jin (2017)||68.2||73.2||70.6|
|Zhou et al. Zhou et al. (2017)||73.5||83.6||78.2|
|Ma et al. Ma et al. (2018)||73.23||82.17||77.4|
|Hu et al. Hu et al. (2017)||77.0||79.3||78.2|
|Liao et al. Liao et al. (2018b)||79.0||85.6||82.2|
|Deng et al. Deng et al. (2018)||82.0||85.5||83.7|
|Ma et al. Ma et al. (2018)||82.2||73.2||77.4|
|Lyu et al. Lyu et al. (2018b)||79.7||89.5||84.3|
|He et al. He et al. (2017c)||80.0||82.0||81.0|
|Xu et al. Xu et al. (2019)||80.5||84.3||82.4|
|Tang et al. Tang et al. (2019)||80.3||83.7||82.0|
|Wang et al. Wang et al. (2019b)||84.5||86.9||85.7|
|Xie et al. Xie et al. (2019a)||85.8||88.7||87.2|
|Zhang et al. Zhang et al. (2019)||83.5||91.3||87.2|
|Liu et al. Liu et al. (2018)||87.92||91.85||89.84|
|Baek et al. Baek et al. (2019)||84.3||89.8||86.9|
|Huang et al. Huang et al. (2019b)||81.5||90.8||85.9|
|Zhong et al. Zhong et al. (2019b)||80.12||87.81||83.78|
|He et al. He et al. (2018a)||86.0||87.0||87.0|
|Liu et al. Liu et al. (2019b)||87.6||86.6||87.1|
|Liao et al. Liao et al. (2018a)||78.5||87.8||82.9|
|Long et al. Long et al. (2018)||80.4||84.9||82.6|
|He et al. He et al. (2020)||79.68||92.0||85.4|
|Lyu et al. Lyu et al. (2018a)||81.0||91.6||86.0|
|He et al. He et al. (2017b)||73.0||80.0||77.0|
|Liao et al. Liao et al. (2019)||87.3||86.6||87.0|
|Wang et al. Wang et al. (2019c)||81.9||84.0||82.9|
|Wang et al. Wang et al. (2019d)||86.0||89.2||87.6|
|Qin et al. Qin et al. (2019)||87.96||91.67||89.78|
|Feng et al. Wei et al. (2019)||83.75||92.45||87.88|
|Liu et al. Liu et al. (2019e)||83.8||89.4||86.5|
Based on these principals, we conclude the results of four ensemble aspects as follows.
Different backbone ensemble. We train three models using baseline setting with three ways of deformable convolution: Starting from C4-1, C4-2, and C3 of the ResNet-101, respectively. Ensemble results of three methods are shown in Table 9. From the Table, we can see that integrating the models with a series of simple modifications of the backbone can improve the detection performance even based on a relative high baseline.
In addition, the result shows that more integrating components will result in better performance.
Multiple intermediate models ensemble. We also evaluate the performance of integrating the intermediate models. We use the trained model with ResNext-152 backbone as a strong baseline and select the last three intermediate iterating models (with 10k iterations as interval) for ensemble. The result shown in Table 9 also demonstrates that by using model ensemble, the intermediate models can also be mutually promoted.
Multi-scale ensemble. To evaluate the performance of multi-scale ensemble, we use grid search to find the best PNMS threshold for three specified settings of (, ), representing large, medium, and small text instances, respectively. Each detection result is then integrated with a PNMS threshold 0.02 higher than the its original best threshold, which results in its approximate optimum integrating results, with 0.6% improvement in terms of Hmean, as shown in Table 9.
Independent models ensemble. Lastly, we test the performance of integrating two final models. The first model is baseline setting plus deformable convolution and the second model is baseline setting with x152 backbone. We first independently integrate each model by intermediate model ensemble and multi-scale ensemble; then we ensemble the final results of two models. As shown in Table 9, the detection result can still be improved.
|linkage-ER-Flow Nayef et al. (2017)||25.59||44.48||32.49|
|TH-DL Nayef et al. (2017)||34.78||67.75||45.97|
|SARI FDU RRPN v2 Ma et al. (2018)||67.0||55.0||61.0|
|SARI FDU RRPN v1 Ma et al. (2018)||55.5||71.17||62.37|
|Sensetime OCR Nayef et al. (2017)||69.0||67.75||45.97|
|SCUT_DLVClab1 Liu and Jin (2017)||62.3||80.28||64.96|
|AF-RNN Zhong et al. (2019a)||66.0||75.0||70.0|
|Lyu et al. Lyu et al. (2018b)||70.6||74.3||72.4|
|FOTS Liu et al. (2018)||62.3||81.86||70.75|
|CRAFT Baek et al. (2019)||68.2||80.6||73.9|
|Liu et al. Liu et al. (2019e)||70.1||83.6||76.3|
|Affiliation||Detection Result||End-to-End Result|
|Recall (%)||Precision (%)||Hmean (%)||Recall (%)||Precision (%)||Hmean (%)||1-NED (%)|
|Tian et al.||93.46||92.59||93.03||92.49||93.49||92.99||81.45|
|Liu et al.||93.41||91.62||92.50||-||-||-||-|
|Zhu et al.||93.51||89.15||91.27||92.36||91.87||92.12||79.38|
|Mei et al.||91.96||90.09||91.02||-||-||-||-|
|Li et al.||90.03||91.65||90.83||90.80||90.26||90.53||73.43|
|Zheng et al.||89.84||91.41||90.62||-||-||-||-|
|Zhou et al.||90.99||89.59||90.28||90.99||89.59||90.28||74.35|
|Zhang et al.||93.66||86.35||89.86||93.62||87.22||90.30||76.60|
|Zhao et al.||86.13||92.72||89.31||86.12||92.73||89.30||72.76|
|Xu et al.||-||-||-||91.54||90.28||90.91||71.89|
|Wang et al.||88.92||88.70||88.80||88.89||88.92||88.91||71.81|
|Baek et al.||85.33||89.38||87.31||75.89||78.44||77.14||41.68|
|Wang et al.||84.67||89.53||87.03||84.64||89.56||87.03||71.10|
|Wang et al.||-||-||-||69.49||89.52||78.24||50.36|
|Li et al.||82.27||88.49||85.27||-||-||-||-|
|Xu et al.||88.52||79.32||83.66||-||-||-||-|
|Lu et al.||85.18||79.66||82.33||-||-||-||-|
|Ma et al.||83.16||80.77||81.94||-||-||-||-|
|Tian et al.||96.17||69.20||80.48||-||-||-||-|
|Feng et al.||73.05||78.35||75.61||-||-||-||-|
|Luan et al.||70.35||80.19||74.95||-||-||-||-|
|Yang et al.||60.66||90.87||72.76||-||-||-||-|
|Liu et al.||66.83||75.87||71.07||-||-||-||-|
|Zhou et al.||72.54||56.44||63.48||-||-||-||-|
|Liu et al.||7.82||8.14||7.98||-||-||-||-|
To further evaluate the effectiveness of the proposed method, we conduct experiments and compare our final model with other state-of-the-art methods on three scene text datasets: ICDAR 2015 Karatzas et al. (2015), MLT Nayef et al. (2017), and ReCTS described in Section 4.2.1. We also conduct experiment on one aerial dataset HRSC2016 Liu et al. (2017) to further demonstrate the generalization ability of our method.
ICDAR 2015 Incidental Scene Text Karatzas et al. (2015) is one of the most popular benchmarks for oriented scene text detection. The images are incidentally captured mainly from streets and shopping malls, and thus the challenges of this dataset rely on the oriented, small, and low resolution text. This dataset contains 1k training samples and 500 testing samples, with about 2k content-recognizable quadrilateral word-level bounding boxes. The results of ICDAR 2015 are given in Table 10. From this table, we can see that our method can outperform all previous methods.
ICDAR 2017 MLT Nayef et al. (2017) is the largest multi-lingual (9 languages) oriented scene text dataset, including 7.2k training samples, 1.8k validation samples and 9k testing samples. The challenges of this dataset are manifold: 1) Different languages have different annotating styles, e.g., most of the Chinese annotations are long (there is not specific word interval for a Chinese sentence) while most of the English annotations are short; 2) the annotations of Bangla or Arabic may be frequently entwined with each other; 3) more multi-oriented, perspective distortion text on various complex backgrounds; 3) many images have more than 50 text instances. All instances are well annotated with compact quadrangles. As shown in Table 11, the proposed approach achieves the best performance on the MLT dataset.
ReCTS is the recent ICDAR 2019 Robust Reading Challenge111 https://rrc.cvc.uab.es/?ch=12&com=introduction described in Section 4.2.1. Competitors are restricted to submit at most 5 results, and all the results will only be evaluated after deadline. The competition attracts numerous competitors from well-known universities and high-tech companies. The results of the ReCTS are shown in Table 12. Our method achieves the first place on the ReCTS detection competition.
ReCTS End-to-End. One of the main goal of scene text detection is to recognize the text instance, which is highly related to the performance of the detection system. To validate the effectiveness and robustness of our detection method, we build a recognition system by incorporating several state-of-the-art methods. All the models are trained on both real images and 600k extra synthetic images Jaderberg et al. (2016). During the testing, we first crop the images according to the detected quadrilateral boxes as inputs. Then the predictions with the highest confidence are selected as the final End-to-End results. Both quantitative and qualitative are presented in Table 12 and Figure 14(b), respectively.
HRSC2016 To demonstrate generalization ability of our method, we further evaluated the performance on Level 1 task of the HRSC2016 dataset Liu et al. (2017)
to show our method’s performance on multi-directional object detection. The ship instances in this dataset are presented in various orientations, and annotating bounding boxes are based on rotated rectangles. There are 436, 181, and 444 images for training, validating, and testing set, respectively. Only the training and validation sets are used for training. The evaluating metric is the same asLiu et al. (2019e); Karatzas et al. (2015). The result is shown in Table 13, with the significant improvement of TIoU-Hmean Liu et al. (2019c), demonstrating the robustness of our method. Qualitative examples of the detection results are shown in Figure 13.
|Algorithms||R (%)||P (%)||H (%)||TIoU-H (%)||mAP|
|Girshick (2015); Liao et al. (2018b)||-||-||-||-||55.7|
|Girshick (2015); Liao et al. (2018b)||-||-||-||-||69.6|
|Girshick (2015); Liao et al. (2018b)||-||-||-||-||75.7|
|Liao et al. (2018b)||-||-||-||-||84.3|
|Liu et al. Liu et al. (2019e)||94.8||46.0||61.96||51.1||93.7|
In this paper, we have addressed omnidirectional scene text detection with an effective SBD method. Through using discretization methodology, SBD solves the inconsistent labeling issue by discretizing the point-wise prediction into sequential-free key edges. To decode accurate vertex positions, we propose a simple but effective MTL method to reconstruct the quadrilateral bounding box. Benefiting from SBD, we can improve the reliability of the confidence of the bounding box and adopt more effective post-processing methods to improve performance.
In addition, based on our method, we conduct a mass of ablation studies on six aspects, including data arrangement, pre-processing, backbone, proposal generating, prediction head, and post-processing, in order to explore the potential upper limit of our method. By accumulating the effective modules, we achieve state-of-the-art results on various benchmarks and win the first place on the recent ICDAR 2019 Robust Reading Challenge on Reading Chinese Text on Signboard. Moreover, by using a recognition model, we also perform the best in the End-to-End detection and recognition task, demonstrating our method can be conducive to the current recognition methods. To test generalization ability, we conduct an experiment on oriented general object dataset HRSC2016, and the results further show that our method can outperform recent state-of-the-art methods with a large margin.
Text-attentional convolutional neural network for scene text detection. IEEE Trans. Image Process. 25 (6), pp. 2529–2541. Cited by: §1, §2.
Real-time lexicon-free scene text localization and recognition. IEEE Trans. Pattern Anal. Mach. Intell. 38 (9), pp. 1872–1885. Cited by: §1.