Scene text detection [tian2016detecting, liao2017textboxes, zhou2017east, textsnake, masktextspotter], which aims to locate texts in the wild, has achieved much attention in recent years because of its numerous applications, e.g.,
instant translation, image retrieval, scene parsing. Since text instances in the real-world scenario are usually with various sizes, directions and shapes, most scene text detection methods[psenet, textsnake, wang2019efficient] utilize polygon annotation to train robust text detectors. Although the polygon is more accurate than the upright bounding box(bounding box equals upright bounding box with two points in this paper), the labeling cost of polygons is extremely high, limiting its large-scale extension in real-world applications.
To address the above-mentioned issue, in this paper, we propose a novel system termed SelfText Beyond Polygon (SBP) for unconstrained scene text detection with extremely low data collection and labeling costs, i.e.,with limited bounding box annotated data and a larger amount of unlabeled data. Our motivations are inspired by the following observations. First, although bounding box annotation is limited by its performance, there is still a cheap way of labeling to roughly distinguish the foreground and background locations. As illustrated in Fig. 1 (a), according to the Amazon Mechanical Turk (MTurk), the upright bounding box annotations are about 4 cheaper than polygon annotations [totaltext, yuliang2017detecting]. Thus, how to effectively utilize the bounding box annotations to boost the detection accuracy becomes critical in such a situation. Second, even if the upright bounding box annotation is adopted, the cost is still very high when the data volume breakthroughs the million range. Fortunately, the data collection cost is usually much cheaper than labeling with the amount of the data increases, just as shown in Fig. 1 (b). Thus how to progressively mining the valuable information hidden in the massive unlabeled data to further improve the accuracy of the detector is another problem that will be addressed in this paper.
In practice, we deal with the above problems by proposing two novel schemes named Bounding Box Supervision (BBS) and Dynamic Self-Training (DST) to efficiently use a limited set of bounding-box annotated data and massive unlabeled data, respectively. For the former one, we firstly apply the synthetic data whose character-level annotations can be automatically generated for free, to pre-train a newly proposed Skeleton Attention Segmentation Network (SASN). By exploiting the text data with various geometric information (e.g., the straight and the curved data ) in the training phase, the SASN could effectively capture the changes of the text appearance in different scenes. Then bounding box annotations are used to crop the text region in the real images which are fed into SASN to generate high-quality polygon-liked pseudo labels. By splicing all of these local pseudo labels, the global one would be obtained and applied to train any text detectors. For the later one, we propose a Dynamic Self-Training (DST) scheme, which adopts the filtering and multi-scale inference strategies to reduce the effect of false positive and false negative in pseudo labels for the unlabeled data. In addition, the dynamic mixed training strategy is also used to limit the use of the pseudo labels in the late stage of training. As shown in Fig. 1 (c), by using BBS and DST, PSENet [psenet] achieves cheaper data cost compare with that of using polygon annotation, and obtaining F-score improvements compare with that of directly using bounding box annotation on Total-Text [totaltext]. The main contributions are four folds:
(1) We introduce a Bounding Box Supervision strategy, which utilizes free synthetic data and box-level annotations to achieve high quality polygon-liked real annotations (i.e., pseudo labels).
(2) We introduce a Skeleton Attention Segmentation Network for capturing the structure of the texts with arbitrary shapes and restraining noise interference.
(3) We propose a stable self training scheme termed Dynamic Self-Training to utilize large scale unlabeled data and limited labeled data for robust detectors training.
(4) The proposed SelfText Beyond Polygon is simple but surprisingly effective and practical, which saves 70%+
(rough estimation) cost and achieves almost the same performance compared with fully-supervised methods.
2 Related Work
2.1 Supervised Text Detection
Scene text detection has achieved remarkable progress in the deep learning era. Previous methods[liao2017textboxes, zhang2016multi] focused on horizontal text detection. CTPN [tian2016detecting] adopted Faster RCNN [ren2015faster] and modified RPN to detect texts. Then many famous methods [rrpn, RRD] aim at solving multi-oriented text detection. EAST [zhou2017east] used FCN [FCN] to predict text score map, distance map and angle map in an anchor-free manner. Recent works focus on curved text detection [baek2019character, xu2019textfield]. TextSnake [textsnake] modeled curved text instance as a series of orderd disks and a text center line. SPCNet [spcnet] and Mask Text Spotter [masktextspotter] are based on Mask RCNN, which translate curved text detection as a instance segmentation problem. PSENet [psenet] and PAN [wang2019efficient] treat text instance as kernels with different scales, and reconstrust the whole text instance in the post-processing.
2.2 Weak Supervised Object Detection
To alleviate the expensive data cost problem, there have been various weakly supervised object detection (WSOD) works [wang2017learning] based on image-level annotations. Bilen et al. [bilen2016weakly]
proposed an architecture called WSDDN to select and classify region proposals simultaneously. ContextLocNet[kantorov2016contextlocnet] further improves WSDDN by taking contextual region into consideration. Beyond that, some works tried to utilize the temporal information in videos. Kwak et al. [kwak2015unsupervised] discovered object appearance representation across videos and tracked the object in temporal space for supervision. Wang et al. [wang2015unsupervised] performed unsupervised tracking on videos and clustered similar features. Recently, Yang et al. [yang2019activity] and Kim et al. [kim2020tell] proposed activity-driven methods, which exploits action classes as contextual information to localize objects. Different from the above methods, we develop a novel pipeline for weakly supervised text detection, which generating polygon-liked pseudo label with bounding-box annotation (2 points) instead of polygon annotation (4-20 points) to train networks.
2.3 Self-training with noisy labels
Self-training, where a teacher model is used to generate labels for the unlabeled set the student can train on, is a widely-used methods in semi-supervised learning. Xieet al. [xie2020self] use strong data augmentation in self-training but only for image classification. Ilija et al. [radosavovic2018data] proposed a data distillation method to improve instance detection level tasks and vote on unlabeled data to get pseudo-label. Recently, noise labels have been shown to have positive effects in self-training for object detection [zoph2020rethinking], which motivates our work. In this work, we introduce a novel dynamic self-training approach to minimizing the negative effects of false negatives and false positives in pseudo labels.
As shown in Fig. 2, we describe the details of the proposed SelfText Beyond Polygon (SBP), which has two key components termed the Bounding Box Supervision (BBS) and the Dynamic Self-Training (DST).
3.1 Bounding Box Supervision
Following the segmentation with bounding boxes [dai2015boxsup], we propose a weakly supervised method, termed Bounding Box Supervision (BBS), which can achieve competitive accuracy with low-cost bounding box annotations for text detection. As shown in Fig. 2 (a), BBS utilizes a novel proposed Skeleton Attention Segmentation Network (SASN), which is shown in Fig. 3, to generate polygon pseudo labels based on the given bounding box annotations. The process composes three steps. (1) In the first step ( see blue arrows in Fig. 2 ), we train SASN with free synthetic data based on character annotation. (2) In the second step ( see red arrows in Fig. 2 ), the box-level annotations are utilized to crop the real image, which are fed into the SASN for generating polygon-liked pseudo labels. (3) By splicing all of the local pseudo labels, the global one is obtained. In this way, bounding box annotations can be converted to high-quality polygon pseudo labels. The detectors ( e.g., PSENet [psenet] ) trained on these pseudo labels can achieve almost the same performance as those trained on polygon annotations.
3.1.1 Skeleton Attention Segmentation Network
Network Architecture. As presented in Fig. 3 (a), we use ResNet50 [he2016deep] as the backbone network for the SASN, and extract three levels of features ( i.e., ) from different downsampled scales ( i.e., ). After that, the skeleton stream fuses and to predict the skeleton attention map. At the same time, the regular stream segment the text by applying the Skeleton Attention Module. Although text segmentation in a bounding box is easier than ain the whole image, there still exists a bit of noise. For example two texts in one bounding box, which is illustrated in the Fig. 3 (a) with green circle. To restrain the noise interference and segment the body text precisely, we carefully design the Skeleton Attention Module, which generates the skeleton map to drive the attention mechanism.
Skeleton Attention. Fig. 3 (b) illustrates the details of the Skeleton Attention Module (SA). Given an input sample , where and denote the -th image and its labels. We use to denote the -th pixel of the -th training image, with = for the background and = for the text pixel. We also define the text skeleton ground-truth as a soft label. For the -th pixel in the text region of -th image, we first calculate the shortest distance between the -th pixel to its nearest background pixel, and then the value is defined as the soft skeleton label of -th pixel by normalizing to :
where is the maximum value of in the -th image.
Intuitively, the pixels close to the skeleton of the text instance should have a greater value than the boundary pixels ( see text skeleton label in Fig. 3 (a) ). Since the soft label is a decimal representing the degree of distance, it is incompatible with the commonly binary cross-entropy loss. Besides, L1 and L2 loss are not sensitive to the distance distribution among [ren2015faster]. Therefore, to handle the soft label, we modify the cross-entropy loss into a “soft” form. For a pixel , denotes the value of the -th pixel in the ground truth skeleton map.
indicates neural networks. Thus the text skeleton loss in Fig.3 (a) is defined as follows:
In practice, the predicted skeleton map is downsampled to obtain multi-scale maps (i.e.,1/4, 1/8, 1/16), which are transferred to the regular stream (the thick yellow arrow in Fig. 3 (a) ). And then, the extracted feature map (e.g.,) and the corresponding scale skeleton map are as the input of the Skeleton Attention Module. The two feature maps before and after the skeleton attention are concatenated and fed into a channel attention block. At last, it outputs the refined feature map. Note that the proposed Skeleton Attention Module is shared for the high-level and the low-level feature maps ( i.e., ).
3.1.2 Suitable Synthetic Datasets
Existing real-world text datasets can be divided two types: straight (e.g.,ICDAR2015 [karatzas2015icdar]) and curved (e.g.,Total-Text [totaltext]), depending on the shape of the text instance. However, the curved synthetic text data is scarce. SynthText [gupta2016synthetic], the most widely-used synthetic datasets, does not contain curved text, which leads to a serious domain gap between synthetic data and real data ( e.g.,Total-Text ) in curved text distribution ( ), as shown in Fig. 4. Therefore, we also adopt Curved SynthText [long2019rethinking] to align the data distribution with curved dataset. In this work, we use SynthText [gupta2016synthetic] to train SASN for straight text line and use Curved SynthText [long2019rethinking] to match curved text line.
3.2 Dynamic Self Training for Text Detection
Inspired by previous works [xie2020self, zoph2020rethinking], we propose a novel self-training strategy, termed Dynamic Self Training, to reduce label cost by exploiting large-scale unlabeled data.
3.2.1 Overview of Dynamic Self-Training
Fig. 2 (b) gives an overview of the proposed dynamic self-training scheme. In such case, we have a large set of unlabeled images and limited labeled images . Here could indicate the polygon-like label generated from BBS or the manually labeled ground truth. We firstly use the labeled images to train an initial detector . Then we apply the detector to generate multi-scale prediction results as the foreground maps of unlabeled images . When is obtained, we can further calculate the background maps by filtering false negatives, as shown in Fig. 5. Specifically, false negatives is filtered by thresholding the distance score of each negative sample pixel, which is calculated by a distance transform function [bradski2008learning] and edge detection ( i.e.,Canny [ding2001canny] ). By leveraging the information from and , we could generate the high-quality pseudo label for unlabeled data. Please refer to Sec. 3.2.2 for more details. After that, we conduct the dynamic mixed retraining of the model by using both and , and adaptively adjust the number of the unlabeled data in a mini-batch. Sec. 3.2.3 gives the detailed training process.
Algorithm 1 summarizes the scheme of DST. represents gradient edge detector [ding2001canny], which is utilized to filter text region with regular change of gradient. represents a distance transform function [bradski2008learning]. and refer to the multi-scale inference and the scale set, respectively. The shortest distance between -th pixel and gradient edge ( i.e.,the white line in Fig. 5 (b) ) can be calculated by the distance transform function. The pixel with bigger
is usually farther from gradient edge, leading to the lower probability to be the foreground.
3.2.2 High-quality Pseudo Label Generation
Multi-Scale Inference. As shown in Algorithm 1, we use multi-scale inference to exploit hard positive samples as many as possible. When generating pseudo labels , the unlabeled image is resized to a pre-defined set of scales = , , , , , , to generate multi-scale predictions, where denotes the shorter side of an image and is set as 32 in practice. And then, the Locality-Aware NMS [zhou2017east] is used to ensemble the multi-scale inference results of each unlabeled image. Such aggregating scheme often obtains a prediction that is always superior to any of the model’s predictions under a single scale, especially for most text detectors without FPN [lin2017feature].
Filtering Negative Samples. To obtain the accurate background maps , we need to identify and discard the false negative regions. In consideration of the particularity of text, we use the edge detection and distance transform to filter pixels and select the ones far away from the gradients of image as the final background pixels. As shown in Fig. 5, for an input image , we firstly obtain the corresponding edge map by edge detector. And then, the distance transform ( i.e., in Algorithm 1 ) is used to calculated the distance between -th pixel and -th pixel, where -th pixel is the nearest one of -th pixel on gradient edge ( i.e.,the white parts in Fig. 5 (b) ). Similar to Eqn. 1, the is normalized to , and Fig. 5 (c) presents the visualization of the distance map. Finally, the background maps are calculated by thresholding the score map with a lower threshold , which is set as in all of the experiments.
3.2.3 Dynamic Mixed Training
The generated pseudo labels usually contain false positive (FP) and false negative (FN), which causes unsteady convergence of network. To tackle this problem and making full use of unlabeled data, we propose the dynamic mixed training, which optimizes models with more unlabeled data ( e.g., unlabeled images and labeled images ) in the early stage of training, and utilize more unlabeled data in the later stage. In the training stage, a batch size data with the volume is composed of labeled images and unlabeled images, where . Here can be calculated with the formula,
. The loss function in each iteration can be written as:
where denotes the text detector. and denotes corresponding loss of detector on labeled data and pseudo labeled data, respectively.
4.1 Datasets and Experimental Settings
Pure Synthetic Datasets. SynthText [gupta2016synthetic] consist of 800k synthetic images generated by adding variants of multi-oriented text with random fonts, size, and color. Curved SynthText 111https://github.com/PkuDavidGuan/CurvedSynthText [long2019rethinking] generates 80m curved texts with character level annotation by revising the text rendering module of the SynthText engine.
Real Datasets. ICDAR2015 [karatzas2015icdar] includes 1,000 training and 500 testing images with quadrilateral annotation. Total-Text[ch2017total] is a English curved text dataset contains 1,555 images, which includes 3 different text orientations: horizontal, multioriented, and curved. MSRA-TD500 [msra] consists of 500 training and 200 testing images for detecting multi-lingual long texts of arbitrary orientation. ICDAR2017-MLT [icdar2017mlt] consists of 18000 images with texts in 9 languages for multi-lingual text detection.
|Synthetic Data||Size||Evaluation on Total-Text/%|
|Method||Synthetic Data||Evaluation on Total-Text/%|
|Method||Evaluation on ICDAR2015/%|
Implementation Details. Bounding Box Supervision. All of the experiments use the same strategy: (1) training SASN with synthetic data based on character annotation. (2) generating the pixel level pseudo label on real data based on bounding box annotation with SASN. (3) training the detectors (i.e.,
EAST and PSENet) with the pseudo label. The stochastic gradient descent(SGD) optimizer is adopted with a momentum of 0.9 and a weight decay of 0.0005. The batch size is set to 8 per GPU. The learning rate is initialized to 0.02 and decayed with the power of 0.9 for 16 epochs. During training and inference, the crop images are resized to a resolution of 128128. Dynamic Self-Training. All of the experiments use the same training strategy: (1) we firstly choose a part of training data as labeled data to train the detectors. (2) generating pseudo label in the rest of the data with the proposed components. (3) fine-tune the detectors on whole data. To evaluate the effectiveness of the DST, we randomly split the training set into labeled data and unlabeled data, and using one ratio control the labeled data size in the total training set. PSENet [psenet] and EAST [zhou2017east] are adopted as the base detectors because of the popularity. While involving PSENet [psenet] and EAST [zhou2017east] experiments, all settings are following the original paper.
4.2 Ablation Study
In this part, we conduct three groups ablation experiments to analyze BBS and DST. More ablation (e.g.,Box supervision v.s. Strong supervision, Labeled data size for DST) experiments are in supplementary materials.
Synthetic data for BBS. To understand the impact of synthetic data with different text shape distribution, we compare the performance of curved Synthtext [long2019rethinking] and Synthtext [synthtext]. As shown in Tab. 1, using curved Synthext obtains better performance, absolute improvement over that using Synthext, because of the serious domain gap of text shape between Total-Text (curved text) and Synthtext (straight text). The performance tends to stabilize while synthetic data size exceeds , so we randomly select synthetic data to train SASN in all experiments.
Skeleton Attention for BBS. Skeleton Attention is the key to the SASN. Tab. 2 gives the ablation study about the attention on Total-Text [totaltext]. No matter which synthetic data, Skeleton Attention can obtain up to improvement compares to the baseline without the attention. In addition, Fig. 6 gives the visualization of the attention map, which enables the network more interested in the text skeleton.
Components for DST. Dynamic mixed training, multi-scale inference and filtering are the keys for DST. We make an assumption that these components independent of each other, which improve the performance of model from different levels by minimizing the effects of false negatives and false positives. Tab. 3 also indirectly proves the conclusion. By combining with dynamic mixed training, multi-scale inference and filtering, the F-score achieved a , and , respectively, making , and absolute improvements over the baseline.
4.3 Experiments on Scene Text Detection
In this section, we firstly present the experiments of BBS and DST, respectively. And then, the combined experiments are given in supplementary materials.
4.3.1 Bounding Box Supervision
Quadrilateral-type datasets. Tab. 4 lists the experimental results of various methods on ICDAR2015, ICDAR2017 and MSRA-TD500 datasets. For ICDAR2015, PSENet [psenet] using pseudo label achieves almost the same performance ( and using the pre-trained model on SynthText) with that using ground truth, which prove high-quality of pseudo label generated by BBS. On the contrary, directly training the detector with bounding box annotation has an unsatisfactory F-score ( and ), which exists more than gap compare with that using polygon annotation. EAST [zhou2017east] has a similar case with PSENet. Moreover, the F-score () of using the pseudo label has a improvement over that () of using ground truth. For MSRA-TD500, annotations are provided at the line level, including the spaces between words in the box. Therefore, bounding box annotation on MSRA-TD500 usually contains large plenty of background, which causes poor performance ( for EAST and for PSENet). In this case, the pseudo label from BBS still brings almost the same performance ( for EAST and for PSENet) with that ( for EAST and for PSENet) of polygon annotation. In addition, after using the pre-trained model on SynthText, PSENet and EAST all obtain almost the same improvement no matter using which labels (pseudo label or ground truth). For ICDAR2017, the performance is similar to ICDAR15, pseudo label has a high-quality performance compare with that using ground truth. The difference is that the bounding box also brings relatively good performance ( for EAST and for PSENet). The main reason is that texts in ICDAR17 have a small tilting angle and size.
Polygon-type dataset. Tab. 4 lists the experimental results on Total-Text [totaltext]. The annotation on Total-Text is complex and polygonal in shape. The great performance () of using pseudo label further proves the significance of our work, and Fig. 7 gives some visualization of ground truth and pseudo label. Similar to MSRA-TD500, bounding box annotation on Total-Text also contains plenty of backgrounds, which causes poor performance ( and ). And using the pseudo label generated by BBS still can achieve a great performance ( and ) with and improvements.
|EAST +DST||50%||-||69.2||63.5||66.2 (+1.7)|
4.3.2 Dynamic Self Training
We present the experimental results on ICDARs in this subsection, and more experiments(i.e.,Total-Text) are in the supplementary material. For ICDAR2015, Tab. 5 lists the experimental results of the self-training, and the ‘Ratio’ denotes only using part of training set as the labeled data. By using DST, EAST using half labeled data and half unlabeled data obtains the great performance ( and ), which is almost the same as that of strong supervision. After using the pre-trained model on SynthText, the performance can be further improved with 2.0% gap. Similar to EAST, PSENet also has achieved a competitive performance ( and ) with only half labeled training set. For ICDAR2017, Tab. 6 lists the experimental results of the self-training. Similar to ICDAR15, our method all achieve competitive performance ( and ) no matter which detectors are used.
In this paper, we present a simple but surprisingly effective and practical method termed SelfText Beyond Polygon (SBP), which includes two components: Bounding Box Supervision (BBS) and Dynamic Self-Training (DST). The BBS with only bounding box annotation achieve almost the same performance as using expensive polygon annotation. The DST training with limited labeled data and unlabeled data to further reduce the data cost. The experiments showed that our method achieves almost the same performance as that of strong supervision while saving + data cost, which can provide a new perspective for weakly supervised text detection and save much money for the industry.
6.1 More Analysis for Dynamic Mixed Training
As shown in Fig. 8, fully supervision (green line) usually has a stable convergence effect (), normal self-training (red line) has a performance fluctuation (), because of the adverse effects of false negatives (FNs) and false positives (FPs). Fortunately, Dynamic Mixed Training could achieve competitive convergence performance () by dynamically adjusting the number of unlabeled data in a mini-batch, as shown in the blue line. The basic idea is simple but effective, the network training with few unlabeled data for minimizing the adverse effects of FNs and FPs in the last training stage, and using massive unlabeled data for mining the valuable information in the early training stage. We argue the self-training requiring such an adjusting strategy to help stable convergence.
|Method||Annotation||Evaluation on Total-Text/%|
|PSENet||PL(Curved ST)||81.7||75.6||78.5 (+33.5)|
|Evaluation on MSRA-TD500/%|
|Method||Ratio||Evaluation on ICDAR2015/%|
6.2 Ablation Study
Box supervision v.s. Strong supervision. Compare with strong supervision, Bounding Box Supervision (BBS) not only can save significant costs, but also achieves almost the same performance as that of strong supervision. In Tab. 7, we adopt two detectors(i.e.,PSENet [psenet] and EAST [zhou2017east]) and two datasets (i.e.,Total-Text [totaltext] and (MSRA-TD500 [msra]) to prove the conclusion. On Total-Text, PSENet using the pseudo label generated by BBS achieve a satisfactory F-score (), which is almost the same as using the ground truth (), with improvement over that of using the bounding box (). On MSRA-TD500, compared with that using ground truth, EAST and PSENet all achieve the competitive F-scores ( v.s. and v.s. ) with the pseudo label from BBS, which have great improvements ( and , respectively). The competitive performance of BBS proves the practicability efficiency of the pseudo label, Fig. 10 gives some visualization of the pseudo label.
Besides, bounding box annotation as the priori location is necessary for BBS. Fig. 9 gives a visualization comparison between based on bounding box annotation and without the box annotation, the pseudo label without box priori location contains large amounts of false negatives and false positives (the red circle), which causes unsatisfactory performances (i.e.,59.4% v.s. 78.0% for EAST [zhou2017east] on ICDAR2015 [karatzas2015icdar]), as shown in Tab. 7.
Labeled data size for DST. The target of Dynamic Self-Training is using limited labeled data and unlabeled data to train detectors. However, how much labeled data is enough for training a competitive model. Tab. 8 give a correlation between labeled data proportion and performance on ICDAR2015 [karatzas2015icdar]. F-score is increasing with labeled data proportion increase and achieves a competitive performance () with labeled data. Besides, DST brings a significant improvement up to while using only labeled training set.