SelfText Beyond Polygon: Unconstrained Text Detection with Box Supervision and Dynamic Self-Training

by   Weijia Wu, et al.
Zhejiang University

Although a polygon is a more accurate representation than an upright bounding box for text detection, the annotations of polygons are extremely expensive and challenging. Unlike existing works that employ fully-supervised training with polygon annotations, we propose a novel text detection system termed SelfText Beyond Polygon (SBP) with Bounding Box Supervision (BBS) and Dynamic Self Training (DST), where training a polygon-based text detector with only a limited set of upright bounding box annotations. For BBS, we firstly utilize the synthetic data with character-level annotations to train a Skeleton Attention Segmentation Network (SASN). Then the box-level annotations are adopted to guide the generation of high-quality polygon-liked pseudo labels, which can be used to train any detectors. In this way, our method achieves the same performance as text detectors trained with polygon annotations (i.e., both are 85.0 removing the false alarms, it is able to leverage limited labeled data as well as massive unlabeled data to further outperform the expensive baseline. We hope SBP can provide a new perspective for text detection to save huge labeling costs.


page 3

page 4

page 5

page 8

page 9

page 10


Towards Noise-resistant Object Detection with Noisy Annotations

Training deep object detectors requires significant amount of human-anno...

Segmentation-Based Bounding Box Generation for Omnidirectional Pedestrian Detection

We propose a segmentation-based bounding box generation method for omnid...

deepNIR: Datasets for generating synthetic NIR images and improved fruit detection system using deep learning techniques

This paper presents datasets utilised for synthetic near-infrared (NIR) ...

EDF: Ensemble, Distill, and Fuse for Easy Video Labeling

We present a way to rapidly bootstrap object detection on unseen videos ...

Unconstrained Text Detection in Manga

The detection and recognition of unconstrained text is an open problem i...

DocReader: Bounding-Box Free Training of a Document Information Extraction Model

Information extraction from documents is a ubiquitous first step in many...

Benchmark for License Plate Character Segmentation

Automatic License Plate Recognition (ALPR) has been the focus of many re...

1 Introduction

Scene text detection [tian2016detecting, liao2017textboxes, zhou2017east, textsnake, masktextspotter], which aims to locate texts in the wild, has achieved much attention in recent years because of its numerous applications, e.g.,

instant translation, image retrieval, scene parsing. Since text instances in the real-world scenario are usually with various sizes, directions and shapes, most scene text detection methods 

[psenet, textsnake, wang2019efficient] utilize polygon annotation to train robust text detectors. Although the polygon is more accurate than the upright bounding box(bounding box equals upright bounding box with two points in this paper), the labeling cost of polygons is extremely high, limiting its large-scale extension in real-world applications.

Figure 1: The comparisons of data cost and the performance of the proposed method. (a) Annotating more points are more expensive (the price of 100 text instances). (b) Data collection is much cheaper than annotation. (c) The proposed SBP achieves competitive performance while saving huge labeling costs for PSENet [psenet] on Total-Text [totaltext]. Note that the data cost information is obtained from Amazon Mechanical Turk (MTurk), and we make a rough cost evaluation of Total-Text.

To address the above-mentioned issue, in this paper, we propose a novel system termed SelfText Beyond Polygon (SBP) for unconstrained scene text detection with extremely low data collection and labeling costs, i.e.,with limited bounding box annotated data and a larger amount of unlabeled data. Our motivations are inspired by the following observations. First, although bounding box annotation is limited by its performance, there is still a cheap way of labeling to roughly distinguish the foreground and background locations. As illustrated in Fig. 1 (a), according to the Amazon Mechanical Turk (MTurk), the upright bounding box annotations are about 4 cheaper than polygon annotations [totaltext, yuliang2017detecting]. Thus, how to effectively utilize the bounding box annotations to boost the detection accuracy becomes critical in such a situation. Second, even if the upright bounding box annotation is adopted, the cost is still very high when the data volume breakthroughs the million range. Fortunately, the data collection cost is usually much cheaper than labeling with the amount of the data increases, just as shown in Fig. 1 (b). Thus how to progressively mining the valuable information hidden in the massive unlabeled data to further improve the accuracy of the detector is another problem that will be addressed in this paper.

In practice, we deal with the above problems by proposing two novel schemes named Bounding Box Supervision (BBS) and Dynamic Self-Training (DST) to efficiently use a limited set of bounding-box annotated data and massive unlabeled data, respectively. For the former one, we firstly apply the synthetic data whose character-level annotations can be automatically generated for free, to pre-train a newly proposed Skeleton Attention Segmentation Network (SASN). By exploiting the text data with various geometric information (e.g., the straight and the curved data ) in the training phase, the SASN could effectively capture the changes of the text appearance in different scenes. Then bounding box annotations are used to crop the text region in the real images which are fed into SASN to generate high-quality polygon-liked pseudo labels. By splicing all of these local pseudo labels, the global one would be obtained and applied to train any text detectors. For the later one, we propose a Dynamic Self-Training (DST) scheme, which adopts the filtering and multi-scale inference strategies to reduce the effect of false positive and false negative in pseudo labels for the unlabeled data. In addition, the dynamic mixed training strategy is also used to limit the use of the pseudo labels in the late stage of training. As shown in Fig. 1 (c), by using BBS and DST, PSENet [psenet] achieves cheaper data cost compare with that of using polygon annotation, and obtaining F-score improvements compare with that of directly using bounding box annotation on Total-Text [totaltext]. The main contributions are four folds:

(1) We introduce a Bounding Box Supervision strategy, which utilizes free synthetic data and box-level annotations to achieve high quality polygon-liked real annotations (i.e., pseudo labels).

(2) We introduce a Skeleton Attention Segmentation Network for capturing the structure of the texts with arbitrary shapes and restraining noise interference.

(3) We propose a stable self training scheme termed Dynamic Self-Training to utilize large scale unlabeled data and limited labeled data for robust detectors training.

(4) The proposed SelfText Beyond Polygon is simple but surprisingly effective and practical, which saves 70%+

(rough estimation) cost and achieves almost the same performance compared with fully-supervised methods.

2 Related Work

2.1 Supervised Text Detection

Scene text detection has achieved remarkable progress in the deep learning era. Previous methods 

[liao2017textboxes, zhang2016multi] focused on horizontal text detection. CTPN [tian2016detecting] adopted Faster RCNN [ren2015faster] and modified RPN to detect texts. Then many famous methods [rrpn, RRD] aim at solving multi-oriented text detection. EAST [zhou2017east] used FCN [FCN] to predict text score map, distance map and angle map in an anchor-free manner. Recent works focus on curved text detection [baek2019character, xu2019textfield]. TextSnake [textsnake] modeled curved text instance as a series of orderd disks and a text center line. SPCNet [spcnet] and Mask Text Spotter [masktextspotter] are based on Mask RCNN, which translate curved text detection as a instance segmentation problem. PSENet [psenet] and PAN [wang2019efficient] treat text instance as kernels with different scales, and reconstrust the whole text instance in the post-processing.

Figure 2: Illustration of the whole SelfText Beyond Polygon. (a) The pipeline of the Bounding Box Supervision, which training with synthetic data and inference on real data to generated pseudo label, details as shown in Fig. 3. (b) The pipeline of Dynamic Self-Training, which utilizes limited data and massive unlabeled data to train detectors for saving cost.

2.2 Weak Supervised Object Detection

To alleviate the expensive data cost problem, there have been various weakly supervised object detection (WSOD) works [wang2017learning] based on image-level annotations. Bilen et al. [bilen2016weakly]

proposed an architecture called WSDDN to select and classify region proposals simultaneously. ContextLocNet 

[kantorov2016contextlocnet] further improves WSDDN by taking contextual region into consideration. Beyond that, some works tried to utilize the temporal information in videos. Kwak et al. [kwak2015unsupervised] discovered object appearance representation across videos and tracked the object in temporal space for supervision. Wang et al. [wang2015unsupervised] performed unsupervised tracking on videos and clustered similar features. Recently, Yang et al. [yang2019activity] and Kim et al. [kim2020tell] proposed activity-driven methods, which exploits action classes as contextual information to localize objects. Different from the above methods, we develop a novel pipeline for weakly supervised text detection, which generating polygon-liked pseudo label with bounding-box annotation (2 points) instead of polygon annotation (4-20 points) to train networks.

2.3 Self-training with noisy labels

Self-training, where a teacher model is used to generate labels for the unlabeled set the student can train on, is a widely-used methods in semi-supervised learning. Xie

et al. [xie2020self] use strong data augmentation in self-training but only for image classification. Ilija et al. [radosavovic2018data] proposed a data distillation method to improve instance detection level tasks and vote on unlabeled data to get pseudo-label. Recently, noise labels have been shown to have positive effects in self-training for object detection [zoph2020rethinking], which motivates our work. In this work, we introduce a novel dynamic self-training approach to minimizing the negative effects of false negatives and false positives in pseudo labels.

3 Methods

As shown in Fig. 2, we describe the details of the proposed SelfText Beyond Polygon (SBP), which has two key components termed the Bounding Box Supervision (BBS) and the Dynamic Self-Training (DST).

Figure 3: The whole pipeline of Bounding Box Supervision. (a) The architecture of Skeleton Attention Segmentation Network, which is composed of two streams: regular stream and skeleton stream. (b) The details of Skeleton Attention Module, which refine the input feature map by weighting with attention map. (c) The detailed structure of the Decoder. How to generate polygon-liked pseudo label includes two steps: (1) Blue arrows show training SASN with synthetic data based on character annotations. (2) Red arrows illustrate generating pseudo labels on real data based on bounding box annotations.

3.1 Bounding Box Supervision

Following the segmentation with bounding boxes [dai2015boxsup], we propose a weakly supervised method, termed Bounding Box Supervision (BBS), which can achieve competitive accuracy with low-cost bounding box annotations for text detection. As shown in Fig. 2 (a), BBS utilizes a novel proposed Skeleton Attention Segmentation Network (SASN), which is shown in Fig. 3, to generate polygon pseudo labels based on the given bounding box annotations. The process composes three steps. (1) In the first step ( see blue arrows in Fig. 2 ), we train SASN with free synthetic data based on character annotation. (2) In the second step ( see red arrows in Fig. 2 ), the box-level annotations are utilized to crop the real image, which are fed into the SASN for generating polygon-liked pseudo labels. (3) By splicing all of the local pseudo labels, the global one is obtained. In this way, bounding box annotations can be converted to high-quality polygon pseudo labels. The detectors ( e.g., PSENet [psenet] ) trained on these pseudo labels can achieve almost the same performance as those trained on polygon annotations.

3.1.1 Skeleton Attention Segmentation Network

Network Architecture. As presented in Fig. 3 (a), we use ResNet50 [he2016deep] as the backbone network for the SASN, and extract three levels of features ( i.e., ) from different downsampled scales ( i.e., ). After that, the skeleton stream fuses and to predict the skeleton attention map. At the same time, the regular stream segment the text by applying the Skeleton Attention Module. Although text segmentation in a bounding box is easier than ain the whole image, there still exists a bit of noise. For example two texts in one bounding box, which is illustrated in the Fig. 3 (a) with green circle. To restrain the noise interference and segment the body text precisely, we carefully design the Skeleton Attention Module, which generates the skeleton map to drive the attention mechanism.

Skeleton Attention. Fig. 3 (b) illustrates the details of the Skeleton Attention Module (SA). Given an input sample , where and denote the -th image and its labels. We use to denote the -th pixel of the -th training image, with = for the background and = for the text pixel. We also define the text skeleton ground-truth as a soft label. For the -th pixel in the text region of -th image, we first calculate the shortest distance between the -th pixel to its nearest background pixel, and then the value is defined as the soft skeleton label of -th pixel by normalizing to :


where is the maximum value of in the -th image.

Intuitively, the pixels close to the skeleton of the text instance should have a greater value than the boundary pixels ( see text skeleton label in Fig. 3 (a) ). Since the soft label is a decimal representing the degree of distance, it is incompatible with the commonly binary cross-entropy loss. Besides, L1 and L2 loss are not sensitive to the distance distribution among  [ren2015faster]. Therefore, to handle the soft label, we modify the cross-entropy loss into a “soft” form. For a pixel , denotes the value of the -th pixel in the ground truth skeleton map.

indicates neural networks. Thus the text skeleton loss in Fig. 

3 (a) is defined as follows:


In practice, the predicted skeleton map is downsampled to obtain multi-scale maps (i.e.,1/4, 1/8, 1/16), which are transferred to the regular stream (the thick yellow arrow in Fig. 3 (a) ). And then, the extracted feature map (e.g.,) and the corresponding scale skeleton map are as the input of the Skeleton Attention Module. The two feature maps before and after the skeleton attention are concatenated and fed into a channel attention block. At last, it outputs the refined feature map. Note that the proposed Skeleton Attention Module is shared for the high-level and the low-level feature maps ( i.e., ).

3.1.2 Suitable Synthetic Datasets

Existing real-world text datasets can be divided two types: straight (e.g.,ICDAR2015 [karatzas2015icdar]) and curved (e.g.,Total-Text [totaltext]), depending on the shape of the text instance. However, the curved synthetic text data is scarce. SynthText [gupta2016synthetic], the most widely-used synthetic datasets, does not contain curved text, which leads to a serious domain gap between synthetic data and real data ( e.g.,Total-Text ) in curved text distribution ( ), as shown in Fig. 4. Therefore, we also adopt Curved SynthText [long2019rethinking] to align the data distribution with curved dataset. In this work, we use SynthText [gupta2016synthetic] to train SASN for straight text line and use Curved SynthText [long2019rethinking] to match curved text line.

Figure 4: The shape distribution difference between synthetic data and real data causes serious domain gap.

3.2 Dynamic Self Training for Text Detection

Inspired by previous works [xie2020self, zoph2020rethinking], we propose a novel self-training strategy, termed Dynamic Self Training, to reduce label cost by exploiting large-scale unlabeled data.

3.2.1 Overview of Dynamic Self-Training

Fig. 2 (b) gives an overview of the proposed dynamic self-training scheme. In such case, we have a large set of unlabeled images and limited labeled images . Here could indicate the polygon-like label generated from BBS or the manually labeled ground truth. We firstly use the labeled images to train an initial detector . Then we apply the detector  to generate multi-scale prediction results  as the foreground maps of unlabeled images . When  is obtained, we can further calculate the background maps by filtering false negatives, as shown in Fig. 5. Specifically, false negatives is filtered by thresholding the distance score  of each negative sample pixel, which is calculated by a distance transform function  [bradski2008learning] and edge detection  ( i.e.,Canny [ding2001canny] ). By leveraging the information from and , we could generate the high-quality pseudo label for unlabeled data. Please refer to Sec. 3.2.2 for more details. After that, we conduct the dynamic mixed retraining of the model by using both and , and adaptively adjust the number of the unlabeled data in a mini-batch. Sec. 3.2.3 gives the detailed training process.

Algorithm 1 summarizes the scheme of DST. represents gradient edge detector [ding2001canny], which is utilized to filter text region with regular change of gradient. represents a distance transform function [bradski2008learning]. and refer to the multi-scale inference and the scale set, respectively. The shortest distance between -th pixel and gradient edge ( i.e.,the white line in Fig. 5 (b) ) can be calculated by the distance transform function. The pixel with bigger

is usually farther from gradient edge, leading to the lower probability to be the foreground.

Figure 5: Background Filtering Process. (a) The pseudo labels contains False Negatives. (b) Generated edge map by edge detector ( i.e.,Canny [ding2001canny] ). (c) Generated distance map by using Distance Transform. (d) The final refined negative samples ( i.e.,white regions ).
Input: Labeled images
   unlabeled images
Output: trained parameters
training the model  on

(just as in normal supervised learning)

while epoch do
       multi-scale inference
       compute edge map
       compute distance map
       filtering negative samples
       if  then
       end if
      dynamic mixed retraining model  on the union of and
end while
Algorithm 1 Dynamic Self-Training Method.

3.2.2 High-quality Pseudo Label Generation

Multi-Scale Inference. As shown in Algorithm 1, we use multi-scale inference to exploit hard positive samples as many as possible. When generating pseudo labels , the unlabeled image is resized to a pre-defined set of scales  = , , , , , , to generate multi-scale predictions, where denotes the shorter side of an image and is set as 32 in practice. And then, the Locality-Aware NMS [zhou2017east] is used to ensemble the multi-scale inference results of each unlabeled image. Such aggregating scheme often obtains a prediction that is always superior to any of the model’s predictions under a single scale, especially for most text detectors without FPN [lin2017feature].

Filtering Negative Samples. To obtain the accurate background maps , we need to identify and discard the false negative regions. In consideration of the particularity of text, we use the edge detection and distance transform to filter pixels and select the ones far away from the gradients of image as the final background pixels. As shown in Fig. 5, for an input image , we firstly obtain the corresponding edge map by edge detector. And then, the distance transform ( i.e., in Algorithm 1 ) is used to calculated the distance between -th pixel and -th pixel, where -th pixel is the nearest one of -th pixel on gradient edge ( i.e.,the white parts in Fig. 5 (b) ). Similar to Eqn. 1, the is normalized to , and Fig. 5 (c) presents the visualization of the distance map. Finally, the background maps are calculated by thresholding the score map with a lower threshold , which is set as in all of the experiments.

3.2.3 Dynamic Mixed Training

The generated pseudo labels usually contain false positive (FP) and false negative (FN), which causes unsteady convergence of network. To tackle this problem and making full use of unlabeled data, we propose the dynamic mixed training, which optimizes models with more unlabeled data ( e.g., unlabeled images and labeled images ) in the early stage of training, and utilize more unlabeled data in the later stage. In the training stage, a batch size data with the volume is composed of labeled images and unlabeled images, where . Here can be calculated with the formula,

. The loss function in each iteration can be written as:


where denotes the text detector. and denotes corresponding loss of detector on labeled data and pseudo labeled data, respectively.

4 Experiments

4.1 Datasets and Experimental Settings

Pure Synthetic Datasets. SynthText [gupta2016synthetic] consist of 800k synthetic images generated by adding variants of multi-oriented text with random fonts, size, and color. Curved SynthText 111 [long2019rethinking] generates 80m curved texts with character level annotation by revising the text rendering module of the SynthText engine.

Real Datasets. ICDAR2015 [karatzas2015icdar] includes 1,000 training and 500 testing images with quadrilateral annotation. Total-Text[ch2017total] is a English curved text dataset contains 1,555 images, which includes 3 different text orientations: horizontal, multioriented, and curved. MSRA-TD500 [msra] consists of 500 training and 200 testing images for detecting multi-lingual long texts of arbitrary orientation. ICDAR2017-MLT [icdar2017mlt] consists of 18000 images with texts in 9 languages for multi-lingual text detection.

Synthetic Data Size Evaluation on Total-Text/%
Precision Recall F-score
SynthText 50k 78.3 73.5 75.9
SynthText 500k 80.3 73.5 76.8
SynthText 1m 79.7 74.0 76.7
Curved SynthText 50k 80.8 75.0 77.8
Curved SynthText 500k 81.7 75.6 78.5
Curved SynthText 1m 81.3 75.1 78.1
Table 1: Synthetic Data for BBS: suitable annotation method bring a large gain. The Skeleton Attention and PSENet [psenet] are used in the experiment.
Method Synthetic Data Evaluation on Total-Text/%
Precision Recall F-score
Baseline SynthText(500k) 77.2 73.0 75.0
Baseline+SA SynthText(500k) 80.3 73.5 76.8
Baseline Curved SynthText(500k) 80.5 73.2 76.7
Baseline+SA Curved SynthText(500k) 81.7 75.6 78.5
Table 2: Skeleton Attention: SA bring a large gain regardless of dataset. The ‘SA’ denotes the ‘Skeleton Attention’. PSENet [psenet] is adopted as the detector.
Method Evaluation on ICDAR2015/%
Precision Recall F-score
Baseline 59.2 59.3 59.2
Baseline,ST 70.1 67.3 68.6
Baseline,ST,Iter 70.8 68.1 69.2
Baseline,ST,Iter,DM 71.2 68.5 69.8
Baseline,ST,Iter,DM,DS 72.5 70.1 71.3
Baseline,ST,Iter,DM,DS,Fil 73.7 69.9 71.7
Table 3: Ablation study of DST: we use PSENet [psenet] as the base detector, and randomly split the ICDAR2015 training set into 100 labeled and 900 unlabeled images. ‘Baseline’ and ‘ST’ denote ‘training with only labeled data’ and ‘self-training on unlabeled data’, respectively. ‘DM’, ‘DS’, ‘Fil’, ‘Iter’ denote ‘dynamic mixed training’, ‘multi-scale inference’, ‘filtering’ and ‘iterative self-training’, respectively.
Method Annotation Pre ICDAR2015/% MSRA-TD500/% ICDAR2017-MLT/% Total-Text/%
Strong Supervision
CTPN[tian2016detecting] GT - 74.2 51.6 60.9 - - - - - - - - -
SegLink[shi2017detecting] GT 73.1 76.8 75.0 86.0 70.0 77.0 - - - 30.3 23.8 26.7
EAST[zhou2017east] GT - 80.5 72.8 76.4 81.7 61.6 70.2 - - - - - -
PixelLink[PixelLink] GT - 82.9 81.7 82.3 83.0 73.2 77.8 - - - - - -
TextSnake[textsnake] GT 84.9 80.4 82.6 83.2 73.9 78.3 - - - 82.7 74.5 78.4
PSENet[psenet] GT - 81.5 79.7 80.6 - - - 73.7 68.2 70.8 81.8 75.1 78.3
PSENet[psenet] GT 86.9 84.5 85.7 - - - - - - 84.0 78.0 80.9
EAST GT - 76.9 77.1 77.0 71.8 69.1 70.4 68.1 63.2 65.6 - - -
EAST GT 82.0 82.4 82.2 77.9 76.5 77.2 70.3 62.8 66.4 - - -
PSENet GT - 81.6 79.5 80.5 80.6 77.7 79.1 73.1 67.3 70.1 80.4 76.5 78.4
PSENet GT 86.4 83.5 85.0 84.1 85.0 84.5 72.5 69.1 70.8 83.4 78.1 80.7
Box Supervision
EAST b-GT - 65.8 63.8 64.8 40.5 31.1 35.2 66.3 59.6 62.8 - - -
EAST+BBS PL - 77.8 78.2 78.0 71.3 70.2 70.7 67.3 64.1 65.7 - - -
EAST b-GT 70.8 72.0 71.4 48.3 42.4 45.2 67.2 60.1 63.5 - - -
EAST+BBS PL 81.3 82.2 81.8 77.4 75.5 76.4 67.6 64.9 66.3 - - -
PSENet b-GT - 70.2 69.1 69.6 47.2 36.9 41.4 67.2 61.4 64.2 46.5 43.6 45.0
PSENet+BBS PL - 82.9 77.6 80.2 80.3 77.5 78.9 72.4 69.3 70.8 81.7 75.6 78.5
PSENet b-GT 72.7 74.3 73.5 47.5 39.5 43.1 66.4 63.1 64.7 51.9 47.5 49.6
PSENet+BBS PL 86.8 83.3 85.0 84.4 84.7 84.5 73.8 67.7 70.6 82.5 77.6 80.1
Table 4: The results of BBS on ICDAR2015, MSRA-TD500, ICDAR2017-MLT, Total-Text. refers to our testing performance. The ‘GT’, ‘b-GT’, ‘PL’ refer to the ’Ground Truth’, ‘Bounding-box annotation of ground truth’, ‘Pseudo Label generated by the SASN training on SynthText [synthtext] or curved SynthText [long2019rethinking]’. ‘P’, ‘R’, ‘F’ and “Pre” refer to ‘Precision’, ‘Recall’, ‘F-score’ and ‘pretraining on external data’, respectively. In green and in bold are highlighted for comparison.

Implementation Details. Bounding Box Supervision. All of the experiments use the same strategy: (1) training SASN with synthetic data based on character annotation. (2) generating the pixel level pseudo label on real data based on bounding box annotation with SASN. (3) training the detectors (i.e.,

EAST and PSENet) with the pseudo label. The stochastic gradient descent(SGD) optimizer is adopted with a momentum of 0.9 and a weight decay of 0.0005. The batch size is set to 8 per GPU. The learning rate is initialized to 0.02 and decayed with the power of 0.9 for 16 epochs. During training and inference, the crop images are resized to a resolution of 128

128. Dynamic Self-Training. All of the experiments use the same training strategy: (1) we firstly choose a part of training data as labeled data to train the detectors. (2) generating pseudo label in the rest of the data with the proposed components. (3) fine-tune the detectors on whole data. To evaluate the effectiveness of the DST, we randomly split the training set into labeled data and unlabeled data, and using one ratio control the labeled data size in the total training set. PSENet [psenet] and EAST [zhou2017east] are adopted as the base detectors because of the popularity. While involving PSENet [psenet] and EAST [zhou2017east] experiments, all settings are following the original paper.

4.2 Ablation Study

In this part, we conduct three groups ablation experiments to analyze BBS and DST. More ablation (e.g.,Box supervision v.s. Strong supervision, Labeled data size for DST) experiments are in supplementary materials.

Synthetic data for BBS. To understand the impact of synthetic data with different text shape distribution, we compare the performance of curved Synthtext [long2019rethinking] and Synthtext [synthtext]. As shown in Tab. 1, using curved Synthext obtains better performance, absolute improvement over that using Synthext, because of the serious domain gap of text shape between Total-Text (curved text) and Synthtext (straight text). The performance tends to stabilize while synthetic data size exceeds , so we randomly select synthetic data to train SASN in all experiments.

Skeleton Attention for BBS. Skeleton Attention is the key to the SASN. Tab. 2 gives the ablation study about the attention on Total-Text [totaltext]. No matter which synthetic data, Skeleton Attention can obtain up to improvement compares to the baseline without the attention. In addition, Fig. 6 gives the visualization of the attention map, which enables the network more interested in the text skeleton.

Components for DST. Dynamic mixed training, multi-scale inference and filtering are the keys for DST. We make an assumption that these components independent of each other, which improve the performance of model from different levels by minimizing the effects of false negatives and false positives. Tab. 3 also indirectly proves the conclusion. By combining with dynamic mixed training, multi-scale inference and filtering, the F-score achieved a , and , respectively, making , and absolute improvements over the baseline.

4.3 Experiments on Scene Text Detection

In this section, we firstly present the experiments of BBS and DST, respectively. And then, the combined experiments are given in supplementary materials.

4.3.1 Bounding Box Supervision

Quadrilateral-type datasets. Tab. 4 lists the experimental results of various methods on ICDAR2015, ICDAR2017 and MSRA-TD500 datasets. For ICDAR2015, PSENet [psenet] using pseudo label achieves almost the same performance ( and using the pre-trained model on SynthText) with that using ground truth, which prove high-quality of pseudo label generated by BBS. On the contrary, directly training the detector with bounding box annotation has an unsatisfactory F-score ( and ), which exists more than gap compare with that using polygon annotation. EAST [zhou2017east] has a similar case with PSENet. Moreover, the F-score () of using the pseudo label has a improvement over that () of using ground truth. For MSRA-TD500, annotations are provided at the line level, including the spaces between words in the box. Therefore, bounding box annotation on MSRA-TD500 usually contains large plenty of background, which causes poor performance ( for EAST and for PSENet). In this case, the pseudo label from BBS still brings almost the same performance ( for EAST and for PSENet) with that ( for EAST and for PSENet) of polygon annotation. In addition, after using the pre-trained model on SynthText, PSENet and EAST all obtain almost the same improvement no matter using which labels (pseudo label or ground truth). For ICDAR2017, the performance is similar to ICDAR15, pseudo label has a high-quality performance compare with that using ground truth. The difference is that the bounding box also brings relatively good performance ( for EAST and for PSENet). The main reason is that texts in ICDAR17 have a small tilting angle and size.

Polygon-type dataset. Tab. 4 lists the experimental results on Total-Text [totaltext]. The annotation on Total-Text is complex and polygonal in shape. The great performance () of using pseudo label further proves the significance of our work, and Fig. 7 gives some visualization of ground truth and pseudo label. Similar to MSRA-TD500, bounding box annotation on Total-Text also contains plenty of backgrounds, which causes poor performance ( and ). And using the pseudo label generated by BBS still can achieve a great performance ( and ) with and improvements.

Method Ratio Pre ICDAR2015/%
Strong Supervision
EAST 100% - 76.9 77.1 77.0
EAST 100% 82.0 82.4 82.2
PSENet 100% - 81.6 79.5 80.5
PSENet 100% 86.4 83.5 85.0
EAST 50% - 77.1 75.7 76.4
EAST+DST 50% - 77.6 77.5 77.5 (+1.1)
EAST 50% 79.1 79.2 79.1
EAST+DST 50% 82.6 79.7 81.1 (+2.0)
PSENet 50% - 76.6 73.6 75.1
PSENet+DST 50% - 79.8 76.8 78.3 (+3.2)
PSENet 50% 78.9 76.6 77.7
PSENet+DST 50% 84.0 82.9 83.5 (+5.8)
Table 5: The results of DST on ICDAR2015. ‘Ratio’, ‘P’, ‘R’, and ‘F’ refer to ‘the proportion of using labeled data’, ‘Precision’, ‘Recall’ and ‘F-score’, respectively. In blue are the gaps of at least  (+1.1) point.
Method Ratio Pre ICDAR2017-MLT/%
Strong Supervision
EAST 100% - 68.1 63.2 65.6
PSENet 100% - 73.1 67.3 70.1
EAST 50% - 70.4 60.2 64.5
EAST +DST 50% - 69.2 63.5 66.2 (+1.7)
PSENet 50% - 69.1 68.4 68.7
PSENet+DST 50% - 70.6 69.8 70.2 (+1.5)
Table 6: The results of DST on ICDAR2017-MLT. ‘Ratio’, ‘P’, ‘R’, and ‘F’ refer to ‘the proportion of using labeled data’, ‘Precision’, ‘Recall’ and ‘F-score’, respectively. In blue are the gaps of at least  (+1.5) point.

4.3.2 Dynamic Self Training

We present the experimental results on ICDARs in this subsection, and more experiments(i.e.,Total-Text) are in the supplementary material. For ICDAR2015, Tab. 5 lists the experimental results of the self-training, and the ‘Ratio’ denotes only using part of training set as the labeled data. By using DST, EAST using half labeled data and half unlabeled data obtains the great performance ( and ), which is almost the same as that of strong supervision. After using the pre-trained model on SynthText, the performance can be further improved with 2.0% gap. Similar to EAST, PSENet also has achieved a competitive performance ( and ) with only half labeled training set. For ICDAR2017, Tab. 6 lists the experimental results of the self-training. Similar to ICDAR15, our method all achieve competitive performance ( and ) no matter which detectors are used.

Figure 6: The visualization of BBS on Skeleton Attention. The Skeleton Attention enables the network to focus on the skeleton of body text instance for segmenting precisely.
Figure 7: The visualization of bounding-box annotation and the pseudo labels generated by BBS on Total-Text [totaltext].

5 Conclusion

In this paper, we present a simple but surprisingly effective and practical method termed SelfText Beyond Polygon (SBP), which includes two components: Bounding Box Supervision (BBS) and Dynamic Self-Training (DST). The BBS with only bounding box annotation achieve almost the same performance as using expensive polygon annotation. The DST training with limited labeled data and unlabeled data to further reduce the data cost. The experiments showed that our method achieves almost the same performance as that of strong supervision while saving + data cost, which can provide a new perspective for weakly supervised text detection and save much money for the industry.

6 Appendix

6.1 More Analysis for Dynamic Mixed Training

As shown in Fig. 8, fully supervision (green line) usually has a stable convergence effect (), normal self-training (red line) has a performance fluctuation (), because of the adverse effects of false negatives (FNs) and false positives (FPs). Fortunately, Dynamic Mixed Training could achieve competitive convergence performance () by dynamically adjusting the number of unlabeled data in a mini-batch, as shown in the blue line. The basic idea is simple but effective, the network training with few unlabeled data for minimizing the adverse effects of FNs and FPs in the last training stage, and using massive unlabeled data for mining the valuable information in the early training stage. We argue the self-training requiring such an adjusting strategy to help stable convergence.

Figure 8: Dynamic Mixed Training. Dynamic Self-Training achieves much stabler convergence compare with common self-training. The EAST [zhou2017east] is used as the base detector, and the training set of ICDAR2015 is divided into half labeled data and half unlabeled data. ‘Self-Training’ denotes traditional self training without dynamic mixed training.
Figure 9: The pseudo labels without bounding box annotation contain large amounts of false negatives and false positives, whose performance is not up to using bounding box (i.e.,59.4% v.s.78.0% for EAST [zhou2017east] on ICDAR2015).
Method Annotation Evaluation on Total-Text/%
Precision Recall F-score
PSENet GT 80.4 76.5 78.4
PSENet b-GT 46.5 43.6 45.0
PSENet PL(ST) 80.3 73.5 76.8 (+31.8)
PSENet PL(Curved ST) 81.7 75.6 78.5 (+33.5)
Evaluation on MSRA-TD500/%
PSENet GT 80.6 77.7 79.1
PSENet b-GT 47.2 36.9 41.4
PSENet PL(ST) 80.3 77.5 78.9 (+37.5)
EAST GT 71.8 69.1 70.4
EAST PL(ST) 66.6 53.5 59.4
EAST b-GT 40.5 31.1 35.2
EAST PL(ST) 71.3 70.2 70.7 (+35.5)
Table 7: Box supervision v.s. Strong supervision. ‘GT’, ‘b-GT’, ‘PL(ST)’, ‘PL(Curved ST)’ denote the ‘Ground Truth’, ‘Bounding box annotation’, ‘Pseudo label generated by BBS with SyntheText’, ‘Pseudo label generated by BBS with Curved SyntheText’, respectively. denotes ‘Pseudo label generated by the detector without bounding box annotation’. In blue are the gaps of at least (+31.8) point.
Method Ratio Evaluation on ICDAR2015/%
Precision Recall F-score
Baseline 10% 59.2 59.3 59.2
Baseline+DST 10% 73.7 69.3 71.7 (+10.1)
Baseline 30% 71.8 69.3 70.5
Baseline+DST 30% 76.6 76.0 76.3 (+5.5)
Baseline 50% 76.6 73.6 75.1
Baseline+DST 50% 79.8 76.8 78.3 (+3.2)
Baseline 70% 77.4 76.4 76.9
Baseline+DST 70% 79.1 77.4 78.2 (+1.3)
Baseline 90% 80.3 78.9 79.6
Baseline+DST 90% 80.6 79.8 80.2 (+0.6)
Baseline 100% 81.6 79.5 80.5
Table 8: Labeled data size for DST. We use PSENet [psenet] as the base detector. ‘Ratio’ denotes , where the total data size is 1,000 training images for ICDAR2015 [karatzas2015icdar]. In blue are the gaps of at least  (+0.6) point.
Figure 10: The visualization of bounding-box annotation, polygon-based ground truth and polygon-based pseudo labels generated by BBS on two datasets: Total-Text [totaltext] and MSRA-TD500[msra].

6.2 Ablation Study

Box supervision v.s. Strong supervision. Compare with strong supervision, Bounding Box Supervision (BBS) not only can save significant costs, but also achieves almost the same performance as that of strong supervision. In Tab. 7, we adopt two detectors(i.e.,PSENet [psenet] and EAST [zhou2017east]) and two datasets (i.e.,Total-Text [totaltext] and (MSRA-TD500 [msra]) to prove the conclusion. On Total-Text, PSENet using the pseudo label generated by BBS achieve a satisfactory F-score (), which is almost the same as using the ground truth (), with improvement over that of using the bounding box (). On MSRA-TD500, compared with that using ground truth, EAST and PSENet all achieve the competitive F-scores ( v.s. and v.s. ) with the pseudo label from BBS, which have great improvements ( and , respectively). The competitive performance of BBS proves the practicability efficiency of the pseudo label, Fig. 10 gives some visualization of the pseudo label.

Besides, bounding box annotation as the priori location is necessary for BBS. Fig. 9 gives a visualization comparison between based on bounding box annotation and without the box annotation, the pseudo label without box priori location contains large amounts of false negatives and false positives (the red circle), which causes unsatisfactory performances (i.e.,59.4% v.s. 78.0% for EAST [zhou2017east] on ICDAR2015 [karatzas2015icdar]), as shown in Tab. 7.

Labeled data size for DST. The target of Dynamic Self-Training is using limited labeled data and unlabeled data to train detectors. However, how much labeled data is enough for training a competitive model. Tab. 8 give a correlation between labeled data proportion and performance on ICDAR2015 [karatzas2015icdar]. F-score is increasing with labeled data proportion increase and achieves a competitive performance () with labeled data. Besides, DST brings a significant improvement up to while using only labeled training set.