Recently, segmentation-based methods are quite popular in scene text detection, as the segmentation results can more accurately describe scene text of various shapes such as curve text. However, the post-processing of binarization is essential for segmentation-based detection, which converts probability maps produced by a segmentation method into bounding boxes/regions of text. In this paper, we propose a module named Differentiable Binarization (DB), which can perform the binarization process in a segmentation network. Optimized along with a DB module, a segmentation network can adaptively set the thresholds for binarization, which not only simplifies the post-processing but also enhances the performance of text detection. Based on a simple segmentation network, we validate the performance improvements of DB on five benchmark datasets, which consistently achieves state-of-the-art results, in terms of both detection accuracy and speed. In particular, with a light-weight backbone, the performance improvements by DB are significant so that we can look for an ideal tradeoff between detection accuracy and efficiency. Specifically, with a backbone of ResNet-18, our detector achieves an F-measure of 82.8, running at 62 FPS, on the MSRA-TD500 dataset. Code is available at: https://github.com/MhLiao/DBREAD FULL TEXT VIEW PDF
A pytorch re-implementation of Real-time Scene Text Detection with Differentiable Binarization
This is a tensorflow2.x implementation of "Real-time Scene Text Detection with Differentiable Binarization"
[WIP] A Pytorch implementation of DB-Text - Real-time Scene Text Detection with Differentiable Binarization
A PyToch implementation of "Real-time Scene Text Detection with Differentiable Binarization".
In recent years, reading text in scene images has become an active research area, due to its wide practical applications such as image/video understanding, visual search, automatic driving, and blind auxiliary.
As a key component of scene text reading, scene text detection that aims to localize the bounding box or region of each text instance is still a challenging task, since scene text is often with various scales and shapes, including horizontal, multi-oriented and curved text. Segmentation-based scene text detection has attracted a lot of attention recently, as it can describe the text of various shapes, benefiting from its prediction results at the pixel-level. However, most segmentation-based methods require complex post-processing for grouping the pixel-level prediction results into detected text instances, resulting in a considerable time cost in the inference procedure. Take two recent state-of-the-art methods for scene text detection as examples: PSENet [Wang et al.2019a] proposed the post-processing of progressive scale expansion for improving the detection accuracies; Pixel embedding in [Tian et al.2019] is used for clustering the pixels based on the segmentation results, which has to calculate the feature distances among pixels.
Most existing detection methods use the similar post-processing pipeline as shown in Fig. 2
(following the blue arrows): Firstly, they set a fixed threshold for converting the probability map produced by a segmentation network into a binary image; Then, some heuristic techniques like pixel clustering are used for grouping pixels into text instances. Alternatively, our pipeline (following the red arrows in Fig.2) aims to insert the binarization operation into a segmentation network for joint optimization. In this manner, the threshold value at every place of an image can be adaptively predicted, which can fully distinguish the pixels from the foreground and background. However, the standard binarization function is not differentiable, we instead present an approximate function for binarization called Differentiable Binarization (DB), which is fully differentiable when training it along with a segmentation network.
The major contribution in this paper is the proposed DB module that is differentiable, which makes the process of binarization end-to-end trainable in a CNN. By combining a simple network for semantic segmentation and the proposed DB module, we proposed a robust and fast scene text detector. Observed from the performance evaluation of using the DB module, we discover that our detector has several prominent advantages over the previous state-of-the-art segmentation-based approaches.
Our method achieves consistently better performances on five benchmark datasets of scene text, including horizontal, multi-oriented and curved text.
Our method performs much faster than the previous leading methods, as DB can provide a highly robust binarization map, significantly simplifying the post-processing.
DB works quite well when using a light-weight backbone, which significantly enhances the detection performance with the backbone of ResNet-18.
As DB can be removed in the inference stage without sacrificing the performance, there is no extra memory/time cost for testing.
Recent scene text detection methods can be roughly classified into two categories: Regression-based methods and segmentation-based methods.
Regression-based methods are a series of models which directly regress the bounding boxes of the text instances. TextBoxes [Liao et al.2017] modified the anchors and the scale of the convolutional kernels based on SSD [Liu et al.2016] for text detection. TextBoxes++ [Liao, Shi, and Bai2018] and DMPNet [Liu and Jin2017] applied quadrilaterals regression to detect multi-oriented text. SSTD [He et al.2017a] proposed an attention mechanism to roughly identifies text regions. RRD [Liao et al.2018] decoupled the classification and regression by using rotation-invariant features for classification and rotation-sensitive features for regression, for better effect on multi-oriented and long text instances. EAST [Zhou et al.2017] and DeepReg [He et al.2017b] are anchor-free methods, which applied pixel-level regression for multi-oriented text instances. SegLink [Shi, Bai, and Belongie2017] regressed the segment bounding boxes and predicted their links, to deal with long text instances. DeRPN [Xie et al.2019b] proposed a dimension-decomposition region proposal network to handle the scale problem in scene text detection. Regression-based methods usually enjoy simple post-processing algorithms (e.g. non-maximum suppression). However, most of them are limited to represent accurate bounding boxes for irregular shapes, such as curved shapes.
Segmentation-based methods usually combine pixel-level prediction and post-processing algorithms to get the bounding boxes. [Zhang et al.2016] detected multi-oriented text by semantic segmentation and MSER-based algorithms. Text border is used in [Xue, Lu, and Zhan2018] to split the text instances, Mask TextSpotter [Lyu et al.2018a, Liao et al.2019] detected arbitrary-shape text instances in an instance segmentation manner based on Mask R-CNN. PSENet [Wang et al.2019a] proposed progressive scale expansion by segmenting the text instances with different scale kernel. Pixel embedding is proposed in [Tian et al.2019] to cluster the pixels from the segmentation results. PSENet [Wang et al.2019a] and SAE [Tian et al.2019] proposed new post-processing algorithms for the segmentation results, resulting in lower inference speed. Instead, our method focus on improving the segmentation results by including the binarization process into the training period, without the loss of the inference speed.
Fast scene text detection methods focus on both the accuracy and the inference speed. TextBoxes [Liao et al.2017], TextBoxes++ [Liao, Shi, and Bai2018], SegLink [Shi, Bai, and Belongie2017], and RRD [Liao et al.2018] achieved fast text detection by following the detection architecture of SSD [Liu et al.2016]. EAST [Zhou et al.2017] proposed to apply PVANet [Kim et al.2016] to improve its speed. Most of them can not deal with text instances of irregular shapes, such as curved shape. Compared to the previous fast scene text detectors, our method not only runs faster but also can detect text instances of arbitrary shapes.
The architecture of our proposed method is shown in Fig. 3. Firstly, the input image is fed into a feature-pyramid backbone. Secondly, the pyramid features are up-sampled to the same scale and cascaded to produce feature . Then, feature is used to predict both the probability map () and the threshold map (). After that, the approximate binary map () is calculated by and . In the training period, the supervision is applied on the probability map, the threshold map, and the approximate binary map, where the probability map and the approximate binary map share the same supervision. In the inference period, the bounding boxes can be obtained easily from the approximate binary map or the probability map by a box formulation module.
Given a probability map produced by a segmentation network, where and indicate the height and width of the map, it is essential to convert it to a binary map , where pixels with value is considered as valid text areas. Usually, this binarization process can be described as follows:
where is the predefined threshold and indicates the coordinate point in the map.
The standard binarization described in Eq. 1 is not differentiable. Thus, it can not be optimized along with the segmentation network in the training period. To solve this problem, we propose to perform binarization with an approximate step function:
where is the approximate binary map; is the adaptive threshold map learned from the network; indicates the amplifying factor. is set to empirically. This approximate binarization function behaves similar to the standard binarization function (see Fig 4) but is differentiable thus can be optimized along with the segmentation network in the training period. The differentiable binarization with adaptive thresholds can not only help differentiate text regions from the background, but also separate text instances which are closely jointed. Some examples are illustrated in Fig.7.
The reasons that DB improves the performance can be explained by the backpropagation of the gradients. Let’s take the binary cross-entropy loss as an example. Defineas our DB function, where . Then the losses for positive labels and for negative labels are:
We can easily compute the differential of the losses with the chain rule:
The derivatives of and are also shown in Fig. 4. We can perceive from the differential that (1) The gradient is augmented by the amplifying factor k; (2) The optimization of the positive labels and negative labels are with different scales. The amplification of gradient is significant when is close to , thus facilitating the optimization and helping to produce more distinctive predictions. Moreover, as , the gradient of P is effected and rescaled between the foreground and the background by .
The threshold map in Fig. 1 is similar to the text border map in [Xue, Lu, and Zhan2018] from appearance. However, the motivation and usage of the threshold map are different from the text border map. The threshold map with/without supervision is visualized in Fig. 6. The threshold map would highlight the text border region even without supervision for the threshold map. This indicates that the border-like threshold map is beneficial to the final results. Thus, we apply border-like supervision on the threshold map for better guidance. An ablation study about the supervision is discussed in the Experiments section. For the usage, the text border map in [Xue, Lu, and Zhan2018] is used to split the text instances while our threshold map is served as thresholds for the binarization.
Deformable convolution [Dai et al.2017, Zhu et al.2019] can provide a flexible receptive field for the model, which is especially beneficial to the text instances of extreme aspect ratios. Following [Zhu et al.2019], modulated deformable convolutions are applied in all the convolutional layers in stages conv3, conv4, and conv5 in the ResNet-18 or ResNet-50 backbone [He et al.2016a].
The label generation for the probability map is inspired by PSENet [Wang et al.2019a]. Given a text image, each polygon of its text regions is described by a set of segments:
is the number of vertexes, which may be different in different datasets, e.g, 4 for the ICDAR 2015 dataset [Karatzas et al.2015] and 16 for the CTW1500 dataset [Liu et al.2019a]. Then the positive area is generated by shrinking the polygon to using the Vatti clipping algorithm [Vati1992]. The offset of shrinking is computed from the perimeter and area of the original polygon:
where is the shrink ratio, set to empirically.
With a similar procedure, we can generate labels for the threshold map. Firstly the text polygon is dilated with the same offset to . We consider the gap between and as the border of the text regions, where the label of the threshold map can be generated by computing the distance to the closest segment in .
The loss functioncan be expressed as a weighted sum of the loss for the probability map , the loss for the binary map , and the loss for the threshold map :
where is the loss for the probability map and is the loss for the binary map. According to the numeric values of the losses, and are set to and respectively.
We apply a binary cross-entropy (BCE) loss for both and . To overcome the unbalance of the number of positives and negatives, hard negative mining is used in the BCE loss by sampling the hard negatives.
is the sampled set where the ratio of positives and negatives is .
is computed as the sum of distances between the prediction and label inside the dilated text polygon :
where is a set of indexes of the pixels inside the dilated polygon ; is the label for the threshold map.
In the inference period, we can either use the probability map or the approximate binary map to generate text bounding boxes, which produces almost the same results. For better efficiency, we use the probability map so that the threshold branch can be removed. The box formation process consists of three steps: (1) the probability map/the approximate binary map is firstly binarized with a constant threshold (0.2), to get the binary map; (2)the connected regions (shrunk text regions) are obtained from the binary map; (3) the shrunk regions are dilated with an offset the Vatti clipping algorithm[Vati1992]. is calculated as
where is the area of the shrunk polygon; is the perimeter of the shrunk polygon; is set to empirically.
SynthText [Gupta, Vedaldi, and Zisserman2016] is a synthetic dataset which consists of images. These images are synthesized from 8k background images. This dataset is only used to pre-train our model.
MLT-2017 dataset 111https://rrc.cvc.uab.es/?ch=8 is a multi-language dataset. It includes 9 languages representing 6 different scripts. There are 7,200 training images, 1,800 validation images and 9,000 testing images in this dataset. We use both the training set and the validation set in the finetune period.
ICDAR 2015 dataset [Karatzas et al.2015] consists of 1000 training images and 500 testing images, which are captured by Google glasses with a resolution of . The text instances are labeled at the word level.
MSRA-TD500 dataset [Yao et al.2012] is a multi-language dataset that includes English and Chinese. There are 300 training images and 200 testing images. The text instances are labeled in the text-line level. Following the previous methods [Zhou et al.2017, Lyu et al.2018b, Long et al.2018], we include extra 400 training images from HUST-TR400 [Yao, Bai, and Liu2014].
CTW1500 dataset CTW1500 [Liu et al.2019a] is a dataset which focuses on the curved text. It consists of 1000 training images and 500 testing images. The text instances are annotated in the text-line level.
Total-Text dataset Total-Text [Ch’ng and Chan2017] is a dataset that includes the text of various shapes, including horizontal, multi-oriented, and curved. They are 1255 training images and 300 testing images. The text instances are labeled at the word level.
For all the models, we first pre-train them with the SynthText dataset for iterations. Then, we finetune the models on the corresponding real-world datasets for epochs. The training batch size is set to 16. We follow a “poly” learning rate policy where the learning rate at current iteration equals the initial learning rate multiplying , where the initial learning rate is set to 0.007 and is . We use a weight decay of 0.0001 and a momentum of 0.9. The means the maximum iterations, which depends on the maximum epochs.
The data augmentation for the training data includes: (1) Random rotation with an angle range of ; (2) Random cropping; (3) Random Flipping. All the processed images are re-sized to for better training efficiency.
In the inference period, we keep the aspect ratio of the test images and re-size the input images by setting a suitable height for each dataset. The inference speed is tested with a batch size of , with a single 1080ti GPU in a single thread. The inference time cost consists of the model forward time cost and the post-processing time cost. The post-processing time cost is about of the inference time.
We conduct an ablation study on the MSRA-TD500 dataset and the CTW1500 dataset to show the effectiveness of our proposed differentiable binarization, the deformable convolution, and different backbones. The detailed experimental results are shown in Tab. 1.
Differentiable binarization In Tab. 1, we can see that our proposed DB improves the performance significantly for both ResNet-18 and ResNet-50 on the two datasets. For the ResNet-18 backbone, DB achieves and performance gain in terms of F-measure on the MSRA-TD500 dataset and the CTW1500 dataset. For the ResNet-50 backbone, DB brings (on the MSRA-TD500 dataset) and (on the CTW1500 dataset) improvements. Moreover, since DB can be removed in the inference period, the speed is the same as the one without DB.
Deformable convolution As shown in Tab. 1, the deformable convolution can also brings performance gain since it provides a flexible receptive field for the backbone, with small extra time costs. For the MSRA-TD500 dataset, the deformable convolution increase the F-measure by (with ResNet-18) and (with ResNet-50). For the CTW1500 dataset, (with ResNet-18) and (with ResNet-50) improvements are achieved by the deformable convolution.
Supervision of threshold map Although the threshold maps with/without supervision are similar in appearance, the supervision can bring performance gain. As shown in Tab. 2, the supervision improves (ResNet-18) and (ResNet-50) on the MLT-2017 dataset.
Backbone The proposed detector with ResNet-50 backbone achieves better performance than the ResNet-18 but runs slower. Specifically, The best ResNet-50 model outperforms the best ResNet-18 model by (on the MSRA-TD500 dataset) and (on the CTW1500 dataset), with approximate double time cost.
We compare our proposed method with previous methods on five standard benchmarks, including two benchmarks for curved text, one benchmark for multi-oriented text, and two multi-language benchmarks for long text lines. Some qualitative results are visualized in Fig. 7.
|TextSnake [Long et al.2018]||82.7||74.5||78.4||-|
|ATRR [Wang et al.2019b]||80.9||76.2||78.5||-|
|MTS [Lyu et al.2018a]||82.5||75.6||78.6||-|
|TextField [Xu et al.2019]||81.2||79.9||80.6||-|
|LOMO [Zhang et al.2019]*||87.6||79.3||83.3||-|
|CRAFT [Baek et al.2019]||87.6||79.9||83.6||-|
|CSE [Liu et al.2019b]||81.4||79.1||80.2||-|
|PSE-1s [Wang et al.2019a]||84.0||78.0||80.9||3.9|
|TextSnake [Long et al.2018]||67.9||85.3||75.6||1.1|
|TLOC [Liu et al.2019a]||77.4||69.8||73.4||13.3|
|PSE-1s [Wang et al.2019a]||84.8||79.7||82.2||3.9|
|SAE [Tian et al.2019]||82.7||77.8||80.1||3|
Curved text detection We prove the shape robustness of our method on two curved text benchmarks (Total-Text and CTW1500). As shown in Tab. 3 and Tab. 4, our method achieves state-of-the-art performance both on accuracy and speed. Specifically, “DB-ResNet-50” outperforms the previous state-of-the-art method by and on the Total-Text and the CTW1500 dataset. “DB-ResNet-50” runs faster than all previous method and the speed can be further improved by using a ResNet-18 backbone, with a small performance drop. Compared to the recent segmentation-based detector [Wang et al.2019a], which runs FPS on Total-Text, “DB-ResNet-50 (800)” is times faster and “DB-ResNet-18 (800)” is times faster.
Multi-oriented text detection The ICDAR 2015 dataset is a multi-oriented text dataset that contains lots of small and low-resolution text instances. In Tab. 5, we can see that “DB-ResNet-50 (1152)” achieves the state-of-the-art performance on accuracy. Compared to the previous fastest method [Zhou et al.2017], “DB-ResNet-50 (736)” outperforms it by on accuracy and runs twice faster. For “DB-ResNet-18 (736)”, the speed can be fps when ResNet-18 is applied to the backbone, with an f-measure of .
|CTPN [Tian et al.2016]||74.2||51.6||60.9||7.1|
|EAST [Zhou et al.2017]||83.6||73.5||78.2||13.2|
|SSTD [He et al.2017a]||80.2||73.9||76.9||7.7|
|WordSup [Hu et al.2017]||79.3||77||78.2||-|
|Corner [Lyu et al.2018b]||94.1||70.7||80.7||3.6|
|TB [Liao, Shi, and Bai2018]||87.2||76.7||81.7||11.6|
|RRD [Liao et al.2018]||85.6||79||82.2||6.5|
|MCN [Liu et al.2018]||72||80||76||-|
|TextSnake [Long et al.2018]||84.9||80.4||82.6||1.1|
|PSE-1s [Wang et al.2019a]||86.9||84.5||85.7||1.6|
|SPCNet [Xie et al.2019a]||88.7||85.8||87.2||-|
|LOMO [Zhang et al.2019]||91.3||83.5||87.2||-|
|CDAFT [Baek et al.2019]||89.8||84.3||86.9||-|
|SAE(720) [Tian et al.2019]||85.1||84.5||84.8||3|
|SAE(990) [Tian et al.2019]||88.3||85.0||86.6||-|
Multi-language text detection Our method is robust on multi-language text detection. As shown in Tab. 6 and Tab. 7, “DB-ResNet-50” is superior to previous methods on accuracy and speed. For the accuracy, “DB-ResNet-50” surpasses the previous state-of-the-art method by and on the MSRA-TD500 and MLT-2017 dataset respectively. For the speed, “DB-ResNet-50” is times faster than the previous fastest method [Liao et al.2018] on the MSRA-TD500 dataset. With a light-weight backbone, “DB-ResNet-18 (736)” achieves a comparative accuracy compared to the previous state-of-the-art method [Liu et al.2018] (82.8 vs 83.0) and runs at 62 FPS, which is times faster than the previous fastest method [Liao et al.2018], on the MSRA-TD500. The speed can be further accelerated to 82 FPS (“ResNet-18 (512)”) by decreasing the input size.
|[He et al.2016b]||71||61||69||-|
|DeepReg [He et al.2017b]||77||70||74||1.1|
|RRPN [Ma et al.2018]||82||68||74||-|
|RRD [Liao et al.2018]||87||73||79||10|
|MCN [Liu et al.2018]||88||79||83||-|
|PixelLink [Deng et al.2018]||83||73.2||77.8||3|
|Corner [Lyu et al.2018b]||87.6||76.2||81.5||5.7|
|TextSnake [Long et al.2018]||83.2||73.9||78.3||1.1|
|[Xue, Lu, and Zhan2018]||83.0||77.4||80.1||-|
|[Xue, Lu, and Zhang2019]||87.4||76.7||81.7||-|
|CRAFT [Baek et al.2019]||88.2||78.2||82.9||8.6|
|SAE [Tian et al.2019]||84.2||81.7||82.9||-|
|Corner [Lyu et al.2018b]||83.8||55.6||66.8||-|
|PSE [Wang et al.2019a]||73.8||68.2||70.9||-|
One limitation of our method is that it can not deal with cases “text inside text”, which means that a text instance is inside another text instance. Although the shrunk text regions are helpful to the cases that the text instance is not in the center region of another text instance, it fails when the text instance exactly locates in the center region of another text instance. This is a common limitation for segmentation-based scene text detectors.
In this paper, we have presented a novel framework for detecting arbitrary-shape scene text, which includes the proposed differentiable binarization process (DB) in a segmentation network. The experiments have verified that our method (ResNet-50 backbone) consistently outperforms the state-the-the-art methods on five standard scene text benchmarks, in terms of speed and accuracy. In particular, even with a lightweight backbone (ResNet-18), our method can achieve competitive performance on all the testing datasets with real-time inference speed. In the future, we are interested in extending our method for end-to-end text spotting.
This work was supported by National Key R&D Program of China (No. 2018YFB1004600), to Dr. Xiang Bai by the National Program for Support of Top-notch Young Professionals and the Program for HUST Academic Frontier Youth Team 2017QYTD08.
Text-attentional convolutional neural network for scene text detection.IEEE Trans. Image Processing 25(6):2529–2541.