FAST: Searching for a Faster Arbitrarily-Shaped Text Detector with Minimalist Kernel Representation

We propose an accurate and efficient scene text detection framework, termed FAST (i.e., faster arbitrarily-shaped text detector). Different from recent advanced text detectors that used hand-crafted network architectures and complicated post-processing, resulting in low inference speed, FAST has two new designs. (1) We search the network architecture by designing a network search space and reward function carefully tailored for text detection, leading to more powerful features than most networks that are searched for image classification. (2) We design a minimalist representation (only has 1-channel output) to model text with arbitrary shape, as well as a GPU-parallel post-processing to efficiently assemble text lines with negligible time overhead. Benefiting from these two designs, FAST achieves an excellent trade-off between accuracy and efficiency on several challenging datasets. For example, FAST-A0 yields 81.4 the previous fastest method by 1.5 points and 70 FPS in terms of accuracy and speed. With TensorRT optimization, the inference speed can be further accelerated to over 600 FPS.

READ FULL TEXT VIEW PDF

page 4

page 7

08/16/2019

Efficient and Accurate Arbitrary-Shaped Text Detection with Pixel Aggregation Network

Scene text detection, an important step of scene text reading systems, h...
11/30/2020

BOTD: Bold Outline Text Detector

Recently, text detection for arbitrary shape has attracted more and more...
08/15/2019

A Single-Shot Arbitrarily-Shaped Text Detector based on Context Attended Multi-Task Learning

Detecting scene text of arbitrary shapes has been a challenging task ove...
05/08/2021

ABCNet v2: Adaptive Bezier-Curve Network for Real-time End-to-end Text Spotting

End-to-end text-spotting, which aims to integrate detection and recognit...
04/11/2017

EAST: An Efficient and Accurate Scene Text Detector

Previous approaches for scene text detection have already achieved promi...
07/20/2022

EASNet: Searching Elastic and Accurate Network Architecture for Stereo Matching

Recent advanced studies have spent considerable human efforts on optimiz...
09/06/2021

Automated Robustness with Adversarial Training as a Post-Processing Step

Adversarial training is a computationally expensive task and hence searc...

Code Repositories

pan_pp.pytorch

Official implementations of PSENet, PAN and PAN++.


view repo

Introduction

Scene text detection is a fundamental task in computer vision with wide practical applications, such as image understanding, instant translation, and autonomous driving. With the remarkable progress of deep learning, a considerable amount of methods

Long et al. (2018); Wang et al. (2019a, b); Liao et al. (2020) have been proposed to detect text with arbitrary shape, and the performance on public datasets is constantly being refreshed. However, we argue that the above methods still have room to improve due to two main sub-optimal designs: (1) hand-crafted network architecture and (2) low-efficient post-processing.

Figure 1: Text detection F-measure and inference speed of different text detectors on Total-Text Ch’ng and Chan (2017). Our FAST models enjoy faster inference speed and better accuracy than counterparts.

First, most existing text detectors adopt a heavy hand-crafted backbone (e.g., ResNet50 He et al. (2016)) to achieve excellent performance at the expense of inference speed. For high efficiency, some methods Wang et al. (2019b); Liao et al. (2020) developed text detector based on ResNet18 He et al. (2016), but the backbone is originally designed for image classification and may not be the best choice for text detection. Although many auto-searched lightweight networks Cai, Zhu, and Han (2018); Cai et al. (2019); Howard et al. (2019) have been presented, they only focus on image classification or general object detection, and the application to text detection is rarely considered. Therefore, how to design an efficient and powerful network specific for text detection, is a topic worth exploring.

Second, the post-processing of previous works usually takes over 30% of the whole inference time Wang et al. (2019a, b); Liao et al. (2020). Moreover, these post-processing approaches are designed to run on the CPU (see Figure 2), which are difficult to parallel with GPU resources, resulting in relatively low efficiency. Consequently, it is also important to develop a GPU-parallel post-processing for the real-time text detector.

Method F Time Cost (ms) FPS
Network Post-Proc
PSENet-1s Wang et al. (2019a) 80.9 118.0 145.0 3.9
PAN-512 Wang et al. (2019b) 84.3 13.7 3.5 57.1
DB-R18 Liao et al. (2020) 82.8 15.3 4.7 50.0
FAST-A1-512 (Ours) 84.9 7.5 1.1 115.5
Figure 2: Overall pipelines of representative arbitrarily-shaped text detectors. Our FAST achieves significantly faster speed than previous methods, benefiting from (1) the network searched for text detection, and (2) the minimalist kernel representation with a GPU-parallel post-processing.

In this work, we propose an efficient and powerful text detection framework, termed FAST (Faster Arbitrary-Shaped Text detector). As illustrated in Figure 2, the proposed FAST contains the following two main improvements to achieve high efficiency. (1) We carefully design a NAS search space and reward function for the text detection task. The searched efficient networks are termed TextNAS, which can provide more powerful features for text detection than the network searched on image classification (e.g., OFA Cai et al. (2019)). (2) We propose a minimalist kernel representation that formulates a text line as an eroded text region surrounded by peripheral pixels. Compared to existing kernel representations Wang et al. (2019a, b); Liao et al. (2020), our representation method not only benefits the network to predict a 1-channel output, but also enjoys a GPU-parallel post-processing (i.e., text dilation). Combining the advantages of these designs, our method achieves the excellent trade-off between accuracy and inference speed.

To demonstrate the effectiveness of our FAST, we conduct extensive experiments on four challenging benchmarks, including Total-Text Ch’ng and Chan (2017), CTW1500 Liu et al. (2019b)

, ICDAR 2015

Karatzas et al. (2015) and MSRA-TD500 Yao et al. (2012). According to the model size, we name our text detectors from FAST-A0 to FAST-A2. As shown in Figure 1, on Total-Text, FAST-A0-448, which means scaling the shorter side of input images to 448, achieves 81.4% F-measure at 152.8 FPS, being 1.5% higher and 70 FPS faster than the previous fastest method PAN-320. Our best model FAST-A2-800 achieves 86.3% F-measure, while still keeping a real-time speed (46.0 FPS).

In summary, our contributions are as follows:

(1) We develop an accurate and efficient arbitrarily-shaped text detector, termed FAST, which is completely GPU-parallel, including network and post-processing.

(2) We design a search space and reward function specifically for text detection, and search for a series of networks friendly to text detection with different inference speeds.

(3) We propose a minimalist kernel representation with a GPU-parallel post-processing, significantly reducing its time consumption.

(4) Our FAST achieves an astonishing speed of 152.8 FPS while maintaining competitive accuracy on Total-Text. With TensorRT, it can be further accelerated to over 600 FPS.

Figure 3: Overall architecture of FAST. The over-parameterized network is divided into four stages, each of which contains learnable blocks, for architecture search of text detection. The multi-level features from the network are upsampled and concatenated as the final feature map , which is used to predict text kernels. The GPU-parallel post-processing (i.e., text dilation) is applied to reconstruct complete text lines.

Related Work

Scene Text Detection.

Inspired by general object detection methods Ren et al. (2016); Liu et al. (2016), many methods Liao et al. (2017); Zhou et al. (2017); Shi, Bai, and Belongie (2017); Liao, Shi, and Bai (2018); Ma et al. (2018); Liao et al. (2018) have been proposed to detect horizontal and multi-oriented text. However, most of them fail to locate curved text accurately. To remedy this defect, recent methods cast the text detection task as a segmentation problem. PixelLink Deng et al. (2018) separated adjacent text lines by performing text/non-text prediction and link prediction at the pixel level. SPCNet Xie et al. (2019) and Mask TextSpotter Lyu et al. (2018a) proposed to detect arbitrarily-shaped text in an instance segmentation manner. SAE Tian et al. (2019) introduced a shape-aware loss and new cluster post-processing to distinguish adjacent text lines with various aspect ratios and small gaps. PSENet Wang et al. (2019a) proposed the progressive scale expansion (PSE) algorithm to merge multi-scale text kernels. Although the above methods achieve excellent performance, most of them run at a slow inference speed due to the complicated network architecture and cumbersome post-processing.

Real-time Text Detection.

With the growing demand of real-time applications, efficient text detection attracts increasing attention. EAST Zhou et al. (2017) applied a fully convolutional network (FCN) to directly produce rotated rectangles or quadrangles for text regions, which is the first text detector that runs at 20 FPS. PAN Wang et al. (2019b) and DB Liao et al. (2020) are two representative real-time text detectors, both of which adopted a lightweight backbone (i.e., ResNet18 He et al. (2016)

) to speedup inference. For post-processing, PAN developed a learnable post-processing algorithm, namely pixel aggregation (PA), to improve the accuracy by using the predicted similarity vectors. DB proposed the box formation process, which utilized the Vatti clipping algorithm 

Vatti (1992) to dilate the predicted text kernels. Although these methods have simplified the text detection pipeline compared to previous methods Wang et al. (2019a); Long et al. (2018), real-time text detection is still room for improvement, due to sub-optimal hand-crafted network architecture and CPU-based post-processing.

Neural Architecture Search.

Recently, owing to the Neural Architecture Search (NAS) techniques, there has been a significant change in designing neural networks. Many auto-searched efficient networks, such as Proxyless 

Cai, Zhu, and Han (2018), EfficientNet Tan and Le (2019), OFA Cai et al. (2019), and MobileNetV3 Howard et al. (2019), play increasingly important roles in industry and the research community. Despite these developments, these NAS-based models are mainly limited to a few tasks such as image classification and general object detection, leading to the weak generalization ability in other tasks. To compensate for these drawbacks, many researchers explore applying NAS methods to their specific domains, including semantic segmentation Liu et al. (2019a)

, pose estimation 

Xu et al. (2021), and scene text recognition Zhang et al. (2020a); Hong, Kim, and Choi (2020). However, there is still rare to extend NAS approaches to text detection.

Proposed Method

Overall Architecture

As illustrated in Figure 3, the proposed FAST contains (1) an over-parameterized network (see Figure 3(b)) with multiple learnable blocks (see Figure 3(c)) for architecture search of text detection; and (2) a GPU-parallel differentable post-processing to rebuild complete text line (see Figure 3(f)) from text kernel (see Figure 3(e)).

In the inference phase, we first feed the input image of into the searched network, and obtain multi-level features, which are 1/4, 1/8, 1/16, 1/32 of the original image resolution. Then, we reduce the dimension of each feature map to 128 via 33 convolution, and these feature maps are upsampled and concatenated via the function , to obtain the final feature map , whose shape is (see Figure 3(d)). After that, the final feature map passes through 2-layer convolutions to perform text kernel segmentation. Finally, we rebuild the complete text regions via the text dilation process with negligible time overhead, as shown in Figure 3(e)(f).

During training, we perform architecture search for text detection based on ProxylessNAS Cai, Zhu, and Han (2018). We stochastically sample subnets (i.e.

, binary gates) from the over-parameterized network for each batch of data, and use loss functions

and to optimize the text kernel predicted by network and the text region generated by post-processing, respectively. During search, we resample the subnets, calculate rewards according to the accuracy and inference speed, and then use a reinforce-based strategy to update the architecture parameters. The two updating steps are performed in an alternative manner. Once the search of architecture is finished, we can prune redundant paths and obtain the final architecture.

Neural Architecture Search for Text Detection

Search Space.

Following ProxylessNAS Cai, Zhu, and Han (2018), we build an over-parameterized network for text detection architecture search. As shown in Figure 3

(c), each stage of the network is comprised of a stride-2 convolution and

learnable blocks, where the 33 convolution with stride 2 is used for downsampling feature maps, and each learnable block consists of a set of candidate operations, from which the most appropriate one is selected as the final operation after NAS.

Specifically, we present a layer-level candidate set, defined as {conv33, conv13, conv31, identity}. As the 13 and 31 convolutions have asymmetric kernels, they can help capture the features of extreme aspect-ratio text lines. Besides, the identity operator indicates that a layer is skipped, which is used to control the depth and inference speed of the model. In summary, because there are a total of learnable blocks, and each of them has four candidates, the size of the search space is .

Reward Function.

In addition to the search space, we design a customized reward function , to search network architectures for efficient text detection. Specifically, given a model , we define the reward function as:

(1)

where and denote the intersection-over-union (IoU) metric of the predicted text kernels and text regions, respectively. is the weight of , which is empirically set to 0.5. Besides that, means the inference speed of the entire text detector measured on the GPU with batch size 1, and is the target inference speed. is a hyper-parameter to balance the accuracy and inference speed, which is set to 0.1 following ProxylessNAS Cai, Zhu, and Han (2018).

Discussion.

Our method is different from existing works on NAS in two main aspects:

(1) We introduce asymmetric convolution Szegedy et al. (2016) into the search space, to capture the features of extreme aspect-ratio text lines, while most existing NAS methods adopt MBConv Sandler et al. (2018) as the building block, do not consider the geometric characteristics of text.

(2) We propose a specialized reward function, which considers both the performance of text kernel and text region, achieving effective architecture search for text detection. Because most previous reward functions Tan et al. (2019); Tan and Le (2019); Howard et al. (2019); Wang et al. (2020a) are designed for image classification or general object detection, are not suitable for arbitrarily-shaped text detection.

In addition, we adopt ProxylessNAS Cai, Zhu, and Han (2018) as the search algorithm in this work. We consider that whether the NAS algorithm needs to be redesigned for text detection, is an interesting topic that can be further explored in the future.

Minimalist Kernel Representation

Figure 4: Label generation of the minimalist kernel representation.

Definition.

As illustrated in Figure 4, our representation method formulates a given text line as an eroded text region (i.e., text kernel) with peripheral pixels. Compared to the existing kernel representations Wang et al. (2019a, b); Liao et al. (2020), our representation method has two main differences as follows:

(1) Because our text kernel is generated by the morphological erosion operation, it can be approximatively restored to complete text region by its reverse operation (i.e.

, dilation). Moreover, both erosion and dilation can be easily implemented in PyTorch with GPU acceleration.

(2) Our representation method only requires the network to predict a 1-channel output, which is simpler than previous methods that need multi-channel output, as shown in Figure 2. To our knowledge, it may be the simplest kernel representation for arbitrarily-shaped text detection.

Label Generation.

To learn this representation, we need to generate labels for text kernels and text regions. Specifically, for a given text image, the label of text regions can be directly produced by filling the bounding boxes, which is denoted as (see Figure 4(b)). Note that is a binary image, applying an erosion operator with kernel to , the peripheral pixels of text regions will be converted to non-text pixels. We take this result as the label for text kernels and denote it as (see Figure 4(c)).

Post-Processing.

Based on the proposed representation method, we develop a GPU-parallel post-processing, termed text dilation, to recover complete text lines with negligible time overhead. The pseudo code is shown in Algorithm 1

, in which we utilize the max-pooling function with

kernel to implement the dilation operator equivalently.

During training, for a given prediction of text kernels, we directly apply the dilation operator to rebuild whole text regions. Since this step is differentiable, we can supervise both text kernels and text regions, as shown in Figure 3

. In the inference phase, we first binarize the predicted text kernels, and implement a GPU-accelerated Connected Components Labeling (CCL) algorithm

Allegretti, Bolelli, and Grana (2019) to distinguish different text kernels. Finally, we apply the dilation operator to reconstruct the complete text lines.

Figure 5: Searched architecture of TextNAS-A2. The four nodes in each column represent a learnable block, and the black arrows indicate the selected operations. TextNAS-A0 and TextNAS-A1 are shown in the supplementary materials.
# s: dilation size
# text dilation for post-processing
def text_dilation(text_kernel, s):
    if not training: # in the inference phase
        # binarize text kernel
        text_kernel = text_kernel > 0
        # distinguish kernels using the connected components labeling (CCL) algorithm
        text_kernel = ccl_gpu(text_kernel)
    # implement dilation operation with F.max_pool2d
    # args: input, kernel size, stride, padding
    text = F.max_pool2d(text_kernel, s, 1, s//2)
    return text
Algorithm 1 PyTorch-like Pseudo Code of Text Dilation

Loss Function

The loss function of our FAST can be formulated as:

(2)

where and are losses for text kernels and text regions. Following common practices Wang et al. (2019a, b), we apply Dice loss Milletari et al. (2016) to supervise the network. Therefore, and can be expressed as follows:

(3)
(4)

where and represent the value of position in the prediction and the ground-truth, respectively. In addition, we apply Online Hard Example Mining Shrivastava, Gupta, and Girshick (2016) to to ignore simple non-text regions. balances the importance of and , which is set to 0.5 in all experiments.

Experiments

Datasets

Total-Text Ch’ng and Chan (2017) is a challenging dataset for arbitrarily-shaped text detection, including horizontal, multi-oriented, and curved text lines. It contains 1,255 training images and 300 testing images, all of which are labeled with polygons at the word level.

CTW1500 Liu et al. (2019b) is also a widely used dataset for arbitrarily-shaped text detection. It consists of 1,000 training images and 500 testing images. In this dataset, text lines are labeled with 14 points as polygons.

ICDAR 2015 Karatzas et al. (2015) is one of the challenges of the ICDAR 2015 Robust Reading Competition. It focuses on multi-oriented text in natural scenes and contains 1,000 training images and 500 testing images. The text lines are labeled by quadrangles at the word level.

MSRA-TD500 Yao et al. (2012) is a multi-lingual dataset that contains multi-oriented and long text lines. It has 300 training images and 200 testing images. Following the previous works Zhou et al. (2017); Long et al. (2018); Lyu et al. (2018b), we include the 400 images of HUST-TR400 Yao, Bai, and Liu (2014) as training data.

IC17-MLT Nayef et al. (2017) is a multi-language dataset that consists of 7,200 training images, 1,800 validation images, and 9,000 testing images. In this dataset, text lines are annotated with word-level quadrangles.

Implementation Details

NAS Settings.

During search, we consider a total of learnable blocks in the network. Following the common practice of doubling the number of channels when halving the size of feature maps, we set , , , to 64, 128, 256, and 512, respectively. We adopt a widely-used ProxylessNAS Cai, Zhu, and Han (2018) as the search algorithm, and Adam Kingma and Ba (2014) with an initial learning rate of 0.01 as the optimizer. We set our target inference speed

as 100, 80, and 60 FPS, for searching TextNAS-A0, A1, and A2 respectively. To keep generalization ability, we take IC17-MLT as the training set during NAS, and construct a validation set that concludes the training sets of ICDAR 2015 and Total-Text for NAS. The entire network is trained and searched for 200 epochs with batch size 16 on 4 GPUs, which takes around 200 GPU hours on 1080Ti.

Training Settings.

Following previous methods Xie et al. (2019); Wang et al. (2019a); Feng et al. (2019); Xie et al. (2021), we pre-train our models on IC17-MLT for 300 epochs, in which images are cropped and resized to 640 640 pixels. We then finetune the models for 600 epochs. The dilation size is set to 9 in our experiments unless explicitly stated. All models are optimized by Adam with batch size 16 on 4 GPUs. We adopt a “poly” learning rate schedule with an initial learning rate of . Training data augmentations include random scale, random flip, random rotation, random crop, and random blur.

Inference Settings.

In the inference phase, we scale the shorter side of images to different sizes, and report the performance on each dataset. For fair comparison, we evaluate all testing images and calculate the average speed. All results are tested with a batch size of 1 on one 1080Ti GPU and a 2.20GHz CPU in a single thread unless explicitly stated.

Figure 6: Text detection F-measure and inference speed of different networks on Total-Text, where we scale the shorter side of images to 640 pixels. Our TextNAS models significantly outperform existing hand-crafted and auto-searched networks.

Ablation Study

Comparison with Hand-Crafted Networks.

We first study the differences between our TextNAS and representative hand-crafted networks, such as ResNets He et al. (2016) and VGG16 Simonyan and Zisserman (2014). Without loss of generality, we show the searched architecture of TextNAS-A2 in Figure 5, from which we can make the following observations: (1) Asymmetric convolutions are the dominant operators in our network, which facilitate the detection of text lines with extreme aspect ratios. (2) TextNAS-A2 tends to stack more convolutions in the shallow stages (i.e., stage 1 and 2), which helps to capture richer low-level characteristics, such as colors, textures, and edges.

As shown in Figure 6, TextNAS achieves a better trade-off between accuracy and inference speed than previous models by a significant margin. In addition, our TextNAS-A0, A1, and A2 have 6.8M, 8.0M, and 8.9M parameters respectively, which are more parameter-efficient than ResNets He et al. (2016) and VGG16 Simonyan and Zisserman (2014), but slightly larger than PVTv2-B0 Wang et al. (2021b, a). These results demonstrate that TextNAS models are effective for text detection on the GPU device.

Comparison with Other NAS Networks.

For fair comparison, all models are pre-trained on IC17-MLT and finetuned on Total-Text. As shown in Figure 6, our TextNAS models outperform existing NAS networks in terms of both accuracy and inference speed, including Proxyless-GPU Cai, Zhu, and Han (2018), OFA-1080Ti-12ms Cai et al. (2019), MobileNetV3 Howard et al. (2019), and EfficientNet-B0 Tan and Le (2019). For example, TextNAS-A1 achieves 85.2% F-measure at 85.3 FPS, being 1.0% more accurate and 1.5 faster than OFA-1080Ti-12ms. A major reason is that these NAS networks are mainly searched for image classification, and the generalization ability on other tasks is not robust. Therefore, designing the search space and reward function for text detection is meaningful and necessary.

Upper Bound of Minimalist Representation.

We verify the upper bound of our text representation method by calculating the F-measure of text lines recovered from the ground-truth text kernels. The verification results of text kernels under different erosion sizes are shown in Table 1, where we see that the best F-measure is approaching 1 when , indicating that this concern is unnecessary.

Dataset
Upper Total-Text 95.9 97.8 99.2 99.3 99.0
Bound ICDAR 2015 81.5 92.4 99.6 99.6 99.5
Actual Total-Text 82.1 84.5 85.5 85.7 85.6
F-measure ICDAR 2015 56.8 77.1 83.0 83.7 83.6
Table 1: Ablation studies of the erosion/dilation size .

Influence of the Dilation Size.

In this experiment, we study the effect of the dilation size (equal to the erosion size) based on our FAST-A2 model. We scale the shorter sides of images in Total-Text and ICDAR 2015 to 640 and 736 pixels, respectively. As shown in Table 1, the F-measure on both datasets drops when the dilation size is too small. Empirically, we set the dilation size to 9 by default. Note that if the size of the shorter side (denoted as ) is changed, the dilation size should be updated proportionally to obtain the best performance:

(5)

Here, is a function to round the decimal portion.

Comparisons with State-of-the-Art Methods

Method Ext. P R F FPS
Non-real-time Methods
TextSnake Long et al. (2018) 82.7 74.5 78.4 12.4
TextField Xu et al. (2019) 81.2 79.9 80.6 -
CRAFT Baek et al. (2019) 87.6 79.9 83.6 4.8
LOMO Zhang et al. (2019) 88.6 75.7 81.6 4.4
SPCNet Xie et al. (2019) 83.0 82.8 82.9 4.6
PSENet Wang et al. (2019a) 84.0 78.0 80.9 3.9
ContourNet Wang et al. (2020b) - 86.9 83.9 85.4 3.8
Real-time Methods
EAST Zhou et al. (2017) - 50.0 36.2 42.0 -
DB-R50 Liao et al. (2020) 87.1 82.5 84.7 32.0
DB-R18 Liao et al. (2020) 88.3 77.9 82.8 50.0
PAN Wang et al. (2019b) 89.3 81.0 85.0 39.6
PAN++ Wang et al. (2021c) 89.9 81.0 85.3 38.3
FAST-A0-448 (Ours) 88.2 75.5 81.4 152.8
FAST-A1-512 (Ours) 90.4 80.0 84.9 115.5
FAST-A2-512 (Ours) 89.3 81.8 85.4 93.2
FAST-A2-640 (Ours) 90.5 81.4 85.7 67.5
FAST-A2-800 (Ours) 90.5 82.5 86.3 46.0
Table 2: Detection results on Total-Text Ch’ng and Chan (2017). The suffix of our method means the size of the shorter side. “*” indicates the results from Long et al. (2018). “Ext.” denotes external data. “P”, “R”, and “F” indicate precision, recall, and F-measure, respectiely.
Method Ext. P R F FPS
Non-real-time Methods
TextSnake Long et al. (2018). 67.9 85.3 75.6 -
TextField Xu et al. (2019) 83.0 79.8 81.4 -
CRAFT Baek et al. (2019) 86.0 81.1 83.5 7.6
LOMO Zhang et al. (2019) 89.2 69.6 78.4 4.4
PSENet Wang et al. (2019a) 84.8 79.7 82.2 3.9
ContourNet Wang et al. (2020b) - 83.7 84.1 83.9 4.5
Real-time Methods
EAST Zhou et al. (2017). - 78.7 49.1 60.4 21.2
DB-R50 Liao et al. (2020) 86.9 80.2 83.4 22.0
DB-R18 Liao et al. (2020) 84.8 77.5 81.0 55.0
PAN Wang et al. (2019b) 86.4 81.2 83.7 39.8
PAN++ Wang et al. (2021c) 87.1 81.1 84.0 36.0
FAST-A0-512 (Ours) 85.6 77.8 81.5 129.1
FAST-A1-512 (Ours) 86.2 78.1 82.0 112.9
FAST-A2-512 (Ours) 85.9 79.9 82.8 92.6
FAST-A2-640 (Ours) 87.2 80.4 83.7 66.5
Table 3: Detection results on CTW1500 Liu et al. (2017). The suffix of our method means the length of the shorter side. Results with “*” are collected from Liu et al. (2017). “Ext.” denotes external data. “P”, “R”, and “F” indicate precision, recall, and F-measure, respectiely.

Curve Text Detection.

To show the advantages of FAST in detecting curved text, we compare it with existing state-of-the-art methods on the Total-Text and CTW1500 datasets, and report the results in Table 2 and Table 3.

On Total-Text, FAST-A0-448 yields an F-measure of 81.4% at 152.8 FPS, which is faster than all previous methods. Our FAST-A1-512 outperforms the real-time text detector DB-R18 Liao et al. (2020) by 2.1% in F-measure and runs 2.3 faster. Compared to PAN Wang et al. (2019b), FAST-A2-640 is 27.9 FPS faster while the F-measure is 0.7% better. It is notable that when taking a larger input scale, FAST-A2-800 achieves the best F-measure of 86.3%, surpassing all counterparts in F-measure by at least 1.0% while still keeping a fast inference speed (46.0 FPS).

Similar conclusions are also on CTW1500. For example, the inference speed of FAST-A0-512 is 129.1 FPS, which is at least 2.3 faster than prior works, while the F-measure is still very competitive (81.5%). The best F-measure of our method is 83.7%, which is as same as PAN Wang et al. (2019b), but our method runs at a faster speed (66.5 FPS vs. 39.8 FPS).

Method Ext. P R F FPS
Non-real-time Methods
PixelLink Deng et al. (2018) - 82.9 81.7 82.3 7.3
Corner Lyu et al. (2018b) 94.1 70.7 80.7 3.6
TextSnake Long et al. (2018) 84.9 80.4 82.6 1.1
CRAFT Baek et al. (2019) 89.8 84.3 86.9 -
LOMO Zhang et al. (2019) 91.3 83.5 87.2 3.4
SPCNet Xie et al. (2019) 88.7 85.8 87.2 4.6
PSENet Wang et al. (2019a) 86.9 84.5 85.7 1.6
PolarMask++ Xie et al. (2021) 87.3 83.5 85.4 10.0
Real-time Methods
EAST Zhou et al. (2017) - 83.6 73.5 78.2 13.2
DB-R50 Liao et al. (2020) 88.2 82.7 85.4 26.0
DB-R18 Liao et al. (2020) 86.8 78.4 82.3 48.0
PAN Wang et al. (2019b) 84.0 81.9 82.9 26.1
PAN++ Wang et al. (2021c) 85.9 80.4 83.1 28.2
FAST-A0-736 (Ours) 85.9 78.3 81.9 60.9
FAST-A1-736 (Ours) 85.7 80.0 82.7 53.9
FAST-A2-736 (Ours) 87.5 80.3 83.7 42.7
FAST-A2-896 (Ours) 88.6 82.2 85.3 31.8
FAST-A2-1280 (Ours) 89.9 84.4 87.0 15.7
Table 4: Detection results on ICDAR 2015 Karatzas et al. (2015). The suffix of our method means the size of the shorter side. “Ext.” denotes external data. “P”, “R”, and “F” indicate precision, recall, and F-measure, respectiely.

Oriented Text Detection.

We evaluate the effectiveness of FAST in detecting oriented text lines on ICDAR 2015. From Table 4, we can observe that our fastest model FAST-A0-736 reaches 60.9 FPS and maintains a competitive F-measure of 81.9%. Compared with PAN Wang et al. (2019b), FAST-A2-896 surpasses it by 2.4% in F-measure and is more efficient (31.8 FPS vs. 26.1 FPS). Because ICDAR 2015 contains many small text lines, previous methods Xie et al. (2019); Wang et al. (2019a) always adopt high-resolution images to ensure detection performance. With this setting, FAST-A2-1280 achieves an F-measure of 87.0%, which outperforms PSENet Wang et al. (2019a) by 1.3% and runs 9.8 faster.

Method Ext. P R F FPS
Non-real-time Methods
PixelLink Deng et al. (2018) - 83.0 73.2 77.8 3.0
Corner Lyu et al. (2018b) 87.6 76.2 81.5 5.7
TextSnake Long et al. (2018) 83.2 73.9 78.3 1.1
CRAFT Baek et al. (2019) 88.2 78.2 82.9 8.6
DRRG Zhang et al. (2020b) 88.0 82.3 85.1 -
Real-time Methods
EAST Zhou et al. (2017) - 87.3 67.4 76.1 -
DB-R50 Liao et al. (2020) 91.5 79.2 84.9 32.0
DB-R18 Liao et al. (2020) 90.4 76.3 82.8 62.0
PAN Wang et al. (2019b) 84.4 83.8 84.1 30.2
PAN++ Wang et al. (2021c) 85.3 84.0 84.7 32.5
FAST-A0-512 (Ours) 90.6 78.9 84.4 137.2
FAST-A0-736 (Ours) 90.9 80.3 85.3 79.6
FAST-A1-736 (Ours) 90.2 82.3 86.1 72.0
FAST-A2-736 (Ours) 90.9 83.0 86.7 56.8
Table 5: Detection results on MSRA-TD500 Yao et al. (2012). The suffix of our method means the size of the shorter side. “Ext.” denotes external data. “P”, “R”, and “F” indicate precision, recall, and F-measure, respectiely.
Figure 7: Qualitative text detection results of FAST on Total-Text Ch’ng and Chan (2017), CTW1500 Liu et al. (2017), ICDAR 2015 Karatzas et al. (2015) and MSRA-TD500 Yao et al. (2012). More examples are provided in the supplementary materials.

Long Straight Text Detection.

FAST is also robust for long straight text detection. As shown in Table 5, on MSRA-TD500, FAST-A1-736 and FAST-A2-736 achieve the F-measure of 86.1% and 86.7% respectively, outperforming all previous works with a significant margin. More notably, FAST-A0-512 runs at 137.2 FPS with 84.4% F-measure, being 75 FPS faster and 1.6% better than the previous fastest method DB-R18 Liao et al. (2020). Some qualitative text detection results are shown in Figure 7.

Conclusion

In this work, we proposed FAST, a faster arbitrarily-shaped text detector. To achieve high efficiency, we designed a search space and reward function tailored for text detection, and searched for a series of efficient networks friendly to text detection. Moreover, we presented a minimalist kernel representation, as well as a GPU-parallel post-processing, making our models can completely run on the GPU. Equipped with the two designs, our FAST achieves a significantly better trade-off between accuracy and inference speed than prior arts. We hope our method could serve as a cornerstone for text-related applications.

References

  • Allegretti, Bolelli, and Grana (2019) Allegretti, S.; Bolelli, F.; and Grana, C. 2019. Optimized block-based algorithms to label connected components on GPUs. IEEE Transactions on Parallel and Distributed Systems, 31(2): 423–438.
  • Baek et al. (2019) Baek, Y.; Lee, B.; Han, D.; Yun, S.; and Lee, H. 2019. Character region awareness for text detection. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    , 9365–9374.
  • Cai et al. (2019) Cai, H.; Gan, C.; Wang, T.; Zhang, Z.; and Han, S. 2019. Once-for-All: Train One Network and Specialize it for Efficient Deployment. In Proceedings of the International Conference on Learning Representations (ICLR).
  • Cai, Zhu, and Han (2018) Cai, H.; Zhu, L.; and Han, S. 2018. ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware. In Proceedings of the International Conference on Learning Representations (ICML).
  • Ch’ng and Chan (2017) Ch’ng, C. K.; and Chan, C. S. 2017. Total-text: A comprehensive dataset for scene text detection and recognition. In Proceedings of the International Conference on Document Analysis and Recognition (ICDAR), 935–942.
  • Deng et al. (2018) Deng, D.; Liu, H.; Li, X.; and Cai, D. 2018. Pixellink: Detecting scene text via instance segmentation. In

    Proceedings of the AAAI Conference on Artificial Intelligence (AAAI)

    , volume 32.
  • Feng et al. (2019) Feng, W.; He, W.; Yin, F.; Zhang, X.-Y.; and Liu, C.-L. 2019. TextDragon: An end-to-end framework for arbitrary shaped text spotting. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 9076–9085.
  • He et al. (2016) He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), 770–778.
  • Hong, Kim, and Choi (2020) Hong, S.; Kim, D.; and Choi, M.-K. 2020. Memory-efficient models for scene text recognition via neural architecture search. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (CACV), 183–191.
  • Howard et al. (2019) Howard, A.; Sandler, M.; Chu, G.; Chen, L.-C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. 2019. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 1314–1324.
  • Karatzas et al. (2015) Karatzas, D.; Gomez-Bigorda, L.; Nicolaou, A.; Ghosh, S.; Bagdanov, A.; Iwamura, M.; Matas, J.; Neumann, L.; Chandrasekhar, V. R.; Lu, S.; et al. 2015. ICDAR 2015 competition on robust reading. In Proceedings of the International Conference on Document Analysis and Recognition (ICDAR), 1156–1160.
  • Kingma and Ba (2014) Kingma, D. P.; and Ba, J. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  • Liao, Shi, and Bai (2018) Liao, M.; Shi, B.; and Bai, X. 2018. Textboxes++: A single-shot oriented scene text detector. IEEE transactions on image processing, 27(8): 3676–3690.
  • Liao et al. (2017) Liao, M.; Shi, B.; Bai, X.; Wang, X.; and Liu, W. 2017. Textboxes: A fast text detector with a single deep neural network. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI).
  • Liao et al. (2020) Liao, M.; Wan, Z.; Yao, C.; Chen, K.; and Bai, X. 2020. Real-time scene text detection with differentiable binarization. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), volume 34, 11474–11481.
  • Liao et al. (2018) Liao, M.; Zhu, Z.; Shi, B.; Xia, G.-s.; and Bai, X. 2018. Rotation-sensitive regression for oriented scene text detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 5909–5918.
  • Liu et al. (2019a) Liu, C.; Chen, L.-C.; Schroff, F.; Adam, H.; Hua, W.; Yuille, A. L.; and Fei-Fei, L. 2019a. Auto-deeplab: Hierarchical neural architecture search for semantic image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 82–92.
  • Liu et al. (2016) Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; and Berg, A. C. 2016. Ssd: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision (ECCV), 21–37.
  • Liu et al. (2019b) Liu, Y.; Jin, L.; Zhang, S.; Luo, C.; and Zhang, S. 2019b. Curved scene text detection via transverse and longitudinal sequence connection. Pattern Recognition, 90: 337–345.
  • Liu et al. (2017) Liu, Y.; Jin, L.; Zhang, S.; and Zhang, S. 2017. Detecting curve text in the wild: New dataset and new solution. arXiv preprint arXiv:1712.02170.
  • Long et al. (2018) Long, S.; Ruan, J.; Zhang, W.; He, X.; Wu, W.; and Yao, C. 2018. Textsnake: A flexible representation for detecting text of arbitrary shapes. In Proceedings of the European Conference on Computer Vision (ECCV), 20–36.
  • Lyu et al. (2018a) Lyu, P.; Liao, M.; Yao, C.; Wu, W.; and Bai, X. 2018a. Mask textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes. In Proceedings of the European Conference on Computer Vision (ECCV), 67–83.
  • Lyu et al. (2018b) Lyu, P.; Yao, C.; Wu, W.; Yan, S.; and Bai, X. 2018b. Multi-oriented scene text detection via corner localization and region segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 7553–7563.
  • Ma et al. (2018) Ma, J.; Shao, W.; Ye, H.; Wang, L.; Wang, H.; Zheng, Y.; and Xue, X. 2018. Arbitrary-oriented scene text detection via rotation proposals. IEEE Transactions on Multimedia, 20: 3111–3122.
  • Milletari et al. (2016) Milletari, F.; Navab, N.; Ahmadi, S.; and V-net. 2016.

    Fully convolutional neural networks for volumetric medical image segmentation.

    In Proceedings of the Fourth International Conference on 3D Vision (3DV), 565–571.
  • Nayef et al. (2017) Nayef, N.; Yin, F.; Bizid, I.; Choi, H.; Feng, Y.; Karatzas, D.; Luo, Z.; Pal, U.; Rigaud, C.; Chazalon, J.; et al. 2017. Icdar2017 robust reading challenge on multi-lingual scene text detection and script identification-rrc-mlt. In Proceedings of the International Conference on Document Analysis and Recognition (ICDAR), volume 1, 1454–1459.
  • Ren et al. (2016) Ren, S.; He, K.; Girshick, R.; and Sun, J. 2016. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6): 1137–1149.
  • Sandler et al. (2018) Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; and Chen, L.-C. 2018. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 4510–4520.
  • Shi, Bai, and Belongie (2017) Shi, B.; Bai, X.; and Belongie, S. 2017. Detecting oriented text in natural images by linking segments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2550–2558.
  • Shrivastava, Gupta, and Girshick (2016) Shrivastava, A.; Gupta, A.; and Girshick, R. 2016. Training region-based object detectors with online hard example mining. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 761–769.
  • Simonyan and Zisserman (2014) Simonyan, K.; and Zisserman, A. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
  • Szegedy et al. (2016) Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; and Wojna, Z. 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2818–2826.
  • Tan et al. (2019) Tan, M.; Chen, B.; Pang, R.; Vasudevan, V.; Sandler, M.; Howard, A.; and Le, Q. V. 2019. Mnasnet: Platform-aware neural architecture search for mobile. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2820–2828.
  • Tan and Le (2019) Tan, M.; and Le, Q. 2019. Efficientnet: Rethinking model scaling for convolutional neural networks. In

    Proceedings of the International Conference on Machine Learning (ICML)

    , 6105–6114.
  • Tian et al. (2019) Tian, Z.; Shu, M.; Lyu, P.; Li, R.; Zhou, C.; Shen, X.; and Jia, J. 2019. Learning shape-aware embedding for scene text detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 4234–4243.
  • Vatti (1992) Vatti, B. R. 1992. A generic solution to polygon clipping. Communications of the ACM, 35(7): 56–63.
  • Wang et al. (2020a) Wang, N.; Gao, Y.; Chen, H.; Wang, P.; Tian, Z.; Shen, C.; and Zhang, Y. 2020a. NAS-FCOS: Fast neural architecture search for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 11943–11951.
  • Wang et al. (2021a) Wang, W.; Xie, E.; Li, X.; Fan, D.-P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; and Shao, L. 2021a. Pvtv2: Improved baselines with pyramid vision transformer. arXiv preprint arXiv:2106.13797.
  • Wang et al. (2021b) Wang, W.; Xie, E.; Li, X.; Fan, D.-P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; and Shao, L. 2021b. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. arXiv preprint arXiv:2102.12122.
  • Wang et al. (2019a) Wang, W.; Xie, E.; Li, X.; Hou, W.; Lu, T.; Yu, G.; and Shao, S. 2019a. Shape robust text detection with progressive scale expansion network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 9336–9345.
  • Wang et al. (2021c) Wang, W.; Xie, E.; Li, X.; Liu, X.; Liang, D.; Zhibo, Y.; Lu, T.; and Shen, C. 2021c. PAN++: Towards Efficient and Accurate End-to-End Spotting of Arbitrarily-Shaped Text. IEEE Transactions on Pattern Analysis and Machine Intelligence.
  • Wang et al. (2019b) Wang, W.; Xie, E.; Song, X.; Zang, Y.; Wang, W.; Lu, T.; Yu, G.; and Shen, C. 2019b. Efficient and accurate arbitrary-shaped text detection with pixel aggregation network. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 8440–8449.
  • Wang et al. (2020b) Wang, Y.; Xie, H.; Zha, Z.-J.; Xing, M.; Fu, Z.; and Zhang, Y. 2020b. Contournet: Taking a further step toward accurate arbitrary-shaped scene text detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 11753–11762.
  • Xie et al. (2021) Xie, E.; Wang, W.; Ding, M.; Zhang, R.; and Luo, P. 2021. PolarMask++: Enhanced Polar Representation for Single-Shot Instance Segmentation and Beyond. IEEE Transactions on Pattern Analysis and Machine Intelligence.
  • Xie et al. (2019) Xie, E.; Zang, Y.; Shao, S.; Yu, G.; Yao, C.; and Li, G. 2019. Scene text detection with supervised pyramid context network. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), volume 33, 9038–9045.
  • Xu et al. (2021) Xu, L.; Guan, Y.; Jin, S.; Liu, W.; Qian, C.; Luo, P.; Ouyang, W.; and Wang, X. 2021.

    ViPNAS: Efficient Video Pose Estimation via Neural Architecture Search.

    In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 16072–16081.
  • Xu et al. (2019) Xu, Y.; Wang, Y.; Zhou, W.; Wang, Y.; Yang, Z.; and Bai, X. 2019. Textfield: Learning a deep direction field for irregular scene text detection. IEEE Transactions on Image Processing, 28(11): 5566–5579.
  • Yao, Bai, and Liu (2014) Yao, C.; Bai, X.; and Liu, W. 2014. A unified framework for multioriented text detection and recognition. IEEE Transactions on Image Processing, 23(11): 4737–4749.
  • Yao et al. (2012) Yao, C.; Bai, X.; Liu, W.; Ma, Y.; and Tu, Z. 2012. Detecting texts of arbitrary orientations in natural images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1083–1090.
  • Zhang et al. (2019) Zhang, C.; Liang, B.; Huang, Z.; En, M.; Han, J.; Ding, E.; and Ding, X. 2019. Look more than once: An accurate detector for text of arbitrary shapes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 10552–10561.
  • Zhang et al. (2020a) Zhang, H.; Yao, Q.; Yang, M.; Xu, Y.; and Bai, X. 2020a. AutoSTR: Efficient backbone search for scene text recognition. In Proceedings of the European Conference on Computer Vision (ECCV), 751–767.
  • Zhang et al. (2020b) Zhang, S.-X.; Zhu, X.; Hou, J.-B.; Liu, C.; Yang, C.; Wang, H.; and Yin, X.-C. 2020b. Deep relational reasoning graph network for arbitrary shape text detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 9699–9708.
  • Zhou et al. (2017) Zhou, X.; Yao, C.; Wen, H.; Wang, Y.; Zhou, S.; He, W.; and Liang, J. 2017. East: an efficient and accurate scene text detector. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), 5551–5560.