DeepAI
Log In Sign Up

Text Growing on Leaf

09/07/2022
by   Chuang Yang, et al.
7

Irregular-shaped texts bring challenges to Scene Text Detection (STD). Although existing contour point sequence-based approaches achieve comparable performances, they fail to cover some highly curved ribbon-like text lines. It leads to limited text fitting ability and STD technique application. Considering the above problem, we combine text geometric characteristics and bionics to design a natural leaf vein-based text representation method (LVT). Concretely, it is found that leaf vein is a generally directed graph, which can easily cover various geometries. Inspired by it, we treat text contour as leaf margin and represent it through main, lateral, and thin veins. We further construct a detection framework based on LVT, namely LeafText. In the text reconstruction stage, LeafText simulates the leaf growth process to rebuild text contour. It grows main vein in Cartesian coordinates to locate text roughly at first. Then, lateral and thin veins are generated along the main vein growth direction in polar coordinates. They are responsible for generating coarse contour and refining it, respectively. Considering the deep dependency of lateral and thin veins on main vein, the Multi-Oriented Smoother (MOS) is proposed to enhance the robustness of main vein to ensure a reliable detection result. Additionally, we propose a global incentive loss to accelerate the predictions of lateral and thin veins. Ablation experiments demonstrate LVT is able to depict arbitrary-shaped texts precisely and verify the effectiveness of MOS and global incentive loss. Comparisons show that LeafText is superior to existing state-of-the-art (SOTA) methods on MSRA-TD500, CTW1500, Total-Text, and ICDAR2015 datasets.

READ FULL TEXT VIEW PDF

page 1

page 4

page 8

page 10

page 12

page 13

04/21/2021

Fourier Contour Embedding for Arbitrary-Shaped Text Detection

One of the main challenges for arbitrary-shaped text detection is to des...
08/11/2020

TextRay: Contour-based Geometric Modeling for Arbitrary-shaped Scene Text Detection

Arbitrary-shaped text detection is a challenging task due to the complex...
12/04/2018

TextField: Learning A Deep Direction Field for Irregular Scene Text Detection

Scene text detection is an important step of scene text reading system. ...
11/30/2022

Growing Instance Mask on Leaf

Contour-based instance segmentation methods include one-stage and multi-...
06/27/2022

TextDCT: Arbitrary-Shaped Text Detection via Discrete Cosine Transform Mask

Arbitrary-shaped scene text detection is a challenging task due to the v...

I Introduction

Reading scene text helps intelligent devices are able to accomplish many applications (such as unmanned systems, intelligent transport, express system, and so on), which has dramatically improved production efficiency and people’s quality of life. Scene Text Detection (STD) [44]

is the key technique for intelligent devices to simulate humans reading scene text, which has attracted a growing number of researchers and becomes a hot topic in computer vision. In the past decade, deep learning has greatly promoted the development of many computer technologies. It helps to extract strong expressive image features for many tasks (e.g. recognition, tracking, and regression). Benefiting from the advantages of deep learning, the performance of STD technique achieves excellent improvements in the aspect of the regular-shaped text detection 

[39, 29]. However, there are many irregular-shaped texts in real scenarios, which brings challenges to traditional approaches. To fit arbitrary-shaped text instances effectively, an increasing number of novel methods are proposed, which can be categorized into segmentation-based methods [55, 10, 2] and regression-based methods [9, 53, 50] roughly.

Fig. 1: Illustration of the proposed leaf vein-based text representation method. We aim to treat text contour as leaf margin and construct it through main, lateral, and thin veins. Main vein is used for locating text instance roughly. Lateral vein is responsible for determining coarse contour. Accurate text contour is fined through thin vein.

The former adopts mask representation, which segments text regions directly and can detect irregular-shaped text instances naturally. However, these methods frequently require large training data and less supervision information aggravates this phenomenon. The latter represents text instances by contour point sequences. They try to sample point sequences by regressing the offsets between center point or quadrilateral and irregular-shaped contour, which have clear drawbacks. Specifically, the one-stage regression-based methods fail to fit highly curved ribbon-like text lines because multiple contour points may reside in the same direction. For multiple-stages methods, the intrinsically computationally expensive post-processing leads to low detection efficiency and limited practical applications. Therefore, how to design an efficient and effective text representation method is under explored.

Considering the limitations above, we combine text geometric characteristics and bionics to design a natural text representation method, which can fit text instances with any shapes accurately, even for highly curved ones. As shown in Figure 1, it is found that leaf margins always enjoy irregular shapes, which is similar to scene texts. Importantly, the leaf margin can be covered precisely by a directed graph that is composed of main, lateral, and thin veins. Inspired by the leaf vein structure, we propose to represent text contour by the combination of main, lateral, and thin veins. We further construct a one-stage text detection framework (called LeafText) based on leaf vein. It rebuilds text contours by simulating the leaf growth process, which is an elegant and effective design. Concretely, for one text instance, LeafText first grows the corresponding main vein from the predicted kernel mask in Cartesian coordinates to locate the text roughly. Then, lateral and thin veins sprout along both sides of the main vein growth direction in polar coordinates. In the end, the text contour is drawn by connecting endpoints of lateral and thin veins in a clockwise direction. Particularly, the lateral veins are used for determining coarse contour, and the thin veins are responsible for refining the contour to obtain an accurate detection result. Considering the deep dependencies of lateral and thin vein endpoints on the main vein, it is important to ensure a reliable main vein for rebuilding contour. However, the main vein extracted from the predicted kernel mask by the existing middle sampling method is always unreliable, which leads to a bad contour point sequence. Therefore, we propose a Multi-Oriented Smoother (MOS) to ensure the main vein robustness even when encountering unstable kernel masks. Additionally, text instances enjoy a large aspect ratio range compared with common objects, which brings challenges to predicting lateral and thin veins of text instances. Therefore, global incentive loss is proposed to force our model to balance the importance of text instances with different scales and focus on the prediction of lateral and thin veins. The main contributions of this paper are as follows:

  1. By combining the text geometric characteristics and bionics, a leaf vein-based text representation method (LVT) is proposed. It explores a natural and effective way to fit arbitrary-shaped text instances, which enhances the model’s fitting ability.

  2. Thin vein is designed for fining text contours. It supports fitting texts accurately with lower model complexity, which accelerates the convergence of the training process effectively. Remarkably, the thin vein length is half of the lateral vein, which eases the learning of contour point sequence and ensures accurate detection results.

  3. A Multi-Oriented Smoother (MOS) is designed to ensure the robustness of the main vein extracted from the predicted kernel mask. It provides the correct growth directions to the lateral and thin veins, which ensures a reliable contour point sequence.

  4. Global incentive loss is proposed to help balance the importance of text instances with different scales and force our method to focus on the predictions of lateral and thin veins. Particularly, it can be integrated into other regression-based detectors seamlessly.

The rest of the paper is organized as follows. Section II introduces the related works on text detection. Section III describes the architecture, training process, and inference process details of LeafText. The experimental results are discussed in Section IV. Section V concludes the paper.

Ii Related Work

Recently, deep learning has promoted the development of the text detection technique greatly. According to the text representation method, previous text detectors can be classified into segmentation-based methods and regression-based methods roughly. In this section, a review of the existing text detection methods will be introduced.

Ii-a Segmentation-Based Methods

Segmentation technology [30] executes pixel-level classification on images, which provides an effective solution for text detection. Zhang [62] segmented rough text regions at first. Then, they extracted character components within text blocks by MSER [35]. In the end, the authors suppressed false hypotheses by the intensity and geometric criteria of character components to obtain the final detection results. Lyu [32] proposed to detect long text lines via a corner localization detection strategy. They generated candidate boxes by sampling and grouping corner points and filtered false-positive samples by the score of segmentation maps.

Deng [6] found that it would lead to text adhesion problems if extracting text contours from segmentation maps directly. To alleviate the above problem, link heat maps in eight directions were predicted for separating adhesive text instances. The works [42, 54] designed similar strategies as [6] to provide solutions for the phenomenon that many texts are very close to each other. Different from the above works, Wang [47, 49, 48] and Liao [22, 24] proposed expansion strategies to generate text regions from shrink regions, which avoided detecting multiple adhesive texts as one either. The differences between them were that the former expanded shrink to text regions at pixel-level and the latter executed the expansion process at instance-level. The works [19, 36, 5] considered that a small amount of pixel-level annotated data limits the model performance and proposed a two-stage detection framework to make full use of a large amount of data annotated with rectangles. In the inference process, the authors located texts roughly by quadrilaterals and extracted text contours precisely from the corresponding segmented text regions within quadrilaterals. Zhang [59] considered stack-omnidirectional text dilemma brings much challenges for text detection. They designed LSTM-based module to help generates omnidirectional text mask proposals from vertical and horizontal directions simultaneously to solve the stack-omnidirectional text dilemma.

Fig. 2: Illustration of the vein growth process. It contains the following three stages: 1) growing potential lateral vein, which is responsible for determining potential growth directions according to the start points and the corresponding tangent slopes; 2) rectifying potential growth lateral vein directions; 3) growing thin vein based on the determined lateral vein.

Except for predicting the whole text instances directly, some approaches [1, 60, 31] detect texts in character-level. Baek [1] proposed a weakly-supervised framework to generate character-level labels to promote the training process. In the rebuilding process, the approach first predicted character regions and then linked them by affinities to obtain the final detection results. Zhang [60] adopted similar strategy as [1] to represent text instances. Moreover, they introduced Graph Convolutional Network (GCN) to predict the affinities between different character regions to improve the reliability of linked components. Different from them, Long [31] segments center line firstly and then predict the each part bound along with the center line.

Ii-B Regression-Based Methods

Object detection methods [38, 26, 37] adopt contour point sequence-based representation method to rebuild object contour or box, which brings great inspiration for the research of text detection. Liao [20] inherited the framework of [26] directly to detect horizontal text. To improve the performance of the multi-oriented texts detection, they proposed to predict rotation angles of texts in [21]. Different from the above anchor-based detection framework, Zhou [63] introduced the detection strategy proposed in [16] into text detection, which predicted corner points of multi-oriented texts and connected them to obtain the text boxes. Liao [23] focused on how to extract strong expressive features for multi-oriented texts. They proposed to rotate the convolutional filters to encourage the model to extract rotation-sensitive features. He  [14]

extracted the text features with strong representation capacities through a hierarchical inception module.

Though the above works achieve comparable performance in detecting multi-oriented text instances, they are hard to detect curved texts effectively. To improve the model’s ability to detect arbitrary-shaped text instances, Some researchers [41, 15, 8] separated word-level text blocks into multiple character-level regions. They regressed character boxes and linked those components to rebuild text blocks. The same as [41], Ma  [33] and Zhang  [61] adopted character-based detection strategy. Importantly, the authors utilized GCN to evaluate the linkages of adjacent characters to improve the stability of rebuilt text regions. Zhang  [58] and Wang  [46] designed two-stage contour point sequence representation method. They extracted text quadrangles and further predicted contour points based on features within the quadrangles through regression way. The former generated the text center line (TCL) region at first. Then, they regressed the offset between TCL and text contour to sample the contour point. The latter predicted the distance between quadrangle and text contour directly to extract the contour point. Wang  [51] proposed a more intuitive way to obtain contour points. The authors segmented those points directly in both vertical and horizontal directions and combined them to filter unreliable results. Inspired by [52], Wang  [45] modeled text instances into the polar coordinate system and emitted multiple rays from text center to contour. The ray endpoints were sampled as contour points and connected to obtained final detection results.

Some works [27, 40, 64] proposed novel regression strategies to detect text instances and achieved state-of-the-art performance. Specifically, Liu [27]

introduced Bezier-curve to represent text contours. They explored the probability to fit texts except for standard bounding box detection. Su 

[40]

encode the text regions into compact vectors through discrete cosine transform. Zhu 

[64] modeled texts into Fourier domain and regressed contour point sequence by Fourier signature vectors.

Iii Methodology

In this section, the leaf vein-based text representation method (LVT) is presented firstly. Then, we introduce the overall architecture of the proposed LeafText. Next, the details of Multi-Oriented Smoother (MOS) and Growth Process of Vein (GPV) are described. In the end, the optimization functions of network are given.

Fig. 3:

Overall architecture of the proposed LeafText. It is composed of Backbone, FPN, MV Header, LV-TV header, Multi-Oriented Smoother (MOS), and Vein Growth Process. MV, LV, and TV denote main vein, lateral vein, and thin vein respectively. MOS extracts main vein from kernel mask in Cartesian coordinate system. Vein Growth Process is the text contours reconstructing process in Fig. 

2.

Iii-a Leaf Vein-Based Text Representation Method

The proposed text representation method (LVT) treats the text contour as leaf margin (as shown in Fig 1) and represents it through main, lateral, and thin veins, which can fit text instances with any shapes effectively even for highly curved ones. This section describes the growth process of veins mathematically combined with Fig 2.

For main vein (as we can see from Fig 2), it is used for locating texts roughly. The corresponding growth process is modeled as polynomial by MOS (described in Section III-C), which can be formulated as:

(1)

where is the degree of . is the coefficient of .

Given a main vein , start points of lateral veins ( is set to 5 for better visualization) are sampled equidistantly along the growth direction of main vein (sampling process can be referred to Algorithm 1). At each start point () on , the corresponding tangent angle (Fig. 2 blue dotted arrow) can be computed by:

(2)

After obtaining , we can determine the growth directions and lengths of lateral veins, which are responsible for generating coarse text contours.

For the growth directions of lateral veins, there are two lateral veins ( and ) along the growth direction of main vein at each start point (as we can see from Fig. 2 (a)). The corresponding potential growth directions ( and ) of them are defined as:

(3)
(4)

Since the coverage of all 360 directions would lead to expensive computational costs, we predefined a polar coordinate system with directions ( and

is set to 8 for better visualization) to rectify the potential direction of lateral vein, which avoids a highly complicated neural network and ensures the strong fitting ability to text contours (verified in Section 

8). Concretely, as shown in Fig. 2 (b), supposing are the all directions in the predefined polar coordinate system and , the rectified direction in Fig. 2 (c) can be calculated as:

(5)

where and are the angles between the potential direction and the corresponding two adjacent directions in the predefined polar coordinate system. denotes the operator for absolute value.

For the lengths of lateral veins, it can be constructed as the distances between the start points of lateral veins and text contours along the growth directions of lateral veins (refered to  III-D).

With the determined lateral veins, the growth directions and lengths of thin veins can be formulated. They are used for refining the coarse contours generated by lateral veins to reconstruct accurate detection results. As we can see from Fig. 2 (d), the middle points of lateral veins are sampled as the start points of thin veins, and there are two thin veins ( and ) along the growth direction of lateral vein at one start point. For the growth directions of thin veins, they are determined according to the rectified directions of lateral veins. Specifically, given two adjacent lateral veins ( and ) and the corresponding rectified growth directions ( and ) that computed by Equation 5, we can obtain the growth directions ( and ) of and by the Equation 6 and Equation 7, respectively:

(6)
(7)

For the lengths of thin veins, it is evaluated by the distances between the start points of thin veins and text contours along the growth directions of thin veins (refered to III-D). With the determined main veins, lateral veins, and thin veins, the text contour can be drawn by connected the endpoints of lateral veins and thin veins in a clockwise direction.

Iii-B Overall Pipeline

The overall pipeline of LeafText is shown in Figure 3, which consists of backbone, FPN, MV header, LV-TV header, Multi-Oriented Smoother(MOS), and Vein Growth Process. ResNet [13] is adopted as the backbone to help extract basic input image features, where are the height, width of input image with 3 channel. It outputs multiple coarse and fine feature maps simultaneously (), where

denotes the stride of network and

means feature maps channel. The coarse features bring a global correlation between texts and the fine ones focus on local details. To extract strong expressive features that are equipped with global and local information for the following detection headers, FPN [25] is used for combining multiple features from the backbone to generate a concatenated feature map . As described in Section III-A, text contour is represented by the combination of main vein, lateral vein, and thin vein. To extract main vein of text instance, LeafText inputs the into MV header the generation of kernel mask map at first. Then, it extracts main vein from kernel mask by MOS. For the growth of lateral and thin veins, LV-TV header conducts regression task on the to generate length mask map , where is the number of directions in predefined polar coordinate system (as described in Section III-A). In , pixel values at the start points of lateral and thin veins are the vein lengths in directions, respectively. With the determined main vein (referred to Equation 1) and the lengths of lateral and thin veins in all directions, text contour can be generated by the Vein Growth Process (described to Section III-A and Fig. 2).

1:The kernel mask map ;
2:The coordinates of lateral vein start points and corresponding tangent slopes , len()=len()=;
3:function Main()
4:      Multi-Oriented Smoother()
5:      equidistantSample() // means rotated equidistant start points of lateral veins sampled from , len()=;
6:      tangentSlope // denote tangent slopes at start points, which can be computed by Equation 2, len()=;
7:     rotate
8:     
9:     rotate
10:     return
11:end function
12:
13:function Multi-Oriented Smoother()
14:      size
15:     

 padding

//obtaining kernel mask from and padding it by 0 in top, bottom, left, and right directions with ;
16:      MiddleSample() // denotes multiple center points of kernel mask, len()=;
17:      angle // is the angle between vector and vector
18:      rotate // rotating the with as the origin;
19:      polyFit // is rotated ;
20:     return
21:end function
22:
23:function MiddleSample()
24:     initial
25:     coordinate // are the point coordinates of kernel region;
26:     min,  max // are the x coordinates of ;
27:     min,  max // are the y coordinates of ;
28:     for  do
29:         if  then
30:              
31:              
32:              
33:              
34:         else
35:              
36:              
37:              
38:              
39:         end if
40:         
41:     end for
42:     return
43:end function
Algorithm 1 Growth of Lateral Vein

Iii-C Multi-Oriented Smoother

As shown in Fig. 2 and Fig. 3, extracting main vein (refer to Equation 1) from the predicted kernel mask accurately is important for determining the lateral and thin veins, which is the key to rebuild text contour. However, generating main vein by the existing middle sampling method always results in a discrete jagged result, which leads to bad growth directions for lateral and thin veins and unreliable reconstructed text contours.

Considering the above issue, Multi-Oriented Smoother (MOS) is designed to improve the reliability of main vein. Specifically, as shown in Algorithm 1 function MULTI-ORIENTED SMOOTHER, MOS extracts kernel mask region from at first. Meanwhile, considering rotating image leads to information loss of text instances at image borders, MOS pads kernel mask by 0 in top, bottom, left, and right directions with ( is the height of input image). Then, initial center points are sampled by the function MIDDLE SAMPLE in Algorithm 1. Next, the angle between and X-axis is computed and are rotated with as origin. In the end, main vein is fitted by the rotated center points . With a smooth main vein, reliable start points and growth directions of lateral veins can be determined by function MAIN in Algorithm 1, which improves the reliability of detection results significantly (verified in Section 9).

Fig. 4: Visualization of the label generation process. Kernel mask (b) is used for extracting main vein. It is the ground-truth of MV header in Fig. 3. For Lengths of lateral and thin veins (e) and (f) in all directions of predefined polar coordinate system, they are responsible for supervising LV-TV header of LeafText in the training process.

Iii-D Label Generation

As described in Section III-A, text contour is represented by the combination of main, lateral, and thin veins. In Fig. 3, LeafText predicts kernel mask to extract main veins. Meanwhile, it regresses the lengths of lateral and thin veins in all directions of the predefined polar coordinate system. In this section, we illustrate the label generation process of kernel mask and vein lengths.

For the label of kernel mask (Fig. 4 (b)), the corresponding boundary is generated by shrinking text contour through the algorithm proposed in [43]. The inner region of the boundary is regarded as the kernel mask.

For the label of lateral vein length, main vein (Fig. 4 (c)) is extracted from kernel mask by the function MULTI-ORIENTED SMOOTHER in Algorithm 1 at first. Then, the start points and growth directions are determined (the details refer to function MAIN in Algorithm 1 and Section III-A, respectively). In the end, the lengths between start points and text contour in directions of the predefined polar coordinate system are computed. For the label of thin vein length, start points are sampled according to the lateral vein, and the lengths are computed in the same way as lateral vein.

Fig. 5: Illustration of the distributions of lateral and thin vein lengths on different public benchmarks. ‘M’, ‘T’, ‘C’, and ‘I’ indicate MSRA-TD500, Total-Text, CTW1500, and ICDAR2015 datasets respectively. ‘r’ and ‘e’‘ are training and testing samples. ‘A’ denotes the lengths in all directions (referred to Fig. 4). ‘L’ and ‘T’ are lengths in lateral and thin vein directions.

Iii-E Loss Function

LeafText determines main, lateral, and thin veins by MV and LV-TV Header (as shown in Fig. 3

). In this paper, to optimize the proposed pipeline effectively, we propose a multi-task loss function

(referred to Equation 8). It consists of two loss functions of MV header loss and LV-TV header loss , which are responsible for supervising the corresponding headers in the training stage, respectively.

(8)

where and are the coefficients of and . They are set to 1 and 0.25 in following experiments.

Optimization of MV header. Dice loss [34] is designed for segmentation tasks where there is a strong imbalance between the positive and negative samples. Considering the regions of kernel mask are much smaller than background, Dice loss is adopted to evaluate the MV header loss :

(9)

where and denote the predicted kernel mask and the corresponding ground-truth. To avoid the situation that there may be no positive samples in , we set as 1 to ensure that the denominator of is bigger than 0.

Optimization of LV-TV header. As we can see from Fig. 3, LV-TV header is responsible for regressing the lengths of lateral and thin veins. To facilitate the optimization of regression tasks, global incentive loss is proposed to supervise LV-TV header in this paper:

(10)

Global incentive loss aims to force our model to balance the importance of text instances with different scales and focus on the prediction of lateral and thin veins.

Specifically, to keep the same sensitivity for multi-scale texts, replaces L2-loss or Smooth- loss used in [37, 16] by negative logarithm loss (as shown in Equation 11). It scales the differences between predicted lengths and ground-truth into the range of 0–1, which ensures the effectiveness of our model to large and small text instances simultaneously.

(11)

where and are the predicted lengths of lateral and thin veins and the corresponding ground-truth.

Moreover, as described in Section III-A and Section III-B, LeafText determines one lateral vein or thin vein from directions. It leads to an overwhelming number of indirect samples and a small number of direct samples (lateral and thin veins). To make training more effective and efficient, we propose an incentive strategy for direct samples. It is found in Fig. 5 that the lengths of direct samples are smaller than indirect ones for a specific dataset. Therefore, a incentive coefficient is formulated as follow:

(12)

where denotes the shorter side size of resized input images in the training and testing stages. is responsible for scaling in to the range of 0–1.

By combining the Equation 11 and 12, global incentive loss can be formulated as:

(13)

where is the sum of number of all lateral and thin vein start points. denotes the direction number of the predefined polar coordinate system.

Iv Experiments

Iv-a Datasets

To demonstrate the strong ability of LVT to fit arbitrary-shaped texts, we analyze the upper bound of the IoU between generated label and ground-truth. Meanwhile, the effectiveness of MOS and global incentive loss are verified. Moreover, the proposed LeafText is evaluated on multiple representative public benchmarks to show the superior performance.

SynthText [11] contains 800k composite training samples that are combined by synthetic varied text instances and scene RGB images. It is proposed to pre-train the model to improve the robustness of the proposed LeafText.

MSRA-TD500 [56] includes line-level Chinese and English text instances simultaneously. It is conposed of 300 training images and 200 testing images, respectively. To ensure a fair comparison environment, 400 images of HUST-TR400 [57] are extra introduced as training data.

Total-Text [3] consists of word-level arbitrary-shaped multilingual texts, which brings significant challenges for model generalization. There are 1255 images for training model and 300 images for evaluating performance.

CTW1500 [28] is composed of 1500 samples, where includes 1000 training images and 500 testing images. Particularly, CTW1500 mainly contains line-level arbitrary-shaped text instances, which requires model’s strong ability to deal with large-scale and ratio objects.

ICDAR2015 [17]

is proposed in ICDAR 2015 Robust Reading Competition, which has 1000 training image and 500 testing images. Different from the above three public benchmarks, the background of ICDAR2015 images are more complicated. Meanwhile, the text instances enjoys similar basic features with background, which brings much difficulties for text detection.

(a) Upper bound analysis of train dataset.
(b) Upper bound analysis of test dataset.
Fig. 6: Upper Bound Analysis. More start points and directions of lateral vein can model text contours with higher IoU with Ground Truth. ‘Directions’ are the vein directions in predefined polar coordinate system.

Iv-B Implementation Details

The overall pipeline of the proposed LeafText is depicted in Fig. 3. The backbone adopts ResNet [13] directly and the details of FPN can be referred to [25]. MV header and LV-TV header are composed of one convolutional layer and convolutional layer, respectively.

In the pre-process stage, training samples can be obtained through data augmentation and label generation operators. For the former, it contains the following strategies: (1) random scaling (including image size and aspect); (2) random horizontal flipping; (3) random rotating in the range of (-10°, 10°); (4) random cropping and padding. For the latter, kernel mask and the lengths of lateral and thin veins in directions are generated by the process in Fig. 5. Different from training samples, testing samples are produced by resizing input RGB images into specific sizes only. Particularly, the text instances labeled as DO NOT CARE are ignored during both the training and testing stages.

In the training stage, the weights of the CNN network are initialized first. Specifically, the backbone is pre-trained on the ImageNet 

[7]. For the FPN and headers, they are initialized by the strategy proposed in [12]. To ensure an efficient and effective converge process, Adam [18]

is adopted as the optimizer. The learning rate is set as 0.001 and adjusted through ’polylr’ strategy with the model converging. In the comparison experiments, our model is trained on the SynthText dataset for 1 epoch at first. Then, it is finetuned on the official datasets (MSRA-TD500, Total-Text, CTW1500, and ICDAR2015) for 600 epochs with a batch size of 16. All the experiments in this paper are conducted on a workstation with RTX 1080Ti GPU.

Iv-C Ablation Study

To verify the effectiveness of LeafText, we conduct ablation experiments on multiple public benchmarks in this section. Specifically, to verify the strong fitting ability of LVT, we analyze the upper bound IoU between reconstructed text contour and ground-truth. Meanhiwle, the superiority of the proposed global incentive loss is demonstrated by comparing it with exisiting loss functions. Furthermore, the importance of MOS for rebuilding text contours is verified. The details of experimental results are described in the following paragraphs.

Fig. 7: Visualization of the proposed leaf vein based text representation method. We aim to treat text contour as leaf margin and construct it through the combination of main vein, lateral vein, and thin vein.
  MSRA-TD500 Total-Text Precision Recall F-measure Precision Recall F-measure   4 2 86.9 77.8 82.1 78.6 67.4 72.6 3 88.3 78.9 83.3 85.5 73.4 79.0 5 88.5 79.1 83.5 89.0 76.3 82.1 7 88.5 79.1 83.5 88.6 75.9 81.8 9 88.2 78.8 83.2 89.1 76.3 82.2 8 2 90.7 80.1 85.1 88.8 76.6 82.3 3 91.5 80.4 85.6 90.5 78.2 83.9 5 92.0 80.5 85.9 91.1 78.6 84.4 7 91.9 80.4 85.8 91.2 78.7 84.5 9 92.1 80.6 86.0 91.1 78.6 84.4 16 2 88.7 80.7 84.5 88.6 83.0 85.7 3 89.1 81.0 84.9 88.9 83.4 86.1 5 89.1 81.0 84.9 89.0 83.5 86.2 7 88.9 80.8 84.7 89.0 83.5 86.2 9 89.1 81.0 84.9 88.8 83.4 86.0 24 2 86.6 80.9 83.7 89.3 81.5 85.2 3 86.6 80.9 83.7 89.6 81.8 85.5 5 87.2 81.4 84.2 89.3 81.6 85.3 7 87.0 81.2 84.0 89.4 81.7 85.4 9 87.2 81.4 84.2 89.4 81.7 85.4 32 2 87.9 79.8 83.7 89.4 76.2 82.3 3 88.5 80.3 84.2 89.6 76.5 82.5 5 88.7 80.5 84.4 89.7 77.4 83.1 7 88.7 80.5 84.4 89.1 76.3 82.2 9 88.7 80.5 84.4 89.3 76.4 82.3  
(a) Table of experimental results
(b) Curves of training loss and IoU.
Fig. 8: Ablation study for the impact of and on detection performance. indicates the direction number predefined in a polar coordinate system. means the number of center points sampled on the main vein for reconstructing text contours. red, green, and blue are the experimental results with three best groups of settings respectively on MSRA-TD500 and Total-Text datasets. ‘IoU’ in (b) indicates the Intersection of Union between predicted shrink-mask and the corresponding ground-truth.

Upper Bound Analysis of LVT. Considering existing approaches fail to fit irregular-shaped texts accurately, a leaf vein-based text representation method is proposed.

To verify the effectiveness of it, we analyze the upper bound of IoU that between rebuilt text contour based on generated label and ground-truth. Specifically, as shown in Fig. 6, the IoU can achieve 96% at least on both training and testing samples of four public benchmarks (MSRA-TD500, Total-Text, CTW1500, and ICDAR2015). For some highly curved text instances (as visualized in Fig 7), LVT still can achieve superior performance. The results demonstrate the strong fitting ability of the proposed leaf vein-based text representation method for arbitrary-shaped texts.

Moreover, as described in Section III-A, the reconstruction process of LVT relies on the start points of lateral veins and the directions of predefined polar coordinate system. Therefore, we further explore the influences brought by the different numbers of the start points and directions ( and in Fig. 7). Concretely, as we can see from Fig. 7, the IoU is evaluated when tuning and , respectively. It is found that there is a significant increase for IoU with being tuned from 2 to 5. The upper bound of IoU continues to slow-growing when is set to 7 and 9, which shows the start points of lateral veins play an important role in representing texts. Furthermore, the relations between IoU and are visualized in this section. Concretely, in Fig. 7, larger brings improvements to the fitting ability of our method, which verifies the importance of for leaf vein-based text representation method.

Performance Analysis under Different and . We have analyzed the upper bound IoU of the proposed LVT on different kinds of text instances in Section 8. To further verify the model’s performance, LeafText is trained and evaluated under different and on MSRA-TD500 and Total-Text text benchmarks.

Specifically, as we can see from the table (a) in Fig. 8, for multi-oriented texts (MSRA-TD500), LeafText achieves the best performance when and are set to 8 and 9, respectively. Meanwhile, the F-measure begins to decrease with the increase of . For irregular-shaped text instances, our method achieves 86.2% in F-measure when and are set as 16 and 5, which outperforms the rest of the other models. The above results show the best settings of and to detect multi-oriented and irregular-shaped texts. Furthermore, we visualize the details of the training process in Fig. 8 (b). It can be found from the curves of the training IoU on MSRA-TD500 that the IoU is smaller than the model under other settings when equals 8, which matches the results of Table (a) in Fig. 8. Meanwhile, the curves of the training IoU and loss on Total-Text show the unsatisfied convergence process when is set to 4 and 8, which verifies the effectiveness of large for irregular-shaped texts. The above experimental results provide appropriate model settings for the following comparison experiments on different kinds of text instances.

 

MOS MSRA-TD500
Precision Recall F-measure
8 9 91.6 78.8 84.7
92.1 80.6 86.0

 

MOS Total-Text
Precision Recall F-measure
16 5 88.3 82.1 85.1
89.0 83.5 86.2

 

TABLE I: The detection results of the models equipped with MOS and w/o MOS on MSRA-TD500 and Total-Text datasets.
Fig. 9: Visualization of the differences between the text contour reconstruction processes with MOS and w/o MOS. The sample is picked from MSRA-TD500 dataset and the and of model are set to 8 and 5, respectively.

Effectiveness of MOS. As described in Section III-C, for improving the accuracy of reconstructed text instance, MOS is designed to ensure the reliability of the main vein extracted from the predicted unreliable kernel mask. To demonstrate the effectiveness of MOS, we analyze the improvements in detection performance brought by MOS and visualize some qualitative results. As shown in Table I, MOS can bring improvements in F-measure on both multi-oriented (MSRA-TD500) and irregular-shaped (Total-Text) datasets. Specifically, LeafText with MOS achieves 86.0% and 86.2% F-measure on the two benchmarks respectively, which surpasses LeafText without MOS 1.3% and 1.1%. These experimental results demonstrate the effectiveness of MOS for improving the quality of rebuilt contours. To further explain how MOS works for smoothing the main vein, we visualize the process details in Fig. 9. Concretely, given a predicted kernel mask (Fig. 9 (b)), MOS helps our method to determine correct tangent directions on each start point (Fig. 9 (d)), which helps avoid disordered contour point sequence (Fig. 9 (f)) and improve the reliability of reconstructed text contour (Fig. 9 (i)) effectively. The visualization demonstrates the effectiveness of MOS and depicts the differences between the text contour reconstruction processes with MOS and w/o MOS vividly.

 

Dataset Loss Precision Recall F-measure

 

MSRA-TD500 Smooth-L1 89.2 77.3 82.8
L2 90.0 71.1 79.4
Global incentive 92.1 80.6 86.0
Total-Text Smooth-L1 88.5 80.1 84.1
L2 88.7 74.5 81.0
Global incentive 89.0 83.5 86.2

 

TABLE II: The detection results of the models trained by different loss functions on MSRA-TD500 and Total-Text datasets.

Effectiveness of Global Incentive Loss. Considering existing L2-loss and Smooth- loss mainly focus on large samples, which leads to the ignorance of small objects, global incentive loss is designed to force our model to balance the importance of texts with different scales and focus on the prediction of lateral and thin veins.

As shown in Table II, compared with existing L2-loss and Smooth- loss, training LeafText by the proposed brings 3.2% and 2.1% improvements in F-measure on MSRA-TD500 and Total-Text at least, respectively. Considering there existing lots of large and small texts simultaneously in MSRA-TD500, the above experimental results demonstrate can help the model improve the ability to deal with different sized text instances. Meanwhile, the results in Table II verify that our method can regress lateral and thin vein lengths more accurately when supervising the LV-TV prediction header by . Furthermore, we visualize the training processes of different losses in Fig. 10. It is found that global incentive loss function fluctuates around 0.1 at the end of the convergence process on MSRA-TD500 and Total-Text datasets simultaneously. Compared with the loss functions of Smooth-L1 and L2, the proposed can accelerate the model converging effectively and improve the model’s ability to learn text features. The above results demonstrate the effectiveness of the proposed global incentive loss for detecting multi-scaled texts.

Fig. 10: Visualization of the training processes on the MSRA-TD500 and Total-Text with different loss functions.

 

TV MAE MSRA-TD500
LV TV Precision Recall F-measure
11.3 11.1 91.8 80.2 85.6
92.1 80.6 86.0

 

TV MAE Total-Text
LV TV Precision Recall F-measure
5.8 5.3 88.1 82.7 85.3
89.0 83.5 86.2

 

TABLE III: Impact of TV for detection results on MSRA-TD500 and Total-Text datasets. ‘LV’ and ‘TV’ denote lateral vein and thin vein, respectively. ‘MAE’ means Mean Absolute Error.

Superiority of the Thin Vein. As described in Section III-A, the thin vein is designed for fining text contours, which supports accurately fitting texts with lower model complexity. Benefiting from the advantage that the thin vein length is half of the lateral vein, the thin vein eases the learning of contour point sequence and ensures accurate detection results. To verify the superiority of the thin vein, we evaluate the accuracy of the lateral and thin vein in table III. We first evaluate the Mean Absolute Error (MAE) of the lateral vein and the thin vein. It is found that the MAE of the lateral vein surpasses the thin vein 0.2 and 0.5 on MSRA-TD500 and Total-Text. It demonstrates the task of thin vein prediction is more accessible than the prediction of the lateral vein, which verifies the advantage of thin vein that can ease the learning of contour point sequence. Meanwhile, thin vein brings 0.4% and 0.9% in F-measure on MSRA-TD500 and Total-Text, respectively. The above experimental results prove thin vein can promote the model performance in the detection of text instances effectively.

Iv-D Comparison with State-of-the-Art Methods

To demonstrate the superior performance of LeafText for detecting texts with arbitrary shapes, multi scales, and multilingual, we compare it with the existing state-of-the-art (SOTA) approaches on four representative public benchmarks (MSRA-TD500, Total-Text, CTW1500, and ICDAR2015) in this section. Meanwhile, the advantages of our method over previous methods are analyzed based on the comparisons and quality detection results.

 

Methods Precision Recall F-measure

 

MOTD [62] (CVPR 2016) 83.0 67.0 74.0
EAST [63] (CVPR 2017) 87.3 67.4 76.1
SegLink [39] (CVPR 2017) 86.0 70.0 77.0
PixelLink [6] (AAAI 2018) 83.0 73.2 77.8
TextSnake [31] (ECCV 2018) 83.2 73.9 78.3
RRD [23] (CVPR 2018) 87.0 73.0 79.0
CornerNet [32] (CVPR 2018) 87.6 76.2 81.5
CRAFT [1] (CVPR 2019) 88.2 78.2 82.9
TextField [54] (TIP 2019) 87.4 75.9 81.3
SAE [42] (CVPR 2019) 84.2 81.7 82.9
ATRR [50] (CVPR 2019) 85.2 82.1 83.6
PAN [49] (ICCV 2019) 84.4 83.8 84.1
DB [22] (AAAI 2020) 90.4 76.3 82.8
DRRG [60] (CVPR 2020) 88.1 82.3 85.1
OPMP [59] (TMM 2021) 86.0 83.4 84.7
PAN++ [48] (TPAMI 2021) 85.3 84.0 84.7
SAVTD [9] (CVPR 2021) 89.2 81.5 85.2
GV [53] (TPAMI 2021) 88.8 84.3 86.5
ReLaText [33] (PR 2021) 90.5 83.2 86.7
LPAP [10] (TOMM 2022) 87.9 77.7 82.5
DC [2] (PR 2022) 87.9 83.1 85.4
Res18-DB++ [24] (TPAMI 2022) 87.9 82.5 85.1
Res50-DB++ [24] (TPAMI 2022) 91.5 83.3 87.2

 

Res50-Pre-Ours (736) 92.1 83.8 87.8

 

TABLE IV: Performance comparison on MSRA-TD500 dataset.

Evaluation on MSRA-TD500. To verify the performance for detecting line-level multi-oriented text instances, we evaluate the proposed LeafText on the MSRA-TD500 dataset. As shown in Table IV, for existing state-of-the-art (SOTA) methods, ReLaText [33], GV [53], and DC [2] achieve 86.7%, 86.5%, and 85.4% in F-measure. Benefiting from decomposing long texts into multiple characters and the strong connection ability of Graph Convolutional Network (GCN), ReLaText surpasses GV and DC in F-measure 0.2% and 1.3% respectively. Unlike ReLaText, LeafText models the whole text directly, which effectively avoids the character ignorance problem and improves detection performance. Specifically, our method achieves 87.8% in F-measure on MSRA-TD500, which surpasses the best existing method ReLaText 1.1%. For DB++ [24], though it achieves significant improvement by embedding DConv [4] into the corresponding backbone, our method still outperforms it with basic network. We show some qualitative results on MSRA-TD500 in Fig. 11 (a). The above results demonstrate the superior ability of LeafText for detecting very long, multi-oriented, and multi-lingual texts.

Fig. 11: Visualization of the differences between the text contour reconstruction processes with MOS and w/o MOS. The sample is picked from MSRA-TD500 dataset and the and of model are set to 8 and 5, respectively.

 

Methods Total-Text CTW1500
Precision Recall F-measure Precision Recall F-measure

 

TextSnake [31] (ECCV 2018) 82.7 74.5 78.4 67.9 85.3 75.6
ATTR [50] (CVPR 2019) 80.9 76.2 78.5 80.1 80.2 80.1
CRAFT [1] (CVPR 2019) 87.6 79.9 83.6 86.0 81.1 83.5
CTD [36] (ICDAR 2019) 80.6 82.3 81.4 79.9 77.0 78.5
LOMO [58] (CVPR 2019) 87.6 79.3 83.3 85.7 76.5 80.8
PSE [47] (CVPR 2019) 84.0 78.0 80.9 84.8 79.7 82.2
SegLink++ [41] (PR 2019) 82.1 80.9 81.5 82.8 79.8 81.3
TextDragon [8] (ICCV 2019) 85.6 75.7 80.3 84.5 82.8 83.6
Boundary [46] (AAAI 2020) 85.2 83.5 84.3
ContourNet [51] (CVPR 2020) 86.9 83.9 85.4 83.7 84.1 83.9
TextRay [45] (ACMMM 2020) 83.5 77.9 80.6 82.8 80.4 81.6
Spotter [19] (TPAMI 2021) 88.3 82.4 85.2
FCENet [64] (CVPR 2021) 87.4 79.8 83.4 85.4 80.7 83.1
PSE+STKM [44] (CVPR 2021) 86.3 78.4 82.2 85.1 78.2 81.5
OPMP [59] (TMM 2021) 88.5 82.9 85.6 85.1 80.8 82.9
ASTD [5] (TMM 2022) 85.4 81.2 83.2 86.2 80.4 83.2
TextDCT [40] (TMM 2022) 87.2 82.7 84.9 85.0 85.3 85.1
LPAP [10] (TOMM 2022) 87.3 79.8 83.4 84.6 80.3 82.4
DC [2] (PR 2022) 90.5 82.7 86.4 86.9 82.7 84.7
Res50-DB++ [24] (TPAMI 2022) 88.9 83.2 86.0 87.9 82.8 85.3

 

Res18-Pre-Ours (640) 90.8 84.0 87.3 87.1 83.9 85.5

 

TABLE V: Performance comparison on Total-Text and CTW1500 datasets.

Evaluation on Total-Text and CTW1500. Irregular-shaped texts bring challenges to existing text detection methods. LeafText represents contours through point sequences for improving the text fitting ability. To verify the effectiveness of our method for the detection of irregular-shaped texts, we make comparisons on the Total-Text and CTW1500 simultaneously. We first resize the short sizes of images into 640 while keeping the original ratio and evaluate the model performance with the backbones of ResNet-18 and ResNet-50, respectively.

As we can see from Table V, for the detection of word-level text instances in Total-Text, DC [2] and DB++ [24] achieve 86.4% and 86.0% in F-measure, they can surpass previous methods up to 7.5%. On this challenging dataset, LeafText achieves the SOTA performance of 87.3% in F-measure and exceeds DC [2] by 0.9%, which demonstrates the effectiveness of the proposed LVT and the superiority over the existing text representation methods. Meanwhile, the thin vein is helpful for detecting large-scaled instances, which further improves the model detection performance on the Total-Text.

Different from Total-Text, CTW1500 is composed of line-level text instances that contain large spaces between different characters or words, which brings challenges to existing methods. As shown in Table V, DB++ [24] and TextDCT [40] are latest SOTA methods on CTW1500 benchmark. They achieve 85.3% and 85.1% in F-measure, respectively. A similar conclusion on the CTW1500 dataset can be generated that our method is superior to previous methods. Specifically, our method achieves 85.5% in F-measure, which surpasses DB++ [24] 0.2% even it is equipped with DConv [4] and complicated backbone (ResNet-50). The experimental results on Total-Text and CTW1500 prove the superiority of the proposed LVT for fitting irregular-shaped texts. Meanwhile, the strong ability to effectively detect word-level and line-level instances simultaneously is verified. Some qualitative results on Total-Text and CTW1500 are depicted in Fig. 11 (b) and (c) for further demonstrating the effectiveness of LeafText.

 

Methods Precision Recall F-measure

 

WordSup [15] (ICCV 2017) 79.3 77.0 78.2
MCN [29] (CVPR 2018) 72.0 80.0 76.0
PixelLink [6] (AAAI 2018) 85.5 82.0 83.7
TextBoxes++ [21] (TIP 2018) 87.8 78.5 82.9
PSE [47] (CVPR 2019) 86.9 84.5 85.7
RRD [23] (CVPR 2019) 88.0 80.0 83.8
SegLink++ [41] (PR 2019) 83.7 80.3 82.0
Boundary [46] (AAAI 2020) 88.1 82.2 85.0
FCENet [64] (CVPR 2021) 85.1 84.2 84.6
Spotter [19] (TPAMI 2021) 85.8 81.2 83.4
PAN++ [48] (TPAMI 2021) 85.9 80.4 83.1
EAST+STKM [44] (CVPR 2021) 88.7 84.9 86.8
PSE+STKM [44] (CVPR 2021) 87.8 84.1 85.9
ASTD [5] (TMM 2022) 87.2 81.3 84.1
TextDCT [40] (TMM 2022) 88.9 84.8 86.8
LPAP [10] (TOMM 2022) 88.7 84.4 86.5

 

Res50-Pre-Ours (1152) 88.9 82.3 86.1

 

TABLE VI: Performance comparison on ICDAR2015 Dataset.

 

Type Methods Training Testing P R F

 

word-level TextField [54] IC15 TT 61.5 65.2 63.3
CM-Net [55] 75.8 64.5 69.7
Res18-Pre-Ours 89.2 80.0 84.4
TextField [54] TT IC15 77.1 66.0 71.1
CM-Net [55] 76.5 68.1 72.1
Res18-Pre-Ours 83.0 69.9 75.9

 

line-level TextField [54] MSRA CTW 75.3 70.0 72.6
CM-Net [55] 77.2 69.7 72.8
Res18-Pre-Ours 83.8 75.0 79.2
TextField [54] CTW MSRA 85.3 75.8 80.3
CM-Net [55] 85.8 77.1 81.2
Res18-Pre-Ours 82.9 82.0 82.4

 

TABLE VII: Cross-dataset evaluations on word-level (ICDAR2015 and Total-Text) and line-level (MSRA-TD500 and CTW1500) datasets.

Evaluation on ICDAR2015. The images in International Conference on Document Analysis and Recognition (ICDAR) 2015 are sampled from the market, which leads to complicated backgrounds and brings challenges to distinguishing texts from interference regions. Moreover, multi-oriented and multi-scaled instance shapes aggravate the difficulty of text detection. To testify the model performance under a complex environment, we conduct comparison experiments on the ICDAR2015 benchmark. As exhibited in Table VI, our method achieves 86.1% F-measure. Although LeafText is a little lower (0.7% and 0.4%) than TextDCT [40] and LPAP [10] in F-measure, our method exceeds most existing SOTA methods (such as PSE [47], Boundary [46] and ASTD [5]). It is mainly because of LeafText’s strong ability to fit various instance shapes and recognize text features. The results in Table VI and Fig. 11 (d) demonstrate our method can recognize the texts with various scales and multi-orientations from the complex background effectively.

Iv-E Cross Dataset Text Detection

To testify the LeafText’s generalization performance on different datasets, we evaluate it through cross-train-test experiments. Specifically, the above four public benchmarks are composed of word-level (Total-Text and ICDAR2015) and line-level (MASRA-TD500 and CTW1500) texts. We conduct cross-train-test experiments on the two types of benchmarks in this section, respectively. As shown in Table VII, on the word-level datasets, our method achieves 84.4% and 75.9% in F-measure when it is trained on ICDAR2015 and Total-Text and is tested on each other. For line-level datasets, LeafText achieves 79.2% and 82.4% in F-measure when it is trained on MSRA-TD500 and CTW1500. The experiments show the LeafText’s superior generalization performance.

(a) False-positive sample
(b) Over emitting
Fig. 12: Illustration of some challenging samples. The green bounding boxes are the detection results from our method. The red ones are failed detection regions.

Iv-F Limitations of Our Algorithm

We have analyzed the upper bound performance of LeafText for fitting arbitrary-shaped text instances and verified the effectiveness of LVT, MOS, and thin vein by the ablation studies in Section IV-C. Meanwhile, the superior detection and generalization performance on multiple benchmarks of our method are demonstrated in Section IV-D and Section IV-E. In this section, we discuss the limitations of our method by visualizing some difficult samples. As depicted in Fig. 12, there are two typical cases. For the false-positive sample (Fig. 12(a)), the high similar vision features between texts and interference regions make it hard to distinguish them effectively. For the case shown in Fig 12(b), there are two adjacent texts and our method over emits into the inner of each other, which brings interference information into detection results and influences the following text recognition task. Therefore, solving the aforementioned limitations that exist in our method will be our future work.

V Conclusion

In this paper, we explore the leaf vein geometric characteristic and relate it to text contour for designing an effective text representation method (LVT), which improves text fitting ability and avoids disordered point sequence problems naturally. Meanwhile, LVT fining text contour through the thin vein that enjoys half the length of the lateral vein, which reduces the model’s complexity and eases the training convergence process while ensuring superior detection performance. Furthermore, considering the lateral and thin veins that are responsible for sampling contour point sequence deeply depending on the main vein, Multi-Oriented Smoother (MOS) enhances the robustness of the main vein, which ensures the correct growth directions of lateral and thin veins effectively. In the end, we successfully accelerate the supervision of lateral and thin vein predictions and balance the importance of texts with different scales through the proposed global incentive loss. Extensive experiments verify the effectiveness of the proposed LVT, MOS, and global incentive loss, and the superiority of the thin vein. Comparisons on the multiple public benchmarks demonstrate the superior detection performance of our approach.

References

  • [1] Y. Baek, B. Lee, D. Han, S. Yun, and H. Lee (2019) Character region awareness for text detection. In CVPR, pp. 9365–9374. Cited by: §II-A, TABLE IV, TABLE V.
  • [2] Y. Cai, Y. Liu, C. Shen, L. Jin, Y. Li, and D. Ergu (2022) Arbitrarily shaped scene text detection with dynamic convolution. Pattern Recognition 127, pp. 108608. Cited by: §I, §IV-D, §IV-D, TABLE IV, TABLE V.
  • [3] C. K. Ch’ng and C. S. Chan (2017) Total-text: a comprehensive dataset for scene text detection and recognition. In Proceedings of the International Conference on Document Analysis and Recognition, Vol. 1, pp. 935–942. Cited by: §IV-A.
  • [4] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei (2017) Deformable convolutional networks. In ICCV, pp. 764–773. Cited by: §IV-D, §IV-D.
  • [5] P. Dai, Y. Li, H. Zhang, J. Li, and X. Cao (2021) Accurate scene text detection via scale-aware data augmentation and shape similarity constraint. IEEE Transactions on Multimedia 24, pp. 1883–1895. Cited by: §II-A, §IV-D, TABLE V, TABLE VI.
  • [6] D. Deng, H. Liu, X. Li, and D. Cai (2018) PixelLink: detecting scene text via instance segmentation. In AAAI, pp. 6773–6780. Cited by: §II-A, TABLE IV, TABLE VI.
  • [7] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and F. Li (2009) Imagenet: a large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. Cited by: §IV-B.
  • [8] W. Feng, W. He, F. Yin, X. Zhang, and C. Liu (2019) TextDragon: an end-to-end framework for arbitrary shaped text spotting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9076–9085. Cited by: §II-B, TABLE V.
  • [9] W. Feng, F. Yin, X. Zhang, and C. Liu (2021) Semantic-aware video text detection. In CVPR, pp. 1695–1705. Cited by: §I, TABLE IV.
  • [10] Z. Fu, H. Xie, S. Fang, Y. Wang, M. Xing, and Y. Zhang (2022) Learning pixel affinity pyramid for arbitrary-shaped text detection. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM). Cited by: §I, §IV-D, TABLE IV, TABLE V, TABLE VI.
  • [11] A. Gupta, A. Vedaldi, and A. Zisserman (2016) Synthetic data for text localisation in natural images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2315–2324. Cited by: §IV-A.
  • [12] K. He, X. Zhang, S. Ren, and J. Sun (2015) Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1026–1034. Cited by: §IV-B.
  • [13] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §III-B, §IV-B.
  • [14] P. He, W. Huang, T. He, Q. Zhu, Y. Qiao, and X. Li (2017) Single shot text detector with regional attention. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3047–3055. Cited by: §II-B.
  • [15] H. Hu, C. Zhang, Y. Luo, Y. Wang, J. Han, and E. Ding (2017) Wordsup: exploiting word annotations for character based text detection. In Proceedings of the IEEE international conference on computer vision, pp. 4940–4949. Cited by: §II-B, TABLE VI.
  • [16] L. Huang, Y. Yang, Y. Deng, and Y. Yu (2015) Densebox: unifying landmark localization with end to end object detection. arXiv preprint arXiv:1509.04874. Cited by: §II-B, §III-E.
  • [17] D. Karatzas, L. Gomez, A. Nicolaou, S. Ghosh, A. Bagdanov, M. Iwamura, J. Matas, L. Neumann, V. Chandrasekhar, and S. Lu (2015) ICDAR 2015 competition on robust reading. In Proceedings of the International Conference on Document Analysis and Recognition, pp. 1156–1160. Cited by: §IV-A.
  • [18] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §IV-B.
  • [19] M. Liao, P. Lyu, M. He, C. Yao, W. Wu, and X. Bai (2021) Mask textspotter: an end-to-end trainable neural network for spotting text with arbitrary shapes. IEEE Transactions on Pattern Analysis and Machine Intelligence 43 (2), pp. 532–548. External Links: Document Cited by: §II-A, TABLE V, TABLE VI.
  • [20] M. Liao, B. Shi, X. Bai, X. Wang, and W. Liu (2016) Textboxes: a fast text detector with a single deep neural network. arXiv preprint arXiv:1611.06779. Cited by: §II-B.
  • [21] M. Liao, B. Shi, and X. Bai (2018) Textboxes++: a single-shot oriented scene text detector. IEEE Transactions on Image Processing 27 (8), pp. 3676–3690. Cited by: §II-B, TABLE VI.
  • [22] M. Liao, Z. Wan, C. Yao, K. Chen, and X. Bai (2020)

    Real-time scene text detection with differentiable binarization.

    .
    In AAAI, pp. 11474–11481. Cited by: §II-A, TABLE IV.
  • [23] M. Liao, Z. Zhu, B. Shi, G. Xia, and X. Bai (2018) Rotation-sensitive regression for oriented scene text detection. In CVPR, pp. 5909–5918. Cited by: §II-B, TABLE IV, TABLE VI.
  • [24] M. Liao, Z. Zou, Z. Wan, C. Yao, and X. Bai (2022) Real-time scene text detection with differentiable binarization and adaptive scale fusion. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §II-A, §IV-D, §IV-D, §IV-D, TABLE IV, TABLE V.
  • [25] T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie (2017) Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2117–2125. Cited by: §III-B, §IV-B.
  • [26] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. Berg (2016) Ssd: single shot multibox detector. In Proceedings of the European Conference on Computer Vision, pp. 21–37. Cited by: §II-B.
  • [27] Y. Liu, H. Chen, C. Shen, T. He, L. Jin, and L. Wang (2020) ABCNet: real-time scene text spotting with adaptive bezier-curve network. In CVPR, pp. 9809–9818. Cited by: §II-B.
  • [28] Y. Liu, L. Jin, S. Zhang, and S. Zhang (2017) Detecting curve text in the wild: new dataset and new solution. arXiv preprint arXiv:1712.02170. Cited by: §IV-A.
  • [29] Z. Liu, G. Lin, S. Yang, J. Feng, W. Lin, and W. L. Goh (2018) Learning markov clustering networks for scene text detection. In CVPR, pp. 6936–6944. Cited by: §I, TABLE VI.
  • [30] J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440. Cited by: §II-A.
  • [31] S. Long, J. Ruan, W. Zhang, X. He, W. Wu, and C. Yao (2018) Textsnake: a flexible representation for detecting text of arbitrary shapes. In ECCV, pp. 20–36. Cited by: §II-A, TABLE IV, TABLE V.
  • [32] P. Lyu, C. Yao, W. Wu, S. Yan, and X. Bai (2018) Multi-oriented scene text detection via corner localization and region segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7553–7563. Cited by: §II-A, TABLE IV.
  • [33] C. Ma, L. Sun, Z. Zhong, and Q. Huo (2021) ReLaText: exploiting visual relationships for arbitrary-shaped scene text detection with graph convolutional networks. Pattern Recognition 111, pp. 107684. Cited by: §II-B, §IV-D, TABLE IV.
  • [34] F. Milletari, N. Navab, and S. Ahmadi

    V-net: fully convolutional neural networks for volumetric medical image segmentation

    .
    In 2016 Fourth International Conference on 3D Vision, pp. 565–571. Cited by: §III-E.
  • [35] L. Neumann and J. Matas (2012) Real-time scene text localization and recognition. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3538–3545. Cited by: §II-A.
  • [36] X. Qin, Y. Zhou, D. Yang, and W. Wang (2019)

    Curved text detection in natural scene images with semi-and weakly-supervised learning

    .
    In 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 559–564. Cited by: §II-A, TABLE V.
  • [37] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi (2016) You only look once: unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788. Cited by: §II-B, §III-E.
  • [38] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Proceedings of the Neural Information Processing Systems, pp. 91–99. Cited by: §II-B.
  • [39] B. Shi, X. Bai, and S. Belongie (2017) Detecting oriented text in natural images by linking segments. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2550–2558. Cited by: §I, TABLE IV.
  • [40] Y. Su, Z. Shao, Y. Zhou, F. Meng, H. Zhu, B. Liu, and R. Yao (2022) TextDCT: arbitrary-shaped text detection via discrete cosine transform mask. IEEE Transactions on Multimedia. Cited by: §II-B, §IV-D, §IV-D, TABLE V, TABLE VI.
  • [41] J. Tang, Z. Yang, Y. Wang, Q. Zheng, Y. Xu, and X. Bai (2019) Seglink++: detecting dense and arbitrary-shaped scene text by instance-aware component grouping. Pattern recognition 96, pp. 106954. Cited by: §II-B, TABLE V, TABLE VI.
  • [42] Z. Tian, M. Shu, P. Lyu, R. Li, C. Zhou, X. Shen, and J. Jia (2019) Learning shape-aware embedding for scene text detection. In CVPR, pp. 4234–4243. Cited by: §II-A, TABLE IV.
  • [43] R. Vatti (1992) A generic solution to polygon clipping. Communications of the ACM 35 (7), pp. 56–63. Cited by: §III-D.
  • [44] Q. Wan, H. Ji, and L. Shen (2021) Self-attention based text knowledge mining for text detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5983–5992. Cited by: §I, TABLE V, TABLE VI.
  • [45] F. Wang, Y. Chen, F. Wu, and X. Li (2020) TextRay: contour-based geometric modeling for arbitrary-shaped scene text detection. In ACMMM, pp. 111–119. Cited by: §II-B, TABLE V.
  • [46] H. Wang, P. Lu, H. Zhang, M. Yang, X. Bai, Y. Xu, M. He, Y. Wang, and W. Liu (2020) All you need is boundary: toward arbitrary-shaped text spotting. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, pp. 12160–12167. Cited by: §II-B, §IV-D, TABLE V, TABLE VI.
  • [47] W. Wang, E. Xie, X. Li, W. Hou, T. Lu, G. Yu, and S. Shao (2019) Shape robust text detection with progressive scale expansion network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9336–9345. Cited by: §II-A, §IV-D, TABLE V, TABLE VI.
  • [48] W. Wang, E. Xie, X. Li, X. Liu, D. Liang, Y. Zhibo, T. Lu, and C. Shen (2021) PAN++: towards efficient and accurate end-to-end spotting of arbitrarily-shaped text. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §II-A, TABLE IV, TABLE VI.
  • [49] W. Wang, E. Xie, X. Song, Y. Zang, W. Wang, T. Lu, G. Yu, and C. Shen (2019) Efficient and accurate arbitrary-shaped text detection with pixel aggregation network. In ICCV, pp. 8440–8449. Cited by: §II-A, TABLE IV.
  • [50] X. Wang, Y. Jiang, Z. Luo, C. Liu, H. Choi, and S. Kim (2019) Arbitrary shape scene text detection with adaptive text region representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6449–6458. Cited by: §I, TABLE IV, TABLE V.
  • [51] Y. Wang, H. Xie, Z. Zha, M. Xing, Z. Fu, and Y. Zhang (2020) ContourNet: taking a further step toward accurate arbitrary-shaped scene text detection. In CVPR, pp. 11753–11762. Cited by: §II-B, TABLE V.
  • [52] E. Xie, P. Sun, X. Song, W. Wang, X. Liu, D. Liang, C. Shen, and P. Luo (2020) Polarmask: single shot instance segmentation with polar representation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12193–12202. Cited by: §II-B.
  • [53] Y. Xu, M. Fu, Q. Wang, Y. Wang, K. Chen, G. Xia, and X. Bai (2020) Gliding vertex on the horizontal bounding box for multi-oriented object detection. IEEE transactions on pattern analysis and machine intelligence 43 (4), pp. 1452–1459. Cited by: §I, §IV-D, TABLE IV.
  • [54] Y. Xu, Y. Wang, W. Zhou, Y. Wang, Z. Yang, and X. Bai (2019) Textfield: learning a deep direction field for irregular scene text detection. IEEE Transactions on Image Processing 28 (11), pp. 5566–5579. Cited by: §II-A, TABLE IV, TABLE VII.
  • [55] C. Yang, M. Chen, Z. Xiong, Y. Yuan, and Q. Wang (2022) CM-net: concentric mask based arbitrary-shaped text detection. IEEE Trans. Image Process. 31, pp. 2864–2877. External Links: Link, Document Cited by: §I, TABLE VII.
  • [56] C. Yao, X. Bai, W. Liu, Y. Ma, and Z. Tu (2012) Detecting texts of arbitrary orientations in natural images. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 1083–1090. Cited by: §IV-A.
  • [57] C. Yao, X. Bai, and W. Liu (2014) A unified framework for multioriented text detection and recognition. IEEE Transactions on Image Processing 23 (11), pp. 4737–4749. Cited by: §IV-A.
  • [58] C. Zhang, B. Liang, Z. Huang, M. En, J. Han, E. Ding, and X. Ding (2019) Look more than once: an accurate detector for text of arbitrary shapes. In CVPR, pp. 10552–10561. Cited by: §II-B, TABLE V.
  • [59] S. Zhang, Y. Liu, L. Jin, Z. Wei, and C. Shen (2020) OPMP: an omnidirectional pyramid mask proposal network for arbitrary-shape scene text detection. IEEE Transactions on Multimedia 23, pp. 454–467. Cited by: §II-A, TABLE IV, TABLE V.
  • [60] S. Zhang, X. Zhu, J. Hou, C. Liu, C. Yang, H. Wang, and X. Yin (2020) Deep relational reasoning graph network for arbitrary shape text detection. In CVPR, pp. 9696–9705. Cited by: §II-A, TABLE IV.
  • [61] S. Zhang, X. Zhu, J. Hou, C. Liu, C. Yang, H. Wang, and X. Yin (2020) Deep relational reasoning graph network for arbitrary shape text detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9699–9708. Cited by: §II-B.
  • [62] Z. Zhang, C. Zhang, W. Shen, C. Yao, W. Liu, and X. Bai (2016) Multi-oriented text detection with fully convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4159–4167. Cited by: §II-A, TABLE IV.
  • [63] X. Zhou, C. Yao, H. Wen, Y. Wang, S. Zhou, W. He, and J. Liang (2017) East: an efficient and accurate scene text detector. In CVPR, pp. 5551–5560. Cited by: §II-B, TABLE IV.
  • [64] Y. Zhu, J. Chen, L. Liang, Z. Kuang, L. Jin, and W. Zhang (2021) Fourier contour embedding for arbitrary-shaped text detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3123–3131. Cited by: §II-B, TABLE V, TABLE VI.