Scene text detection in the wild becomes a hot research topic of computer vision for its various application in document image analysis, scene understanding, and autonomous driving, etc. Recently, scene text detection community focuses on arbitrary shape text, and many profound approaches have been proposed.
Arbitrary shape text detection methods can be classified into two types: segmentation based and regression based. Segmentation based[wang2019PSENet, liao2020db] methods represent text regions with pixel-level classification masks that are not constrained by the shape of the text, but they have drawbacks such as computationally intensive post-processing and the lack of noise resistance. The regression based methods regress the text boundaries directly, making the prediction process much simpler. For horizontal and multi-oriented straight text, regressing the quadrilateral is sufficient to represent the text shape [zhou2017east, liao2018textboxes++]. However, complex representations must be designed for arbitrary shape texts. Apart from directly increasing the number of points to represent text contours [wang2019ATRR, dai2021progressive], some methods apply parametric curves to fit the text boundaries [Liu2020ABCNet, Wang2020textray, zhu2021fourier], resulting in tighter and smoother contour curves.
In our opinion, a good representation for arbitrary shape text should: (1) be compact and complete enough so that the presence of background pixels and the missing of text pixels are as less as possible; (2) be integral and not easy to be confused with other instances; (3) be effective in both detection and recognition. Although there are various representation methods for text, none of them can satisfy all the requirements. As is shown in Fig. 1, TextRay [Wang2020textray] is not compact for the highly-curved text, FCE [zhu2021fourier] will miss partial corner pixels of the long text, and both of them can not beneficial for subsequent recognition directly. The Bezier curve representation [Liu2020ABCNet] meets the first and the third requirements, but it takes two curves to represent the text’s upper and lower boundaries separately which is not integral, thus may be confused with other nearby instances.
To solve the problems that previous representations encountered, we propose a better arbitrary shape text representation: Thin-Plate-Spline (TPS) representation. TPS transformation [bookstein1989principal] is typically applied in scene text recognition for rectification [shi2016robust, shi2018aster, zhan2019esir, yang2019symmetry, shang2020character], where the irregular text region is rectified to the horizontal regular region so that the classical simple methods like CRNN [crnn] can recognize it well. Though TPS is effective in scene text recognition, it has not been applied in scene text detection to the best of our knowledge. If we think the usage of TPS reversely, as shown in Fig. 1 (a), it can be a novel and simple representation for detection. While rectifying text in recognition, the TPS parameters are solved based on the corresponding control points, with which the source shape can be transformed into the target rectangular shape. Considering the reverse process of rectification, if we get the TPS parameters, every points on the target rectangle can be transformed to the points of source arbitrary shape. Since the target rectangle can be fixed, the TPS parameters can be taken as a good text representation and meet all the three criteria for better representation. As the TPS representation takes the rectangle as the basic shape of text, it is adaptive to the characteristics of large aspect ratios and right angles in the corners, treats the text as an integral shape, and can be applied to the recognition straightforwardly, as shown in Fig. 1.
The standard approach for calculating TPS parameters is to use the corresponding control points on the source and target shapes. However, even if we have text boundary annotations, it is not easy to obtain control points from them, because any rule that selects a small number of control points is not appropriate for all shapes. That means that trying to use control points as a representation of shapes is not reliable, so we choose to predict the TPS parameters. Without control points, the ground truth of the TPS parameters is not available, so we propose the boundary set loss and shape alignment loss as supervisions. The text’s boundary points are inferred from the TPS parameters, and the boundary set loss is calculated between the predicted boundary set and the ground truth boundary set. Furthermore, we apply the TPS parameters to rectify the text ground truth mask to a horizontal rectangle, the shape alignment loss is obtained by the difference between the rectified rectangle and the fiducial rectangle. With these two supervisions, the prediction of TPS parameters can be optimized robustly.
The contributions of this work are summarized as follows:
An exquisite representation - TPS representation is first proposed for arbitrary shape text detection to our best knowledge. The inspiration is from the extensively used TPS transformation in scene text recognition and a sophisticated reverse thinking. It is compact, complete, integral and can be reused in recognition directly.
To address the ambiguity of the boundary annotation and improve the supervision of the text shape, we design the boundary set loss and the shape alignment loss to ensure that the network can converge effectively and correctly.
TPSNet equipped with TPS representation, boundary set loss and shape alignment loss is presented, and is evaluated on two arbitrary shape text detection benchmarks. The performance is superior to previous counterparts.
2 Related Works
2.1 Scene Text Representation
As a special kind of object, the scene text needs to be represented appropriately for accurate detection. Segmentation mask is the common representation and has been widely used [deng2018pixellink, wang2019PSENet, lyu2018masktextspotter, liu2019arbitrarily, liao2020db, ye2020textfusenet, xiao2020SD]. The mask can naturally represent arbitrary shape text, but it has the limitation of confusing different text instances. Most of segmentation-based methods try to address this problem by linking pixels belong to the same text [deng2018pixellink, liu2019CSE, liu2018MCN] or distinguishing pixels of different texts [tian2019SAE, wang2019PAN]. Some methods convert the binary mask to continuous masks to represent more information [xue2019msr, xu2019textfield, zhu2021textmountain]. Although it is easy to represent arbitrary-shape text with segmentation mask, it suffers from the lack of noise resistance and computationally intensive post-processing. Some methods represent the text with a set of text components [tian2016CTPN, shi2017SegLink, tang2019seglink++, zhang2020DRRG], which also belongs to segmentation-based methods but the units are text blocks rather than pixels.
Regressing the geometry of the text shape and position is another kind of representation. For the horizontal and multi-oriented straight text, a rectangle [zhang2016multi, zhou2017east, liao2017textboxes, he2017deep, liu2017deep, he2021most, wang2018ITN] or quadrilateral [liao2018rotation, liao2018textboxes++, xue2018border, lyu2018corner] is sufficient. When it comes to curve texts, the representation becomes complicated. TextSnake [long2018textsnake] regresses the radius along the text center line, which is similar to segmentation-based methods. [wang2019ATRR] employs LSTM to predict varying number of boundary points for different texts, while PCR [dai2021progressive] progressively refines the contour points iteratively. Directly representing contours with points is inefficient; too many points are redundant, while too few points can not describe complicated shapes, making text contour unsmooth. To address this issue, parameter curves have been applied for better representation. ABCNet [Liu2020ABCNet] formulates the long sides of the text with two Bezier curves to get compact border fitting, but the text shape is not regressed integrally, causing confusing with adjacent texts. TextRay [Wang2020textray]
converts the coordinates of boundary points from Cartesian coordinate system to polar one, and then employs the Chebyshev polynomials to approximate the boundary. Since text instances usually have large aspect ratios, the distribution of sampling points are not homogeneous on the boundary , making it hard for TextRay to fit the long and highly curved texts. FCENet[zhu2021fourier]
adopts the trigonometric series or fourier transform to fit the text boundaries more compactly and simply. However, although the fourier curve is an excellent fitter, it struggles to fit text at right angle corners with relatively fewer parameters, resulting in incomplete characters.
Our proposed TPS representation is fundamentally different from previous representation. TPS parameters represent the text by transforming a rectangular shape to the text shape, taking the text as in integral, which is able to represent arbitrary shape and naturally adapt to the corners and large aspect ratios of the text.
2.2 Text Rectification
Rectification is a pre-processing module for irregular text recognition [shi2016robust, liu2018charnet, zhan2019esir, yang2019symmetry, shi2019aster, luo2019moran, yunze2020progressive, gong2021unattached, lin2021stan, zhang2021spin]. The rectification methods are mainly based on the affine transformation [liu2016star, liu2018charnet, lin2021stan] or TPS transformation [shi2016robust, shi2018aster, zhan2019esir, shang2020character]. [luo2019moran, zhang2021spin] predict offset maps to shift characters to the horizontal line. Most of them optimize the transformation parameters learning in weak supervision manners, while the character position labels help the network predict better transformation [yang2019symmetry]. The strategy of iteration is also applied for complex shape rectification [zhan2019esir, dai2021progressive].
Since the rectification requires the geometry and localization information, it will be more effective if this process is conducted by text detection model. TextSnake [long2018textsnake] exploits local geometries to sketch the structure of the text instance and transforms the predictable curved text instances into canonical form, but the transformation based on local geometries is discontinuous and causes distortion. ABCNet [Liu2020ABCNet] utilizes the Beizer curve to formulate the text shape and obtains the text region with BeizerAlign, which meanwhile rectifies the curve text.
In this section, we first introduce the TPS representation for arbitrary shape text. Then the boundary set loss and shape alignment loss are presented in detail. Finally, the TPSNet based on the representation and loss functions is described.
3.1 Thin-Plate-Spline Representation
Unlike generic objects, text is a special kind of object. If a word or a text line is taken as an instance, its shape should be a rectangle in a standard case such as in plain documents. In scene images, texts emerge in various shapes containing bends and twists, but these shapes are still deformed from the basic rectangle one, mostly retaining the characteristics of right angle corners and large aspect ratios. From the perspective of deformation, we try to establish a mapping between arbitrary text shapes to regular rectangles, thus enabling arbitrary shape text representations.
TPS has been widely used as the non-rigid transformation model in image alignment and shape matching. We apply TPS as the basic model to implement the deformation from an arbitrary shape to a regular rectangle, which requires to obtain the correspondence of each point from the rectangle to arbitrary shape. We formulate the target shape as the rectangle and the source text shape as , and the rectangle is also called fiducial shape. According to TPS [bookstein1989principal], the corresponding point of on can be calculated by
is the radial basis function
is the distance from to . are the fixed points on the fiducial shape A, called fiducial points, and is the number of fiducial points. Given fiducial points, the basis function is defined as
then the TPS transform function is determined by the parameters
with the shape of . Any fiducial point set can define a basis function, so it is specified as equidistant points on the boundary of the rectangle for simplicity; the number of fiducial points is 8, and the dimension of TPS parameters is 22, which is sufficient for the deformation of arbitrary texts. The scale and aspect ratio of fiducial shape have an great impact on the results, and we will conduct experiments to find the best setting.
With the TPS transform function,
the grids on can be transformed into the corresponding points on , where the text boundaries are naturally obtained, as shown in Fig. 2. Note that, the grids on are predefined, so the is also calculated in advance. The TPS parameters can be decoded quickly to the text shape with a matrix multiplication .
We emphasize that, unlike previous rectification methods, we do not attempt to predict control points on the text shape
because the locations of control points on arbitrary shapes lack strict definition, and we directly predict the TPS parameters with the neural network instead.
To verify the fitting ability of TPS, we directly predict the TPS parameters by taking equidistant sampling points on the annotated text boundaries of shape as control points (This strategy will not be used for network training in experimental evaluation). The text boundary can be derived with equation (1), and then evaluated the Tightness-IOU [liu2019tightness] comparing with the ground truth annotations, where the value reveals the compactness and the completeness of the fitting. The fitting results of four typical representations are demonstrated on Table 1, and the visualization of the shape fitting is shown in the Fig.1. The Cheby fails on highly-curve shapes, and both of Cheby and Fourier fail on extreme aspect ratio cases and miss the text corner. The Beizer [Liu2020ABCNet] can reach the highest fitting performance, but it uses two separate curves to represent the text shape rather than take it as a whole. It is apparent that our proposed TPS representation is most appropriate for arbitrary shape text shape with the characteristics of large aspect ratios and right angle corners.
3.2 Shape Losses
3.2.1 Boundary Set Loss
As mentioned by [Wang2020textray], compared to directly minimize the distances between TPS parameters and its ground truth in parameter space, designing the loss function in geometry space is more efficient. In other words, the TPS parameters should be decoded into the shape first, and the loss can be calculated as the distance of boundary point pairs between the decoded shape and its ground truth.
This distance can be simply defined as the distance of matching point pairs, which aligns every predicted points to the ground truth points. Nevertheless, due to the variety of the shape and ambiguity of the annotation, it is hard to define the target position. Additionally, some researches declare that fixed matching between regressed points and annotations is not optimal for network optimization [Yang2020DenseRR, wei2020point].
Based on the consideration above, we propose the boundary set loss as shown in Fig.3
. We first decode the TPS parameters into text shape, and naturally get the predicted boundary points. Then we take each boundary as a point set, the distance from predicted point set and ground truth point set is estimated by Chamfer distance[fan2017point]. The boundary set loss is formulated as
where is the point on predicted boundary , and is the point on the ground-truth boundary ; and denote the left, right, bottom and top boundaries.
The ground truth boundary points are sampled equidistantly and densely from the sparse annotation points.
3.2.2 Shape Alignment Loss
To further supervise the shape regression, we propose shape alignment loss, which utilizes the interpolation capability of TPS representation. IOU loss[yu2016unitbox] is popular in object detection and its essence is to calculate the global match of the shape. However, The IOU loss can only be easily applied for regular shape regression, and for arbitrary shape, there is not convenient or differentiable algorithm to calculate the areas. TPS representation provides an alternative to solve this problem. As the text rectification process, we apply the TPS parameters to rectify the text ground truth masks. For every points on the text shape grids, we use the bilinear interpolation to get value on text ground truth mask to get the aligned mask. If the TPS parameters are totally right, there will be a perfect rectification and the aligned text mask should be rectangular, so we set this retangular mask as the target mask. The difference between the aligned and target mask is calculated by MSE loss, as shown in Fig.3. The shape alignment loss is formulated as
where is the shape alignment mask, is the default target mask and is the original text ground truth mask. The L2 distance between two masks is divided by the area of the original text mask for balance because the text mask has been cropped and resized to the same scale and irregular curve shape will cover less points getting a smaller area than regular shape. As the is generated by bilinear interpolation, which is differentiable, and the loss can be propagated to the TPS parameters and contribute to the optimization. To get more gradients at the mask boder, we smooth the text ground truth mask with average pooling.
The prediction of TPS parameter are optimized by boundary set loss and shape alignment loss together, and the regression loss are given by:
where is the weight parameter.
Equipped with TPS representation, we propose our one-stage network TPSNet for arbitrary shape text detection.
Following previous regression-based text detection network [zhou2017east, Wang2020textray, zhu2021fourier], we adopt a compact one-stage architecture, which consists of a backbone, a detection head, and a TPS decoder, as shown in Fig. 2. The backbone followed by Feature Pyramid Network (FPN) is applied to extract multi-scale features. The multi-level feature maps are passed into the detection head, which consists of a classification branch and a regression branch. The classification branch predict the per-pixel masks of text region and text center, and these two masks are multiplied as the confidence of the detection at every position. In the regression branch, we directly predict the TPS parameters on each feature bin, and the confidence from the classification branch is used for Non-Maximum Suppression (NMS) to remove duplicated predictions.
In the TPS decoder, the predicted TPS parameters are decoded to text shape with pre-defined fiducial shape and . The decoder process is defined as Equation (1). From the text shape, the text boundary can be obtained directly, and it is also convenient to rectify the text by sampling on the input image.
The optimization objectives of the classification branch and regression branch respectively are and , and the whole TPSNet is optimized by:
The classification loss consists of the Text Region loss and Text Center Region loss :
Both of and are cross entropy loss. To solve the sample imbalance problem, OHEM is adopted for with the ratio between negative and positive samples being 3 : 1.
In this section, we evaluate our proposed TPSNet by CTW1500 and TotalText datasets to validate its effectiveness. We first conduct some ablation studies to demonstrate the advantages of proposed designs and the setting of hyper-parameters. Then we compare the detection performance of our model with previous state-of-the-art methods. Finally, the rectified text instance is recognized to show the representation’s contribution to subsequent recognition.
CTW1500 [yuan2019ctw] is a dataset for curved text. It contains 1,000 training images and 500 test images. Text is represented by polygons with 14 points at text-line level.
TotalText. TotalText [ch2017total] includes curved, horizontal, and multi-oriented text. It consists of 1, 255 training images and 300 test images. All text annotations are word-level. Similar to CTW1500, text areas are annotated with polygons.
ICDAR2015. ICDAR2015[karatzas2015icdar2015] is a multi-oriented text detection dataset only for English, which includes 1000 training images and 500 testing images. The text regions are annotated with quadrilaterals.
SynthText. SynthText [gupta2016synthetic] is a synthetically generated dataset composed of 800000 synthetic images. We use the dataset to pre-train our model.
4.2 Implementation Details
We implement our TPSNet based on MMOCR [mmocr2021]
library. The backbone is ResNet50 pretrained on ImageNet with DCN in stage 2, 3 and 4, followed by FPN. Feature maps of P3, P4 and P5 are used in classification and regression branch, where 4convolutional layers are applied for the text region and text center region classification and the TPS regression. Text instances are assigned into different feature maps according to its scale ratio (instance scale/image scale), and the ranges are [0, 0.25], [0.2, 0.65] and [0.55, 1.0] for P3, P4 and P5 respectively.
The training images are resized to
, and data augmentaion strategies are applied, including ColorJitter, RandomCrop, RandomRotate and RandomFlip. The training batch size is set to 8. Stochastic gradient descent (SGD) is adopted as optimizer with the weight decay of 0.001 and the momentum of 0.9. The initialized learning rate is 0.001, which is reduced by
every 150 epochs, and the number of total training epochs is 1000. The weight of shape alignment loss is 0 before the 500th epoch and raises to 1.0 after that. Our TPSNet is directly trained on corresponding training set and not pretrained on any other dataset.
In the test stage, the long side of the test image is set to 1080, 1280 for CTW1500 and TotalText, while the short side is resized to keep the original aspect ratio. All experiments are conducted on a single NVIDIA RTX3090 GPU.
4.3 Ablation Study
The ablation study is conducted on CTW1500, and all experiments are conducted under the same training and test settings without pretraining.
4.3.1 Regression Target and Supervision
To verify the advantage of our design, we set different regression targets and supervisions. The alternative regression target is the control point, although it is not reliable as above discussion; the corresponding regression is smooth L1 of the coordinates of control points. For TPS parameters prediction, the available regression supervisions include: 1) L1 distance between predicted TPS parameters and its ground truth solved from equidistant control points, 2) distance of the boundary points that employ the simple point-to-point matching strategy and 3) the proposed boundary set loss and shape alignment loss. The evaluation results are shown in Table 2.
The boundary set loss can improve 2% on F-measure compared to direct calculating loss for TPS parameters, because it gets rid of the reliance on the unreliable ground truth of TPS parameters , reducing noise due to the ambiguity of the annotations and boundary points sampling strategy. The shape alignment loss further optimizes the TPS parameters by the global shape constraints, thus it brings 1.3% improvement on F-measure. Indeed, the control points are equivalent to the TPS Parameters since the latter can be solved directly from the former. If the proposed loss functions are applied for controlling points prediction, it might reach the similar performance with TPS parameters, but the process to get text shape is complicated: from control points to TPS parameters, and then from TPS parameters to text shape. Predicting the TPS parameters is a simple but efficient choice.
4.3.2 The Fiducial Shape Setting
The fiducial shape decides the basis function and have a great impact on the performance. To find the best setting of fiducial shape, we set different aspect ratios and scales for ablation as shown in Table 4. As expected, larger aspect ratio is more suitable for text shape representation. Although the TPS parameters based on the fiducial shape of small aspect ratio can also fit the text boundary, the inner points are uneven, causing the shape alignment loss can not work anymore. The scale of fiducial shape mainly affects the optimization of the network. Since the function basis have term, larger scale generate larger and will cause larger gradients, making the loss unstable and even explode when the scale is 4. According to the ablation, the aspect ratio is set to 1:4 and the scale is set to 1, which means the width of fiducial rectangle is 1 and the height is 0.25.
4.4 Comparisons with Previous Methods
We evaluate our TPSNet on benchmark datasets and compare with previous methods as shown in Table 3. Previous methods are divided into three categories: segmentaion-based, regression-based and hybrid-based methods that use both segmentation and regression for detection. Obviously, the segmentation-based methods rely on external dataset to pretrain the model. As a regression-based method, our TPSNet can significantly outperforms previous methods on curved scene text detection datasets CTW1500 and TotalText, and for straight scene text in ICDAR2015, it can also achieve the best performance. In addition, the TPSNet is easy to train as it can get comparable performance without extra data for pretraining. FCENet is the most competitive method with TPSNet, but it can not rectify the detected text region directly, and as shown in Fig. 1&4, it misses partial corner pixels which will lead to poor recognition accuracy.
Qualitative comparison is shown in Fig. 4. ABCNet[Liu2020ABCNet] is easy to be confused by adjacent instances since it predicts two separate curves rather than the whole shape together. TextRay[Wang2020textray] fails in highly-curved or large aspect ratio cases and FCENet[zhu2021fourier] prefers to missing the corners of long text, which is not conducive to the subsequent recognition. By comparison, our proposed TPSNet obtained the most compact and complete detection.
4.5 Rectification for Recognition
To verify that the TPS representation can provide robust rectification for recognition, we conduct the recognition experiment with the following settings: we apply TPSNet to detect texts on TotalText test images, and then we match the detection results to the ground truth with IOU threshold at 0.5; all the matched detection results take the corresponding text labels as the ground truth, where only latin alphabet and numbers are preserved, and the text regions are rectified by our TPS parameters or directly cropped by bounding boxes (no rectification). We use ASTER[shi2018aster]
as the recognition model, and the open source trained model is applied. We switch off and on the rectification module of ASTER to recognize the bounding-box text and the text rectified by TPSNet. The recognition accuracy are shown in Table5 and the rectified texts shown in Fig. 5.
The rectification from TPSNet improves the recognition accuracy by 13.8%, superior to the rectification in ASTER.
In this paper, we have proposed a novel TPS representation for arbitrary shape text, which is the first application of TPS transformation in scene text detection. The TPS representation is compact, complete, integral, and reusable for subsequent recognition. Equipped with TPS representation, we implement the TPSNet, and design two shape losses for network training. Experimental results on CTW1500 and TotalText show the effectiveness. Since rectification directly is the most advantage of TPS representation, we will extend TPSNet to end-to-end text spotting in our future work.