Corner-based Region Proposal Network
Previous approaches for scene text detection usually rely on manually defined sliding windows. In this paper, an intuitive region-based method is presented to detect multi-oriented text without any prior knowledge regarding the textual shape. We first introduce a Corner-based Region Proposal Network (CRPN) that employs corners to estimate the possible locations of text instances instead of shifting a set of default anchors. The proposals generated by CRPN are geometry adaptive, which makes our method robust to various text aspect ratios and orientations. Moreover, we design a simple embedded data augmentation module inside the region-wise subnetwork, which not only ensures the model utilizes training data more efficiently, but also learns to find the most representative instance of the input images for training. Experimental results on public benchmarks confirm that the proposed method is capable of achieving comparable performance with the state-of-the-art methods. On the ICDAR 2013 and 2015 datasets, it obtains F-measure of 0.876 and 0.845 respectively. The code is publicly available at https://github.com/xhzdeng/crpnREAD FULL TEXT VIEW PDF
Corner-based Region Proposal Network
Automatically reading text in the wild is a fundamental problem of computer vision since text in scene images commonly conveys valuable information. It has been widely used in various applications such as multilingual translation, automotive assistance and image retrieval. Normally, a reading system consists of two sub-tasks: detection and recognition. This work focuses on text detection, which is the essential prerequisite of the subsequent processes in the whole workflow.
Though extensively studied , , ,  in recent years, scene text detection is still enormously challenging due to the diversity of text instances and undesirable image quality. Previous works on text detection utilized sliding window ,  or connected component , , 
with hand-crafted features. Although these methods have shown promising performance, they may be restricted to complex situations such as non-uniform illumination and partial occlusion. Recently, benefited from the significant achievements of generic object detection based on deep learning, the methods with high performance,  have been modified to detect horizontal scene text ,  and the results have amply demostrated their effectiveness. In addition, in order to achieve multi-oriented text detection, some methods ,  designed several rotated anchors to find the best matched proposals to inclined text. However, as  itself refers, those methods based on the man-made shape of sliding windows may not be the optimal designs.
This paper presents a new strategy to tackle scene text detection which is mostly learned from the process of object annotation. The full process of the proposed method is depicted in Fig.1. Generally, the most common way ,  to make an annotation is to click on a corner of an imaginary rectangle enclosing the object, and then drag the mouse to the diagonally opposite corner to obtain a bounding box. If the box is not particularly accurate, users can make further amendments by adjusting the coordinates of the corners. Following this procedure, our motivation is to harness corner points to infer the locations of text bounding boxes. Essentially, our method is a two-stage detection method based on the R-CNN 
framework. In the first stage, we abandon the anchor fashion and employ corners to estimate the possible positions of text regions. Based on corner detection, our new proposal method is flexible to capture texts in any aspect ratio and any orientation. Furthermore, it is also accurate that can achieve the desirable recall with only 200 proposals. In the second stage, we design a built-in data augmentation inside the region-wise subnetwork which combines feature extraction and multi-instance learning to find the most representative instance of the input images for training. The resulting model is trained end-to-end and evaluated on three public benchmarks: ICDAR 2013, ICDAR 2015, COCO-Text, and obtains F-measure of 0.876, 0.845, 0.591 respectively. Besides, compared to recent publiced works, it is competitively fast in running speed.
We summarize our primary contributions as follows:
1. Dissimilar to previous works heavily relying on the well-designed sliding windows, we present an anchor-free method for multi-oriented text detection.
2. Based on corner detection, our new region proposal network is capable of generating quadrilateral region proposals to capture various kinds of text instances.
3. To improve the utilization of training data, we provide a simple data augmentation module inside the region-wise subnetwork. This nearly cost-free module will result in better performance on benchmarks.
Detecting scene text has been widely studied in the past decade. Before the deep learning era, most related works were based on sliding window or connected component with hand-crafted features. The sliding window based methods ,  detect text by shifting a window onto all positions in multiple scales. Although these methods are able to achieve high recall, a large number of candidates may result in low precision and heavy computations. Different from sliding window, approaches based on connected component , ,  focus on the detection of individual characters and the relationships between them. Those methods usually comprise several steps including character candidate detection, negative candidate removal, text extraction and verification. However, errors which occur and accumulate throughout each of these sequential steps may degenerate the performance of detection .
Recently, deep learning based algorithms have gradually become the mainstream in the area of computer vision. The state-of-the-art object detectors based on deep neural network have been extended to detect scene text. Gupta et al. introduced a Fully Convolutional Regression Network which performs text detection and bounding box regression at all locations and scales in given images. Liu et.al  designed several multi-oriented anchors inside the SSD framework for multi-oriented text detection. Tian et.al  proposed a Connectionist Text Proposal Network which detects text by predicting the sequence of fine-scale text components. Later, Shi et al.  modified SSD to predict the segments of text and the linkage of two adjacent segments. Different from these methods, Zhou et al.  presented a deep regression network that directly predicts text region with arbitrary orientations in full images. All these methods discard the text composition and aim to regress the text bounding boxes straightly. However, these methods still struggle with the enormous variations of the text aspect ratios and orientations.
This section describes the proposed method for scene text detection. Technically, it consists of two stages where region proposals are firstly generated to estimate the possible locations of text instances, and then a region-wise subnetwork is trained for further classification and regression over these proposals. Details of each stage will be delineated in the following.
Here we introduce our new region proposal algorithm named Corner-based Region Proposal Network (CRPN). It draws primarily on DeNet  and PLN , which are novel generic object detection frameworks based on corner detection. Both of them discarded anchor strategy and utilized corner points to represent object bounding box. The CRPN combines these two methods and employs several innovations to generate quadrilateral region proposals for multi-oriented text. Briefly, it is clear that two matched corners can determine a diagonal, and two matched diagonals can determine a quadrilateral bounding box. Following this idea, we divide the CRPN into two steps: Corner Detection and Proposal Sampling.
As described in DeNet 
, the corner detection task is performed by predicting the probability of each positionthat belongs to one of the predefined corner types. Owing that texts are usually very close to each other in natural scenes, it is hard to separate corners which are from adjacent text regions. Thus, we use the one-versus-rest strategy and apply four-branches ConvNet to predict the independent probability for each corner type. However, unlike the horizontal rectangles whose corners can be easily split into four types (top-left, top-right, bottom-right, bottom-left), the category of corners in oriented rectangle or quadrangle may not have clear definition by their relative position. In our method, a sequential protocol of coordinates proposed by  is adopted. Based on this protocol, we can denote the probability maps as where is the sequence number of corner types.
As mentioned before, the detected corners need to be associated into diagonals. Similar to SWT  applying the gradient direction of each edge pixel to find another edge pixel where the gradient direction is roughly opposite to , we define a variable named link direction to indicate where to find the partner for each corner candidate. Let
be the orientation of vectorto the horizontal, and are two candidates from different types which can be linked into a diagonal, then is discretized into one of values, as HOG  did. The link direction of to is calculated as follow:
Then we convert the binary classification problem to a multi-class one, where each class corresponds to a link direction. To this end, each corner detector outputs a prediction map with channel dimension (plus one for background and other corner types). The independent probability of each position which belongs to corner type is given by:
To suppress the negative diagonals, we define the unmatched link:
where is the predicted link direction of corner , is the practical link direction between corner and calculated via E.q.1. A diagonal with any unmatched link endpoint will be discarded. As shown in Fig.2, the link direction is not only indispensable for connecting two matched corners, but also helpful for separating two adjacent texts.
After corner detection, proposal sampling is used to extract corner candidates from probability maps and assembles them into region proposals. We can easily estimate the probability that a proposal
contains a text instance by applying a Naïve Bayesian Classifier to each corner of the proposal:
Where indicates the coordinates of the -th corner associated with proposal . We develop a simple algorithm for searching and grouping corner candidates. Specially, in order to improve recall, we only use three types of corners to generate a quadrilateral proposal. The working steps of algorithm are described as follows:
1. For each corner type, search the probability map and extract candidates where . A candidate close to higher scoring selected one will be rejected. The top- ranked candidates are used for next step.
2. Generate a set of unique diagonals by linking every selected corner of type 1 with every one of type 3. The diagonals with unmatched link will be filtered out.
3. For each diagonal, select any one corner from last two types and rotate the diagonal until three points (two endpoints and the third one) are collinear, then a quadrilateral proposal determined by those two diagonals will be obtained.
4. Calculate the probability of each proposal being non-null via E.q.4.
5. Repeat steps 2 and 3 with corners of type 2 and 4.
To reduce redundancy, we adopt non-maximum suppression (NMS) on the proposals based on their probabilities. Ordinarily, most of the proposals are quadrangles, so the computational accuracy of the standard NMS based on the IoU of rectangles is unsatisfactory in our task. Thus, the IoU calculation method for quadrangles proposed by  is utilized to solve this problem. The vast majority of proposals will be discarded after NMS.
As presented in 
, the RoI pooling layer uses max pooling to convert the features inside any valid RoI into a small feature map with a fixed spatial extent of. Considering the traditional RoI pooling which is only used for the rectangular window is not accurate for quadrangles, as shown in Fig.3, we adopt the Rotation RoI pooling layer presented by  to address this issue. The RRoI pooling takes rotated rectangle represented by 5 parameters as input, where are the central coordinates of axis-aligned rectangle, are the width and height, is the rotation angle by anticlockwise. This operator works by mapping the axis-aligned rectangle to feature map and dividing it into a grid of sub-windows, and then affine transformation is used to calculate the rotated position for each sub-window. Obviously, we need to convert the quadrilateral RoI into a rotated rectangular one in advance.
To improve the robustness of the model when encountering various orientations of text, existing methods , ,  rotated images to different angles to harvest sufficient training data. Despite the effectiveness of this fashion, learning all the transformations of data requires a large number of network parameters, and it may result in significant increase of training cost and over-fitting risk . To alleviate these drawbacks, we design a simple module named Dual-RoI pooling which embeds data augmentation inside the network. For each RoI, we calculate the oriented rectangle represented by , and then transform into another oriented rectangle represented by . These two oriented rectangles are both fed into the RRoI pooling layer and their output features are fused via element-wise adding. We call these two oriented rectangles as Dual-RoI, as shown in Fig.3. We only use these two oriented RoIs because there are only two forms of arrangement, horizontal and vertical, within text after affine transformation. The essence of Dual-RoI pooling employs multi-instance learning  and learns to find the most representative instance of the input images for training, similar to TI-POOLING . As a result of directly conducting the transformation on feature maps instead of applying on the source image, our module is more efficient than . Moreover, considering the the diversity of arrangements within text, we argue that the element-wise adding is more appropriate than maximum for our task.
The network architecture of the proposed method is illustrated in Fig.4. All of our experiments are implemented based on VGG16 , though other networks are also applicable. In order to maintain spatial information for more accurate corner prediction, the output of conv4 is chosen as the final feature map, whose size is of the input image. We also apply architecture  to associate finer features from lower layers with coarse semantic features from higher layers. Different from , we add a deconvolutional layer on conv5 to carry out upsamling, and a
convolutional layer with stride 2 on conv3 to conduct subsampling.
Based on the above definitions, we give the multi-task loss
of our network. The CNN model is trained to simultaneously minimize the losses on corner segmentation, proposal classification and regression. Overall, the loss function is a weighted sum of the three losses:
where are user defined constants indicating the relative strength of each component. In our experiment, all weight constants are set to 1.
Technically, our goal is to produce corner maps approaching to the ground-truth in segmentation task. The ground-truth is identified by mapping the corners of each text instance to a single position in the label map, and corners out of boundary are simply discarded. Since the great majority of points is non-corner, tremendous imbalance between positive and negative number will bias towards negative sample. Therefore, we alleviate this issue from two ways. On one hand, we only choose 32 samples as a mini-batch which include all positive samples and negative samples randomly sampled in the feature map to compute the loss function. On the other hand, we use a weighted softmax loss function introduced in , given by:
is samples of input mini-batch for -th corner type, is the number of samples equal to 32 in our implement. denotes the maximum value of the link direction. is the ground truth of the -th sample . represents how likely the link direction of is . The is the balanced weight between positive and negative samples, given by:
As in , the softmax loss for binary classification of text or non-text and the smooth- loss for bounding box regression are used. We use mini-batch of size from each image and take 25 of the RoIs from proposals that have intersection over union(IoU) overlap with a ground truth of at least 0.5. Each training RoI is labeled with a ground-truth and a ground-truth bounding box regression target . The classification loss function for each RoI is given by:
The regression loss function for each RoI is given by:
where is a predicted tuple for regression and denote the target tuple. We adopt the parameterizations of the 8 coordinates as following:
variables are the coordinates of the -th corner for predicted box, proposal and target bounding box respectively (likewise for ). and are the width and height of minimal bounding rectangle. The target box is identified by selecting the ground-truth with the largest IoU overlap.
SynthText in the Wild(SynthText)  is a dataset of 800,000 images generated via blending natural images with artificial text rendered with random fonts, sizes, orientations and colors. Each image has about ten text instances annotated with character and word-level bounding boxes.
ICDAR 2013 Focused Scene Text(IC13)  was released during the ICDAR 2013 Robust Reading Competition. It was captured by user explicitly detecting the focus of the camera on the text content of interest. It contains 229 training images and 233 testing images. The dataset provides very detailed annotations for characters and words.
ICDAR 2015 Incidental Scene Text(IC15)  was presented at the ICDAR 2015 Robust Reading Competition. It was collected without taking any specific prior attention. Therefore, it is more difficult than previous ICDAR challenges. The dataset provides 1000 training images and 500 testing images, and the annotations are provided as word bounding boxes.
COCO-Text  is currently the largest dataset for text detection and recognition in scene images. The original images are from Microsoft COCO dataset, and it contains 43686 training images and 20000 images for validation and testing. The dataset provides the word-level annotations.
Our training images are collected from SynthText and the training sets of ICDAR 2013 and 2015. We randomly pick up 100,000 images from SynthText for pretraining, and then the real data from the training sets of IC13 and IC15 are used to finetune a unified model. The detection network is trained end-to-end by using the standard SGD algorithm. Momentum and weight decay are set to 0.9 and respectively. Following the multi-scale training in , we resize the images in each training iteration such that the short side of inputs is randomly sampled between 480 and 800. In pretraining, the learning rate is set to for the first 60k iterations, and then decayed to for the rest 40k iterations. In finetuning, the learning rate is fixed to for 20k iterations throughout. No extra data augmentation is used.
For the trade-off between efficiency and accuracy, we set and in our implement. Moreover,
is set to 0.1 in training phase for high recall and 0.5 in testing phase for high precision. Only 200 proposals are used for further detection at test-time. The proposed method is implemented using Caffe. All experiments are carried out on a standard PC with Intel i7-6800K CPU and a single NVIDIA 1080Ti GPU.
We evaluate our method on three benchmarks including ICDAR 2013, ICDAR 2015 and COCO-Text, following the standard evaluation protocols in these fields. All of our results are reported on single-scale testing images with a single model. Fig.5 shown some detection results from these datasets.
ICDAR 2015 Incidental Scene Text. In ICDAR 2015, we rescale all of the testing images such that their short side is 900 pixels. Table 1 shows the comparison between our method with other recently published works. By incorporating both CRPN and Dual-RoI pooling, our method achieves an F-meansure of 0.845, surpassing all of the sliding window based methods including RRPN  and RCNN , which are also extended from the Faster R-CNN and employ VGG16 as the backbone of framework. Besides, our method is faster with a speed of 5.0 FPS. However, compared to , there are still opportunities for further enhancements in terms of recall.
ICDAR 2013 Focused Scene Text. In ICDAR 2013, the testing images are resized with a fixed short side of 640. By using the ICDAR 2013 standard, we obtain the state-of-the-art performance with an F-measure of 0.876. As depicted in Table 2, our method also outperforms all the other compared methods including DeepText , CTPN  and TextBoxes , which are mainly designed for horizontal text detection. We further investigate the running time of various methods. The proposed method runs at 9.1 FPS, which is slightly faster than SSTD . However, it is still too slow for real-time or near real-time application.
COCO-Text. At last, we evaluate our method on the COCO-Text, which is the largest benchmark to date. Similarly, the testing images are resized with a fixed short side of 900. We use the online evaluation system provided officially. As shown in Table 3, our method achieves 0.633, 0.555 and 0.591 in recall, precision and F-measure. It is worth noting that no images from COCO-Text are involved in training phase. The presented results demonstrate that our method is capable of applying practically in the unseen contexts.
This paper presents an intuitive region-based method towards multi-oriented text detection by learning the idea from the process of object annotation. We discard the anchor strategy and employ corners to estimate the possible positions of text regions. Based on corner detection, the resulting model is flexible to capture various kinds of text instances. Moreover, we combine multi-instance learning with feature extraction and design a built-in data augmentation, which not only utilizes training data more efficiently, but also improves the robustness of the resulting model. Experiments on standard benchmarks demonstrate that the proposed method is effective and efficient in the detection of scene text. In the future, the performance of our method can be further improved by using much stronger backbone networks, such as ResNet  and ResNeXt . Additionally, we are also interested in extending our method to an end-to-end text reading system.
Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Reading text in the wild with convolutional neural networks. IJCV (2016)