The code of "Mask TextSpotter v3: Segmentation Proposal Network for Robust Scene Text Spotting"
Recent end-to-end trainable methods for scene text spotting, integrating detection and recognition, showed much progress. However, most of the current arbitrary-shape scene text spotters use region proposal networks (RPN) to produce proposals. RPN relies heavily on manually designed anchors and its proposals are represented with axis-aligned rectangles. The former presents difficulties in handling text instances of extreme aspect ratios or irregular shapes, and the latter often includes multiple neighboring instances into a single proposal, in cases of densely oriented text. To tackle these problems, we propose Mask TextSpotter v3, an end-to-end trainable scene text spotter that adopts a Segmentation Proposal Network (SPN) instead of an RPN. Our SPN is anchor-free and gives accurate representations of arbitrary-shape proposals. It is therefore superior to RPN in detecting text instances of extreme aspect ratios or irregular shapes. Furthermore, the accurate proposals produced by SPN allow masked RoI features to be used for decoupling neighboring text instances. As a result, our Mask TextSpotter v3 can handle text instances of extreme aspect ratios or irregular shapes, and its recognition accuracy won't be affected by nearby text or background noise. Specifically, we outperform state-of-the-art methods by 21.9 percent on the Rotated ICDAR 2013 dataset (rotation robustness), 5.9 percent on the Total-Text dataset (shape robustness), and achieve state-of-the-art performance on the MSRA-TD500 dataset (aspect ratio robustness). Code is available at: https://github.com/MhLiao/MaskTextSpotterV3READ FULL TEXT VIEW PDF
The code of "Mask TextSpotter v3: Segmentation Proposal Network for Robust Scene Text Spotting"
Reading text in the wild is of great importance, with abundant real-world applications, including Photo OCR , reading menus, and geo-location. Systems designed for this task generally consist of text detection and recognition components, where the goal of text detection is localizing the text instances with their bounding boxes whereas text recognition aims to recognize the detected text regions by converting them into a sequence of character labels. Scene text spotting/end-to-end recognition is a task that combines the two tasks, requiring both detection and recognition.
The challenges of scene text reading mainly lie in the varying orientations, extreme aspect ratios, and diverse shapes of scene text instances, which bring difficulties to both text detection and recognition. Thus, rotation robustness, aspect ratio robustness, and shape robustness are necessary for accurate scene text spotters. Rotation robustness is important in scene text images, where text cannot be assumed to be well aligned with the image axes. Aspect ratio robustness is especially important for non-Latin scripts where the text is often organized in long text lines rather than words. Shape robustness is necessary for handling text of irregular shapes, which frequently appears in logos.
A recent popular trend is to perform scene text spotting by integrating both text detection and recognition into a unified model [3, 20], as the two tasks are naturally closely related. Some such scene text spotters are designed to detect and recognize multi-oriented text instances, such as Liu et al.  and He et al. . Mask TextSpotter v1 , Qin et al. , and Mask TextSpotter v2  can further handle text instances of arbitrary shapes. Mask TextSpotter series adopt Region Proposal Network (RPN)  to generate proposals and extract RoI features of the proposals for detection and recognition. Qin et al.  directly apply Mask R-CNN  for detection, which also uses RPN to produce proposals. These methods made great progress towards rotation robustness and shape robustness. The architectures of these methods, however, were not designed to be fully robust to rotations, aspect ratios, and shapes. Although these methods can deal with the scattered text instances of various orientations and diverse shapes, they can fail on densely oriented text instances or text lines of extreme aspect ratios due to the limitations of RPN.
The limitations of RPN mainly lie in two aspects: (1) The manually pre-designed anchors are defined using axis-aligned rectangles which cannot easily match text instances of extreme aspect ratios. (2) The generated axis-aligned rectangular proposals can contain multiple neighboring text instances when text instances are densely positioned. As evident in Fig. 1, the proposals produced by Mask TextSpotter v2  are overlapped with each other and its RoI features therefore include multiple neighboring text instances, causing errors for detection and recognition. As shown in Fig. 1
, the errors can be one or several characters, which may not be embodied in the performance if a strong lexicon is given. Thus, the evaluation without lexicon or with a generic lexicon is more persuasive.
In this paper, we propose a Segmentation Proposal Network (SPN), designed to address the limitations of RPN-based methods. Our SPN is anchor-free and gives accurate polygonal representations of the proposals. Without restrictions by pre-designed anchors, SPN can handle text instances of extreme aspect ratios or irregular shapes. Its accurate proposals can then be fully utilized by applying our proposed hard RoI masking into the RoI features, which can suppress neighboring text instances or background noise. This is beneficial in cases of densely oriented or irregularly shaped texts, as shown in Fig. 1. Consequently, Mask TextSpotter v3 is proposed by adopting SPN into Mask TextSpotter v2.
Our experiments show that Mask TextSpotter v3 significantly improves robustness to rotations, aspect ratios, and shapes. On the Rotated ICDAR 2013 dataset where the images are rotated with various angles, our method surpasses the state-of-the-art on both detection and end-to-end recognition by more than 21.9%. On the Total-Text dataset  containing text instances of various shapes, our method outperforms the state-of-the-art by 5.9% on the end-to-end recognition task. Our method also achieves state-of-the-art performance on the MSRA-TD500 dataset  labeled with text lines of extreme aspect ratios, as well as the ICDAR 2015 dataset that includes many low-resolution small text instances with a generic lexicon. To summarize, our contributions are three-fold:
We describe Segmentation Proposal Network (SPN), for an accurate representation of arbitrary-shape proposals. The anchor-free SPN overcomes the limitations of RPN in handling text of extreme aspect ratios or irregular shapes, and provides more accurate proposals to improve recognition robustness. To our knowledge, it is the first arbitrary-shape proposal generator for end-to-end trainable text spotting.
We propose hard RoI masking to apply polygonal proposals to RoI features, effectively suppressing background noise or neighboring text instances.
Our proposed Mask TextSpotter v3 significantly improves robustness to rotations, aspect ratios, and shapes, beating/achieving state-of-the-art results on several challenging scene text benchmarks.
Current text spotting methods can be roughly classified into two categories: (1)two-stage scene text spotting methods, whose detector and recognizer are trained separately; (2) end-to-end trainable scene text spotting methods, which integrate detection and recognition into a unified model.
Two-stage scene text spotting Two-stage scene text spotting methods use two separate networks for detection and recognition. Wang et al.  tried to detect and classify characters with CNNs. Jaderberg et al. 
proposed a scene text spotting method consisting of a proposal generation module, a random forest classifier to filter proposals, a CNN-based regression module for refining the proposals, and a CNN-based word classifier for recognition. TextBoxes and TextBoxes++  combined their proposed scene text detectors with CRNN  and re-calculated the confidence score by integrating the detection confidence and the recognition confidence. Zhan et al.  proposed to apply multi-modal spatial learning into the scene text detection and recognition system.
End-to-end trainable scene text spotting Recently, end-to-end trainable scene text spotting methods have dominated this area, benefiting from the complementarity of text detection and recognition. Li et al.  integrated a horizontal text detector and a sequence-to-sequence text recognizer into a unified network. Meanwhile, Bušta et al.  used a similar architecture while its detector can deal with multi-oriented text instances. After that, Liu et al.  and He et al.  further improved performance by adopting better detection and recognition methods, respectively.
Mask TextSpotter v1  is the first end-to-end trainable arbitrary-shape scene text spotter, consisting of a detection module based on Mask R-CNN  and a character segmentation module for recognition. Following Mask TextSpotter v1 , several arbitrary-shape scene text spotters appeared concurrently. Mask TextSpotter v2  further extends Mask TextSpotter v1 by applying a spatial attentional module for recognition, which alleviated the problem of character-level annotations and improved the performance significantly. Qin et al.  also combine a Mask R-CNN detector and an attention-based recognizer to deal with arbitrary-shape text instances. Xing et al.  propose to simultaneously detect/recognize the characters and the text instances, using the text instance detection results to group the characters. TextDragon  detects and recognizes text instances by grouping and decoding a series of local regions along with their centerline.
Qin et al.  use the mask maps from a Mask R-CNN detector to perform RoI masking on the RoI features, which is beneficial to recognition. However, the detector that adopts RPN to produce proposals may produce inaccurate mask maps, causing further recognition errors. Different from Qin et al. , our Mask TextSpotter v3 obtains accurate proposals and applies our hard RoI masking on the RoI features for both detection and recognition modules. Thus, it can detect and recognize densely oriented/curved text instances accurately.
Segmentation-based scene text detectors Zhang et al 
first use FCN to obtain the salient map of the text region, then estimate the text line hypotheses by combining the salient map and character components (using MSER). Finally, another FCN predicts the centroid of each character to remove the false hypotheses. He et al. propose Cascaded Convolutional Text Networks (CCTN) for text center lines and text regions. PSENet  adopts a progressive scale expansion algorithm to get the bounding boxes from multi-scale segmentation maps. DB 
proposes a differentiable binarization module for a segmentation network. Comparing to the previous segmentation-based scene text detectors that adopt multiple cues or extra modules for the detection task, our method focuses on proposal generation with a segmentation network for an end-to-end scene text recognition model.
Mask TextSpotter v3 consists of a ResNet-50  backbone, a Segmentation Proposal Network (SPN) for proposal generation, a Fast R-CNN module  for refining proposals, a text instance segmentation module for accurate detection, a character segmentation module and a spatial attentional module for recognition. The pipeline of Mask TextSpotter v3 is illustrated in Fig. 2. It provides polygonal representations for the proposals and eliminates added noise for the RoI features, thus achieving accurate detection and recognition results.
As shown in Fig. 2, our proposed SPN adopts a U-Net  structure to make it robust to scales. Unlike the FPN-based RPN [26, 35], which produces proposals of different scales from multiple stages, SPN generates proposals from segmentation masks, predicted from a fused feature map that concatenates feature maps of various receptive fields. is of size , where and are the height and width of the input image respectively. The configuration of the segmentation prediction module for is shown in the supplementary. The predicted text segmentation map is of size , whose values are in the range of .
Segmentation label generation To separate the neighboring text instances, it is common for segmentation-based scene text detectors to shrink the text regions [49, 42]. Inspired by Wang et al.  and DB , we adopt the Vatti clipping algorithm  to shrink the text regions by clipping pixels. The offset pixels can be determined as , where and are the area and perimeter of the polygon that represents the text region, and is the shrink ratio, which we empirically set to . An example of the label generation is shown in Fig. 3.
Proposal generation Given a text segmentation map, , whose values are in the range of , we first binarize into a binary map :
Here, and are the indices of the segmentation or binary map and is set to . Note that is of the same size as and the input image.
We then group the connected regions in the binary map . These connected regions can be considered as shrunk text regions since the text segmentation labels are shrunk, as described above. Thus, we dilate them by un-clipping pixels using the Vatti clipping algorithm, where is calculated as . Here, and are the area and perimeter of the predicted shrunk text regions. is set to according to the value of the shrink ratio .
As explained above, the proposals produced by SPN can be accurately represented as polygons, which are the contours of text regions. Thus, SPN generates suitable proposals for text lines with extreme aspect ratios and densely oriented/irregularly shaped text instances.
Since the custom RoI Align operator only supports the axis-aligned rectangular bounding boxes, we use the minimum, axis-aligned, rectangular bounding boxes of the polygon proposals to generate the RoI features to keep the RoI Align operator simple.
Qin et al. 
proposed RoI masking which multiplies the mask probability map and the RoI feature, where the mask probability map is generated by a Mask R-CNN detection module. However, the mask probability maps may be inaccurate since they are predicted by the proposals from RPN. For example, it may contain multiple neighboring text instances for densely oriented text. In our case, accurate polygonal representations are designed for the proposals, thus we can directly apply the proposals to the RoI features through our proposed hard RoI masking.
Hard RoI masking multiplies binary polygon masks with the RoI features to suppress background noise or neighboring text instances, where a polygon mask indicates an axis-aligned rectangular binary map with all values in the polygon region and all values outside the polygon region. Assuming that is the RoI feature and is the polygon mask, which is of size , the masked RoI feature can be calculated as , where indicates element-wise multiplication. can be easily generated by filling the polygon proposal region with while setting the values outside the polygon to . We report an ablation study on the hard RoI masking in Sec. 4.7, where we compare the proposed hard RoI masking with other operators including the RoI masking in Qin et al. .
After applying hard RoI masking, the background regions or neighboring text instances are suppressed in our masked RoI features, which significantly reduce the difficulties and errors in the detection and recognition modules.
We follow the main design of its text detection and recognition modules of Mask TextSpotter v2  for the following reasons: (1) Mask TextSpotter v2 is the current state-of-the-art with competitive detection and recognition modules. (2) Since Mask TextSpotter v2 is a representative method in the RPN-based scene text spotters, we can fairly compare our method with it to verify the effectiveness and robustness of our proposed SPN.
For detection, the masked RoI features generated by the hard RoI masking are fed into the Fast R-CNN module for further refining the localizations and the text instance segmentation module for precise segmentation. The character segmentation module and spatial attentional module are adopted for recognition.
The loss functionis defined as below:
and are defined in Fast R-CNN  and Mask TextSpotter v2  respectively. consists of a text instance segmentation loss, a character segmentation loss, and a spatial attentional decoder loss. indicates the SPN loss. Finally, following Mask TextSpotter v2 , we set and to 1.0.
We adopt dice loss  for SPN. Assuming that and are the segmentation map and the target map, the segmentation loss can be calculated as:
where and indicate the intersection and union of the two maps, and represents element-wise multiplication.
We evaluate our method, testing robustness to four types of variations: rotations, aspect ratios, shapes, and small text instances, on different standard scene text benchmarks. We further provide an ablation study of our hard RoI masking.
SynthText  is a synthetic dataset containing 800k text images. It provides annotations for word/character bounding boxes and text sequences.
Rotated ICDAR 2013 dataset (RoIC13) is generated from the ICDAR 2013 dataset , whose images are focused around the text content of interest. The text instances are in the horizontal direction and labeled by axis-aligned rectangular boxes. Character-level segmentation annotations are given and so we can get character-level bounding boxes. The dataset contains 229 training and 233 testing images. To test rotation robustness, we create the Rotated ICDAR 2013 dataset by rotating the images and annotations in the test set of the ICDAR 2013 benchmark with some specific angles, including , , , , , and . Since all text instances in the ICDAR 2013 dataset are horizontally oriented, we can easily control the orientations of the text instances and find the relations between performances and text orientations. We use the evaluation protocols in the ICDAR 2015 dataset, because the ones in ICDAR 2013 only support axis-aligned bounding boxes.
MSRA-TD500 dataset  is a multi-language scene text detection benchmark that contains English and Chinese text, including 300 training images and 200 testing images. Text instances are annotated in the text-line level, thus there are many text instances of extreme aspect ratios. This dataset does not contain recognition annotations.
Total-Text dataset [4, 5] includes 1,255 training and 300 testing images. It offers text instances of various shapes, including horizontal, oriented, and curved shapes, which are annotated with polygonal bounding boxes and transcriptions. Note that although character-level annotations are provided in the Total-Text dataset, we do not use them for fair comparisons with previous methods[31, 21].
ICDAR 2015 dataset (IC15)  consists of 1,000 training images and 500 testing images, which are annotated with quadrilateral bounding boxes. Most of the images are of low resolution and contain small text instances.
For a fair comparison with Mask TextSpotter v2 , we use the same training data and training settings described below. Data augmentation follows the official implementation of Mask TextSpotter v2 111https://github.com/MhLiao/MaskTextSpotter, including multi-scale training and pixel-level augmentations. Since our proposed SPN can deal with text instances of arbitrary shapes and orientations without conflicts, we adopt a more radical rotation data augmentation. The input images are randomly rotated with an angle range of while the original Mask TextSpotter v2 uses an angle range of . Note that the Mask TextSpotter v2 is trained with the same rotation augmentation as ours for the experiments on the RoIC13 dataset.
The model is optimized using SGD with a weight decay of and a momentum of . It is first pre-trained with SynthText and then fine-tuned with a mixture of SynthText, the ICDAR 2013 dataset, the ICDAR 2015 dataset, the SCUT dataset , and the Total-Text dataset for 250k iterations. The sampling ratio among these datasets is set to for each mini-batch of eight.
During pre-training, the learning rate is initialized with and then decreased to a tenth at 100k iterations and 200k iterations respectively. During fine-tuning, we adopt the same training scheme while using as the initial learning rate. We choose the model weights of 250k iterations for both pre-training and fine-tuning. In the inference period, the short sides of the input images are resized to on the RoIC13 dataset and the Total-Text dataset, on the IC15 dataset, keeping the aspect ratios.
We test for rotation robustness by conducting experiments on the RoIC13 dataset. We compare the proposed Mask TextSpotter v3 with two state-of-the-art methods Mask TextSpotter v2 1 and CharNet 222https://github.com/MalongTech/research-charnet, with their official implementations. For a fair comparison, Mask TextSpotter v2 is trained with the same data and data augmentation as ours. Some qualitative comparisons on the RoIC13 dataset are shown in Fig. 4. We can see that Mask TextSpotter v2 fails on detecting and recognizing the densely oriented text instances while Mask TextSpotter v3 can successfully handle such cases.
We use the pre-trained model with a large backbone (Hourglass-88 ) for CharNet since the official implementation does not provide the ResNet-50 backbone. Note that the official pre-trained model of CharNet is trained with different training data. Thus, it is not suitable to directly compare the performance with Mask TextSpotter v3. However, we can observe the performance variations under different rotation angles. The detection and end-to-end recognition performance of CharNet drop dramatically when the rotation angle is large.
|MTS v2* ||64.8||59.9||62.2||66.4||45.8||54.2||70.5||61.2||65.5||68.2||48.3||56.6|
Detection task As shown in Fig. 7, the detection performance of Mask TextSpotter v2 drops dramatically when the rotation angles are , , and . In contrast, the detection results of Mask TextSpotter v3 are much more stable with various rotation angles. The maximum performance gap between Mask TextSpotter v3 and Mask TextSpotter v2 occurs when the rotation angle is . As shown in Tab. 1, Mask TextSpotter v3 outperforms Mask TextSpotter v2 by 26.8 percent, 18.0 percent, and 22.0 percent in terms of Precision, Recall, and F-measure, with a rotation angle of . Note that it is reasonable that the two methods achieve almost the same results with and , since indicates without rotation and the bounding boxes are also in the shape of axis-aligned rectangles when the rotation angle is .
End-to-end recognition task The trend of the end-to-end recognition results is similar to the detection results, as shown in Fig. 7. The performance gaps between Mask TextSpotter v2 and Mask TextSpotter v3 are especially large when the rotation angles are , , and . Mask TextSpotter v3 surpasses Mask TextSpotter v2 by more than 19.2 percent in terms of F-measure with the rotation angle of and . The detailed results of rotation angle are listed in Tab. 1, where Mask TextSpotter v3 achieves 22.1, 21.0, and 21.9 performance gain compared to the previous state-of-the-art method Mask TextSpotter v2.
The qualitative and quantitative results on the detection task and end-to-end recognition task prove the rotation robustness of Mask TextSpotter v3. The reason is the RPN used in Mask TextSpotter v2 would result in errors in both detection and recognition when dealing with densely oriented text instances. In contrast, the proposed SPN can generate accurate proposals and exclude the neighboring text instances by hard RoI masking in such cases. More qualitative and quantitative results are provided in the supplementary.
|He et al. ||71||61||69|
|Xue et al. ||83.0||77.4||80.1|
|Tian et al. ||84.2||81.7||82.9|
|DB (without DCN) ||86.6||77.7||81.9|
|Mask TextSpotter v2 ||80.8||68.6||74.2|
|Mask TextSpotter v3||90.7||77.5||83.5|
Aspect ratio robustness is verified by our experimental results on the MSRA-TD500 dataset, which contains many text lines of extreme aspect ratios. Since there are no recognition annotations, we disable our recognition module and evaluate only on the detection task. Our qualitative and quantitative results are shown in Fig. 5 and Tab. 2.
Although Mask TextSpotter v2 is the existing state-of-the-art, end-to-end recognition method, it fails to detect long text lines due to the limitation of RPN. Compared with Mask TextSpotter v2, Mask TextSpotter v3 achieves a 9.3% performance gain, which proves its superiority in handling text lines of extreme aspect ratios. Moreover, Mask TextSpotter v3 even outperforms state-of-the-art methods designed for text line detection [29, 1, 38], further showing its robustness to aspect ratio variations.
Robustness to shape variations is evaluated with end-to-end recognition performance on the Total-Text dataset, which contains text instances of various shapes, including horizontal, oriented, and curved shapes. Some qualitative results are shown in Fig. 6, where we can see that our method obtains more accurate detection and recognition results compared with Mask TextSpotter v2, especially on text instances with irregular shapes or with large spaces between neighboring characters. The quantitative results listed in Tab. 3 show that our method outperforms Mask TextSpotter v2 by 5.9% in terms of F-measure when no lexicon is provided. Both the qualitative and quantitative results demonstrate the superior robustness to shape variations offered by our method.
|Mask TextSpotter v1 ||52.9||71.8|
|CharNet  Hourglass-57||63.6||-|
|Qin et al.  Inc-Res||63.9||-|
|Boundary TextSpotter ||65.0||76.1|
|Mask TextSpotter v2 ||65.3||77.4|
|Mask TextSpotter v3||71.2||78.4|
The challenges in the IC15 dataset mainly lie in the low-resolution and small text instances. As shown in Tab. 4, Mask TextSpotter v3 outperforms Mask TextSpotter v2 on all tasks with different lexicons, demonstrating the superiority of our method on handling small text instances in low-resolution images.
Although TextDragon  achieves better results on some tasks with the strong/weak lexicon, our method outperforms it by large margins, and , with the generic lexicon. We argue that there are no such strong/weak lexicons with only 100/1000+ words in most real-world applications, thus performance with a generic lexicon of 90k words is more meaningful and more challenging. Regardless, the reason for the different behaviors with different lexicons is that the attention-based recognizer in our method can learn the language knowledge while the CTC-based recognizer in TextDragon is more independent for the character prediction. Mask TextSpotter v3 relies less on the correction of the strong lexicon, which is also one of the advantages.
|Method||Word Spotting||E2E Recognition||FPS|
|He et al. ||85.0||80.0||65.0||82.0||77.0||63.0||-|
|Mask TextSpotter v1  (1600)||79.3||74.5||64.2||79.3||73.0||62.4||2.6|
|CharNet  R-50||-||-||-||80.1||74.5||62.2||-|
|Boundary TextSpotter ||-||-||-||79.7||75.2||64.1||-|
|Mask TextSpotter v2  (1600)||82.4||78.1||73.6||83.0||77.7||73.5||2.0|
|Mask TextSpotter v3 (1440)||83.1||79.1||75.1||83.3||78.1||74.2||2.5|
It is important to apply polygon-based proposals to the RoI features. There are two attributions for such an operator: “direct/indirect” and “soft/hard”. “direct/indirect” means using the segmentation/binary map directly or through additional layers; “soft/hard” indicates a soft probability mask map whose values are from or a binary polygon mask map whose values are or . We conduct experiments on four types of combinations and the results show that our proposed hard RoI masking (Direct-hard) is simple yet achieves the best performance. Results and discussions are in the supplementary.
Although Mask TextSpotter v3 is far more robust to rotated text variations than the existing state-of-the-art scene text spotters, it still suffers minor performance disturbance with some extreme rotation angles, e.g. , as shown in Fig. 7, since it is hard for the recognizer to judge the direction of the text sequence. In the future, we plan to make the recognizer more robust to such rotations.
We propose Mask TextSpotter v3, an end-to-end trainable arbitrary-shape scene text spotter. It introduces SPN to generate proposals, represented with accurate polygons. Thanks to the more accurate proposals, Mask TextSpotter v3 is much more robust on detecting and recognizing text instances with rotations or irregular shapes than previous arbitrary-shape scene text spotters that use RPN for proposal generation. Our experiment results on the Rotated ICDAR 2013 dataset with different rotation angles, the MSRA-TD500 dataset with long text lines, and the Total-Text dataset with various text shapes demonstrate the robustness to rotations, aspect ratios, and shape variations of Mask TextSpotter v3. Moreover, results on the IC15 dataset show that the proposed Mask TextSpotter v3 is also robust in detecting and recognizing small text instances. We hope the proposed SPN could extend the application of OCR to other challenging domains  and offer insights to proposal generators used in other object detection/instance segmentation tasks.
Deng, D., Liu, H., Li, X., Cai, D.: Pixellink: Detecting scene text via instance segmentation. In: AAAI Conf. on Artificial Intelligence (2018)
He, T., Huang, W., Qiao, Y., Yao, J.: Text-attentional convolutional neural network for scene text detection. Trans. Image Processing25(6), 2529–2541 (2016)
Li, H., Wang, P., Shen, C.: Towards end-to-end text spotting with convolutional recurrent neural networks. In: Proc. Int. Conf. Comput. Vision. pp. 5248–5256 (2017)
Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: European Conf. Comput. Vision. pp. 483–499 (2016)
|Conv||k: 3; s: 1; p: 1||256/64|
|DeConv||k: 2, s: 2, p: 0||64/64|
|DeConv||k: 2, s: 2, p: 0||64/1|
|RA ()||CharNet||Mask TextSpotter v2||Mask TextSpotter v3|
|RA ()||CharNet||Mask TextSpotter v2||Mask TextSpotter v3|
There are two attributions for the RoI masking operator: “direct/indirect” and “soft/hard”. “direct/indirect” means using the segmentation/binary map directly or through additional layers; “soft/hard” indicates a soft probability mask map whose values are from or a binary polygon mask map whose values are or . We conduct experiments with the following settings:
(1) Baseline: Using the original RoI feature. (2) Direct-soft: It is similar to the RoI masking proposed in Qin et al. , applying element-wise multiplication between the corresponding segmentation probability map and the RoI feature. (3) Direct-hard: Our proposed hard RoI masking, applying element-wise multiplication between the corresponding binary polygon mask map and the RoI feature. (4) Indirect-soft: The corresponding segmentation probability map and the RoI feature are concatenated and then a mask prediction module consisting of two convolutional layers is applied to predict a new mask map. Element-wise multiplication is then applied between the new mask map and RoI feature. (5) Indirect-hard: First, a masked RoI feature is obtained by the hard RoI masking. Then, we concatenate the masked RoI feature and the original RoI feature. Finally, the concatenated feature is classified, choosing whether the masked RoI feature or the original RoI feature is used as the output feature.
The experimental results in Tab. 8 show that “direct” is better than “indirect” and “hard” is better than “soft”. The reason is the “direct” and “hard” strategies provide the most strict mask, fully blocking background noise and neighboring text instances. Our proposed hard RoI masking is simple yet achieves the best performance.