A New Perspective for Flexible Feature Gathering in Scene Text Recognition Via Character Anchor Pooling

by   Shangbang Long, et al.

Irregular scene text recognition has attracted much attention from the research community, mainly due to the complexity of shapes of text in natural scene. However, recent methods either rely on shape-sensitive modules such as bounding box regression, or discard sequence learning. To tackle these issues, we propose a pair of coupling modules, termed as Character Anchoring Module (CAM) and Anchor Pooling Module (APM), to extract high-level semantics from two-dimensional space to form feature sequences. The proposed CAM localizes the text in a shape-insensitive way by design by anchoring characters individually. APM then interpolates and gathers features flexibly along the character anchors which enables sequence learning. The complementary modules realize a harmonic unification of spatial information and sequence learning. With the proposed modules, our recognition system surpasses previous state-of-the-art scores on irregular and perspective text datasets, including, ICDAR 2015, CUTE, and Total-Text, while paralleling state-of-the-art performance on regular text datasets.


page 1

page 2

page 3

page 4


Scene Text Recognition from Two-Dimensional Perspective

Inspired by speech recognition, recent state-of-the-art algorithms mostl...

Toward Understanding WordArt: Corner-Guided Transformer for Scene Text Recognition

Artistic text recognition is an extremely challenging task with a wide r...

MANGO: A Mask Attention Guided One-Stage Scene Text Spotter

Recently end-to-end scene text spotting has become a popular research to...

2D Attentional Irregular Scene Text Recognizer

Irregular scene text, which has complex layout in 2D space, is challengi...

CSTR: A Classification Perspective on Scene Text Recognition

The prevalent perspectives of scene text recognition are from sequence t...

Character Region Attention For Text Spotting

A scene text spotter is composed of text detection and recognition modul...

ReADS: A Rectified Attentional Double Supervised Network for Scene Text Recognition

In recent years, scene text recognition is always regarded as a sequence...

1 Introduction

Scene text recognition has been an increasingly popular research topic in computer vision in the last few decades. As a carrier for information of high-level semantics, the ability to read text from natural images is beneficial for the understanding of the surrounding scenes through computer vision 

[long2018scene]. There are various applications, including instant translation, robot navigation, industrial automation, and traffic sign reading for autonomous vehicles. More recently, the detection and recognition of irregular text, e.g., text arranged in a curved line, has attracted much attention [shi2016robust, long2018textsnake].

As deep learning is widely applied to this field, most recent methods follow an encoder-decoder framework, in which, images are discomposed as a sequence of pixel frames, starting from the left side of the image to the right 

[shi2016robust, cheng2017arbitrarily, yin2017scene, yang2017learning]. The framework can be summarized and termed as

Convolutional Recurrent Neural Networks (CRNN)

, where convolutional neural network (CNN) layers encode the image into deep features and compress the height of feature maps into

. The compressed feature has a size of , and therefore is processed as a -dimensional sequential features with time steps, which are then further encoded and decoded by a Seq2Seq model [sutskever2014sequence]. We refer to this feature encoding process as feature gathering, which transforms 2D images into 1D feature sequences. These methods produce good results when the text in the image is horizontal.

For oriented and irregular text, the use of a rectification layer [shi2016robust, shi2018aster]

based on Spatial Transformer Networks 

[jaderberg2015spatial] alleviates the problem to some extent. The rectification layer first predicts a bounding polygon to precisely locate the text, then generates grids according to the polygon, and finally transforms it. The idea behind is intuitive and has proven effective. However, the coordinates of bounding polygons are predicted via fully connected networks, and fail when the text has a shape that is poorly represented in the training dataset. The fact that the polygon prediction is shape-sensitive and may not generalize well to unseen shapes limits the potential of rectification-based methods. Similar problem also exists in 2D attention method [li2018show], which is proven by a less competent score on blurred datasets.

Recently, CA-FCN [liao2019two] takes the two-dimensional spatial distribution of text into consideration, and text recognition is reformulated as semantic segmentation, where character categories are segmented from the background. However, their method abandons the use of recurrent neural networks (RNN), and thus fails to obtain an overall vision. Therefore, it is prone to missing characters and performance drops significantly especially when the text are blurred.

To tackle the challenges mentioned above, we design a flexible feature gathering method which deeply integrates the idea of sequence learning and the idea of considering two-dimensional spatial distribution.

The key step is to gather the feature vectors from the shared feature maps along

character anchor line (CAL), to form sequential features for subsequent sequence-to-sequence learning. To achieve this function, we design two novel modules, Character Anchoring Module (CAM) that anchors characters and Anchor Pooling Module (APM) that forms sequential features. The CAM module detects character center anchors individually in the form of heat map, which is inherently shape-agnostic, and therefore adapts better to irregular text even with unseen shapes. Then, the APM module interpolates the feature vectors along the CAL, and gathers into a sequence. Instead of only considering detected character centers, the interpolation can achieve robustness against challenging image conditions by filling missing characters. While the shape-insensitive CAM can robustly guide the APM along the text features to form sequence, APM can correct errors made by CAM. Based on these two complementary modules, we propose a recognition model, termed as Character Anchor Pooling Network (CAPNet). Compared with previous methods, our method successfully and harmonically unifies shape-agnostic localization and sequence learning. Our methodology is demonstrated in Fig. 1.

The contributions of our paper can be summarized as follows. (1) We propose two innovative and complementary modules, Character Anchoring Module and Anchor Pooling Module, to successfully harmonize sequence learning and two-dimensional spatial arrangement of text. (2) Empirically, the proposed CAPNet outperforms previous state-of-the-art results on irregular and perspective datasets, including ICDAR 2015, CUTE and Total-Text. It also outperforms previous methods on several horizontal text dataset such as IIIT5K and ICDAR 2013, while paralleling on other datasets. (3) We provide in-depth qualitative and quantitative analysis as well as ablation tests to further understand its strengths and dependencies.

2 Methodology

2.1 Pipeline

As shown in Fig. 1, the pipeline of CAPNet contains the following components: (1) A fully convolutional backbone module for feature encoding, producing shared feature maps for the following steps. (2) The Character Anchor Pooling Module (CAPM) that comprises of CAM and APM. (3) An RNN-based recognition module that encodes the pooled features and decodes them into text symbols.

Figure 1: The pipeline of the proposed CAPNet: (a): ResNet50+FPN backbone; (b): the feature gathering module that we refer to as Character Anchor Pooling Module; (c): the sequence learning module. We apply an attentional RNN-based encoder-decoder network.

2.2 Model Architecture

Backbone: We use a standard 50-layered ResNet [He_2017_Res] as backbone and FPN [Lin_2017_CVPRPyramid] connections to integrate features from different stages. The upsampling in the FPN branch is performed by bi-linear interpolation. Other settings follow  [Lin_2017_CVPRPyramid]. All CNN layers in FPN have filters. The kernel size is for lateral connections, and for top-down connections. The network produces shared feature maps with channels, and is one-quarter of the size of the input image.

Character Anchoring Module: To flexibly and robustly localize a text instance, we propose to anchor each individual character instead. Note that, in bounding polygon regression [shi2018aster] that localizes the text as an entirety, each control point will depend on the overall shape. This results in shape-sensitivity. On the contrary, the localization of characters does not have dependencies over the rest of the images. Therefore, detecting characters is insensitive to text shape by design. For easy separation of each character anchor, we define character anchors as shrank character boxes. The downsampling ratio is . The CAM contains a two-layered CNN, which we found strong enough to detect character centers. The first CNN layer has a kernel size of and

filters in total. The second one is in essence a pixel-wise classification layer, which takes the shared feature maps as input, and produces a heat map, indicating the probability for each pixel to be the character anchor. The prediction map has the same size as the shared feature maps.

Anchor Pooling Module for Feature Gathering: The details of APM is illustrated in Fig. 2. First, we separate and aggregate adjacent positive responses on the predicted character anchor heat map into groups, and each group would indicate one character anchor. Then we take the midpoint of each group and produce the coordinates of each character anchor. The coordinates are sorted from left to right, and the ordered list of sorted coordinates forms the basis of CAL. To enrich our feature sequence, we evenly sample a fixed number of markers along the sorted coordinates, which makes the CAL. The markers have floating point coordinates. We denote the number of markers in CAL as . For each marker in CAL, we bi-linearly interpolate a feature vector from the corresponding floating point position on the shared feature maps. The last step is to concatenate the extracted features in order, and we obtain sequential features, with a size of , which has time steps.

Besides, experiments results show that extracting features directly from the shared feature maps produced by backbone would result in worse performance. Therefore we add two extra CNN layers to further encode the shared feature maps. The character anchor pooling is performed on the further encoded feature maps, while the character anchoring still performs on the shared feature maps. Most previous works follow similar practice [shi2018aster]. This is mainly because localization and recognition require different aspects of the features. Besides, the two additional CNN layers also enlarge the receptive field to cover more visual features.

Attentional Encoder-Decoder:

The recognition module is an RNN-based attentional encoder-decoder. The encoder is a one-layered bidirectional LSTM. It encodes the features extracted by character anchor pooling and outputs

. The encoder captures long-term dependencies and maintains an overall vision over the pooled sequential features. This is important especially when individual characters are blurred, or missing. The size of the hidden state is set as for each direction. The hidden states of both directions are concatenated.

Figure 2: Anchor Pooling Module: (1) separating and grouping character center pixels to form individual character anchor; (2) sampling the CAL along generated character anchors; (3) feature pooling along CAL, and concatenation into sequence.

For the decoder, we use a unidirectional LSTM. The size of the hidden state is . The hidden-state at time step is initialized to equal the last hidden state of the encoder. The decoder is equipped with the attention mechanism [bahdanau2014neural]. It consists of an alignment module and a RNN-decoder module. At time step , the alignment module calculates the attention weights , which effectively indicate the importance of every item of the encoder output :


The RNN module produces an output, , and a new state :


Given the alphabet set , the decoder takes as input and predicts the output symbol with softmax:


2.3 Training Targets

The network is trained jointly to match the ground-truth labels of character centers and text sequence. Character anchor heat map is a binary prediction at the pixel level. Given the number of pixels in the feature maps as , the loss is defined as the following binary cross entropy loss:


The recognition module predicts a symbol sequence. We denote the ground-truth of the text sequence as

. In training, we pad the sequence with

end-of-sentence (EOS) symbol due to variable lengths. The loss is defined as negative maximum-likelihood averaged over time :


The training target is the weighted sum of the localization loss and recognition loss:


where and are weights and are set to by default in our experiments.

3 Experiment

Datasets: Following previous works, we train our network solely on synthetic datasets, SynthText [gupta2016synthetic] and Synth90K [jaderberg2014synthetic]. There are image crops in SynthText, which are annotated at the character level. Synth90K contains 9M greyscale images. It has a balanced distribution over a 90K-vocabulary, and only has annotations of ground-truth word.

We evaluate the trained network on various real-world datasets: IIIT5K [mishra2012scene], SVT [wang2010word], SVT-Perspective [quy2013recognizing], IC03 [lucas2003icdar], IC13 [karatzas2013icdar], IC15, CUTE [risnumawan2014robust] and Total-Text [kheng2017total].

Methods Backbone, Data IIIT5K SVT IC03 IC13 IC15 SVT-P CUTE Total-Text
50 1k 0 50 0 50 Full 0 0 0 0 0 0
Aster [shi2018aster] ResNet, 90K+ST 99.6 98.8 93.4 97.4 89.5 98.8 98.0 94.5 91.8 76.1 78.5 79.5 -
CA-FCN [liao2019two] Attentional VGG, ST + extra ST 99.8 98.9 92.0 98.8 86.4 - - - 91.5 - - 79.9 61.6
2D Attn [li2018show] ResNet, ST + 90K + extra ST - - 91.5 - 84.5 - - - 91.0 69.2 76.4 83.3 -
CAPNet ResNet, 90K+ST 99.8 98.8 93.7 98.9 88.9 99.3 97.8 94.6 92.4 76.6 78.8 86.8 62.7
Table 1: Performance of different methods over

datasets. “50”, “1K”, “Full” are the size of lexicons. “0” means no lexicon. “90K” and “ST” are the Synth90k and the SynthText datasets, respectively. “ST

” means including character-level annotations. “Private” means private training data. “-” means the score is not reported in the paper.
Methods IIIT5K SVT IC03 IC13 IC15 SVT-P CUTE Total
92.1 87.0 93.9 91.6 71.3 72.2 80.6 62.0
CAPNet + VGG 93.1 87.5 94.3 92.1 74.9 77.1 86.3 62.9
CAPNet 93.7 88.9 94.6 92.4 76.6 78.8 86.8 62.7
Table 2: Results of ablation tests compared with CAPNet.

Implementation: We use the two-staged strategy to jointly train the localization and recognition module. In the first training stage, character anchor pooling is performed using ground-truth character centers. This stage requires character-level annotations, and therefore the network is only trained on SynthText, which provides effortless character-level annotations. In the second stage, character anchor pooling is performed using the output produced by the CAPM. In this stage, the network is trained on both SynthText and Synth90K and learns to calibrate its own outputs. The localization loss is only computed and averaged over SynthText data, and the recognition loss is computed with both. We step into the second stage of training once the moving average of falls below for the first time. We train epochs in total. The learning rate is set to for the first epoch, and decays to at the second epoch. At the last two epochs, the learning rate is set to .

The network is implemented with PyTorch. The length of a pooled feature sequence is set to

to match the length of feature sequences in most previous works. All words are padded to tokens with the special symbol EOS in training. The recognition module recognizes ten digits and 26 case-insensitive alphabets. During training and evaluation, images are resized to . For data augmentation, we apply random Gaussian noise and motion blur. To train our network, we use the ADADELTA [zeiler2012adadelta] optimizer with default parameters to mini-batches of randomly selected samples. All experiments are performed on NVIDIA GeForce 1080 Ti GPU, each with 12GB memory.

Performance on Straight Text: We achieve better performance on of the straight text datasets, including IIIT5K (), IC03 (), and IC15 (). We also parallel previous methods on other straight text datasets, including SVT () and IC13 (). Results are shown in Tab. 1. Therefore, our method is no worse than previous methods in recognizing straight text and even better on some datasets.

Performance on Curved Text: As for curved dataset, we outperforms previous state-of-the-art method using rectification [shi2018aster] by an absolute improvement of on CUTE. CAPNet also achieves higher score than the 2D attention baseline [li2018show] by on CUTE, while surpassing by on IC15 and on SVT-P. The superior performance verifies the effectiveness of our method. For more comprehensive comparison, we also evaluate our method on Total-Text, a large curved text dataset containing more data than CUTE. Our method still outperforms previous SOTA result by .

Ablation Study We makes two variants of CAPNet by: (1) Change the backbone of CAPNet from ResNet-50 to VGG-16. (2) Add the same RNN module of CAPNet into CA-FCN. As shown in Table 2, CAPNet outperforms the two variants, which shows that ResNet-50 is more suitable than VGG-16 in CAPNet and the improvements not only comes from the sequence learning but also the combination of CAM and APM.

The Quality of the Predicted CAL:

To estimate the quality of the predicted CAL quantitatively, we consider the correlation coefficients of

and coordinates between predicted CAL and ground-truth CAL. The higher the correlation, the better control points CAPM produces.

We evaluate on Total-Text, which provides word bounding polygons to compute ground-truth CAL. We compute CAL correlation coefficients for each image, and draw a 2D scatter diagram of all test samples. The diagram is shown in Fig. 3. In (a), we give an illustrative description of how character anchors look like given different levels of correlation. Nearly all scatter points fall on the top-right corner of the diagram which indicates that, for most images, CAPM performs good enough to accurately sketch the shape of the text. In (b), we analyse the distances from the predicted CAL to the text head and tail regions. Most scatter points fall on the bottom-left corner of the diagram. The observations verify that predicted CALs are flawless in most cases.

Figure 3: (a): Correlation coefficients of over all images in Total-Text, and samples marked in the diagram. (b): Distance (length of missing segments) from the CAL to the text head and tail (marked in red).

4 Conclusion

In this paper, we investigate how to realize sequence learning in two-dimensional space for text recognition, and propose Character Anchor Pooling Module (CAPM). CAPM pools features along character anchors, which is formed based on the localization of character centers. Our proposed CAPM localizes text flexibly, and provides a better basis for the subsequent sequence learning. The localization module is shape-agnostic, and therefore can produce accurate outputs on curved text even if it is only trained on mostly straight text. In experiments, we find that our method outperforms existing methods, on both straight and curved text datasets, which demonstrates the effectiveness of our method. We also perform in-depth analysis with regard to our CAM, to show that it is good enough even under some difficult situations. In conclusion, our paper makes an effort attempting to find proper representation for the irregular scene text recognition.