Scene text recognition has been an increasingly popular research topic in computer vision in the last few decades. As a carrier for information of high-level semantics, the ability to read text from natural images is beneficial for the understanding of the surrounding scenes through computer vision[long2018scene]. There are various applications, including instant translation, robot navigation, industrial automation, and traffic sign reading for autonomous vehicles. More recently, the detection and recognition of irregular text, e.g., text arranged in a curved line, has attracted much attention [shi2016robust, long2018textsnake].
As deep learning is widely applied to this field, most recent methods follow an encoder-decoder framework, in which, images are discomposed as a sequence of pixel frames, starting from the left side of the image to the right[shi2016robust, cheng2017arbitrarily, yin2017scene, yang2017learning]. The framework can be summarized and termed as
Convolutional Recurrent Neural Networks (CRNN). The compressed feature has a size of , and therefore is processed as a -dimensional sequential features with time steps, which are then further encoded and decoded by a Seq2Seq model [sutskever2014sequence]. We refer to this feature encoding process as feature gathering, which transforms 2D images into 1D feature sequences. These methods produce good results when the text in the image is horizontal.
For oriented and irregular text, the use of a rectification layer [shi2016robust, shi2018aster]
based on Spatial Transformer Networks[jaderberg2015spatial] alleviates the problem to some extent. The rectification layer first predicts a bounding polygon to precisely locate the text, then generates grids according to the polygon, and finally transforms it. The idea behind is intuitive and has proven effective. However, the coordinates of bounding polygons are predicted via fully connected networks, and fail when the text has a shape that is poorly represented in the training dataset. The fact that the polygon prediction is shape-sensitive and may not generalize well to unseen shapes limits the potential of rectification-based methods. Similar problem also exists in 2D attention method [li2018show], which is proven by a less competent score on blurred datasets.
Recently, CA-FCN [liao2019two] takes the two-dimensional spatial distribution of text into consideration, and text recognition is reformulated as semantic segmentation, where character categories are segmented from the background. However, their method abandons the use of recurrent neural networks (RNN), and thus fails to obtain an overall vision. Therefore, it is prone to missing characters and performance drops significantly especially when the text are blurred.
To tackle the challenges mentioned above, we design a flexible feature gathering method which deeply integrates the idea of sequence learning and the idea of considering two-dimensional spatial distribution.
The key step is to gather the feature vectors from the shared feature maps alongcharacter anchor line (CAL), to form sequential features for subsequent sequence-to-sequence learning. To achieve this function, we design two novel modules, Character Anchoring Module (CAM) that anchors characters and Anchor Pooling Module (APM) that forms sequential features. The CAM module detects character center anchors individually in the form of heat map, which is inherently shape-agnostic, and therefore adapts better to irregular text even with unseen shapes. Then, the APM module interpolates the feature vectors along the CAL, and gathers into a sequence. Instead of only considering detected character centers, the interpolation can achieve robustness against challenging image conditions by filling missing characters. While the shape-insensitive CAM can robustly guide the APM along the text features to form sequence, APM can correct errors made by CAM. Based on these two complementary modules, we propose a recognition model, termed as Character Anchor Pooling Network (CAPNet). Compared with previous methods, our method successfully and harmonically unifies shape-agnostic localization and sequence learning. Our methodology is demonstrated in Fig. 1.
The contributions of our paper can be summarized as follows. (1) We propose two innovative and complementary modules, Character Anchoring Module and Anchor Pooling Module, to successfully harmonize sequence learning and two-dimensional spatial arrangement of text. (2) Empirically, the proposed CAPNet outperforms previous state-of-the-art results on irregular and perspective datasets, including ICDAR 2015, CUTE and Total-Text. It also outperforms previous methods on several horizontal text dataset such as IIIT5K and ICDAR 2013, while paralleling on other datasets. (3) We provide in-depth qualitative and quantitative analysis as well as ablation tests to further understand its strengths and dependencies.
As shown in Fig. 1, the pipeline of CAPNet contains the following components: (1) A fully convolutional backbone module for feature encoding, producing shared feature maps for the following steps. (2) The Character Anchor Pooling Module (CAPM) that comprises of CAM and APM. (3) An RNN-based recognition module that encodes the pooled features and decodes them into text symbols.
2.2 Model Architecture
Backbone: We use a standard 50-layered ResNet [He_2017_Res] as backbone and FPN [Lin_2017_CVPRPyramid] connections to integrate features from different stages. The upsampling in the FPN branch is performed by bi-linear interpolation. Other settings follow [Lin_2017_CVPRPyramid]. All CNN layers in FPN have filters. The kernel size is for lateral connections, and for top-down connections. The network produces shared feature maps with channels, and is one-quarter of the size of the input image.
Character Anchoring Module: To flexibly and robustly localize a text instance, we propose to anchor each individual character instead. Note that, in bounding polygon regression [shi2018aster] that localizes the text as an entirety, each control point will depend on the overall shape. This results in shape-sensitivity. On the contrary, the localization of characters does not have dependencies over the rest of the images. Therefore, detecting characters is insensitive to text shape by design. For easy separation of each character anchor, we define character anchors as shrank character boxes. The downsampling ratio is . The CAM contains a two-layered CNN, which we found strong enough to detect character centers. The first CNN layer has a kernel size of and
filters in total. The second one is in essence a pixel-wise classification layer, which takes the shared feature maps as input, and produces a heat map, indicating the probability for each pixel to be the character anchor. The prediction map has the same size as the shared feature maps.
Anchor Pooling Module for Feature Gathering: The details of APM is illustrated in Fig. 2. First, we separate and aggregate adjacent positive responses on the predicted character anchor heat map into groups, and each group would indicate one character anchor. Then we take the midpoint of each group and produce the coordinates of each character anchor. The coordinates are sorted from left to right, and the ordered list of sorted coordinates forms the basis of CAL. To enrich our feature sequence, we evenly sample a fixed number of markers along the sorted coordinates, which makes the CAL. The markers have floating point coordinates. We denote the number of markers in CAL as . For each marker in CAL, we bi-linearly interpolate a feature vector from the corresponding floating point position on the shared feature maps. The last step is to concatenate the extracted features in order, and we obtain sequential features, with a size of , which has time steps.
Besides, experiments results show that extracting features directly from the shared feature maps produced by backbone would result in worse performance. Therefore we add two extra CNN layers to further encode the shared feature maps. The character anchor pooling is performed on the further encoded feature maps, while the character anchoring still performs on the shared feature maps. Most previous works follow similar practice [shi2018aster]. This is mainly because localization and recognition require different aspects of the features. Besides, the two additional CNN layers also enlarge the receptive field to cover more visual features.
The recognition module is an RNN-based attentional encoder-decoder. The encoder is a one-layered bidirectional LSTM. It encodes the features extracted by character anchor pooling and outputs. The encoder captures long-term dependencies and maintains an overall vision over the pooled sequential features. This is important especially when individual characters are blurred, or missing. The size of the hidden state is set as for each direction. The hidden states of both directions are concatenated.
For the decoder, we use a unidirectional LSTM. The size of the hidden state is . The hidden-state at time step is initialized to equal the last hidden state of the encoder. The decoder is equipped with the attention mechanism [bahdanau2014neural]. It consists of an alignment module and a RNN-decoder module. At time step , the alignment module calculates the attention weights , which effectively indicate the importance of every item of the encoder output :
The RNN module produces an output, , and a new state :
Given the alphabet set , the decoder takes as input and predicts the output symbol with softmax:
2.3 Training Targets
The network is trained jointly to match the ground-truth labels of character centers and text sequence. Character anchor heat map is a binary prediction at the pixel level. Given the number of pixels in the feature maps as , the loss is defined as the following binary cross entropy loss:
The recognition module predicts a symbol sequence. We denote the ground-truth of the text sequence as
. In training, we pad the sequence withend-of-sentence (EOS) symbol due to variable lengths. The loss is defined as negative maximum-likelihood averaged over time :
The training target is the weighted sum of the localization loss and recognition loss:
where and are weights and are set to by default in our experiments.
Datasets: Following previous works, we train our network solely on synthetic datasets, SynthText [gupta2016synthetic] and Synth90K [jaderberg2014synthetic]. There are image crops in SynthText, which are annotated at the character level. Synth90K contains 9M greyscale images. It has a balanced distribution over a 90K-vocabulary, and only has annotations of ground-truth word.
We evaluate the trained network on various real-world datasets: IIIT5K [mishra2012scene], SVT [wang2010word], SVT-Perspective [quy2013recognizing], IC03 [lucas2003icdar], IC13 [karatzas2013icdar], IC15, CUTE [risnumawan2014robust] and Total-Text [kheng2017total].
|Aster [shi2018aster]||ResNet, 90K+ST||99.6||98.8||93.4||97.4||89.5||98.8||98.0||94.5||91.8||76.1||78.5||79.5||-|
|CA-FCN [liao2019two]||Attentional VGG, ST + extra ST||99.8||98.9||92.0||98.8||86.4||-||-||-||91.5||-||-||79.9||61.6|
|2D Attn [li2018show]||ResNet, ST + 90K + extra ST||-||-||91.5||-||84.5||-||-||-||91.0||69.2||76.4||83.3||-|
datasets. “50”, “1K”, “Full” are the size of lexicons. “0” means no lexicon. “90K” and “ST” are the Synth90k and the SynthText datasets, respectively. “ST” means including character-level annotations. “Private” means private training data. “-” means the score is not reported in the paper.
|CAPNet + VGG||93.1||87.5||94.3||92.1||74.9||77.1||86.3||62.9|
Implementation: We use the two-staged strategy to jointly train the localization and recognition module. In the first training stage, character anchor pooling is performed using ground-truth character centers. This stage requires character-level annotations, and therefore the network is only trained on SynthText, which provides effortless character-level annotations. In the second stage, character anchor pooling is performed using the output produced by the CAPM. In this stage, the network is trained on both SynthText and Synth90K and learns to calibrate its own outputs. The localization loss is only computed and averaged over SynthText data, and the recognition loss is computed with both. We step into the second stage of training once the moving average of falls below for the first time. We train epochs in total. The learning rate is set to for the first epoch, and decays to at the second epoch. At the last two epochs, the learning rate is set to .
The network is implemented with PyTorch. The length of a pooled feature sequence is set toto match the length of feature sequences in most previous works. All words are padded to tokens with the special symbol EOS in training. The recognition module recognizes ten digits and 26 case-insensitive alphabets. During training and evaluation, images are resized to . For data augmentation, we apply random Gaussian noise and motion blur. To train our network, we use the ADADELTA [zeiler2012adadelta] optimizer with default parameters to mini-batches of randomly selected samples. All experiments are performed on NVIDIA GeForce 1080 Ti GPU, each with 12GB memory.
Performance on Straight Text: We achieve better performance on of the straight text datasets, including IIIT5K (), IC03 (), and IC15 (). We also parallel previous methods on other straight text datasets, including SVT () and IC13 (). Results are shown in Tab. 1. Therefore, our method is no worse than previous methods in recognizing straight text and even better on some datasets.
Performance on Curved Text: As for curved dataset, we outperforms previous state-of-the-art method using rectification [shi2018aster] by an absolute improvement of on CUTE. CAPNet also achieves higher score than the 2D attention baseline [li2018show] by on CUTE, while surpassing by on IC15 and on SVT-P. The superior performance verifies the effectiveness of our method. For more comprehensive comparison, we also evaluate our method on Total-Text, a large curved text dataset containing more data than CUTE. Our method still outperforms previous SOTA result by .
Ablation Study We makes two variants of CAPNet by: (1) Change the backbone of CAPNet from ResNet-50 to VGG-16. (2) Add the same RNN module of CAPNet into CA-FCN. As shown in Table 2, CAPNet outperforms the two variants, which shows that ResNet-50 is more suitable than VGG-16 in CAPNet and the improvements not only comes from the sequence learning but also the combination of CAM and APM.
The Quality of the Predicted CAL:
To estimate the quality of the predicted CAL quantitatively, we consider the correlation coefficients ofand coordinates between predicted CAL and ground-truth CAL. The higher the correlation, the better control points CAPM produces.
We evaluate on Total-Text, which provides word bounding polygons to compute ground-truth CAL. We compute CAL correlation coefficients for each image, and draw a 2D scatter diagram of all test samples. The diagram is shown in Fig. 3. In (a), we give an illustrative description of how character anchors look like given different levels of correlation. Nearly all scatter points fall on the top-right corner of the diagram which indicates that, for most images, CAPM performs good enough to accurately sketch the shape of the text. In (b), we analyse the distances from the predicted CAL to the text head and tail regions. Most scatter points fall on the bottom-left corner of the diagram. The observations verify that predicted CALs are flawless in most cases.
In this paper, we investigate how to realize sequence learning in two-dimensional space for text recognition, and propose Character Anchor Pooling Module (CAPM). CAPM pools features along character anchors, which is formed based on the localization of character centers. Our proposed CAPM localizes text flexibly, and provides a better basis for the subsequent sequence learning. The localization module is shape-agnostic, and therefore can produce accurate outputs on curved text even if it is only trained on mostly straight text. In experiments, we find that our method outperforms existing methods, on both straight and curved text datasets, which demonstrates the effectiveness of our method. We also perform in-depth analysis with regard to our CAM, to show that it is good enough even under some difficult situations. In conclusion, our paper makes an effort attempting to find proper representation for the irregular scene text recognition.