FACLSTM: ConvLSTM with Focused Attention for Scene Text Recognition

04/20/2019 ∙ by Qingqing Wang, et al. ∙ 0

Scene text recognition has recently been widely treated as a sequence-to-sequence prediction problem, where traditional fully-connected-LSTM (FC-LSTM) has played a critical role. Due to the limitation of FC-LSTM, existing methods have to convert 2-D feature maps into 1-D sequential feature vectors, resulting in severe damages of the valuable spatial and structural information of text images. In this paper, we argue that scene text recognition is essentially a spatiotemporal prediction problem for its 2-D image inputs, and propose a convolution LSTM (ConvLSTM)-based scene text recognizer, namely, FACLSTM, i.e., Focused Attention ConvLSTM, where the spatial correlation of pixels is fully leveraged when performing sequential prediction with LSTM. Particularly, the attention mechanism is properly incorporated into an efficient ConvLSTM structure via the convolutional operations and additional character center masks are generated to help focus attention on right feature areas. The experimental results on benchmark datasets IIIT5K, SVT and CUTE demonstrate that our proposed FACLSTM performs competitively on the regular, low-resolution and noisy text images, and outperforms the state-of-the-art approaches on the curved text with large margins.



There are no comments yet.


page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Scene text recognition has received considerable attention from the community of computer vision since text is an essential way to convey information and knowledge. Due to the challenges posed by poor image qualities (

e.g., low resolution, blur, and uneven illumination) and various text appearances (e.g., size, fonts, colors, directions, perspective view as well as complex background), as shown in Fig. 1, though many efforts have been made in past decades, scene text recognition is still an unsolved task.

Figure 1: Challenging samples of scene text recognition.

Inspired by speech recognition and machine translation, most of recent state-of-the-art approaches regard scene text recognition as a sequence-to-sequence prediction problem and widely adopt techniques like LSTM [10] and attention mechanism [1, 5] in their sequential transcription module. However, the LSTM used in these recognizers is the fully-connected-LSTM (FC-LSTM) that only takes stream signals like sentences or audio as inputs and connects them in a fully connected way, while scene text recognition generates sequential outputs from 2-D images. To adapt FC-LSTM to scene text recognition, the most straightforward way is pooling 2-D feature maps to a height of one or flattening them into 1-D sequential feature vectors [8, 3, 4, 21, 22], as shown in Fig 2(a). Unfortunately, such operations could severely disrupt the valuable spatial correlation relationships among pixels, which is essential to computer vision tasks, especially to scene text recognition, where the structures of strokes are the key factors to discriminate characters. To retain such important spatial and structural information, researchers have also explored other alternative solutions. For example, STN-OCR [2]

directly performed sequential prediction on 2-D feature maps with a fixed number of softmax classifiers; CA-FCN 

[16] generated character-level confidence maps with a fully convolutional network, as shown in Fig. 2(b). However, compared with LSTM, these solutions often introduce additional parameters or post processing steps.

Figure 2: Current solutions for scene text recognition. When using LSTM, 2-D feature maps are usually converted to 1-D space by pooling or flattening operations. When the LSTM is not used, additional parameters or post precessing steps are involved.

In this paper, we propose to address the issue of scene text recognition from the perspective of spatiotemporal prediction, where the spatial correlation information is taken into account when performing sequential prediction with LSTM. The ConvLSTM proposed by Shi et al. [24] for precipitation nowcasting provides some insights on how to achieve this. In ConvLSTM, all of the fully connected operations are replaced by convolutional ones, so input feature maps are allowed to keep their 2-D shape when being fed into the ConvLSTM. Given this advantage, for the first time, we introduce ConvLSTM to scene text recognition and apply it in the sequential transcription module of our proposed recognizer.

However, in existing models, both FC-LSTM and ConvLSTM are used only for frame-level prediction and are incapable of producing sequential outputs from one single input image unless the Connectionist Temporal Classification (CTC) [7, 8, 21] or attention mechanism [3, 4, 22, 27] is incorporated. To perform sequential prediction and, meanwhile, provide the model spatial awareness, we further improve ConvLSTM by embedding the attention mechanism into the structure. Notably, different from the existing attention-LSTM-based recognizers, where the attention mechanism and FC-LSTM are combined in a fully connected way, we properly integrate the attention mechanism into ConvLSTM with the convolutional operations. Moreover, as ConvLSTM extends 2-D operations into 3-D, the costs of computation and memory increase significantly. To achieve high efficiency, inspired by Liu et al. [17], we propose to assemble a bottleneck gate at the beginning of the proposed attention-equipped ConvLSTM, so that the internal feature map channels can be reduced.

Last but not the least, since existing attention-based recognizers often suffer from the ‘attention drift’ problem [3], i.e.

, they fail to align target outputs to proper feature areas, we propose to learn additional character center masks with a second decoder branch in the encoder-decoder feature extraction stage to assist the proposed network to focus attention on right feature areas. The experimental results conducted on benchmark datasets demonstrate that our proposed recognizer is able to achieve comparable performance with the state-of-the-art approaches on regular, low-resolution and noisy text and outperforms other methods significantly on the more challenging curved text.

The contributions made in this work are summarized as follows. (1) We propose to handle the scene text recognition problem from a spatiotemporal prediction perspective and for the first time introduce ConvLSTM to this application. (2) We design a ConvLSTM-based sequential transcription module, where the attention mechanism is harmoniously embedded into ConvLSTM with convolutional operations, and the bottleneck gate is assembled at the beginning of ConvLSTM to retain its efficiency. (3) We propose to learn additional character center masks to help the proposed network focus attention on the center of characters.

In the rest of this paper, we first review the most related works in Section 2. Then, the details of our proposed approach and designed experiments are presented in Sections 3 and 4, respectively. Finally, the conclusions are given in Section 5.

2 Related works

The existing scene text recognizers can be grouped into two categories, i.e.

, the ones utilizing traditional techniques and the ones based on deep learning techniques. Methods belonging to the first category were mainly proposed before 2015, and follow a bottom-up routine,

i.e., detecting and recognizing individual characters first, followed by word formation. Ye et al. [28] provided a comprehensive survey for these methods. By contrast, the deep learning-based recognizers depend on end-to-end trainable deep networks, where feature extraction and sequential translation are integrated into one unified framework. According to literature, the deep learning-based recognizers are now the dominant solutions to scene text recognition, and surpass traditional ones by large margins. Therefore, in this section, we only review recognizers applying deep learning techniques, along with ConvLSTM and related variants.

Methods based on LSTM: LSTM is widely used in the existing state-of-the-art recognizers for three purposes, i.e., producing frame-level predictions required by the subsequent sequential transcription module [8, 21], encoding sequential features with considering historical information [22, 23], and directly generating sequential predictions when cooperating with the attention mechanism [3, 4, 27, 14, 23]. For example, CRNN proposed by Shi et al. [21] was composed of three parts, i.e., the convolution module used to extract features from input images, a bi-LSTM layer built to make predictions for individual frames, and a CTC-based sequential transcription component utilized to infer sequential outputs from frame-level predictions. Shi et al. [22]

employed a bi-LSTM layer in their RARE to extract sequential feature vectors from input feature maps, followed by feeding these vectors into an attention-Gated Recurrent Unit (GRU) module to generate label sequences. A highlight of RARE was its usage of Spatial Transformer Network (STN) 

[13], which was responsible for rectifying images containing irregular texts and was widely adopted by subsequent recognizers like STN-OCR [2] and ASTER [23]. Afterwards, RARE was extended to ASTER [23] by modifying the architecture of rectification network. Note that, LSTM was used for both feature encoding and sequential transcription in ASTER. Lee et al. [14] combined a recursive CNN with a recurrent CNN in their R AM to capture long-term dependencies when extracting features from raw images, and then fed these features to an attention-RNN network for sequential transcription. Gao et al. [7, 8] designed two models to compare the performance of CNN and LSTM in terms of sequential feature encoding. According to their experiments, features extracted by LSTM were more powerful than those extracted by CNN. Cheng et al. [3, 4] combined LSTM with an attention mechanism in the sequential transcription module of their FAN and AON recognizers, but they criticized that the existing attention-based models often failed to align attention to right feature areas when performing prediction. Therefore, a focusing network was assembled in their FAN [3] to tackle this problem. AON [4] was specially designed for irregular text recognition. In this work, features were extracted from four directions, and then combined and filtered with a filter gate. Wojna et al. [27]

utilized an attention-equipped LSTM to localize and recognize text from street view images. Their model was given location awareness by incorporating one-hot encoded spatial coordinates into the LSTM.

Note that, the LSTM used in all the methods mentioned above refers to traditional FC-LSTM, so the 2-D feature maps have to be mapped into 1-D space in order to adapt to the LSTM layers, and the attention mechanism has to be incorporated in a fully connected way. This severely damages the spatial and structural information of input images, which is essential to computer vision tasks such as scene text recognition.

Methods without LSTM: At the beginning of the deep learning era, a group of Deep CNN (DCNN) recognizers [11, 1] were well developed and made breakthrough over traditional recognizers. In these models, CNN together with softmax classifier were widely used for character or word classification. However, with the development of LSTM-based recognizers, the DCNN ones were quickly and significantly surpassed. Recently, some researchers argued that LSTM-based models were hard to train [7] and not able to achieve good performance on non-horizontal text [16], so explorations on models without LSTM started again. For instance, STN-OCR [2] utilized fully connected layers and a fixed number of softmax classifiers for sequential prediction; SqueezedText [18] employed a binary convolutional encoder-decoder network to generate salience maps for individual characters and then exploited a GRU-based bi-RNN for further correction; Liao et al. [16] proposed to address the scene text recognition issue from a 2-D perspective with a CA-FCN model, so that the spatial information could be taken into account when performing prediction. Concretely, they designed a FCN structure equipped with a character attention module to produce pixel-level confidence maps for target characters, and then fed these maps into a word formation module to generate word-level outputs.

Different from those LSTM-based approaches, recognizers without LSTM can better leverage the spatial information, but they also unavoidably introduce additional parameters or post processing steps in order to produce sequential outputs, such as the multiple classifiers used by STN-OCR [2] and the word formation module designed in [16].

ConvLSTM and Related Variants: As explained in [24], the main drawback of traditional FC-LSTM was its usage of full connections in the input-to-state and state-to-state transitions, which resulted in the neglect of spatial information. To retain such important information, ConvLSTM, proposed by Shi et al. [24], replaced all of the full connections of traditional FC-LSTM with convolutional operations, and extended the 2-D features and states into 3-D, as shown in Fig. 3. Their experimental results demonstrated the superiority of ConvLSTM over traditional FC-LSTM. Thereafter, some variants of ConvLSTM have been developed for action recognition [15], object detection in video [17], and gesture recognition [29, 30] etc. For example, Zhu et al. [30] combined ConvLSTM with the 3-D convolution in a multimodal model, and achieved promising gesture recognition performance. Li et al. [15] designed a motion-based attention mechanism and combined it with ConvLSTM in their VideoLSTM, which is proposed for action recognition in videos.

Figure 3: Illustration of the FC-LSTM (left) and the ConvLSTM (right). The FC-LSTM is performed in 1-D space, while the ConvLSTM is performed in 2-D space.

In our work, aiming to better consider the spatial and structural information of input images when performing sequential prediction with LSTM, for the first time, we propose an attention-equipped ConvLSTM structure in the sequential transcription module, and further design a focused attention module to help learn more accurate alignment between predicted characters and corresponding feature areas.

3 Methodology

Figure 4: Overview of proposed FACLSTM. and denote the extracted feature maps and character center masks. groups of feature maps are produced by the proposed attention-eqipped ConvLSTM, where is the maximal string length, and the followed softmax classifier is responsible for producing groups of feature maps from extracted feature maps. Note that, the softmax classifier and previous fully connected layer are shared by the groups of feature maps.

As illustrated in Fig. 4, our proposed FACLSTM, i.e., Focused Attention ConvLSTM, consists of two components, i.e., the CNN-based feature extraction module and the ConvLSTM-based sequential transcription module. The feature extraction module is an encoder-decoder structure that takes VGG-16 as the backbone, while the sequential transcription module is a combination of ConvLSTM and attention mechanism. More details are presented as follows.

3.1 CNN-based Feature Extraction

Backbone: Similar to Liao’s work [16], we take VGG-16 as the encoder of our feature extraction module, and remove the fully connected layers and pooling layers from the last two encoding stages. We also assemble two deformable convolutional layers [6] at stage-4 and stage-5 of the decoder given their flexible receptive fields. However, compared with Liao’s network [16], the resolution of final feature maps is restored to a smaller size of in our FACLSTM, instead of the used in [16], considering the memory and computation cost. Here, , and denote the width, height and channels of feature maps, respectively. In addition, we remove their character attention module set in the encoder stage, and meanwhile, design a focused attention module in the higher-level decoder stage so that more abstract and powerful character center masks can be extracted.

Focused Attention Module: As pointed out in [3], current attention-based models suffer from the ‘attention drift’ problem, i.e., they fail to obtain accurate alignment between target characters and related feature areas, especially in complicated and low-quality images. To tackle this problem, in the feature extraction module of the proposed FACLSTM, we assemble two decoder branches, of which one is used as normal for feature extraction and another is designed to learn additional character center masks. These masks are expected to guide the subsequent attention module regarding where to focus. Obviously, for each timestep, the attention should be focused on the center of certain character. Moreover, these masks can also help to enhance foreground pixels and suppress background pixels.

In other works [7, 8, 16], the feature maps and attention maps are always combined with the element-wise multiplication in the way of . However, in our experiments we find that directly concatenating feature maps and character center masks can achieve better performance, since the subsequent attention-based module prefers to learn patterns from and directly, rather than from their fused results. Therefore, direct concatenation is used in our FACLSTM.

3.2 Sequential Transcription Module

As shown in Fig. 4, our sequential transcription module starts with an attention-equipped ConvLSTM, by which groups of feature maps with the size of are generated. Here, is the predefined maximal string length. Afterwards, a convolutional layer is applied to reduce the feature map channels, followed by a fully connected layer and a softmax classifier that are employed to sequentially predict characters. Details of proposed sequential transcription module are presented below.

ConvLSTM: The structure of the traditional FC-LSTM [10] is illustrated in Fig. 3(left), and related key formulations can be expressed as Eq. 1, where is the Hadamard product (i.e., element-wise multiplication),

denotes the activation function of input gate

, output gate and forget gate , and , and represent input features, cell states and cell outputs, respectively.


As we can see, FC-LSTM takes 1-D sequential feature vectors as input, and calculates both the input-to-state and state-to-state transactions in a fully connected manner. Therefore, when applying it to computer vision tasks, the 2-D feature maps have to be mapped into 1-D space, during which the spatial correlation relationships among pixels are badly damaged.

To take advantage of such valuable spatial and structural information in computer vision tasks, Shi et al. [24] proposed ConvLSTM by incorporating convolutional structures into LSTM. As shown in Fig. 3(right), all input features, gates, cell states and cell outputs are 3-D in ConvLSTM, and all of the input-to-state and state-to-state transactions are performed with the convolutional operations, instead of the fully connected ones. Thus, the key formulations of ConvLSTM can be written as Eq. 2, where denotes the convolutional operation.


Proposed Attention-equipped ConvLSTM: The attention mechanism has achieved excellent performance in sequential prediction tasks, such as machine translation [1], speech recognition [5], as well as scene text recognition [3, 4, 27, 14, 23]. Especially, in the field of scene text recognition, it has been widely combined with FC-LSTM or GRU to produce more accurate predictions. On the other hand, LSTM is used only for frame-level prediction in the existing works and is seldom utilized for producing sequential outputs from one single input image unless when combined with the CTC or attention mechanism.

Figure 5: Illustration of our proposed attention-equipped ConvLSTM, where the inputs are weighted by attention scores derived from previous cell state and cell output.

Therefore, in this work, to adapt ConvLSTM to scene text recognition and, meanwhile, provide the proposed network location awareness, we incorporate the attention mechanism into ConvLSTM by weighting the input feature maps with attention scores derived from the cell states and cell outputs obtained at the previous timestep, as illustrated in Fig. 5. In addition, to retain the efficiency of the proposed network, an additional bottleneck gate is assembled before the original input gate, forget gate and output gate to reduce the internal feature map channels.

Eqs. 3 and 4 provide more details on how the cell outputs and the attention scores are calculated. Here, is the channel-wise concatenation, and

denote the ReLU activation function and the Sigmoid function, respectively, and

represents the weighted inputs computed by Eq. 4. Keep it in mind that all of the gates , inputs , cell states and cell outputs in Eqs. 3 and 4 are in 3-D. Moreover, and are the involved network weights and biases, and is the concatenation of feature maps and character center masks produced by aforementioned encoder-decoder feature extraction module.


Once the cell outputs are obtained from the proposed attention-equipped ConvLSTM, a convolutional layer is applied to map them to and , which is also used to improve model’s efficiency, just like the bottleneck gate does. Afterwards, a fully connected layer and a softmax classifier are designed to generate the final sequential outputs from , where is from the predefined charset. Compared with STN-OCR [2], where multiple fully connected layers and multiple softmax classifiers are assembled for sequential transcription, in our FACLSTM, only one single fully connected layer and one softmax classifier are employed and shared by groups of feature maps.

3.3 Training

Loss function: The objective function of our proposed FACLSTM consists of two parts, i.e., the sequential prediction loss and the mask loss , as formulated in Eq. 5, where , , and are the ground truth masks, predicted masks, smoothed ground truth strings and predicted sequential outputs, respectively. is the coefficient used to balance the importance of the sequential prediction loss and the mask loss, and is set to in our experiments. Additionally, the label smoothing method proposed by Szegedy et al. [25] is able to help regularize the proposed model. Therefore, given the one-hot encoded ground truth , we convert it to the smoothed version with Eq. 6. Moreover, for the ground truth masks , we set the value of their foreground pixels (center of characters) and background pixels to 1 and 0, respectively. Thus, the mask loss is calculated in the way of Eq. 7.


Generation of Ground Truth: Obviously, to optimize the proposed network, ground truth of character center masks is required. Assuming is the bounding box of individual characters, we use the same method as that in [16] to calculate the ground truth of the corresponding mask , as shown in Eq. 8.


Note that, the shrink ratio is set to 0.25 in our experiments, instead of 0.5 used in [16].

4 Experiments

4.1 Datasets

We train the proposed FACLSTM network with 7 million synthetic images from SynthText dataset [9] without fine-tuning on individual real-word datasets, and evaluate the corresponding performance on three widely used benchmarks, including the regular text dataset IIIT5K, low-resolution and noisy text dataset SVT, and curved text dataset CUTE.

  • SynthText is proposed by Gupta et al. [9] for scene text detection. The original dataset is composed of 800,000 scene text images, each with multiple word instances. Texts in this dataset are rendered in different styles, and annotated with character-level bounding boxes. Overall, about 7 million text images are cropped for scene text recognition.

  • IIIT5K is built by Mishra et al. [19]

    . This dataset consists of 3000 text images obtained from the web. Most of these images are regular, and for individual images, two lexicons are provided, including one 50-word lexicon and one 1000-word lexicon.

  • SVT is a very challenging dataset collected by Wang et al. [26] from the Google Street View. Totally, 647 text images with low-resolution and noise are included.

  • CUTE is released by Risnumawan et al. [20]. There are only 288 word images in this dataset, but most of them are seriously curved. Therefore, compared with other datasets, CUTE is more challenging.

4.2 Implementation Details

In our experiments, all of the input images are scaled to a size of with aspect ratio preserved. The maximal string length is set to 20, including one START token and one EOF token. This means up to 18 real characters are allowed within individual words. Our charset is composed of 39 characters, i.e.

, 26 alphabet letters, 10 digits, 1 START token, 1 EOS token and 1 special token for any other symbols. The Adam optimizer with an initial learning rate of 1e-4 is employed in our work to optimize the proposed network. Totally, the proposed FACLSTM is trained for five epochs, with learning rates of 1e-4, 1e-4, 5e-5, 1e-5 and 1e-6, respectively. Moreover, the kernel size and channels (

in Fig. 5) of the convolutional operations in Eqs. 3 and 4 are set to

and 256, respectively. Finally, the proposed network is implemented using the Tensorflow framework.

4.3 Experimental Results

We evaluate the performance of our proposed FACLSTM on the aforementioned three benchmark datasets, and compare it with those of the state-of-the-art approaches. Table 1 presents the details of the comparison results. Note that, in this table, CA-FCN [16] and SqueezedText [18] are the two latest recognizers recently published in AAAI2019.

Method Usage of LSTM IIIT5K_None IIIT5K_50 IIIT5K_1k SVT CUTE
FAN [3] FC-LSTM 87.4 99.3 97.5 85.9 63.9
AON [4] FC-LSTM 87.0 99.6 98.1 82.8 76.8
CRNN [21] FC-LSTM 78.2 97.6 94.4 80.8 -
(Gao et al.)* [8] FC-LSTM 83.6 99.1 97.2 83.9 -
RARE [22] FC-LSTM 81.9 96.2 93.8 81.9 59.2
RAM [14] FC-LSTM 78.4 96.8 94.4 80.7 -
SqueezedText (binary) [18] FC-LSTM 86.6 96.9 94.3 - -
SqueezedText (full-precision) [18] FC-LSTM 87.0 97.0 94.1 - -
CA-FCN [16] No 92.0 99.8 98.9 82.1 78.1
(Gao et al.)* [7] No 81.8 99.1 97.9 82.7 -
STN-OCR* [2] No 86.0 - - 79.8 -
FLSTM_base1 FC-LSTM 73.7 99.0 97.4 58.7 67.4
FAFLSTM_base2 FC-LSTM 87.8 99.3 98.1 78.2 75.7
FACLSTM (proposed) ConvLSTM 90.5 99.5 98.6 82.2 83.3
Table 1: Result comparison across different methods and datasets. Word-level recognition rate is used here. IIIT5K_None, IIIT5K_50 and IIIT5K_1k denote no lexicon, 50-word lexicon and 1k-word lexicon are used, respectively. * means word images containing non-alphanumeric characters have been eliminated from the test set when evaluating the related methods.

Comparison with Methods based on the Traditional FC-LSTM: As previously introduced, traditional FC-LSTM is widely used in the existing scene text recognizers. Among the methods listed in Table 1, RARE [22], AON [4] and FAN [3] combined FC-LSTM with the attention mechanism in the fully connected way when performing sequential transcription, while CRNN [21], RAM [14], Gao’s model [8] and SqueezedText [18] utilized FC-LSTM for frame-level prediction, sequential feature encoding or other purposes.

As shown in Table 1, our proposed FACLSTM outperforms these methods by large margins on the regular text dataset IIIT5K (90.5% vs 87.4%) and curved text dataset CUTE (83.33% and 76.8%) when no lexicon is used. Our performance also ranks the first and the second on the IIIT5K dataset when 1k-word lexicon and 50-word lexicon are used, respectively.

As for the low-resolution and noisy dataset SVT, our proposed FACLSTM also achieves competitive performance. Readers should keep in mind that both Gao’s methods [7, 8] were evaluated on an incomplete SVT dataset that removed words containing non-alphanumeric characters or with a length smaller than three. On the other hand, apart from the 7 million training images from SynthText, the recognizers named as AON [4] and FAN [3] also employed additional 4 million images provided by Jaderberg et al. [12] for their training. These 4 million images are generated with a 50k-word lexicon that covers all the test words of ICDAR and SVT datasets, and blended with word images randomly-sampled from these two datasets. Thus, the recognition performance on SVT dataset would benefit largely from the usage of these 4 million images, which can also be seen from Liao’s work [16]. Unfortunately, Jaderberg et al. [12] did not provide character-level bounding boxes, so we cannot employ their dataset to train our proposed network, unless we remove the focused attention module from our FACLSTM or, as Liao et al. [16] did, generate new synthetic images with their 50k-word lexicon. On the other hand, since both synthetic datasets contain large amounts of images, our hardware is insufficient.

Comparison with Non-LSTM based Methods: Considering the limitations of the traditional FC-LSTM on neglecting spatial and structural information and slow training convergence etc, CA-FCN [16], Gao’s model [7] and STN-OCR [2] have also explored other non-LSTM solutions. Especially, CA-FCN [16] also addressed the recognition issue from the 2-D perspective by utilizing a FCN structure.

From Table 1, we can see that the accuracy of our proposed FACLSTM is 1.5% lower than that of the best recognizer, i.e. CA-FCN [16], on the regular text dataset IIIT5K. However, on the more challenging curved text dataset CUTE, we achieve an accuracy of 83.3%, which is 5.2% higher than that of CA-FCN [16]. As for the low-resolution and noisy dataset SVT, our FACLSTM performs slightly better than CA-FCN [16] with an accuracy of 82.2% (vs. 82.1% of CA-FCN [16]). Note that, CA-FCN [16] is not an end-to-end trainable system because in order to infer the final sequential outputs from the pixel-level predictions generated by their network, an empirical rule-based word formation module is required. By contrast, our FACLSTM is able to directly produce the final sequential outputs via the proposed ConvLSTM-based sequential transcription module.

Effectiveness of the Proposed Focused Attention Module and ConvLSTM-based Sequential Transcription Module: Furthermore, to highlight the effectiveness of our proposed focused attention module and ConvLSTM-based sequential transcription module, we compare the performance of our proposed FACLSTM with that of the following two baseline models:

  • FLSTM_base1, which shares the same feature extraction module with our proposed FACLSTM, but removes the focused attention module. Besides, the sequential transcription module used in this model is the traditional attention-based FC-LSTM network, just as the one used in AON [4], FAN [3] and both Gao’s models [7, 8].

  • FAFLSTM_base2, which is built upon FLSTM_base1, but with the proposed focused attention module applied.

Apparently, from the comparison of FLSTM_base1, FAFLSTM_base2 and our proposed FACLSTM, we can see that the recognition accuracies on IIIT5K, SVT and CUTE datasets are elevated by 14.1%, 19.5% and 8.4%, respectively when the proposed focused attention module is assembled. When the traditional attention-based FC-LSTM is replaced by our proposed ConvLSTM-based sequential transcription module, further 2.7%, 3.6% and 7.6% improvements are achieved. Therefore, we can say that both of the proposed modules are effective.

In summary, on the regular text dataset, our proposed FACLSTM outperforms all of listed FC-LSTM-based and non-LSTM-based recognizers, except CA-FCN, but on the more challenging curved text dataset, our FACLSTM surpasses all of the listed methods significantly with an accuracy of 83.3%, including CA-FCN (78.1%). Moreover, the comparisons with other two baseline models demonstrate the effectiveness of our proposed focused attention module and ConvLSTM-based sequential transcription module. Finally, we also give the visualization results of the predicted masks and the attention shift procedure, as shown in Fig. 6.

Figure 6: Visualization results of predicted mask and attention shift procedure.

5 Conclusion

Scene text recognition has been treated as a sequence-to-sequence prediction problem for quite a long time, and traditional FC-LSTM is widely used in current state-of-the-art recognizers. In this work, we have demonstrated that scene text recognition is actually a spatiotemporal prediction problem and we have proposed to tackle this problem from the spatiotemporal perspective. Toward this end, we have presented an effective scene text recognizer named FACLSTM, where ConvLSTM was applied and improved by integrating the attention mechanism in the sequential transcription module, and a focused attention module has been designed at the encoder-decoder feature extraction stage. Experimental results have revealed that, our proposed FACLSTM is able to handle both regular and irregular (low-resolution, noisy and curved) text well. Especially for the curved text, our proposed FACLSTM has outperformed other advanced approaches by large margins. Thus, we can conclude that ConvLSTM is more effective in scene text recognition than the widely used FC-LSTM since the valuable spatial and structural information can be better leveraged when performing sequential prediction with ConvLSTM.


  • [1] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473, 2014.
  • [2] C. Bartz, H. Yang, and C. Meinel. Stn-ocr: a single neural network for text detection and recognition. CoRR, arXiv preprint arXiv: 1707.08831v1, 2017.
  • [3] Z. Cheng, F. Bai, Y. Xu, and G. Zheng. Focusing attention: towards accurate text recognition in natural images. In IEEE International Conference on Computer Vision, pages 5086–5094, 2017.
  • [4] Z. Cheng, Y. Xu, F. Bai, Y. Niu, S. Pu, and S. Zhou. Aon: towards arbitrarily-oriented text recognition. In

    International Conference on Computer Vision and Pattern Recognition

    , pages 5571–5579, 2018.
  • [5] J. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio. Attention-based models for speech recognition. CoRR, abs/1506.07503, 2015.
  • [6] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei. Deformable convolutional networks. In IEEE International Conference on Computer Vision, pages 764–773, 2017.
  • [7] Y. Gao, Y. Chen, J. Wang, and H. Lu. Reading scene text with attention convolutional sequence modeling. CoRR, arXiv preprint arXiv: 1709.04303v1, 2017.
  • [8] Y. Gao, Y. Chen, J. Wang, M. Tang, and H. Lu. Dense chained attention network for scene text recognition. In International Conference on Image Processing, pages 679–683, 2018.
  • [9] A. Gupta, A. Vedaldi, and A. Zisserman. Synthetic data for text localization in natural images. In International Conference on Computer Vision and Pattern Recognition, pages 2315–2324, 2016.
  • [10] S. Hochreiter and J. Schmihuber. Long short-term memory. Neural Computation, 9:1735–1780, 1997.
  • [11] M. Jaderberg, A.Vedaldi, and A. Zisserman. Deep features for text spotting. In ECCV, 2014.
  • [12] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman. Synthetic data and artificial neural networks for natural scene text recognition. CoRR, arXiv preprint arxiv:1412.1842, 2014.
  • [13] M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu. Spatial transformer networks. CoRR, abs/1506.02025, 2015.
  • [14] C. Lee and S. Osindero.

    Recursive recurrent nets with attention modeling for ocr in the wild.

    In International Conference on Computer Vision and Pattern Recognition, pages 2231–2239, 2016.
  • [15] Z. Li, K. Gavrilyuk, E. Gavves, M. Jain, and C. G. Snoek. Videolstm convolves, attends and flows for action recognition. Computer Vision and Image Understanding, 166:41–50, 2018.
  • [16] M. Liao, J. Zhang, Z. Wan, F. Xie, J. Liang, P. Lyu, C. Yao, and X. Bai. Scene text recognition from two-dimensional perspective. In AAAI, 2019.
  • [17] M. Liu and M. Zhu. Mobile video object detection with temporally-aware feature maps. In International Conference on Computer Vision and Pattern Recognition, pages 5571–5579, 2018.
  • [18] Z. Liu, Y. Li, F. Ren, W. Goh, and H. Yu. Squeezedtext: a real-time scene text recognition by binary convolutional encoder-decoder network. In AAAI, pages 7194–7201, 2018.
  • [19] A. Mishra, K. Alahari, and C. Jawahar. Top-down and bottom-up cues for scene text recognition. In International Conference on Computer Vision and Pattern Recognition, pages 2687–2694, 2012.
  • [20] A. Risnumawan, P. Shivakumara, C. Chan, and C. Tan. A robust arbitrary text detection system for natural scene images. Expert Systems with Applications, 41:8027–8048, 2014.
  • [21] B. Shi, X. Bai, and C. Yao.

    An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition.

    IEEE Transaction on Pattern Analysis and Machine Intelligence, 39(11):2298–2304, 2017.
  • [22] B. Shi, X. Wang, P. Lyu, C. Yao, and X. Bai. Robust scene text recognition with automatic rectification. In International Conference on Computer Vision and Pattern Recognition, pages 4168–4176, 2016.
  • [23] B. Shi, M. Yang, X. Wang, P. Lyu, X. Bai, and C. Yao. Aster: an attention scene text recognizer with flexible rectification. IEEE Transaction on Pattern Analysis and Machine Intelligence (TPAMI), 31(11):855–868, 2018.
  • [24] X. Shi, Z. Chen, H. Wang, D. Yeung, W. Wong, and W. Woo.

    Convolutional lstm network: a machine learning approach for precipitation nowcasting.

    In NIPS, 2015.
  • [25] C. Szegedy, V. Vanhoucke, S. Ioffe, and Z. Wojna. Rethinking the inception architecture for computer vision. In International Conference on Computer Vision and Pattern Recognition, pages 2818–2826, 2016.
  • [26] K. Wang, B. Babenko, and S. Belongie. End-to-end scene text recognition. In IEEE International Conference on Computer Vision, pages 1457–1464, 2011.
  • [27] Z. Wojna, A. Gorban, D. Lee, K. Murphy, Q. Yu, Y. Li, and J. Ibarz. Attention-based extraction of structured information from street view imagery. In International Conference on Document Analysis and Recognition, pages 844–850, 2017.
  • [28] Q. Ye and D. Doermann. Text detection and recognition in imagery: a survey. IEEE Transaction on Pattern Analysis and Machine Intelligence, 37(7):1480–1500, 2015.
  • [29] L. Zhang, G. Zhu, L. Mei, P. Shen, S. Shah, and M. Bennamoun. Attention in convolutional lstm for gesture recognition. In NIPS, 2018.
  • [30] G. Zhu, L. Zhang, P. Shen, and J. Song. Multimodal gesture recognition using 3-d convolution and convolutional lstm. IEEE Access, 5:4517–4524, 2017.