Log In Sign Up

Scale-Invariant Multi-Oriented Text Detection in Wild Scene Images

by   Kinjal Dasgupta, et al.

Automatic detection of scene texts in the wild is a challenging problem, particularly due to the difficulties in handling (i) occlusions of varying percentages, (ii) widely different scales and orientations, (iii) severe degradations in the image quality etc. In this article, we propose a fully convolutional neural network architecture consisting of a novel Feature Representation Block (FRB) capable of efficient abstraction of information. The proposed network has been trained using curriculum learning with respect to difficulties in image samples and gradual pixel-wise blurring. It is capable of detecting texts of different scales and orientations suffered by blurring from multiple possible sources, non-uniform illumination as well as partial occlusions of varying percentages. Text detection performance of the proposed framework on various benchmark sample databases including ICDAR 2015, ICDAR 2017 MLT, COCO-Text and MSRA-TD500 improves respective state-of-the-art results significantly. Source code of the proposed architecture will be made available at github.


page 1

page 4


Textual Visual Semantic Dataset for Text Spotting

Text Spotting in the wild consists of detecting and recognizing text app...

Deep Direct Regression for Multi-Oriented Scene Text Detection

In this paper, we first provide a new perspective to divide existing hig...

MOST: A Multi-Oriented Scene Text Detector with Localization Refinement

Over the past few years, the field of scene text detection has progresse...

FC2RN: A Fully Convolutional Corner Refinement Network for Accurate Multi-Oriented Scene Text Detection

Recent scene text detection works mainly focus on curve text detection. ...

Motion Deblurring in the Wild

The task of image deblurring is a very ill-posed problem as both the ima...

A pooling based scene text proposal technique for scene text reading in the wild

Automatic reading texts in scenes has attracted increasing interest in r...

1 Introduction

Although existing state-of-the-art scene text detection models have already acquired promising performances on moderately well-behaved scene image samples, till date, efficient detection of incidental scene texts (such as texts captured by wearable cameras where the capture is difficult to control) remains one of the most challenging tasks in the Computer Vision community.  In 2015, the ICDAR Robust Reading Competition

[11] introduced a few challenges of processing incidental scene texts and published a related sample database for the first time. COCO-Text [20]

, another benchmark dataset created at a later period also contains similar samples of complex everyday scenes. Such samples are often motion blurred, non-uniformly illuminated, partially occluded, multi-oriented or multi-scaled. Efficient detection of similar texts in scene image needs further intensive studies. Robustness of the detection procedure largely depends on the distinguishing power of the feature representation between text and non-text components. Since supremacy of deep learning-based strategies for feature representation of raw images over various traditional hand-crafted filters has already been established

[15], our present study is centred around the development of an efficient deep convolutional neural network architecture for the present purpose. The deep architecture developed in this study includes a novel Feature Representation Block (FRB) for robust detection of scene texts. This FRB has been suitably designed towards abstraction of information at multiple levels generating features adaptive to changes in orientation and scale. Its Gabor Filter based Convolutional component captures scale and orientation information, the channel-wise attention map enhances important features and attenuates noisy and unwanted background information, the 4Dir IRNN component [1] takes care of contextual information by moving the RNNs laterally across the image while the conditional random fields (CRFs) [13] based aggregation component refines inbound information of multiple scales. Figure. 1 shows outputs of the proposed detection framework on two samples of Robust Reading Competition datasets [11, 17] containing incidental scene texts.

Figure 1: Red rectangular boxes show widely varying scene texts detected by the proposed framework.

2 Related Works

Detection and recognition of texts in scene images have been studied for a long period. A recent survey of existing approaches and analysis of results can be found in [25]. Among the traditional approaches of scene text detection, stroke width transform was used in [5], stroke feature transform was studied in [9], MSERs had been used in [18] while the well-known FAST corner detector based text fragment detection was reported in [3]. Recent studies of text detection use deep learning based strategies where features are learned automatically with the help convolutional neural networks (CNN). Certain Edge Boxes region proposal algorithm [26] and an aggregate channel features detector [4] had been combined in [10] for detecting words in scene images. This approach made use of the R-CNN object detection framework [6]. The objective of all these studies was detection of texts in natural scene images where the cameraman controls the camera. However, if the capturing device is a wearable one, detection of incidental texts appearing in such images captured without any control becomes more difficult. In the 2015 version of ICDAR Robust Reading Competition [11], a challenge on Incidental Scene Text was introduced, based on a dataset of 1,670 sample images captured using the Google Glass. A similar dataset, called COCO-Text [20], was latter introduced in the year 2016. In the next year, Zhou et al. [24] proposed a fully convolutional network architecture and a non-maximum suppression strategy which can directly locate a word or a text line of arbitrary orientations in similar scene images. He et al. [8] proposed a direct regression strategy to detect incidental texts of multiple orientations having variations in size and perspective distortions. Later Liu et al. [14] combined low-level and high-level feature maps produced by their CNN architecture which could detect incidental texts of different orientations. In [22], a pyramid context network had been proposed for localization of text regions in natural scene images. Wang et al. [21] used a region proposal network together with a RNN for detection of scene text regions of arbitrary shapes. Mask R-CNN has recently been used in [19] for detection of scene texts of arbitrary shapes.

Figure 2: Schematic diagram of the proposed architecture: it contains (i) a ResNet-18 used as feature encoder , (ii) a Feature Representation Block (FRB), (iii) a Deconvolution Block and (iv) a Detection Branch. The FRB has a Multi-scale Feature Refinement Module (MFRM) consisting of a 4Dir IRNN, a channel-wise attention block and a CRFs-based aggregation block.

3 Proposed Methodology

The proposed methodology implements a sophisticated neural network architecture capable of detecting instant scene texts. The network architecture is shown in Figure. 2. It is a fully-convolutional neural network consisting of a block for extension of input image in 4 orientation channels as in [15] followed by a ResNet-18 [7]

based feature extraction block modulated by Gabor Orientation Filter (GOF), an FRB containing multiple convolutional Gabor Orientation filters and a

Mult-scale Feature Refinement Module (MFRM), a decoder block of a series of deconvolution operations and finally a detection branch similar to the one used in [24].

3.1 Feature Representation Block (FRB)

The Feature Representation Block (FRB) receives the output of the layer of backbone Resnet-18 architecture modulated by Gabor Orientation Filter used for enhancement of the robustness of traditional convolution filters towards various image transformations such as scale variations and rotations. The FRB consists of multiple Gabor Convolutional layers arranged in multiple rows and a Multi-scale Feature Refinement Module (MRFM) to make the learned feature representation more abstract. Feature maps computed at a higher layer of the FRB is downsampled before entering into the Gabor Convolutional layers of the next row. We also considered kernels of different sizes in each row to tackle the enormous difference in the scales of scene texts. A larger kernel helps to capture more information of Gabor Orientation while a smaller kernel is capable to capture more locally distributed information.

3.2 Multi-scale Feature Refinement Module (MFRM)

MFRM consists of a channel-wise attention map, a 4Dir-IRNN component and an aggregation component modulated by (CRFs). Outputs () (where , and denote respectively the numbers of convolution and orientation channels, and indicates the height and width of input image) of the last three Gabor Convolutional layers one in each row of FRB are fed as input to the MFRM. We have chosen and . Note that for every channel , the network produces rank 3 matrix : (where ). We concatenate such that the final output matrix will be , where . Thereafter it has been moved through a reduce the number of channels and fed into the distinct position for every feature. The channel-wise attention module that assists to different channels of the feature map to enhance the essential knowledge, while attenuating noise and background data before transferring it to task-definite layers of the model. We first pass the feature map originating from the last convolution block into a to produce a of same size and number of channels.

is then passed through sigmoidal activation function and multiplied with

component-wise to produce the feature map . Mathematically,


We follow the 4Dir-IRNN component [1]

in this framework for computing the contextual features. The 4Dir-IRNN component is a sort of RNN composed with ReLU’s. The total number of channels in coming from FRB distributed by a factor of 4 and for each 128 channel, we implement a

and fed toward the 4Dir-IRNN block. Inside the IRNN we have four RNNs which entirety moves laterally across the image in the 4 directions: up, down, left, right. The input-to-hidden is analysed by and shared the convolution layer with 4 recurrent transitions. A distinct direction of RNN will move to collect the contextual features from every 128 channels. The hidden-to-output is merged into a single convolution layer, concatenation followed by a along with a ReLU activation function to generate . The aggregation block is a multiple features merging unit reinforced with CRFs. Different features transferred from FRB, and into the aggregation unit. The CRFs [13] present in the aggregation component helps to refine each incoming multi-scale feature by combining the corresponding knowledge received from other features. We simply conserve those () learned features from the aggregation block and shared into the decoder via .

3.3 Feature Learning

Our primary goal is to make a model which is invariant to learn different deformation. Instead of relying on the large diverse dataset, we consider a different approach Mask and Predict [12] along with pixel-wise blurring strategy for training to learn diverse deformation. The strategy in our proposed network endeavours to explain the difficulty of text detection in cases of occlusion, motion blur and different lighting conditions of the input image. In this approach, we gradually increase a certain percentage of pixel-wise blurring in the input image. Also, we use curriculum learning [2] by manually arranging images from simpler to difficult ones.

3.4 Feature Decoder

The feature map containing rich information ejected from the aggregation block is fed into a . In each feature merging point Figure. 2, the incoming feature map from FRB is operated in the same way as described in section 3.2 for its addition with the current feature map. After the two such successive feature additions followed by of 256 and 128 channels, two more of 64 and 32 channels are applied before feeding it to the detection branch.

Figure 3: Qualitative detection results of the proposed framework on the samples which have texts of widely varying scales, multiple orientations, multiple scripts, motion blurred, non-uniformly illuminated, partially occluded etc.

3.5 Detection

The design of the detection branch is similar to the output layer of the network of [24]. It consists of several

operations of 1, 4, 1, 8 channels of feature maps where the first one provides score map and the following two provide RBOX geometry while the last one provide QUAD geometry. The Loss Function used for the detection branch is as follows,


Where is Loss of the Score Map, is Loss of the Geometry and consider the significance between two losses.

Methods ICDAR 2015 [11] ICDAR 2017 MLT [17] MSRA-TD500 [23] COCO-Text [20]
Recall Precision F-Scrore Recall Precision F-Scrore Recall Precision F-Scrore Recall Precision F-Scrore
EAST [24] 78.3 83.3 80.7 - - - 67.4 87 .3 76.0 32.4 50.3 39.4
He et al. [8] 80.0 82.0 80.9 - - - - - - - - -
FOTS [14] 87.9 91.8 89.8 62.3 81.8 70.7 - - - - - -
Xie et al. [22] 85.8 88.7 87.2 68.6 80.6 74.1 - - - - - -
Wang et al. [21] 86.0 89.2 87.6 - - - 82.1 85.2 83.6 - - -
Qin et al. [19] 87.9 91.6 89.7 - - - - - - - - -
SCUT_ DLVClab [17] - - - 74.1 80.2 77.0 - - - - - -
Lyu et al. [16] - - - 70.6 74.3 72.4 76.2 87.6 81.5 32.4 61.9 42.5
Ours 89.2 91.3 90.2 73.9 88.6 80.5 81.6 88.2 84.7 50.6 71.5 59.2
Table 1: Comparative detection results of the proposed framework and different SOTA models on various benchmark sample databases.

4 Experimental Details

For comparison purpose, the proposed text detection framework has been trained on four publicly available benchmark datasets: ICDAR2015 [11], ICDAR2017 [17], COCO-Text [20] and MSRA-TD500 [23]. Numbers of training samples of these four datasets are respectively 1000, 7200, 43686 and 300 while the volumes of the respective test sets are 500, 9000, 10000 and 200.

4.1 Training Details

We employ Curriculum Learning [2] with respect to progressively increasing pixel-wise blurring, masking percentage and complexity of images to enable the feature learning invariant to image degradation, occlusion and image complexities. We train the Gabor Convolutional layers in the proposed framework with 4 orientation channels and let the scale setting remain 4. Momentum Optimizer is used with an initial learning rate of and momentum of during training. The learning rate is decreased by a factor of 10 after each subsequent 15k iterations. Data Augmentation is also employed to improve the robustness of the proposed framework. The training of the network has been carried out on a computer with two NVIDIA P6 GPU.

4.2 Evaluation Results

For performance evaluation of the proposed method, simulations have been done on several benchmark datasets. Among these ICDAR2015 and ICDAR2017 MLT have image samples containing texts affected by motion blur, low-resolution, variable lighting and multi-scaled texts. Simulations done in [14] ignored image samples suffered from blur during training. However, our model has been purposefully trained on the entire training sets. Results of our extensive simulations have been compiled to construct Table 1.

5 Conclusion

In this article, we have presented a novel Feature Representation Block (FRB) of a fully convolutional neural network for efficient abstraction of features representing scene texts. For effective learning, we used curriculum learning strategy with respect to percentages of pixel-wise blurring and image sample complexities. Additionally, we used the Mask and Predict strategy to enable the network to produce satisfactory results in cases of partial occlusions. We obtained extensive simulation on four benchmark datasets which include ICDAR 2015, ICDAR 2017 MLT, MSRA-TD500 and COCO-Text. Also, compared these results with the available SOTA performances. The proposed framework can efficiently detect incidental scene texts suffered by uneven lighting condition, blurring, partial occlusions and varying scales. In future, we shall study development of a strategy capable of detecting incidental curved texts. Also, we shall study development of a method to predict occluded texts from the visible partial information.


  • [1] S. Bell, C. Lawrence Zitnick, K. Bala, and R. Girshick (2016)

    Inside-Outside Net: detecting objects in context with skip pooling and recurrent neural networks

    In CVPR, Cited by: §1, §3.2.
  • [2] Y. Bengio, J. Louradour, R. Collobert, and J. Weston (2009) Curriculum learning. In ICML, pp. 41–48. Cited by: §3.3, §4.1.
  • [3] M. Busta, L. Neumann, and J. Matas (2015) FASText: efficient unconstrained scene text detector. In ICCV, pp. 1206–1214. Cited by: §2.
  • [4] P. Dollár, R. D. Appel, S. J. Belongie, and P. Perona (2014) Fast feature pyramids for object detection. IEEE Trans. PAMI 36, pp. 1532–1545. Cited by: §2.
  • [5] B. Epshtein, E. Ofek, and Y. Wexler (2010) Detecting text in natural scenes with stroke width transform. In CVPR, pp. 2963–2970. Cited by: §2.
  • [6] R. B. Girshick, J. Donahue, T. Darrell, and J. Malik (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, pp. 580–587. Cited by: §2.
  • [7] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, Cited by: §3.
  • [8] W. He, X. Zhang, F. Yin, and C. Liu (2017) Deep direct regression for multi-oriented scene text detection. In CVPR, pp. 745–753. Cited by: §2, Table 1.
  • [9] W. Huang, Z. Lin, J. Yang, and J. Wang (2013) Text localization in natural images using stroke feature transform and text covariance descriptors. In ICCV, pp. 1241–1248. Cited by: §2.
  • [10] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman (2016) Reading text in the wild with convolutional neural networks.. IJCV 116 (1), pp. 1–20. Cited by: §2.
  • [11] D. Karatzas et al. (2015) ICDAR 2015 competition on robust reading. In ICDAR, pp. 1156–1160. Cited by: §1, §2, Table 1, §4.
  • [12] P. S. R. Kishore, S. Das, P. S. Mukherjee, and U. Bhattacharya (2019)

    ClueNet: a deep framework for occluded pedestrian pose estimation

    In BMVC, Cited by: §3.3.
  • [13] L. Liu, Z. Qiu, G. Li, S. Liu, W. Ouyang, and L. Lin (2019) Crowd counting with deep structured scale integration network. In ICCV, Cited by: §1, §3.2.
  • [14] X. Liu, D. Liang, S. Yan, D. Chen, Y. Qiao, and J. Yan (2018) FOTS: fast oriented text spotting with a unified network. In CVPR, Cited by: §2, Table 1, §4.2.
  • [15] S. Luan, C. Chen, B. Zhang, J. Han, and J. Liu (2018) Gabor convolutional networks. IEEE Trans. IP 27 (9), pp. 4357–4366. Cited by: §1, §3.
  • [16] P. Lyu, C. Yao, W. Wu, S. Yan, and X. Bai (2018) Multi-oriented scene text detection via corner localization and region segmentation. In CVPR, pp. 7553–7563. Cited by: Table 1.
  • [17] N. Nayef et al. (2017) ICDAR 2017 robust reading challenge on multi-lingual scene text detection and script identification-rrc-mlt. In ICDAR, Vol. 1, pp. 1454–1459. Cited by: §1, Table 1, §4.
  • [18] L. Neumann and J. Matas (2010) A method for text localization and recognition in real-world images. In ACCV, pp. 770–783. Cited by: §2.
  • [19] S. Qin, A. Bissacco, M. Raptis, Y. Fujii, and Y. Xiao (2019) Towards unconstrained end-to-end text spotting. In ICCV, Cited by: §2, Table 1.
  • [20] A. Veit, T. Matera, L. Neumann, J. Matas, and S. Belongie (2016) COCO-Text: dataset and benchmark for text detection and recognition in natural images. arXiv:1601.07140. Cited by: §1, §2, Table 1, §4.
  • [21] X. Wang, Y. Jiang, Z. Luo, C. Liu, H. Choi, and S. Kim (2019) Arbitrary shape scene text detection with adaptive text region representation. In CVPR, Cited by: §2, Table 1.
  • [22] E. Xie, Y. Zang, S. Shao, G. Yu, C. Yao, and G. Li (2019) Scene text detection with supervised pyramid context network. In AAAI, Vol. 33, pp. 9038–9045. Cited by: §2, Table 1.
  • [23] C. Yao, X. Bai, W. Liu, Y. Ma, and Z. Tu (2012) Detecting texts of arbitrary orientations in natural images. In CVPR, pp. 1083–1090. Cited by: Table 1, §4.
  • [24] X. Zhou, C. Yao, H. Wen, Y. Wang, S. Zhou, W. He, and J. Liang (2017) EAST: an efficient and accurate scene text detector. In CVPR, Cited by: §2, §3.5, Table 1, §3.
  • [25] Y. Zhu, C. Yao, and X. Bai (2016) Scene text detection and recognition: recent advances and future trends. Frontiers of Computer Science 10 (1), pp. 19–36. Cited by: §2.
  • [26] C. L. Zitnick and P. Dollár (2014) Edge Boxes: locating object proposals from edges. In ECCV, pp. 391–405. Cited by: §2.