Enhanced Characterness for Text Detection in the Wild

12/04/2017 ∙ by Aarushi Agrawal, et al. ∙ Indian Institute of Technology Delhi 0

Text spotting is an interesting research problem as text may appear at any random place and may occur in various forms. Moreover, ability to detect text opens the horizons for improving many advanced computer vision problems. In this paper, we propose a novel language agnostic text detection method utilizing edge enhanced Maximally Stable Extremal Regions in natural scenes by defining strong characterness measures. We show that a simple combination of characterness cues help in rejecting the non text regions. These regions are further fine-tuned for rejecting the non-textual neighbor regions. Comprehensive evaluation of the proposed scheme shows that it provides comparative to better generalization performance to the traditional methods for this task.



There are no comments yet.


page 4

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Text co-occurring in images and videos serve as a warehouse for valuable information for image description, thus assists in providing suitable annotations. Typical practical applications involve extracting street names and numbers, textual indications such as “diversion ahead” etc. from road signs in natural scenes. Such information can be further stored in geo-tagged databases [16]. Autonomous vehicles are also heavily dependent on efficiency and accuracy of such methods to effectively follow traffic rules. Another area where text detection is applied is indexing and tagging images/videos where text in images helps in better understanding of the content [23]. Performing the above tasks is trivial for humans but segregating it against a challenging background still remains as a complicated task for machines. Traditional methods for text detection employ the use of blob detection schemes like Maximally Stable Extremal Regions (MSERs) [1, 18], edge based analysis, Stroke Width Transform (SWT) [9, 24], strokelets [22] and features like Histogram of Oriented Gradients (HOG) [16, 7], Gabor based features [24], text covariance descriptors [9, 20] and shape descriptors (e.g. Fourier descriptors [3, 5]

, Zernike moments

[12]). The reason behind great popularity of using MSERs and SWT is their

time complexity for performing efficient segmentation which helps in detecting the text regions. MSERs are very effective in detecting the text components but it are extremely sensitive to noise. So, most of the techniques concentrate on pruning the non-text regions using some heuristics or geometric properties. Despite the advent of deep learning based techniques

[10, 8] which have resulted in tremendous progress in machine driven text detection, the traditional methods still hold relevance primarily owing to their simplicity and comparable generalization capability to different languages.

Authors in [15] utilize text specific saliency detection measure termed as characterness. The authors demonstrate that due to presence of contrasting objects, saliency alone cannot be an effective indicator of textual region. They overcome this limitation by introducing saliency cues which accentuate the boundary information in addition to saliency [17]

. Deriving motivation from this work, we propose a simple combination of various characterness cues for generating candidate bounding boxes for text regions. We use these characterness cues (HOG, stroke width variance, pyramid histogram of oriented gradients (PHOG)) to refine the blobs generated by edge enhanced MSERs (eMSERs)

[15] for generating text candidates. This is followed by rejection of non-text regions by incorporating difference of entropy as a discriminating factor. The last step is the refinement step, where we combine the smaller blobs into one single text region by concatenating blobs with similar stroke width variance and characterness cue distribution. As per the above discussion, the key contributions of the paper are listed below:

  1. We develop a language agnostic text identification framework using text candidates obtained from edge based MSERs and combination of various characterness cues. This is followed by a entropy assisted non-text region rejection strategy. Finally, the blobs are refined by combining regions with similar stroke width variance and distribution of characterness cues in respective regions.

  2. We provide comprehensive evaluation on popular text datases against recent text detection techniques and show that the proposed technique provides equivalent or better results.

Organization of the paper is as follows: The proposed methodology is discussed in Section 2. The experimental analysis and the results is detailed in Section 3. Finally the paper is concluded in Section 4.

2 Proposed Methodology

The workflow of the proposed method is shown in Fig. 1. In the following subsections, we describe in detail the components of the proposed method.

Figure 1: Workflow of the proposed methodology

2.1 Text candidate generation using eMSERs

We begin by generating initial set of text candidates using edge enhanced Maximally Stable Extremal Regions (eMSERs) approach [15]. MSER is a method for blob detection which extracts the covariant regions in the image. It is based on aggregation of regions which have similar intensity values at various thresholds which makes it a suitable candidate for detecting regions with text in images. It efficiently detects the characters in case of distinctive boundaries but fails in the presence of blur. In order to handle this, eMSERS are computed over the gradient amplitude based image. It divides the image into two sets of regions: dark and bright; dark regions are those which have lower intensity than their surroundings and vice-versa. Initially non text regions are rejected based on geometric properties like aspect ratio, number of pixels and skeleton length followed by connected component analysis for combining the text regions. Fig. 2 shows instances of bright and dark regions formed during text candidate generation using eMSERs. As can be observed, in the bright regions the color of the text is lighter as compared to dark background (red) while in the dark regions the dark text was highlighted against the light colored background.

Figure 2:

Left Column: (a) Original image (b) Bright regions (c) Dark regions (d) After processing on these regions final set of blobs detected by eMSERs; Right Column: Top Row: (a)-(c) eMSER region; Bottom Row: (a) Binarized region obtained from original image (b) Binarized region neglected due to size constraints (c) binarized image- refined object (alphabet) obtained with less disturbance which gives us better results

2.2 Elimination of non-text regions

The regions are further refined based on the property that text usually appears on a surrounding having a distinctive intensity. Utilizing this property we refine textual regions while reject non-textual regions. To achieve this, we find corresponding image patches for the blobs identified by eMSERs. As the image patches contain spurious data along with the information in the form of text, we perform binarization over these image patches using Otsu’s threshold [19] for that region and obtain a common region, between the binarized image patch and the blob obtained by eMSER (where ) for image . A blob is rejected, if it is not contained in the binarized image patch. Fig. 2 shows some examples of this rejection strategy. We then define various characterness cues [15] for common regions . Apart from stroke width and HOG used in [15], we check the values of pyramid histogram of oriented gradients (PHOG) features and entropy for the blobs. During the experiments we found that PHOG is a good measure of similarity over HOG. In case of alphabets i.e textual regions, we observe that the HOG and PHOG values for are very less. We now briefly explain these cues,

  1. Stroke width variance: A stroke is effectively a continuous band of same width in an image. Stroke width transform (SWT) [4] is defined as a local operator which gives the most likely stroke for every pixel in the image. In SWT, all the pixels are initialized with infinity as their stroke width. A Canny based edgemap is then calculated followed by calculation of gradient direction for all the edge pixels. If the gradient direction () of an edge pixel is opposite to the gradient direction () of next edge pixel then the distance between and is the stroke width else the ray tracing and is discarded. The pixels having similar stroke widths are grouped using connected component analysis. The letter candidates are chosen after some post processing based on the stroke width variance and aspect ratio. The letter candidates are grouped to give text regions. The idea is to segregate text from other high freuency content that might be present in the scene e.g. trees branches etc. We perform a bottom-up aggregation by merging pixels with similar stroke widths into connected components which allows in detecting characters across wide range of scales. It is able to identify near-horizontal text candidates.

    Stroke width of a region () is defined as [15],


    where defines the shortest path between every pixel in the skeletal image of region () to the boundary of the region, is the stroke width variance and gives the stroke width mean. We utilize the stroke width variance only which should be less for text candidates. We also store the values of stroke width as and (stroke width deviation) where H and W denote height and width of the common region respectively.

  2. HOG and PHOG: PHOG consists of a histogram of orientation gradients over every sub-region in the image for every resolution level. The HOG vectors computed over each pyramid in the grid cells are concatenated. As compared to HOG, PHOG is more efficient. HOG is invariant to geometric and photometric transformations. In addition to this, PHOG helps in providing a spatial layout for the local shape of the image. Therefore, we utilize their combination as a characterness cue.

  3. Entropy: We calculate the entropy as the Shannon’s entropy for the common regions () given as,


    where denotes the number of gray levels and

    refers to the probability associated to the gray level

    . In information theory, entropy is the measure of average information of a signal given its probability distribution. Higher entropy indicates higher disorder. In our scenario, text candidates shows lower variation in color values, thus typically there is a dominating color in histogram having one sharp peak. However, for non-character candidates, its color values span the histogram as result of color variation. This corresponds to the entropy of the text candidates yielding smaller values than that of the non-text candidates and hence acts as an important cue in distinguishing among them and rejecting non-text candidates as described in the next section.

2.3 Bounding Box Refinement

The remaining set of regions are refined by calculating a set of parameters as stroke width distribution, pretrained characterness cues distribution and stroke width difference. We define a character cue distribution by computing the characterness cue values on ICDAR 2013 dataset. Additionally, we use this distribution to combine the neighboring candidate regions and aggregate them into one larger text region. We recompute the neighbors if they have similar distribution and reject otherwise. Finally, we combine all the neighboring regions into a single text candidate. Fig. 3 shows the results of this post processing step.

Figure 3: (a) Smaller regions in the blobs detected by eMSERs (b) Final result after postprocessing.

3 Experimental Results and Discussions

3.1 Experimental Setup and Datasets

The experiments were performed on a GB RAM machine with Xeon processor and GB NVIDIA Graphics Card. Matlab b was used as the programming platform. The datasets used for evaluation of the proposed methodology are publicly available text datasets: MSRATD500 [21] and KAIST [14]. MSRATD500 consists of images (indoor and outdoor scenes). The standard size of image varies between x to x. It consists of scenes capturing signboards with text in Chinese, English and mixed. The diversity and complex background in the images makes the dataset challenging. The KAIST scene text dataset consists of images captured in different environmental settings (indoor and outdoor) with varying lighting conditions. The images are of size x. It consists of scenes with English, Korean and mixed texts. The majority of scenes are of shop and street numbers.

3.2 Evaluation Methodology

3.2.1 Metrics.

The proposed technique is evaluated with precision, recall and F-measure metrics on the chosen datasets. The input for computing these metrics, is Intersection over Union (IoU) score, given as


where indicates the set of white pixels inside the blobs detected by our strategy before the elimination step (smaller individual blobs), indicates the set of white pixels inside the ground truth region and is the cardinality. The performance metrics in this paper are reported on blobs with majority of region being text i.e. having IoU .

3.2.2 Training and Testing.

We perform training on ICDAR [13] dataset while the test set consists of MSRATD and KAIST datasets. This is unlike earlier methods where, in general, the training and testing samples are drawn from the same dataset. Moreover, such a setting makes the evaluation potentially challenging as well as allows us to evaluate the generalization ability of various techniques. The results on Characterness [15] and Blob Detection [11] methods with training and testing sets as described earlier are reported using the publicly available source code.

3.3 Results

3.3.1 Qualitative Results.

Figure 4: Results on i) MSRATD and ii) KAIST dataset images: (a) Ground Truth (b) Characterness [15] (c) Blob detection [11] (d) Proposed Approach (before refinement step) (e) Proposed Approach
Method/Metric Precision Recall F-measure
Proposed 0.85 0.33 0.46
Characterness [15] 0.53 0.25 0.31
Blob Detection [11] 0.80 0.47 0.55
Epshtein et al. [4] 0.25 0.25 0.25
Chen et al. [2] 0.05 0.05 0.05
TD-ICDAR [21] 0.53 0.52 0.50
Gomez et. al. [6] 0.58 0.54 0.56
Table 1: Performance measures on MSRATD dataset

Figure 4 shows qualitative results on a few example images from MSRATD and KAIST datasets. It can be observed that the images obtained after region refinement demonstrate better localization of textual regions while those on MSRATD dataset (Fig. 4 (i)) show tighter localization as compared to other techniques. One of the aims of the proposed technique is to reduce false positives, which can be observed from the second row of Fig. 4 (i) where the proposed method provides a tight bounding box on text regions while there are false positives with other techniques except Characterness. The signboard in the image does not consist of any text data still the contemporary methods detect it as a text candidate. This could be due the fact that the signboard consists of a rounded sketch which may correspond to alphabets such as ’O’, ’Q’ etc. Since the proposed technique strictly encodes the stroke width variance along with other characterness cues, we are able to avoid detection of such false candidates. Similar findings are observed for the KAIST dataset as well.

3.3.2 Quantitative Results.

Method/Metric Precision Recall F-Measure
Proposed 0.8485 0.3299 0.4562
Characterness 0.5299 0.2476 0.3136
Blob Detection 0.8047 0.4716 0.5547
Method/Metric Precision Recall F-measure
Proposed 0.9545 0.3556 0.4994
Characterness 0.7263 0.3209 0.4083
Blob Detection [11] 0.9091 0.5141 0.6269
Method/Metric Precision Recall F-measure
Proposed 0.9702 0.3362 0.4838
Characterness 0.8345 0.3043 0.4053
Blob Detection 0.9218 0.4826 0.5985
Method/Metric Precision Recall F-measure
Proposed 0.9244 0.3407 0.4798
Characterness [15] 0.6969 0.2910 0.3757
Blob Detection [11] 0.8785 0.4894 0.5933
Gomaz et al. [6] 0.66 0.78 0.71
Lee et al. [14] 0.69 0.60 0.64
Table 2: Performance measures on KAIST dataset

Table 1 and 2 show empirical results on MSRA and KAIST scene datasets respectively. From the empirical results, it can be seen that on MSRATD dataset, the proposed method achieves significantly higher precision and F-measure as compared to Characterness while having a % (precision) and % (F-measure) gain and and a slightly lower (%) recall rate with Blob Detection. The proposed technique outperforms the compared methods on precision while performs close in terms of F-measure and recall. It is important to note here that the proposed technique does not involve any explicit training allowing the technique to be directly extensible to domains such as symbol identification, road sign identification etc. On KAIST dataset, the proposed method consistently outperforms Characterness on all benchmarks with average improvement of %, % and % in precision, recall and F-measure respectively. The proposed technique also achieves better precision as compared to Blob Detection. The results show that the proposed method is able to generalize better on a test set while being trained on an entirely distinctive character set. For completeness in comparison, we also provide performance of other techniques on KAIST dataset. However, it should be noted that the objective of these techniques is generally to maximize text detection specifically for a script or to attain script independence with curated training examples with the mixture of scripts to be detected. This possibly makes the comparison with proposed technique tougher as the objective is to obtain better generalization ability.

4 Conclusion

This paper proposed an effective text detection scheme by utilizing stronger characterness measure. A post processing step is used to reject the non-textual blobs and combine smaller blobs obtained by eMSERs into one larger region. The effectiveness of the proposed scheme has been analyzed with precision, recall and F-measure evaluation measures showing that the proposed scheme performs better than the traditional text detection schemes.


  • [1] Chen, H., Tsai, S.S., Schroth, G., Chen, D.M., Grzeszczuk, R., Girod, B.: Robust text detection in natural images with edge-enhanced maximally stable extremal regions. In: 2011 18th IEEE International Conference on Image Processing. pp. 2609–2612. IEEE (2011)
  • [2]

    Chen, X., Yuille, A.L.: Detecting and reading text in natural scenes. In: Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on. vol. 2, pp. II–II. IEEE (2004)

  • [3] De, S., Stanley, R.J., Cheng, B., Antani, S., Long, R., Thoma, G.: Automated text detection and recognition in annotated biomedical publication images. International Journal of Healthcare Information Systems and Informatics (IJHISI) 9(2), 34–63 (2014)
  • [4] Epshtein, B., Ofek, E., Wexler, Y.: Detecting text in natural scenes with stroke width transform. In: Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. pp. 2963–2970. IEEE (2010)
  • [5] Fabrizio, J., Marcotegui, B., Cord, M.: Text detection in street level images. Pattern Analysis and Applications 16(4), 519–533 (2013)
  • [6] Gomez, L., Karatzas, D.: Multi-script text extraction from natural scenes. In: Document Analysis and Recognition (ICDAR), 2013 12th International Conference on. pp. 467–471. IEEE (2013)
  • [7] Hanif, S.M., Prevost, L.: Text detection and localization in complex scene images using constrained adaboost algorithm. In: 2009 10th International Conference on Document Analysis and Recognition. pp. 1–5. IEEE (2009)
  • [8]

    He, T., Huang, W., Qiao, Y., Yao, J.: Text-attentional convolutional neural network for scene text detection. IEEE Transactions on Image Processing 25(6), 2529–2541 (2016)

  • [9] Huang, W., Lin, Z., Yang, J., Wang, J.: Text localization in natural images using stroke feature transform and text covariance descriptors. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 1241–1248 (2013)
  • [10] Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Reading text in the wild with convolutional neural networks. International Journal of Computer Vision 116(1), 1–20 (2016)
  • [11]

    Jahangiri, M., Petrou, M.: An attention model for extracting components that merit identification. In: 2009 16th IEEE International Conference on Image Processing (ICIP). pp. 965–968. IEEE (2009)

  • [12] Kan, C., Srinath, M.D.: Invariant character recognition with zernike and orthogonal fourier–mellin moments. Pattern recognition 35(1), 143–154 (2002)
  • [13] Karatzas, D., Shafait, F., Uchida, S., Iwamura, M., i Bigorda, L.G., Mestre, S.R., Mas, J., Mota, D.F., Almazan, J.A., de las Heras, L.P.: Icdar 2013 robust reading competition. In: Document Analysis and Recognition (ICDAR), 2013 12th International Conference on. pp. 1484–1493. IEEE (2013)
  • [14] Lee, S., Cho, M.S., Jung, K., Kim, J.H.: Scene text extraction with edge constraint and text collinearity. In: Pattern Recognition (ICPR), 2010 20th International Conference on. pp. 3983–3986. IEEE (2010)
  • [15] Li, Y., Jia, W., Shen, C., van den Hengel, A.: Characterness: An indicator of text in the wild. IEEE Transactions on Image Processing 23(4), 1666–1677 (2014)
  • [16] Minetto, R., Thome, N., Cord, M., Leite, N.J., Stolfi, J.: Snoopertext: A text detection system for automatic indexing of urban scenes. Computer Vision and Image Understanding 122, 92–104 (2014)
  • [17] Mukherjee, P., Lall, B., Shah, A.: Saliency map based improved segmentation. In: Image Processing (ICIP), 2015 IEEE International Conference on. pp. 1290–1294. IEEE (2015)
  • [18] Neumann, L., Matas, J.: Real-time scene text localization and recognition. In: Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. pp. 3538–3545. IEEE (2012)
  • [19] Otsu, N.: A threshold selection method from gray-level histograms. Automatica 11(285-296), 23–27 (1975)
  • [20] Sivic, J., Zisserman, A.: Video google: A text retrieval approach to object matching in videos. In: Computer Vision, 2003. Proceedings. Ninth IEEE International Conference on. pp. 1470–1477. IEEE (2003)
  • [21] Yao, C., Bai, X., Liu, W., Ma, Y., Tu, Z.: Detecting texts of arbitrary orientations in natural images. In: Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. pp. 1083–1090. IEEE (2012)
  • [22] Yao, C., Bai, X., Shi, B., Liu, W.: Strokelets: A learned multi-scale representation for scene text recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4042–4049 (2014)
  • [23] Ye, Q., Doermann, D.: Text detection and recognition in imagery: A survey. IEEE transactions on pattern analysis and machine intelligence 37(7), 1480–1500 (2015)
  • [24] Yi, C., Tian, Y.: Localizing text in scene images by boundary clustering, stroke segmentation, and string fragment classification. IEEE Transactions on Image Processing 21(9), 4256–4268 (2012)