Overlay text bands in TV broadcast news videos provide us with rich semantic information about news stories which are otherwise hard to estimate by processing audio-visual data. In Indian TV news broadcast, overlay text has widely varying formats and occupiesof the total screen area (during regular news presentations and increases while presenting headlines etc.). Hence, overlay text is an important feature in a number of sub-tasks of news video analysis viz. news summarization, story segmentation, indexing and linking, commercial detection etc.
A video text extraction pipeline generally involves text detection and localization in each frame, text tracking over the frames and recognizing the text using an OCR engine. In TV news broadcast, text is overlaid in the form of different text bands. Text bands contain single or multi-line sentences or semantically linked set of words (e.g. name of person followed by designation in next line). Different text bands have different semantic meanings and are characterized by on screen position and style of the text band. For example, text relevant to a story is often overlaid in upper part of the screen in large font size and with high contrast in colors. Thus, instead of identifying discrete words, we propose to detect, track and recognize text from overlay text bands. We define a text band as a region bounded by a rectangle enclosing one or more adjoining text regions (words) subject to following conditions. First, all the text regions should have almost same stroke width. Second, no sharp changes in background as well as foreground (text) color of text regions. Third, all the text regions should have common base line. Fourth, text regions should not have any separator between them.
Overlay text detection schemes can be categorized as either patch based or geometrical property based 
. Patch based techniques extract features from image patches and identify text regions using pre-trained classifiers. These patches are grouped further to detect the text regions. These methods have shown excellent performance on various challenging real life problems but at the cost of rigorous pre-training. On the other hand, geometrical property based methods make assumptions on representative features of text regions like high edge/corner density [3, 4], edge continuity , stroke consistency [1, 6], and color consistency . These methods are well known for off the shelf deployment (as no pre-training is required) and fast speed of operation.
The simplest of the geometrical property based approaches rely on edge or corner densities. Edge/corner density based methods assume high density of strong edges in text regions and have been used extensively due to their simplicity and speed . However, these approaches require selection of thresholds and suffers due to high edge density in non-text regions resulting in false positives. Based on different properties of text regions, various curative measures were proposed in literature in order to suppress strong edges from non-text regions and hence, the false positives . For example, Kim et.al  have observed that text regions have peculiar color transition pattern due to high contrast and hence, suggested the use of color transition maps instead of edge images to calculate the density.
We believe that for comparatively simple task of overlaid text detection using patch based approaches  and some of the complex methods like stroke width transform  will overkill the resources. In this work we build up on basic edge density based text detection and propose the following to improve the video text extraction pipeline typically for TV news broadcast videos. First, we propose to reduce false positives in edge density based methods by selectively boosting overlaid text edges using contrast enhancement based preprocessing scheme. Moreover, we use derivatives of edge projection profiles for threshold free detection of text bands. Second, we propose a spatial relations based reasoning framework for tracking multiple (detected) text bands, which is capable of identifying and handling various problems arising out of detection/tracking failures. Third, we propose to improve the performance of Tesseract OCR engine by using synthetically generated training data and incorporating a dictionary of words derived from web news articles. The rest of this paper is organized as follows. The methodology for frame-wise text band detection is presented in Section II. The proposed approach for multiple text region tracking in videos is presented in Section III. Modifications in Tesseract OCR engine to improve the recognition rate are described in Section IV. The experimental results are discussed in Section V. Finally we conclude in Section VI and sketch the future scope of work.
Ii Text Band Detection
Overlay text bands have usually clutter free background, high contrast between foreground (text) and background and doesn’t suffer from perspective distortions. Hence, comparatively faster and easy to deploy geometrical approaches are a natural choice for overlaid text detection. We have adopted and improved on a well established edge density based approach  for detecting text bands instead of discrete words. The basic edge based method detects and localizes text regions by using horizontal and vertical projection profiles of gradient magnitude (edge) image. Performance of the basic method deteriorates due to high edge densities in non-text regions, mis-alignment of different text bands and high sensitivity to thresholds. We propose to reduce the false positives by a preprocessing technique to selectively boost text edges while suppressing non-text edges based on contrast of gradient magnitude image. We use first and second derivatives of projection profiles to reduce the dependence on projection profile thresholds. Our proposed preprocessing scheme is described next.
Ii-a Preprocessing Proposal
In edge density based text detection, the first step is to calculate gradient magnitude/edge image followed by binarization of gradient magnitude image to get an edge map. Threshold for binarizing the gradient magnitude is selected such that, edge map should have edges only from text regions. This binarization threshold is a critical parameter for basic text detection approaches[9, 4, 8]. However, gradient magnitudes from text and non-text regions have significant overlap (Figure 1(d)) and hence, choosing a threshold to discriminate text and non-text gradients is not trivial. In cases of such overlaps, usually a lower threshold is chosen to eliminate definitely non-text regions. Further, local text specific features are used to eliminate strong edges from non-text regions .
Due to high contrast of text bands, gradient magnitudes originating from text regions lie on comparatively higher side of the edge magnitude histogram (Figure 1(d)). We propose to increase the dynamic range of gradient magnitude histogram to increase discrimination between text and non-text gradient magnitudes. We achieve this in the following two steps. First, by linearly stretching (linear contrast enhancement) the gradient magnitude histogram and second, by equalizing the stretched gradient histogram. We use isotropic Scharr operator  to compute the gradient magnitude image over the Sobel operator proposed in basic method . We start with normalizing the gradient magnitude image by the maximum gradient magnitude value () to obtain the image . Next, we eliminate edges from definitely non-text regions having small gradient magnitudes by linearly stretching gradient magnitude histogram (Figure 1(e)). The gradient magnitude image after linear stretching is, where, and is the normalizing constant. The factor () decides the extent of suppression. The value of should be selected so as to nullify gradient magnitudes from “definitely non-text regions”. The lowest non-suppressed gradient magnitude value is where, is the maximum gradient value. Thus, the value of is not independent of data and we select its value using Otsu’s thresholding scheme. Otsu’s threshold gives upper bound on values of gradient magnitudes originating form definitely non-text regions. We select value of such that, or . The final expression for stretched histogram image (Figure 1(b)) is given by, where, is the unit step function.
Histogram equalization is done further on to obtain the edge map which we use further for text band detection. The histogram equalized image (Figure 1(f)) now have well distributed gradient magnitudes. Also, it is to be noted that gradient magnitudes originating from text regions are concentrated towards higher side of the pixel value histogram, while non-text magnitudes are concentrated towards lower side. Hence, the preprocessing stage successfully suppresses false positives originating due to strong non-text edges. Our next proposal for threshold free overlaid text band detection using derivatives of horizontal and projection profiles is described next (Sub-section II-B).
Ii-B Text Band Detection
The Basic method of text detection proposed in  assumes high edge density in text regions. The basic method uses horizontal and vertical projection profile of gradient magnitude or edge image to locate high edge density text regions. Text bands are aligned horizontally. Thus, horizontal projection profile is processed first to obtain bands having sufficient edge density. Horizontal projection profile of edge image of size is given by, , where . Edge profile is thresholded by a threshold to discard horizontal lines having insufficient edge density. A region between and is marked horizontal band if and only if . Further, the vertical projection profile, is calculated in every horizontal band bounded by and ; followed by thresholding with a threshold . A region between and in a particular horizontal band bounded by and is called text region if and only if , .
Performance of the basic edge density based method is curtailed due to its high dependency on projection profile thresholds and . Moreover, words are detected as text regions instead of text bands by the basic approach in , which is not favorable for TV broadcast news.
We propose to detect all possible horizontal lines first (arising due to text band boundaries) instead of horizontal bands. Once all these horizontal lines are located, we assume the region between each pair of consecutive lines as potential horizontal band and locate text regions by examining its vertical projection profile. Abrupt changes in the horizontal projection profile signifies the presence of a horizontal line or boundary. The first difference of the horizontal profile , captures the abrupt changes in horizontal profile and has very high differences at boundary locations, while zero or very small differences elsewhere. Further, first difference of the profile have local extremum at boundary locations. Second difference of the horizontal projection profile locates these local extrema in first difference of the profile and hence, the potential text band boundaries. Various steps in text band detection are illustrated in Figure 2.
The differentiated profile (Figure 2(b)) have non-zero values only at the locations where, discontinuities or horizontal lines are present. For a single horizontal line in image we will get multiple non-zero values in . An -CCA (connected component analysis) is performed on to group the prominent non-zero values (by assigning same label to horizontal lines in a group) arising due to single horizontal line. The result of CCA is stored in a label array where are the number of distinct labels assigned by CCA. This number of distinct labels is equal to the number of potential horizontal lines. The grouped non-zero values are shown marked in image 2(b). The second difference of the projection profile is thresholded by a local mean. The local mean is computed using set of points have same label in label array . Identifying local minima in each region will give us the location of the horizontal line in respective local region.
Localized horizontal lines are shown in Figure 2(c). Region between every consecutive pair of lines out of horizontal lines is considered to be a potential horizontal band. Vertical projection profile is calculated in every potential horizontal band. Next, we use the (as done in case of horizontal profile) on first difference of the vertical profile to locate the vertical band boundaries in each of the horizontal band. The Label array stores the result of CCA. The bounding boxes containing the text band are thus obtained from and .
The proposed approaches for image pre-processing (sub-section II-A) and text band detection have shown sufficient reduction of false positives in text band detection for individual frames (Table-III). We next introduce the framework for tracking (Section III) the text bands detected in individual frames.
Iii Multiple Text Band Tracking
Tracking aims at associating the detection results across frames to establish a time history of the image plane trajectory of objects. This reduces the cost of detection and recognition in every video frame. Moreover, false detections arising out of local artifacts in some frames which do not persist for more than a few frames can be filtered out from trajectory analysis. The multiple text region tracking is performed over two sets – first, the set consisting of the text bounding rectangles from the previous instant ; and second, the set containing the bounding rectangles of the text regions detected in the present frame .
The association between the text regions from two consecutive frames is measured by their overlap. For example, if and has significant overlap, then we can conclude that both of them indicate the same text region and hence, the former can be updated by the later. However, errors in text detection do persist, either due to algorithm failure or on account of image quality. In many cases, either text bands appear fragmented after detection or the detection itself fails. This calls for an in-depth and formal analysis for identifying all such problem situations that the process of tracking may encounter. The associations between the elements of and are resolved in two steps. First, we construct the set of detected regions overlapped with each of the previously tracked region and similarly, the set of previously tracked regions overlapped with each of the presently detected region . Next, we categorize the nature of these overlaps using the spatial relations.
The text region at the instant is represented by its bounding rectangle and its color histogram . Thus, in ideal situations, a corresponding (detected) text region in the frame can be identified by checking the maximum overlap of and color match with . However, problem cases like – failure of text detection in the frame thereby losing correspondence; appearance of new text bands which have no association with previously tracked text bands; disappearance of existing text regions, call for the inclusion of reasoning through association analysis. The process of reasoning involves the estimation of qualitative spatial relations between text regions in and . In this context, we explore the spatial relations and are described next.
Iii-a The RCC-5 Relations
Qualitative spatial relations are a common way to represent knowledge about configurations of interacting objects thereby reducing the burden of quantitative expressions of such relative arrangements. Besides, composition of existing relations allow the possibility of deducing newer relations among objects. There exists a vast literature on the spatial relations of overlapping objects, among which the region connection relations are most widely used. These relations consist of , , , and . In the context of the estimation of these relations, we define the fractional overlap measure between two regions and as . The predicates for detecting the object-blob relations by using the fractional overlap measure and with respect to a tolerance are shown in table I. We next describe the procedure for identifying the different cases in the context of multiple text region tracking.
Unique Correspondence – A text region is considered to have an unique correspondence with a detected text region , if and . In this case, is updated with the bounding rectangle and color histogram of to form . The update rule for updating the bounding box and color histogram is determined by RCC-5 Relation between two regions. Four possible relations are shown in the first row of table II. The fluctuations in localizing the band boundary in detection stage are handled by these conditions.
Multiple Correspondences can have three different types – First, multiple tracked text regions can uniquely overlap with a single detected text band i.e. and , . Second, multiple detected text regions can uniquely overlap with a single tracked band , i.e. and ,. Third, multiple tracked text regions can overlap with multiple detected text regions, i.e. and , .
The possible cases are listed in table II. First case arises due to multiple track initializations on same text band. The possibility of merging is checked and the bounding boxes are updated accordingly. Second case occurs when a tracker was initialized on group of text regions due to detection failures. The tracked band is checked for splitting according to newly detected regions. In the third case, there is a possibility for both splitting as well as merging and both are checked.
Disappearance – A text band is considered to disappear in the next frame if it does not have any correspondence with any detected text band in , i.e. or . This may be due to actual termination of the display of that text or detection failure due to local artifacts in present frame.
In this case, we first focus on the bounding rectangle box of in the current frame. If the color histogram obtained from in the current frame matches with that of then, we call it a temporary detection failure and restore the track of . Otherwise, we consider this as the termination of display of the tracked text and do not include it further in .
New Entry – The detected text region is considered to be a new entry if it does not have any correspondence with the text bands in , i.e. or . In this case, we form a new text region with the bounding rectangle and color histogram of and include it in .
The system initializes with . The first set of detections are inducted in the set and the tracking process continues with addition, update and removal of text bands by association analysis of tracked and detected text bounding rectangles. Each tracked text region have the background and foreground information. This color information is further used to binarize the text bands. These binarized text bands are then accumulated over the entire track before passing to the OCR engine so as to save multiple passes of OCR. Next, we describe the modifications made on Tesseract OCR to improve the text recognition.
Iv Adapting Tesseract For Overlaid Text Recognition
Tesseract  OCR engine was primarily developed for optical character recognition on scanned documents at HP labs during and . In late , HP released Tesseract for open source. Tesseract provides basic OCR engine with tools for training. We have used Tesseract OCR on aggregated binarized images of consistently tracked bands to recognize the text content. OCR engine uses adaptive character segmentation based on linguistic model and character classifier. Font dependent polygonal approximation of the character outline is used as a feature for the character classifier. Therefore, linguistic model and the fonts used for training the character classifier plays crucial role in overall performance of Tesseract. Tesseract OCR engine is by default trained for scanned documents and uses standard English language model. Whereas, overlaid text often contains proper nouns and hash tags presented in variety of fonts. Hence, Tesseract with default models performs poorly on overlaid text recognition. To adapt the OCR engine for the task of overlaid text recognition it is necessary to train the character classifier with overlaid text samples and update the language model to include proper nouns and hash tags.
First, for training character classifiers sufficient ground truth data is required. Marking character level ground truth data for text recognition is a very tedious job. In literature, it is argued that polygonal approximation feature used in Tesseract is only font dependent. Hence, instead of marking the character level ground truth data, we have identified the commonly used overlaid text fonts (we have used fonts) across channels and synthetically generated the data for training the Tesseract. Second, in order to enrich the linguistic model used during recognition, we have used the news articles available on various news websites. News articles are rich sources of proper nouns, hash tags (present in meta data) and abbreviations. We have used a corpus of approximately web articles along with meta data provided on websites (like hash tags, tweets) from three different sources viz. NDTV, Times of India and First Post published during January, to April, . Tesseract requires a dictionary of words, list of bi-grams, list of frequently occurring words, list of numbers and list of punctuations, as directed acyclic word graphs to build the linguistic model. From the corpus of web articles, all required directed acyclic word graphs are generated. In the next section, we present the results of experimentation.
We have evaluated the performance of each sub-system separately. We have tested our preprocessing (CE) scheme with stroke width transform  (SWT) and projection profile based method ()(PP), on different image datasets of two different types viz. natural scenes and born-digital images. Though the preprocessing method is primarily developed for overlaid text, the assumption of high contrast of text regions holds at large for natural scene text as well. Hence, we have experimented on natural scene images as well as born digital images. The natural scene images are from ICDAR ( images, test and training set) and ICDAR ( images, test and training set) and Born-digital images are from ICDAR ( images). The performance of text detection/localization with and without preprocessing is compared using precision, recall and f-measure, calculated by Ephstein’s Criterion . SWT works on binary image and hence, histogram equalized gradient magnitude image is thresholded and non-maximal suppression is used to obtain the binary image. Both the text detection algorithms have shown improvement in the performance with the proposed preprocessing method (Table-III).
The performance of text band localization is evaluated on our own dataset having () images containing challenging detection cases in news videos. For our dataset, the ground truth is marked for band detection instead of word detection. On our dataset of news video images, projection profile based text band detection (PP-TB) outperforms the other methods (Table III).
The drop in recall of SWT is due to the poor quality of the video frames. More so, the proposed method is a lot faster than SWT and other reported methods. While evaluating the performance of SWT on our dataset to compensate for difference in ground truth, we relax the matching criterion to favor SWT. Our implementation of SWT took an average time of seconds for an image size of (reported time is sec for image) while our approach took an average time of msecs only.
The performance of tracking is evaluated by using the purity of the track as well as track switches. A track is called pure when it has single properly tracked text band. While track switch occurs when multiple tracks are generated on a single ground truth track. We have evaluated the tracking performance on videos of hour each from different Indian news channels. In all, we obtained tracks out of which tracks were found to be pure with track switches. However remaining tracks were either initialized wrongly or tracked incorrectly.
We have evaluated the performance of OCR engine on three hours of video data from three different channels. The word level and character level error rates are presented in table IV. It is evident from the table that adapting Tesseract using web articles has improved performance of the OCR significantly.
We have proposed a methodology for overlay text extraction in TV broadcast news videos. In the context of text detection and localization, we have significantly improved over existing edge density based methods. We observed that this basic approach had high false positives on account of strong edges present in non-text regions. We have proposed a threshold free preprocessing scheme for suppression of non-text edges while boosting text edges. The effects of stronger edges coming from non-text regions are nullified further by using the first and second order derivatives of the edge density projection profiles while localizing the text bands. The detected text regions are tracked across the frames to extract the static and consistent text bands using a formal reasoning framework. The use of RCC-5 based reasoning framework allowed us to identify different problem cases in detection and tracking. Finally, text is extracted from consistently tracked text bands using Tesseract OCR engine trained using web news articles. All our proposals have shown significant improvement in the performance of each subsystem.
|Character Level||Word Level|
This work was a first step towards the larger goal of broadcast (news) video analytics. The textual data acts as an important source of information for obtaining meaningful tags for displayed events, persons (faces), places and time. Thus, the extracted text data can be used further as features for applications like video event classification, news summarization, news story segmentation and association of names to places and persons.
This work is part of the ongoing project on “Multi-modal Broadcast Analytics: Structured Evidence Visualization for Events of Security Concern” funded by the Department of Electronics & Information Technology (DeitY), Govt. of India.
-  B. Epshtein, E. Ofek, and Y. Wexler, “Detecting text in natural scenes with stroke width transform,” in IEEE CVPR, 2010, pp. 2963–2970.
Minetto, Rodrigo, T. Nicolas, C.Matthieu, S. Jorge, and L. Neucimar J,
“Text detection and recognition in urban scenes,”
IEEE International Conference on Computer Vision Workshops, 2011, pp. 227–234.
-  Rainer Lienhart, “Video ocr: A survey and practitioner’s guide,” in Video Mining, Rosenfeld, Azriel, Doermann D. Daniel, and DeMenthon, Eds., vol. 6 of The Springer International Series in Video Computing, pp. 155–183. Springer US, 2003.
-  X. Tang, B. Luo, X. Gao, and H. Zhan, “Multiscale edge-based text extraction from complex images,” in Proceedings of the IEEE International Conference on Multimedia and Expo, 2002, pp. 85–88.
-  H. Koo and D. H Kim, “Scene text detection via connected component clustering and nontext filtering,” Transactions on Image Processing,, vol. 22, no. 6, pp. 2296–2305, 2013.
-  C. Yao, X. Bai, W. Liu, Y. Ma, and Z. Tu, “Detecting texts of arbitrary orientations in natural images,” in IEEE CVPR, 2012, pp. 1083–1090.
-  A.K. Jain and B. Yu, “Automatic text location in images and video frames,” Pattern recognition, vol. 31, no. 12, pp. 2055–2076, 1998.
-  M.R. Lyu, S. Jiqiang, and C. Min, “A comprehensive method for multilingual video text detection, localization, and extraction,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 15, no. 2, pp. 243–255, 2005.
-  W. Kim and C. Kim, “A new approach for overlay text detection and extraction from complex video scene,” IEEE Transactions on Image Processing, vol. 18, no. 2, pp. 401–411, 2009.
-  R. Smith, “An overview of the tesseract ocr engine,” in Ninth International Conference on Document Analysis and Recognition, Sept 2007, vol. 2, pp. 629–633.