Extracting textual overlays from social media videos using neural networks

by   Adam Słucki, et al.

Textual overlays are often used in social media videos as people who watch them without the sound would otherwise miss essential information conveyed in the audio stream. This is why extraction of those overlays can serve as an important meta-data source, e.g. for content classification or retrieval tasks. In this work, we present a robust method for extracting textual overlays from videos that builds up on multiple neural network architectures. The proposed solution relies on several processing steps: keyframe extraction, text detection and text recognition. The main component of our system, i.e. the text recognition module, is inspired by a convolutional recurrent neural network architecture and we improve its performance using synthetically generated dataset of over 600,000 images with text prepared by authors specifically for this task. We also develop a filtering method that reduces the amount of overlapping text phrases using Levenshtein distance and further boosts system's performance. The final accuracy of our solution reaches over 80A pair with state-of-the-art methods.



There are no comments yet.


page 2

page 5

page 11


Recurrent Neural Network based Part-of-Speech Tagger for Code-Mixed Social Media Text

This paper describes Centre for Development of Advanced Computing's (CDA...

Tribrid: Stance Classification with Neural Inconsistency Detection

We study the problem of performing automatic stance classification on so...

Semi-Supervised Recurrent Neural Network for Adverse Drug Reaction Mention Extraction

Social media is an useful platform to share health-related information d...

Unsupervised Text Extraction from G-Maps

This paper represents an text extraction method from Google maps, GIS ma...

Representing Social Media Users for Sarcasm Detection

We explore two methods for representing authors in the context of textua...

Automatic Genre and Show Identification of Broadcast Media

Huge amounts of digital videos are being produced and broadcast every da...

Exploiting the relationship between visual and textual features in social networks for image classification with zero-shot deep learning

One of the main issues related to unsupervised machine learning is the c...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Videos published on social media are commonly described only with their title, short summary and unstructured keywords. Extracting additional information from textual overlays such as captions, key ideas or scene level summaries can be a crucial component of a content retrieval system, video classifier or intelligent advertisement targeting. The problem of extracting this information is twofold. First part of the problem is choosing frames on which OCR will be performed and the second is text detection and recognition on those frames. There are many domain-specific difficulties related to the detection and recognition in social media videos. Backgrounds of textual overlays in those videos are rarely solid and contrastive. They are often displayed as part of a background and have various font sizes, colors and combinations. We present examples of frames with text appearing in social media videos in Fig. 


In this paper, we present an entire working pipeline for text extraction tailored specifically for social media videos. We propose a multi-component system that consists of frame extraction, text detection, text recognition and post-processing by text merging and rectification. Fig. 6 shows an overview of our system. We also propose a method for generating synthetic training data designed for textual overlays commonly appearing in social media videos. Extending our training dataset with the synthetically generated data allows our text recognition model to reduce a word level recognition error by 20% compared to a general, pre-trained CRNN [1] model for text recognition. The main contribution of this work is a complete system that allows its user to extract video overlays with minimal amount of textual overlap and state-of-the-art text detection and recognition results. We also describe the details of a training data generation algorithm that takes into account visual characteristics of overlays in online videos and we show how using this algorithm improves the accuracy of our system. Finally, we propose a new method based on the Levenshtein distance that allows to filter out the text appearing in multiple frames and extract the most relevant information presented in a video.

The remainder of this paper is organized in the following manner: In Sec. 2, we discuss related works. Sec. 3 presents our system and in Sec. 4 we evaluate its performance against baselines. We conclude the paper in Sec. 5.

(a) Standard frame with text
(b) Superimposed subtitles
(c) Multiple text blocks with various fonts and sizes
(d) Text displayed as a part of a background
Figure 5: Sample frames extracted from social media videos with textual overlays displayed in challenging conditions.

2 Related Work

A significant amount of research focused on addressing the problem of text detection and recognition in images [1, 2, 3, 4]. The work [2]

presents a fully convolutional neural network that is trained for pixelwise classification of text regions in natural scene images. Text recognition is performed with a CRNN model inspired by

[1], which is further improved through a dictionary based correction. In our system, we also rely on a spellcheck dictionary-based functionality to improve the output of our system. Another approach to text detection in images is presented in [3], where the authors use a fast cascade boosting technique to detect single characters. The characters are then merged into lines with min-cost flow network in a post-processing step, similarly to our rectification module. Although the selection of text detection and recognition methods presented above is far from complete, in this work we focus on extracting textual overlays from videos, not images and we review below related works that address this exact problem.

Detecting and recognizing blocks of text in videos has also gained significant attention from the research community [5, 6, 7]. In [5], Sato et al.

present an approach based on extracting and classifying hand-crafted features using a computer vision method. They rely on specific properties of letters to detect blocks of text, segment it into characters and recognize them individually using template matching. To enhance the quality of the input, they leverage temporal consistency of the text blocks displayed across several frames of video using time-based minimum pixels value search. Although fairly effective for videos with high contrast, where the color of textual characters is significantly different then the background, the main limitation of the method is lack of robustness against less contrastive frames. As presented in Fig. 

5, this is often not the case for social media videos where various font colors are used and the contrast against the background cannot be guaranteed.

A recent work [6] presents a method that, similarly to [5], uses a computationally efficient text detection method, in this case the maximally stable external regions (MSER) [8], to generate a set of candidate regions. The regions are then filtered using a binary classifier based on a convolutional neural network architecture and text recognition is done using a similar neural network model. The system is capable of providing real-time OCR recognition in videos. Nevertheless, its main drawback is that the frames are processed individually, hence discarding temporal consistencies that are useful for getting stable and robust overlay detection and recognition system. Furthermore, processing videos on a frame-by-frame basis introduces a significant computational overhead in the context of overlay extraction - the exact problem we address in this paper. This is mainly due to the fact that the goal of overlay extraction is to output a set of phrases or sentences that do not have a significant overlap between each other, i.e. can be read as a single block of text spread across several scenes. In our approach, we tackle this problem using additional post-processing step that focuses on text rectification and proves its effectiveness through a set of qualitative results.

The problem of video overlay extraction is also tackled in [7], where Kannao and Guha propose to detect entire lines of text instead of single words. To decrease the computational cost of detection and recognition, they use temporal tracking across multiple frames. For the recognition, they train the Tesseract OCR model [9] with synthetically generated images. Inspired by this approach, we also generate part of our training data synthetically, however, we use the resulting dataset to improve the performance of several of our system’s modules and not a Tesseract engine. Furthermore, contrary to the results presented in [7], our text recognition engine that relies on a convolutional recurrent neural network architecture [1] significantly outperforms the competing methods, including the baseline Tesseract method.

3 Overlay extraction system

In this section, we present our system whose goal is to extract complete sequences of text split across several frames of a social media video. An overview of the system is also shown in Fig. 6. The proposed solution comprises several components, starting from the frame extractor through the text detector to the text recognizer and rectifier. Below, we outline the main features of those components along with the method for generating a synthetic dataset used to improve the performance of text recognition model. We conclude this section with a description of post-processing step that allows us to avoid redundancies in the overlays returned by our system. We present sample results in Fig.  7.

Figure 6: Architecture of the system for textural overlays extraction.

3.1 Frames extractor

The goal of this component of our system is the extraction of all frames containing unique overlays. Extracting too few frames leads to an information loss and extracting unnecessarily too many frames with overlapping overlays increases the processing time. Our frames extraction step is therefore an essential part of the whole system.

We use the functionality provided as a part of the ffmpeg codec111https://www.ffmpeg.org/ as a frame extractor. More specifically, we input a video and extract intra-coded frames, the so-called I-frames, used by the codec as benchmark frames. According to the codec specifications, I-frames are stored as complete images, in contrary to the P and B predictive frames which are encoded only through differences with respect to the benchmark I-frames. Although, there may be some cases, where the overlay text is visible only through the encoded P-frames, our preliminary results indicated that using only I-frames in those cases does not lead to a significant reduction in the information conveyed in the video.

Several alternative approaches to the problem of informative frame extraction exist. Since the overlays are typically changed when the video shot changes, selecting the last frame from every scene can be a viable solution. Unfortunately, a significant portion of our database videos consists of only one scene with multiple overlays, which reduces the applicability of this method in our use case. Another approach for frame extraction relies on a more complex method for highlight extraction based on neural network architectures [10]. Our initial experiments indicated, however, that this approach is too computationally expensive and therefore reduce the usability of the entire system. We therefore rely on our frame extraction on the ffmpeg codec which provides an efficient and effective method for selecting important video frames.

Figure 7: Results of text detection (left) and text recognition (right) modules used in the proposed framework.

3.2 Text detection

Our text detection component uses the TextBoxes method [11] based on an end-to-end trainable Single Shot Detector (SSD) [12]. Multiple layers of the network return coordinates of word bounding boxes along with a prediction score of text presence. Then, a non-maximum suppression algorithm is used to obtain optimum bounding box coordinates for each word. The publicly available implementation222https://github.com/MhLiao/TextBoxes we use detects only horizontal text. Therefore the text blocks that are less likely to be part of the typically horizontal overlays are automatically filtered out. In general, vertical texts are rarely used as overlays in social media videos and this is confirmed within our evaluation dataset. Modifying the solution to also detect vertical text blocks can therefore lead to a higher rate of misclassifications and ultimately reduced accuracy of our system.

3.3 Text recognition

Our method for text recognition is based on the Convolutional Recurrent Neural Network (CRNN) model [1]

. We use the architecture with seven convolutional layers followed by two Bidirectional LSTM layers. Probability of a sequence is given by a Connectionist Temporal Classification layer 

[13]. As an input, the model takes an image displaying a single word. The image must be scaled to a fixed height while the width of the image can vary. Sequences that are input to recurrent layers are generated by concatenating columns of feature maps produced by convolutional layers.

Although the text detection module based on the CRNN performs well in general scenarios, its performance can be further improved by adjusting the training dataset to the application scenario. In our case, the goal is to recognize textual overlays presented in social media videos. Those videos often contain text blocks with special characters and are frequently displayed in challenging conditions (various backgrounds, font colors and sizes, etc.). To address those challenges, we propose to improve the recognition model based on the CRNN by fine-tuning the network on a synthetically generated dataset. Below, we outline the details of a dataset generation procedure which, as shown in Sec. 4 leads to significant performance improvements.

3.3.1 Generation of a synthetic dataset

As shown in [4], training a text recognition model with synthetically generated datasets can improve its results. Furthermore, due to a specific application of our system for text recognition in social media videos, existing datasets typically used for training text recognition models may not be sufficient, as they mostly contain natural images, very much different from those published in social media. Therefore we propose to synthetically generate a dataset that can simulate the conditions observed in social media, such as diversified background of text blocks and various fonts and colors of the text displayed in the images. The generation of a synthetic dataset can be split into the following steps:

Text. We prepare transcripts of overlays from over 100 social media videos collected from several Facebook profiles and a list of 5000 most frequent words in the Corpus of Contemporary American English [14] to create a set of unique single words for rendering on images. This dataset was augmented with digits and special characters, such as hyphens, commas and question marks. This augmentation is especially important, since the original CRNN model was trained on a dataset of alphanumerical characters only and its performance is significantly decreased on a dataset of social media videos, as shown in Tab. 2.

Background images. To increase the diversity of the synthetically generated dataset, we superimpose text blocks over various backgrounds. To increase the diversity of those backgrounds, we use 50 frames from randomly selected videos and manually extract regions without blocks of text. We ensure that the extracted regions represented a wide range of used colors, intensity values and texture types. We also extract regions whose dimensions are large enough that we can randomly crop them to increase the pool of potential background images.

Fonts. We collect 71 fonts out of 30 font-families with Calibre font being the most popular one. The other fonts are picked to mimic the distribution of similar fonts in social media based on general guidances for editors. We present full list of font-families used below.

  1. Alegreya

  2. Aleo

  3. AnonymousPro

  4. Archivo

  5. Arvo

  6. BioRhyme

  7. Bitter

  8. Cabin

  9. Calibre

  10. Cardo

  11. Chivo

  12. Cormorant

  13. CrimsonText

  14. Dosis

  15. Helvetica

  16. Karla

  17. Libre

  18. Lora

  19. Merriweather

  20. Montserrat

  21. Neuton

  22. OldStandard

  23. OpenSans

  24. PlayfairDisplay

  25. Poppins

  26. Raleway

  27. Roboto

  28. SourceSans

  29. SpaceMono

  30. Spectral

Random sampling. For each word in our dataset, we generate 100 samples by selecting random font, size and color. Then, based on the size of the text, we crop randomly selected background image and superimpose the text on the cropped image. All images are resized to 100x32px with anti-aliasing and saved in jpeg format. Fig. 8 shows a comparison between real and synthetically generated frames with text.

3.3.2 Fine-tuning

We use the synthetically generated dataset to fine-tune our CRNN model. We experiment with three different variants of the CRNN tuning procedure:

  1. We modify the dimensions of the last LSTM layer to adapt it to the extended set of characters. Only the last LSTM layer is initialized with random weights and the parameters of all other layers are frozen.

  2. We modify output dimensions of the last LSTM layer but the weights of both LSTM layers are initialized with random weights, while the other parameters are frozen.

  3. We initially load weights from pretrained model and change output dimensions of the last LSTM layer. All network parameters are updated during training.

The comparison of the results obtained with different variants is shown in Sec. 4.

Figure 8: Comparison of sample images extracted from real videos (left column) and synthetic images generated with the same text block (right column).

3.4 Text merging and rectification

Although our frame extraction component is fairly robust, it does not prevent the text extracted from text overlays to overlap across consecutive frames. We propose novel yet simple method for filtering out such overlaps. We can assume that components of a single overlay appear gradually and last until the end of a scene. Therefore, the version containing the greatest number of characters can be considered the final one. We sort the extracted overlays by the time of their appearance in a reverse order. We then compare consecutive texts of overlays using normalized Levenshtein distance. If the result is below a given threshold we consider that overlays are overlapping and disregard the one with fewer characters. We also use an autocorrection toolkit333https://github.com/phatpiglet/autocorrect/ to further improve the results of the OCR model.

4 Evaluation

In this section, we present the results of the evaluation of our method on a benchmark dataset. We first present the evaluation dataset along with the evaluation metrics. We then show the experimental results obtained for different pre-processing steps. Finally, we show the comparison of the results obtained with our method against the results of the system based on Tesseract of CRNN models.

4.1 Dataset

To measure the accuracy of the OCR component of our system we extract frames from 100 videos published on Facebook between June 2017 and January 2018 on NowThisNews, NowThisPolitics, NowThisHer, thedodosite and SeekerMedia channels using ffmpeg codec, as described in section  3.1. Then, using the method presented in section 3.2, we detect and extract single word images from a random subset of frames. We discard images with less than 20px height. We also exclude images with less than 3 characters as well as images that are part of a media brand logo as they would introduce overlaps in our test set. We randomly select 1000 of the remaining images and manually annotate them to use as the final testing set. To measure end-to-end performance of the OCR and the text detection components, we annotate 100 randomly selected frames with 1128 words displayed in total. For each frame, we mark the location of the text and we transcribe all the words shown in the frame. The set of videos we have used to extract those frames was separate from the set that we used to select the list of words for generating synthetic images. We have not explicitly excluded repetitions of other words. We assume that random selection of frames and words taken from them is enough to prevent including two identical images in our testing set.

4.2 Evaluation metrics

To evaluate our system and compare it with the baseline, we follow the evaluation protocol of [15], and compute several metrics: average precision, recall and f1 score of the system output. We also compare targets with the predictions using similarity metric based on normalized Levenshtein distance [16]. All metrics are calculated on a word level. Similar metric was used in  [17] to evaluate OCR accuracy on distorted images which also may be the case in our task due to the frame extraction process.

The metrics are computed according to the following formulas:

4.3 Preprocessing methods

The detection and recognition modules of our system expect grayscale images as their input and we evaluate several preprocessing methods that aim to improve the quality of grayscale images text recognition. To that end, we test the following preprocessing methods with a pretrained CRNN model [1] and the Tesseract OCR Engine [18]:

  • No preprocessing: raw, grayscale images are input to the recognition component.

  • Otsu’s binarization: binarization method based on dynamic thresholding. We use OpenCV

    444http://www.opencv.org implementation.

  • Gaussian blurring with px kernel followed by Otsu’s binarization: Additional blurring step can potentially increase the robustness of the system.

  • Gaussian blurring with Otsu’s binarization and opening: by adding the morphological openning observation, we expect to reduce the amount of noise in the images.

  • Max-RGB filter: we flat the color channel space by selecting a maximum pixel value from each channel and using it as the output image pixel. This preprocessing method is based on the assumption that text and background have different color and using this filter should lead to an improved contrast of the image.

Preprocessing method Tesseract CRNN model
None 57.8% 75.4%
Gaussian blur + Otsu 52% 65.9%
Gaussian blur + Otsu + opening 57.2% 67.4%
Otsu 57.8% 69.1%
Max RGB 56.3% 75.7%
Table 1: Word level recognition accuracy for the Tesseract OCR engine  [18] and pretrained CRNN model [1] when given preprocessing method was applied.

Tab. 1 shows the results of the experiments with preprocessing methods. For the Tesseract OCR the best pre-processing method is Otsu’s binarization, yet identical result was obtained without the preprocessing. However, for the CRNN model, which we use in our system in practice, the best results are obtained when using max-RGB filtering. Nevertheless, the performance improvement achieved by the best preprocessing method is negligible. The conclusion of this experiment is that the convolutional layers of the CRNN module are able to learn optimal transformations to increase the system performance and therefore fully substitute preprocessing steps. One can also see that the neural network based model significantly outperforms the traditional Tesseract OCR system.

4.4 Results

Accuracy tests performed with the original model and its fine-tuned versions presented in the Table 2 show improvement for cases where only parameters for LSTM layers were updated during training. It means that the generated set may be too small for training the entire network without overfitting. At the same time updating parameters of both LSTM layers turn out to be better than modifying parameters of only the single last layer. It shows that features encoded by the penultimate LSTM layer are not generic enough and that our synthetic training set is sufficient to learn new features specific to the task. The best model allowed to reduce the word recognition error by 20%.

Evaluation of text detection and recognition presented in Table 2 shows that all fine-tuned models perform better than the original version for this specific task. Using the fine-tuned CRNN model leads to a 20% increase of precision, recall, F1 score and similarity metrics compared to a generic CRNN model.

End-to-end results show that the text detection component plays an important role in the system. Imperfect detection lowers the quality of CRNN input which translates into a decrease in recognition accuracy. However, from a practical point of view the system can be already used to extract meaningful information from social media videos. Overlays can be further processed using presented rectification methods.

Recognition Detection with recognition
Model Accuracy Precision Recall F1 Similarity
Tesseract [18] 57.8% 0.284 0.266 0.274 0.42
CRNN [1] 75.7% 0.368 0.343 0.352 0.52
Fine-tuned CRNN (all parameters) 74% 0.40 0.375 0.386 0.59
Fine-tuned CRNN (last LSTM layer) 76.6% 0.406 0.378 0.389 0.59
Fine-tuned CRNN (both LSTM layers) 80.1% 0.45 0.42 0.432 0.62
Table 2: Performance comparison of baseline, original and fine-tuned models.

To further evaluate different models used for text recognition, we visualize the outputs of various methods on a sample video frame with the overlay. The results are shown in Fig. 12. Those qualitative results confirm the numerical evaluation performed above - our fine-tuned CRNN model provides the most accurate transcription of the overlay.

(a) Tesseract
(b) Original CRNN model  [1]
(c) Fine-tuned CRNN model
Figure 12: Results of text recognition using different models.

5 Conclusions

In this paper, we presented a comprehensive system for video overlay text extraction that comprises several components: keyframe extraction, text detection, recognition and rectification. The system is specifically designed and evaluated in the context of social media videos where textual overlays appear in particularly challenging conditions. Using synthetically generated dataset allowed us to reduce recognition error of our neural network-based text recognition model by over 20%. Overall, the proposed system provides an effective and robust method for video overlay extraction. It has been successfully implemented and integrated into a complex social media video analysis engine and is actively used as part of many services, including a content classifier and a retention analytics engine.


This work was partially funded by the Dean’s Grant nr II/2017/GD/1 of the Faculty of Electronics and Information Technology at Warsaw University of Technology.