Videos published on social media are commonly described only with their title, short summary and unstructured keywords. Extracting additional information from textual overlays such as captions, key ideas or scene level summaries can be a crucial component of a content retrieval system, video classifier or intelligent advertisement targeting. The problem of extracting this information is twofold. First part of the problem is choosing frames on which OCR will be performed and the second is text detection and recognition on those frames. There are many domain-specific difficulties related to the detection and recognition in social media videos. Backgrounds of textual overlays in those videos are rarely solid and contrastive. They are often displayed as part of a background and have various font sizes, colors and combinations. We present examples of frames with text appearing in social media videos in Fig.5.
In this paper, we present an entire working pipeline for text extraction tailored specifically for social media videos. We propose a multi-component system that consists of frame extraction, text detection, text recognition and post-processing by text merging and rectification. Fig. 6 shows an overview of our system. We also propose a method for generating synthetic training data designed for textual overlays commonly appearing in social media videos. Extending our training dataset with the synthetically generated data allows our text recognition model to reduce a word level recognition error by 20% compared to a general, pre-trained CRNN  model for text recognition. The main contribution of this work is a complete system that allows its user to extract video overlays with minimal amount of textual overlap and state-of-the-art text detection and recognition results. We also describe the details of a training data generation algorithm that takes into account visual characteristics of overlays in online videos and we show how using this algorithm improves the accuracy of our system. Finally, we propose a new method based on the Levenshtein distance that allows to filter out the text appearing in multiple frames and extract the most relevant information presented in a video.
The remainder of this paper is organized in the following manner: In Sec. 2, we discuss related works. Sec. 3 presents our system and in Sec. 4 we evaluate its performance against baselines. We conclude the paper in Sec. 5.
2 Related Work
presents a fully convolutional neural network that is trained for pixelwise classification of text regions in natural scene images. Text recognition is performed with a CRNN model inspired by, which is further improved through a dictionary based correction. In our system, we also rely on a spellcheck dictionary-based functionality to improve the output of our system. Another approach to text detection in images is presented in , where the authors use a fast cascade boosting technique to detect single characters. The characters are then merged into lines with min-cost flow network in a post-processing step, similarly to our rectification module. Although the selection of text detection and recognition methods presented above is far from complete, in this work we focus on extracting textual overlays from videos, not images and we review below related works that address this exact problem.
present an approach based on extracting and classifying hand-crafted features using a computer vision method. They rely on specific properties of letters to detect blocks of text, segment it into characters and recognize them individually using template matching. To enhance the quality of the input, they leverage temporal consistency of the text blocks displayed across several frames of video using time-based minimum pixels value search. Although fairly effective for videos with high contrast, where the color of textual characters is significantly different then the background, the main limitation of the method is lack of robustness against less contrastive frames. As presented in Fig.5, this is often not the case for social media videos where various font colors are used and the contrast against the background cannot be guaranteed.
A recent work  presents a method that, similarly to , uses a computationally efficient text detection method, in this case the maximally stable external regions (MSER) , to generate a set of candidate regions. The regions are then filtered using a binary classifier based on a convolutional neural network architecture and text recognition is done using a similar neural network model. The system is capable of providing real-time OCR recognition in videos. Nevertheless, its main drawback is that the frames are processed individually, hence discarding temporal consistencies that are useful for getting stable and robust overlay detection and recognition system. Furthermore, processing videos on a frame-by-frame basis introduces a significant computational overhead in the context of overlay extraction - the exact problem we address in this paper. This is mainly due to the fact that the goal of overlay extraction is to output a set of phrases or sentences that do not have a significant overlap between each other, i.e. can be read as a single block of text spread across several scenes. In our approach, we tackle this problem using additional post-processing step that focuses on text rectification and proves its effectiveness through a set of qualitative results.
The problem of video overlay extraction is also tackled in , where Kannao and Guha propose to detect entire lines of text instead of single words. To decrease the computational cost of detection and recognition, they use temporal tracking across multiple frames. For the recognition, they train the Tesseract OCR model  with synthetically generated images. Inspired by this approach, we also generate part of our training data synthetically, however, we use the resulting dataset to improve the performance of several of our system’s modules and not a Tesseract engine. Furthermore, contrary to the results presented in , our text recognition engine that relies on a convolutional recurrent neural network architecture  significantly outperforms the competing methods, including the baseline Tesseract method.
3 Overlay extraction system
In this section, we present our system whose goal is to extract complete sequences of text split across several frames of a social media video. An overview of the system is also shown in Fig. 6. The proposed solution comprises several components, starting from the frame extractor through the text detector to the text recognizer and rectifier. Below, we outline the main features of those components along with the method for generating a synthetic dataset used to improve the performance of text recognition model. We conclude this section with a description of post-processing step that allows us to avoid redundancies in the overlays returned by our system. We present sample results in Fig. 7.
3.1 Frames extractor
The goal of this component of our system is the extraction of all frames containing unique overlays. Extracting too few frames leads to an information loss and extracting unnecessarily too many frames with overlapping overlays increases the processing time. Our frames extraction step is therefore an essential part of the whole system.
We use the functionality provided as a part of the ffmpeg codec111https://www.ffmpeg.org/ as a frame extractor. More specifically, we input a video and extract intra-coded frames, the so-called I-frames, used by the codec as benchmark frames. According to the codec specifications, I-frames are stored as complete images, in contrary to the P and B predictive frames which are encoded only through differences with respect to the benchmark I-frames. Although, there may be some cases, where the overlay text is visible only through the encoded P-frames, our preliminary results indicated that using only I-frames in those cases does not lead to a significant reduction in the information conveyed in the video.
Several alternative approaches to the problem of informative frame extraction exist. Since the overlays are typically changed when the video shot changes, selecting the last frame from every scene can be a viable solution. Unfortunately, a significant portion of our database videos consists of only one scene with multiple overlays, which reduces the applicability of this method in our use case. Another approach for frame extraction relies on a more complex method for highlight extraction based on neural network architectures . Our initial experiments indicated, however, that this approach is too computationally expensive and therefore reduce the usability of the entire system. We therefore rely on our frame extraction on the ffmpeg codec which provides an efficient and effective method for selecting important video frames.
3.2 Text detection
Our text detection component uses the TextBoxes method  based on an end-to-end trainable Single Shot Detector (SSD) . Multiple layers of the network return coordinates of word bounding boxes along with a prediction score of text presence. Then, a non-maximum suppression algorithm is used to obtain optimum bounding box coordinates for each word. The publicly available implementation222https://github.com/MhLiao/TextBoxes we use detects only horizontal text. Therefore the text blocks that are less likely to be part of the typically horizontal overlays are automatically filtered out. In general, vertical texts are rarely used as overlays in social media videos and this is confirmed within our evaluation dataset. Modifying the solution to also detect vertical text blocks can therefore lead to a higher rate of misclassifications and ultimately reduced accuracy of our system.
3.3 Text recognition
Our method for text recognition is based on the Convolutional Recurrent Neural Network (CRNN) model 
. We use the architecture with seven convolutional layers followed by two Bidirectional LSTM layers. Probability of a sequence is given by a Connectionist Temporal Classification layer. As an input, the model takes an image displaying a single word. The image must be scaled to a fixed height while the width of the image can vary. Sequences that are input to recurrent layers are generated by concatenating columns of feature maps produced by convolutional layers.
Although the text detection module based on the CRNN performs well in general scenarios, its performance can be further improved by adjusting the training dataset to the application scenario. In our case, the goal is to recognize textual overlays presented in social media videos. Those videos often contain text blocks with special characters and are frequently displayed in challenging conditions (various backgrounds, font colors and sizes, etc.). To address those challenges, we propose to improve the recognition model based on the CRNN by fine-tuning the network on a synthetically generated dataset. Below, we outline the details of a dataset generation procedure which, as shown in Sec. 4 leads to significant performance improvements.
3.3.1 Generation of a synthetic dataset
As shown in , training a text recognition model with synthetically generated datasets can improve its results. Furthermore, due to a specific application of our system for text recognition in social media videos, existing datasets typically used for training text recognition models may not be sufficient, as they mostly contain natural images, very much different from those published in social media. Therefore we propose to synthetically generate a dataset that can simulate the conditions observed in social media, such as diversified background of text blocks and various fonts and colors of the text displayed in the images. The generation of a synthetic dataset can be split into the following steps:
Text. We prepare transcripts of overlays from over 100 social media videos collected from several Facebook profiles and a list of 5000 most frequent words in the Corpus of Contemporary American English  to create a set of unique single words for rendering on images. This dataset was augmented with digits and special characters, such as hyphens, commas and question marks. This augmentation is especially important, since the original CRNN model was trained on a dataset of alphanumerical characters only and its performance is significantly decreased on a dataset of social media videos, as shown in Tab. 2.
Background images. To increase the diversity of the synthetically generated dataset, we superimpose text blocks over various backgrounds. To increase the diversity of those backgrounds, we use 50 frames from randomly selected videos and manually extract regions without blocks of text. We ensure that the extracted regions represented a wide range of used colors, intensity values and texture types. We also extract regions whose dimensions are large enough that we can randomly crop them to increase the pool of potential background images.
Fonts. We collect 71 fonts out of 30 font-families with Calibre font being the most popular one. The other fonts are picked to mimic the distribution of similar fonts in social media based on general guidances for editors. We present full list of font-families used below.
Random sampling. For each word in our dataset, we generate 100 samples by selecting random font, size and color. Then, based on the size of the text, we crop randomly selected background image and superimpose the text on the cropped image. All images are resized to 100x32px with anti-aliasing and saved in jpeg format. Fig. 8 shows a comparison between real and synthetically generated frames with text.
We use the synthetically generated dataset to fine-tune our CRNN model. We experiment with three different variants of the CRNN tuning procedure:
We modify the dimensions of the last LSTM layer to adapt it to the extended set of characters. Only the last LSTM layer is initialized with random weights and the parameters of all other layers are frozen.
We modify output dimensions of the last LSTM layer but the weights of both LSTM layers are initialized with random weights, while the other parameters are frozen.
We initially load weights from pretrained model and change output dimensions of the last LSTM layer. All network parameters are updated during training.
The comparison of the results obtained with different variants is shown in Sec. 4.
3.4 Text merging and rectification
Although our frame extraction component is fairly robust, it does not prevent the text extracted from text overlays to overlap across consecutive frames. We propose novel yet simple method for filtering out such overlaps. We can assume that components of a single overlay appear gradually and last until the end of a scene. Therefore, the version containing the greatest number of characters can be considered the final one. We sort the extracted overlays by the time of their appearance in a reverse order. We then compare consecutive texts of overlays using normalized Levenshtein distance. If the result is below a given threshold we consider that overlays are overlapping and disregard the one with fewer characters. We also use an autocorrection toolkit333https://github.com/phatpiglet/autocorrect/ to further improve the results of the OCR model.
In this section, we present the results of the evaluation of our method on a benchmark dataset. We first present the evaluation dataset along with the evaluation metrics. We then show the experimental results obtained for different pre-processing steps. Finally, we show the comparison of the results obtained with our method against the results of the system based on Tesseract of CRNN models.
To measure the accuracy of the OCR component of our system we extract frames from 100 videos published on Facebook between June 2017 and January 2018 on NowThisNews, NowThisPolitics, NowThisHer, thedodosite and SeekerMedia channels using ffmpeg codec, as described in section 3.1. Then, using the method presented in section 3.2, we detect and extract single word images from a random subset of frames. We discard images with less than 20px height. We also exclude images with less than 3 characters as well as images that are part of a media brand logo as they would introduce overlaps in our test set. We randomly select 1000 of the remaining images and manually annotate them to use as the final testing set. To measure end-to-end performance of the OCR and the text detection components, we annotate 100 randomly selected frames with 1128 words displayed in total. For each frame, we mark the location of the text and we transcribe all the words shown in the frame. The set of videos we have used to extract those frames was separate from the set that we used to select the list of words for generating synthetic images. We have not explicitly excluded repetitions of other words. We assume that random selection of frames and words taken from them is enough to prevent including two identical images in our testing set.
4.2 Evaluation metrics
To evaluate our system and compare it with the baseline, we follow the evaluation protocol of , and compute several metrics: average precision, recall and f1 score of the system output. We also compare targets with the predictions using similarity metric based on normalized Levenshtein distance . All metrics are calculated on a word level. Similar metric was used in  to evaluate OCR accuracy on distorted images which also may be the case in our task due to the frame extraction process.
The metrics are computed according to the following formulas:
4.3 Preprocessing methods
The detection and recognition modules of our system expect grayscale images as their input and we evaluate several preprocessing methods that aim to improve the quality of grayscale images text recognition. To that end, we test the following preprocessing methods with a pretrained CRNN model  and the Tesseract OCR Engine :
No preprocessing: raw, grayscale images are input to the recognition component.
Gaussian blurring with px kernel followed by Otsu’s binarization: Additional blurring step can potentially increase the robustness of the system.
Gaussian blurring with Otsu’s binarization and opening: by adding the morphological openning observation, we expect to reduce the amount of noise in the images.
Max-RGB filter: we flat the color channel space by selecting a maximum pixel value from each channel and using it as the output image pixel. This preprocessing method is based on the assumption that text and background have different color and using this filter should lead to an improved contrast of the image.
|Preprocessing method||Tesseract||CRNN model|
|Gaussian blur + Otsu||52%||65.9%|
|Gaussian blur + Otsu + opening||57.2%||67.4%|
Tab. 1 shows the results of the experiments with preprocessing methods. For the Tesseract OCR the best pre-processing method is Otsu’s binarization, yet identical result was obtained without the preprocessing. However, for the CRNN model, which we use in our system in practice, the best results are obtained when using max-RGB filtering. Nevertheless, the performance improvement achieved by the best preprocessing method is negligible. The conclusion of this experiment is that the convolutional layers of the CRNN module are able to learn optimal transformations to increase the system performance and therefore fully substitute preprocessing steps. One can also see that the neural network based model significantly outperforms the traditional Tesseract OCR system.
Accuracy tests performed with the original model and its fine-tuned versions presented in the Table 2 show improvement for cases where only parameters for LSTM layers were updated during training. It means that the generated set may be too small for training the entire network without overfitting. At the same time updating parameters of both LSTM layers turn out to be better than modifying parameters of only the single last layer. It shows that features encoded by the penultimate LSTM layer are not generic enough and that our synthetic training set is sufficient to learn new features specific to the task. The best model allowed to reduce the word recognition error by 20%.
Evaluation of text detection and recognition presented in Table 2 shows that all fine-tuned models perform better than the original version for this specific task. Using the fine-tuned CRNN model leads to a 20% increase of precision, recall, F1 score and similarity metrics compared to a generic CRNN model.
End-to-end results show that the text detection component plays an important role in the system. Imperfect detection lowers the quality of CRNN input which translates into a decrease in recognition accuracy. However, from a practical point of view the system can be already used to extract meaningful information from social media videos. Overlays can be further processed using presented rectification methods.
|Recognition||Detection with recognition|
|Fine-tuned CRNN (all parameters)||74%||0.40||0.375||0.386||0.59|
|Fine-tuned CRNN (last LSTM layer)||76.6%||0.406||0.378||0.389||0.59|
|Fine-tuned CRNN (both LSTM layers)||80.1%||0.45||0.42||0.432||0.62|
To further evaluate different models used for text recognition, we visualize the outputs of various methods on a sample video frame with the overlay. The results are shown in Fig. 12. Those qualitative results confirm the numerical evaluation performed above - our fine-tuned CRNN model provides the most accurate transcription of the overlay.
In this paper, we presented a comprehensive system for video overlay text extraction that comprises several components: keyframe extraction, text detection, recognition and rectification. The system is specifically designed and evaluated in the context of social media videos where textual overlays appear in particularly challenging conditions. Using synthetically generated dataset allowed us to reduce recognition error of our neural network-based text recognition model by over 20%. Overall, the proposed system provides an effective and robust method for video overlay extraction. It has been successfully implemented and integrated into a complex social media video analysis engine and is actively used as part of many services, including a content classifier and a retention analytics engine.
This work was partially funded by the Dean’s Grant nr II/2017/GD/1 of the Faculty of Electronics and Information Technology at Warsaw University of Technology.
-  Baoguang Shi, Xiang Bai, and Cong Yao. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. CoRR, abs/1507.05717, 2015.
-  Cong Yao, Jia-Nan Wu, Xinyu Zhou, Chi Zhang, Shuchang Zhou, Zhimin Cao, and Qi Yin. Incidental scene text understanding: Recent progresses on ICDAR 2015 robust reading competition challenge 4. CoRR, abs/1511.09207, 2015.
-  Shangxuan Tian, Yifeng Pan, Chang Huang, Shijian Lu, Kai Yu, and Chew Lim Tan. Text flow: A unified text detection system in natural scene images. CoRR, abs/1604.06877, 2016.
-  M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman. Synthetic data and artificial neural networks for natural scene text recognition. arXiv preprint arXiv:1406.2227, 2014.
-  Toshio Sato, Takeo Kanade, Ellen Hughes, Michael Smith, and Shin ichi Satoh. Video ocr: Indexing digital news libraries by recognition of superimposed caption. In ACM Multimedia Systems Special Issue on Video Libraries, February 1998.
-  Haojin Yang, Cheng Wang, Christian Bartz, and Christoph Meinel. Scenetextreg: A real-time video ocr system. In Proceedings of the 2016 ACM on Multimedia Conference, MM ’16, pages 698–700, New York, NY, USA, 2016. ACM.
-  Raghvendra Kannao and Prithwijit Guha. Overlay text extraction from TV news broadcast. CoRR, abs/1604.00470, 2016.
-  Michael Donoser and Horst Bischof. Efficient maximally stable extremal region (mser) tracking. In CVPR, 2006.
-  Ray Smith. An overview of the tesseract ocr engine. In Proc. Ninth Int. Conference on Document Analysis and Recognition (ICDAR), pages 629–633, 2007.
-  Huan Yang, Baoyuan Wang, Stephen Lin, David P. Wipf, Minyi Guo, and Baining Guo. Unsupervised extraction of video highlights via robust recurrent auto-encoders. CoRR, abs/1510.01442, 2015.
-  Minghui Liao, Baoguang Shi, Xiang Bai, Xinggang Wang, and Wenyu Liu. Textboxes: A fast text detector with a single deep neural network. CoRR, abs/1611.06779, 2016.
-  Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott E. Reed, Cheng-Yang Fu, and Alexander C. Berg. SSD: single shot multibox detector. CoRR, abs/1512.02325, 2015.
Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen
Connectionist temporal classification: Labelling unsegmented sequence
data with recurrent neural networks.
Proceedings of the 23rd International Conference on Machine Learning, ICML ’06, pages 369–376, New York, NY, USA, 2006. ACM.
-  Mark Davies. The corpus of contemporary american english (coca): 560 million words, 1990-present., 2008.
-  D. Karatzas, S. R. Mestre, J. Mas, F. Nourbakhsh, and P. P. Roy. Icdar 2011 robust reading competition - challenge 1: Reading text in born-digital images (web and email). In 2011 International Conference on Document Analysis and Recognition, pages 1485–1490, Sept 2011.
-  V. I. Levenshtein. Binary Codes Capable of Correcting Deletions, Insertions and Reversals. Soviet Physics Doklady, 10:707, February 1966.
-  Lundqvist and O. Wallberg. Natural image distortions and optical character recognition accuracy. PhD thesis, KTH, School of Computer Science and Communication, 2016.
-  R. Smith. An overview of the tesseract ocr engine. In Proceedings of the Ninth International Conference on Document Analysis and Recognition - Volume 02, ICDAR ’07, pages 629–633, Washington, DC, USA, 2007. IEEE Computer Society.