Visual Summarization of Scholarly Videos using Word Embeddings and Keyphrase Extraction

11/25/2019 ∙ by Hang Zhou, et al. ∙ Technische Informationsbibliothek 0

Effective learning with audiovisual content depends on many factors. Besides the quality of the learning resource's content, it is essential to discover the most relevant and suitable video in order to support the learning process most effectively. Video summarization techniques facilitate this goal by providing a quick overview over the content. It is especially useful for longer recordings such as conference presentations or lectures. In this paper, we present an approach that generates a visual summary of video content based on semantic word embeddings and keyphrase extraction. For this purpose, we exploit video annotations that are automatically generated by speech recognition and video OCR (optical character recognition).



There are no comments yet.


This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The massive growth of online video platforms underline the role of audio-visual content as one of the most commonly used sources of information not only for entertainment, but also in learning related scenarios. Exploring a large collection of videos in order to find the most relevant candidate for a specific learning intent can be overwhelming and therefore inefficient. This is especially true for longer videos if the title alone is not able to capture all parts and aspects of the content. Approaches for video summarization

address this problem by analyzing the visual content and generating an overview by the combination of identified key sequences and frames. However, such approaches struggle with videos, where the visual content lacks variance or is mostly comprised of concepts with low

visualness [32], e.g., abstract concepts. Scientific and educational videos often share this characteristic, for example, tutorials or lecture recordings of the STEM subjects (Science, Technology, Engineering, and Mathematics) like chemistry or computer science.

In this paper, we propose an interactive visualization approach in order to summarize the content of scientific or educational videos. The goal is to provide a tool that facilitates the explorative search capabilities of respective video portals and thus, making learning for the end user more efficient and satisfying. Our approach makes use of automatically extracted video annotations and entities, which significantly enrich the usually available, conventional metadata. These entities are generated from the 1) speech transcript, 2) visual concept classification, and 3) text extracted using optical character recognition (OCR). Such kind of metadata is available for videos of the TIB AV-Portal (, which is run by the Leibniz Information Centre for Science and Technology (TIB). The metadata are provided to the public as open data as well. For these reasons, we choose the TIB AV portal as the basis platform and incorporate the proposed system there. Our system utilizes these data and generates a comprehensive, interactive visualization by combining semantic word embeddings and keyphrase extraction methods. We demonstrate how to display the visualization on the actual website with a GreaseMonkey script, which is also a pre-requisite for our user study that investigates the usefulness of the proposed approach for video content visualization.

The paper is structured as follows: Section 2 discusses the related work for video summarization and other related areas, while Section 3 introduces the different components of our system and the utilized dataset. Section 4 describes the experimental setup and discusses the results. Lastly, section 5 concludes the paper and briefly outlines areas of future work.

2 Related Work

2.0.1 Video Summarization

The vast majority of video summarization algorithms rely on visual features and are very domain-specific (e.g., movies, sports, news, documentary, surveillance, etc.), resulting in a large number of different approaches. The focus of these approaches can be dominant concepts [25], user preferences [21], query context [31] or user attention [15]. A typical result of these approaches is a sequence of keyframes or a video excerpt comprising the most important parts of a video. More recent methods treat video summarization as an optimization problem [34, 10, 7]

or they utilize recurrent neural networks 

[35, 36]

based on, for instance, long short-term memory cells (LSTMs), which are able to capture temporal or sequential information very well. Another use case for LSTMs is proposed by Mahasseni et al. 


, who suggest a generative adversarial network (GAN) consisting of an LSTM-based autoencoder and a discriminator. There are also methods that include textual information (e.g., tags 

[14] or full documents [19]) which result in a storyboard that provides short titles for each keyshot. This is in particular useful for news summarization. Scientific or scholarly videos provide a greater challenge in this respect, since their visual content often lacks visualness. Consequently, summarization techniques focus even more on textual metadata. Chang et al. [6]

combine image processing, text summarization and keyword extraction techniques resulting in a multimodal surrogate. A word cloud is generated, where more important words are displayed with a bigger font size, and also a set of three to four thumbnails with a short transcription.

In this paper, we go one step further and show how to summarize the content solely based on textual information. The core techniques to create a video summarization utilized in this paper are keyphrase extraction and measures for semantic text similarity. The related work in these respective areas is subsequently described.

2.0.2 Keyphrase Extraction

Hasan and Ng [12]

describe that keyphrase extraction techniques are generally comprised of two steps. First, a list of possible candidate phrases is identified, and then these candidates are ranked according to their importance. This is realized by a wide range of approaches that can be categorized into supervised and unsupervised methods. Early supervised algorithms rely on, for instance, decision trees 

[28]. Hulth [16]

extend this approach by adding linguistic features to a bagged decision tree classifier while also extending previous work by filtering incorrectly assigned keywords with different feature pairs. Another approach 


utilizes lexical chains based on a WordNet ontology, which is associated with features such as first occurrence position, last occurrence position, and word frequency. Additionally, support vector machines 

[29], maximum entropy classifiers [18, 20], conditional random field models [33]

, logistic regression 

[11] and neural networks [30, 17] have been used to solve the task of finding the most important phrases in a document.

All of the aforementioned techniques share the drawback, that the training data require manual labeling, which generally introduces unreal experimental data and is time-consuming and resource-intensive. Thus, unsupervised approaches moved into the focus of attention. Their task is to automatically discover the underlying structure of a dataset without human-labeled keyphrases. To summarize, the two most popular methods are graph-based ranking and topic-based clustering. The idea behind graph-based algorithms is to construct a graph of phrases, which are connected with weighted edges that describe their relation derived from the frequency of their co-occurrence [23]

. Topic-based clustering methods use statistical language models, which contain the probability of all possible sequences of words 

[1]. Recently, fusions of these two directions gain attention, namely TopicRank [5], PositionRank [9] and MultiPartiteRank [4]. The latter one, which is also used in our approach, first builds a graph representation of the document and then ranks each keyphrase with a relevance score. In addition, in an intermediate step edge weights are adjusted to capture information about the word’s position in the document.

2.0.3 Semantic Text Similarity

Corpus-based similarity algorithms determine the semantic relation between two textual phrases based on information learned from large corpora like Wikipedia. Particulary neural network approaches benefit greatly from huge amounts of data, leading to the current success of methods such as Word2Vec [24], GloVe [27], and fastText [2]. They all create word-vector spaces that cover a desired vocabulary size and embed semantically similar words close to one another, while they also allow for mathematical operations on these vectors to unveil relationships. For instance, when applying fastText, the difference vectors between Paris - France and Rome - Italy are almost identical, indicating the vectors describe the relation ”capital”. If this vector is added to Poland, it therefore leads to Warsaw.

3 Visual Summarization of Scientific Video Content

In this section, we describe our approach for video content summarization solely based on textual information. The necessary process to summarize a scientific video and display this information in an efficient way 1 consists of four steps: 1) pre-processing, 2) semantic embedding of content related information to generate a bubble diagram, 3) creation of a keyphrase table from the speech transcript, and 4) combining diagram and table to form a visualization. The utilized video dataset from the TIB AV-Portal is available at, including the associated metadata as Resource Description Framework (RDF) triples (under Creative Commons License CC0 1.0 Universal).

Figure 1: Workflow diagram of the proposed visualization approach.

3.0.1 Preprocessing

To build the RDF graph we use Python 3.6 and the rdflib library. Next, we use the query language SPARQL to select videos that contain automatically extracted metadata (this applies only for videos related to the six core subjects of the TIB111engineering, architecture, chemistry, computer science, mathematics, physics). An exemplary query can be seen in Listing 1.

1PREFIX dcterms: <>
2PREFIX oa:      <>
6    ?annotation oa:annotatedBy asr_link.
7    ?annotation oa:hasTarget ?videofragment .
8    ?videofragment dcterms:isPartOf ?url .}
Listing 1: SPARQL-query that returns all videos which contain automatically analyzed speech transcripts (ASR) and recognized entities.

This yields a list of videos from multiple languages, which we then query further for the embedded metadata, in particular the key entities which are the result of visual concept classification, optical character recognition and automated speech recognition. Additionally, we crawl the unfiltered speech transcript from the website using the BeautifulSoup library.

3.0.2 Semantic Embedding of Key Entities

We use fastText to generate word embeddings from the extracted key entities. fastText’s tri-gram technique embeds words by their substrings instead of the whole word, for instance, the word google would be decomposed into the following tri-grams: go, goo, oog, ogl, gle, le. This is a valuable feature for multiple reasons. First, it enables the system to encode misspelled or unknown words. Secondly, the quality of embeddings of the generally longer or compound words of the German language is improved, too. We use the pre-trained model for German language 222, which contains the vocabulary of the German Wikipedia and encodes each word in a 300-dimensional vector

. The visualization of the embedded feature vectors requires dimension reduction to project data onto a two-dimensional space. We apply a linear algorithm (principal component analysis) instead of a non-linear one like t-SNE, since we intend to keep the semantic arrangement laid out by fastText and refrain from clustering the keywords further.

3.0.3 Keyphrase Extraction

Input for the keyphrase extraction process is the unfiltered speech transcript, which is already divided into time segments in TIB’s AV portal. The required format of the textual information is given by the pke toolkit [3] which is shown in Listing 2. Requirements are a tokenization and part-of-speech tagging (POS tagging), that is the assignment of lexical categories such as noun, verb, adjective, adverb and so on. For this process we use the Python Natural Language Toolkit (NLTK), in particular, the Stanford POS-Tagger which also comes with a pre-trained model for the German anguage333

wenig/PRON Speicher/NOUN es/PRON kommen/VERB [...]
Listing 2: POS-Tagged speech transcript labeled with lexical categories.

Results of the POS tagging process are then passed to the Multipartite Rank [4] algorithm of the pke library in order to perform keyphrase extraction. As stated in Section 2, this technique models topics and phrases in a single graph, and their mutual reinforcement together with a specific mechanism to select the most import keyphrases are used to generate candidate rankings. We only consider nouns, adjectives, personal pronouns, and verbs (’NOUN’, ’ADJ’, ’PROPN’, ’VERB’) and dismiss all words given by NLTK’s collection of German stop words. The remaining parameters are , which controls the weight adjustment mechanism, and the for the minimum similarity for clustering (default: ). We decide to set this value to due to the high similarity of topics in a single video. The linkage method was set to average. Finally, we choose to retrieve the highest ranked keyphrases of every time stamp for our keyphrase table that will become part of the visualization.

3.0.4 Visualization of Results

Finally, we display the recognized, embedded entities in an interactive graph with the properties shown in Table 1 and combine it with the keyphrase table generated in the last section.

Components Meaning Approach
circle key topics recognized entities
the size of a circle importance of the topic the frequency of the entity
arrangement similarity between topics word embeddings
table timestamp based summary of the speech transcript keyphrase extraction
Table 1: Overview over the properties of the visualization.

We choose a bubble diagram as opposed to Chang et al.’s [6] word cloud. This allows us to illustrate and emphasize also on the distance, between related or unrelated keywords, which reflects (dis)similarity. In addition, smaller differences in area sizes are visually easier to perceive than font sizes. We decided against other alternative implementations such as TextArc [26] since we aimed for a more intuitive approach. The inclusion of the temporal dimension using ThemeRiver [13] did not deliver consistent results for videos that were short, or contained only few keywords. In addition, ThemeRiver is less suitable to represent the similarity of several entities.

The actual implementation is done in Javascript and the API444 As displayed in Figures  2 and 3, the visualization is comprised of circles of different sizes each representing a topic and its importance. An interactive toolbar is displayed on the upper right allowing the user to explore the graph easily. At the bottom, a keyphrase table indicates the main topic of each time segment.

Figure 2: Visualization of video titled ”Bubblesort, Quicksort, Runtime” incorporated via GreaseMonkey in the live website as portrayed during the user study comprised of the visualization itself, a toolbar and the keyphrase table. Note: Translated for better comprehensibility.

Figure 3: Visualization of video

titled ”Eigenwerte, Eigenvektoren” (eng: ”eigenvalues, eigenvectors”). Note: Entities were translated for better comprehensibility.

4 Experiments and Results

We conducted a user study to evaluate the quality and usefulness of the proposed visualization approach. 10 participants were recruited, of which 8 were male and 2 female. Their ages were between 21 and 30, and their educational levels between high-school and master. Seven participants study computer science, one mechanical engineering and one mathematics. All of them are fluent in German, four of them are native speakers. Task I of the study investigates how precisely the visual summary represent the video content. Therefore, 10 videos with a duration of 5 to 30 minutes were randomly assigned to each participant. Then, the user had to rate how well the presented visualization matches the video content, based on the following options: ”0” - no correlation, ”1” - slight match, ”2” - good match, ”3” - exact match. Task II aimed to evaluate if the visualization is a useful tool to provide a quick overview of the video content, or if it is no improvement over the current state of the website. The participants could choose one of the following options to rate the usefulness: ”0” - not helpful at all, ”1” - slightly helpful, ”2” - moderately helpful, ”3” - very helpful, ”4” - extremely helpful, and had to give a short statement about their reasoning. Figure 3(a) shows the distribution of the 100 gathered ratings, while Figure 3(b) shows the results of Task II.


(a) Results of Task I of the user study evaluating the correlation of the visualization to the video content. From ”0” - no correlation to ”3” - exact match.


(b) Results of Task II of the user study showing the perceived helpfulness of the visualization. From ”0” - not helpful at all to ”4” - extremely helpful.

4.0.1 Discussion

Figure 3(a) shows that of the visualizations were good or better, while only provided a slight match or did not correlate at all to the video content (). Positive examples, as can be seen in Figure 2, successfully provide the user with a summarization of the video content. The first example, video 9557, explains the runtime behavior of the sorting algorithms Bubblesort and Quicksort. The largest circle in the visualization is Runtime (”Laufzeit”) and represents well the main topic. Also related topics from computer science covered in the video like sorting methods (”Sortierverfahren”), algorithm (”Algorithmus”) and Quick-Sort itself are closely arranged on the left, while related topics from mathematics like factorization (”Faktorisierung”), asymptote (”Asymptote”) and statement (”Aussage ¡Mathematik¿”) are grouped on the right. The second positive example (video 10234), which talks about eigenvalues and eigenvectors, is mainly represented by the entity matrix multiplication (”Matrizenmultiplikation”) and vector (”Vektor”), but also shows more detailed aspects of that topic, namely vector algebra (”Vektorrechnung”), inverse matrix, gradient and of course, eigenvector and eigenvalue.

The results of the keyphrase extraction, as can be seen in Figure 2, were less helpful. The main reason is most likely the nature of the automatic speech transcripts, which usually differ from written text. They often consist of incomplete sentences, misspelled words, missing punctuation, and falsely recognized words which can change interpretations of a sentence completely. Since common keyphrase models are suited for proper textual content, there is still room for improvement in our scenario.

Figure 4: Visualization of video It demonstrates the effect of very common entity ”Geschwindigkeit” (eng: velocity), which was used frequently by the speaker during an example scenario, but is misleading since the video talks about arc length computation. Note: Entities were translated for better comprehensibility.

In order to find out what led participants to give the rating ”uncorrelated”, we reviewed these 6 videos and found that they came from the subject engineering and had very application-specific content, which might be a limitation of the system. One video, for instance, discusses the cause, consequences and solutions of driftwood accumulation on bridges leading to overflowing rivers ( A lot of technical terms, switching contexts from real world, to model testing to technical considerations paired with topic specific phrases yielded a visualization which was only marginally helpful. Finally, the reason that more results present a ”good match” instead of an exact match is most likely due to the nature of the entities extracted from the speech transcript. For example, videos and tutorials from the field of mathematics contain a lot of terms that are important when explaining a concept, but are rather general and not closely related to the topic itself. That includes words like ”square”, ”point” and ”integral”. Yet, these words are captured by the system and are present in the dataset, but they contribute only marginally to the comprehension of the video even though they appear very frequently. A respective less useful summary result is exemplary presented in Figure 4. This circumstance is also reflected in the results of Task II, where our participants agreed that the visualization would be more helpful if the redundant keywords were omitted.

5 Conclusion and Future Work

In this paper, we have presented a system that summarizes and displays the content of scholarly videos in order to support semantic search in video portals. Based on entirely automatic video content analysis as conducted in the TIB AV Portal, we have proposed an approach that leverages the resulting metadata and generates an interactive visualization and a keyphrase table to outline the content of a video. Different techniques like POS-Tagging, semantic word embeddings and keyphrase extractions were exploited in our approach. The usefulness of the visualization was evaluated in a user study that demonstrated the feasibility of the proposed visual summarization, but also indicated areas for future work. For instance, we plan to implement reliable filters for keywords that are not closely related to the content to provide a better user experience.


  • [1]

    Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. Journal of machine Learning research

    3(Jan), 993–1022 (2003)
  • [2] Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5, 135–146 (2017)
  • [3]

    Boudin, F.: pke: an open source python-based keyphrase extraction toolkit. In: Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations. pp. 69–73. Osaka, Japan (December 2016),
  • [4] Boudin, F.: Unsupervised keyphrase extraction with multipartite graphs. arXiv preprint arXiv:1803.08721 (2018)
  • [5]

    Bougouin, A., Boudin, F., Daille, B.: Topicrank: Graph-based topic ranking for keyphrase extraction. In: International Joint Conference on Natural Language Processing (IJCNLP). pp. 543–551 (2013)

  • [6] Chang, W.H., Yang, J.C., Wu, Y.C.: A keyword-based video summarization learning platform with multimodal surrogates. In: 2011 IEEE 11th International Conference on Advanced Learning Technologies. pp. 37–41. IEEE (2011)
  • [7]

    Elhamifar, E., Clara De Paolis Kaluza, M.: Online summarization via submodular and convex optimization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1783–1791 (2017)

  • [8] Ercan, G., Cicekli, I.: Using lexical chains for keyword extraction. Information Processing & Management 43(6), 1705–1714 (2007)
  • [9] Florescu, C., Caragea, C.: Positionrank: An unsupervised approach to keyphrase extraction from scholarly documents. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 1105–1115 (2017)
  • [10] Gygli, M., Grabner, H., Van Gool, L.: Video summarization by learning submodular mixtures of objectives. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 3090–3098 (2015)
  • [11] Haddoud, M., Abdeddaïm, S.: Accurate keyphrase extraction by discriminating overlapping phrases. Journal of Information Science 40(4), 488–500 (2014)
  • [12] Hasan, K.S., Ng, V.: Automatic keyphrase extraction: A survey of the state of the art. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). vol. 1, pp. 1262–1273 (2014)
  • [13] Havre, S., Hetzler, E., Whitney, P., Nowell, L.: Themeriver: Visualizing thematic changes in large document collections. IEEE transactions on visualization and computer graphics 8(1), 9–20 (2002)
  • [14] Hong, R., Tang, J., Tan, H.K., Ngo, C.W., Yan, S., Chua, T.S.: Beyond search: Event-driven summarization for web videos. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 7(4),  35 (2011)
  • [15]

    Hua, X.s., Lu, L., Zhang, H.j., District, H.: A generic framework of user attention model and its application in video summarization. IEEE Transaction on multimedia

    7(5), 907–919 (2005)
  • [16] Hulth, A.: Reducing false positives by expert combination in automatic keyword indexing. Recent Advances in Natural Language Processing III: Selected Papers from RANLP 2003 260,  367 (2004)
  • [17] Jo, T.: Neural based approach to keyword extraction from documents. In: International Conference on Computational Science and Its Applications. pp. 456–461. Springer (2003)
  • [18] Kim, S.N., Kan, M.Y.: Re-examining automatic keyphrase extraction approaches in scientific articles. In: Proceedings of the workshop on multiword expressions: Identification, interpretation, disambiguation and applications. pp. 9–16. Association for Computational Linguistics (2009)
  • [19] Li, Z., Tang, J., Wang, X., Liu, J., Lu, H.: Multimedia news summarization in search. ACM Transactions on Intelligent Systems and Technology (TIST) 7(3),  33 (2016)
  • [20] Liu, F., Liu, F., Liu, Y.: A supervised framework for keyword extraction from meeting transcripts. IEEE Transactions on Audio, Speech, and Language Processing 19(3), 538–548 (2011)
  • [21] Lu, Z., Grauman, K.: Story-driven summarization for egocentric video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2714–2721 (2013)
  • [22] Mahasseni, B., Lam, M., Todorovic, S.: Unsupervised video summarization with adversarial lstm networks. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. pp. 202–211 (2017)
  • [23] Mihalcea, R., Tarau, P.: Textrank: Bringing order into text. In: Proceedings of the 2004 conference on empirical methods in natural language processing (2004)
  • [24]

    Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems. pp. 3111–3119 (2013)

  • [25] Over, P., Smeaton, A.F., Awad, G.: The trecvid 2008 bbc rushes summarization evaluation. In: Proceedings of the 2nd ACM TRECVid Video Summarization Workshop. pp. 1–20. ACM (2008)
  • [26] Paley, W.B.: Textarc: Showing word frequency and distribution in text. In: Poster presented at IEEE Symposium on Information Visualization. vol. 2002 (2002)
  • [27] Pennington, J., Socher, R., Manning, C.: Glove: Global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). pp. 1532–1543 (2014)
  • [28] Turney, P.D.: Learning algorithms for keyphrase extraction. Information retrieval 2(4), 303–336 (2000)
  • [29] Wang, J., Peng, H.: Keyphrases extraction from web document by the least squares support vector machine. In: The 2005 IEEE/WIC/ACM International Conference on Web Intelligence (WI’05). pp. 293–296. IEEE (2005)
  • [30] Wang, J., Peng, H., Hu, J.s.: Automatic keyphrases extraction from document using neural network. In: Advances in Machine Learning and Cybernetics, pp. 633–641. Springer (2006)
  • [31] Wang, M., Hong, R., Li, G., Zha, Z.J., Yan, S., Chua, T.S.: Event driven web video summarization by tag localization and key-shot identification. IEEE Transactions on Multimedia 14(4), 975–985 (2012)
  • [32] Yanai, K., Barnard, K.: Image region entropy: a measure of visualness of web images associated with one concept. In: Proceedings of the 13th annual ACM international conference on Multimedia. pp. 419–422. ACM (2005)
  • [33] Zhang, C.: Automatic keyword extraction from documents using conditional random fields. Journal of Computational Information Systems 4(3), 1169–1180 (2008)
  • [34] Zhang, K., Chao, W.L., Sha, F., Grauman, K.: Summary transfer: Exemplar-based subset selection for video summarization. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1059–1067 (2016)
  • [35] Zhang, K., Chao, W.L., Sha, F., Grauman, K.: Video summarization with long short-term memory. In: European conference on computer vision. pp. 766–782. Springer (2016)
  • [36] Zhao, B., Li, X., Lu, X.: Hsa-rnn: Hierarchical structure-adaptive rnn for video summarization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 7405–7414 (2018)