Time-sync Video Tag Extraction Using Semantic Association Graph

05/03/2019 ∙ by Wenmian Yang, et al. ∙ University of Macau 0

Time-sync comments reveal a new way of extracting the online video tags. However, such time-sync comments have lots of noises due to users' diverse comments, introducing great challenges for accurate and fast video tag extractions. In this paper, we propose an unsupervised video tag extraction algorithm named Semantic Weight-Inverse Document Frequency (SW-IDF). Specifically, we first generate corresponding semantic association graph (SAG) using semantic similarities and timestamps of the time-sync comments. Second, we propose two graph cluster algorithms, i.e., dialogue-based algorithm and topic center-based algorithm, to deal with the videos with different density of comments. Third, we design a graph iteration algorithm to assign the weight to each comment based on the degrees of the clustered subgraphs, which can differentiate the meaningful comments from the noises. Finally, we gain the weight of each word by combining Semantic Weight (SW) and Inverse Document Frequency (IDF). In this way, the video tags are extracted automatically in an unsupervised way. Extensive experiments have shown that SW-IDF (dialogue-based algorithm) achieves 0.4210 F1-score and 0.4932 MAP (Mean Average Precision) in high-density comments, 0.4267 F1-score and 0.3623 MAP in low-density comments; while SW-IDF (topic center-based algorithm) achieves 0.4444 F1-score and 0.5122 MAP in high-density comments, 0.4207 F1-score and 0.3522 MAP in low-density comments. It has a better performance than the state-of-the-art unsupervised algorithms in both F1-score and MAP.



There are no comments yet.


page 7

page 21

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Recently, watching online videos of news and amusement have become mainstream entertainment during people’s leisure time. The booming of online video-sharing websites raises significant challenges in effective management and retrieval of videos. To address that, many text retrieval based automatic video tagging techniques have been proposed (Siersdorfer et al., 2009; Raamkumar et al., 2017; Ramaboa and Fish, 2018). However, these techniques can only provide video-level tags (Wu et al., 2014). The problem is that even if these generated tags can perfectly summarize the video content, users have no idea how these tags are associated with the video playback time. If videos are associated with time-sync tags, users can preview the content with both thumbnails and text along the timeline, and this textual information can further enhance users’ search experience. Although there are many video content analysis algorithms that can generate video tags with timestamps (Hussein and Piccardi, 2017; Chen et al., 2017a), their time complexities are too high for large-scale video retrieval. Fortunately, a new type of review data, i.e., time-sync comments (TSCs) appear on video websites like Youku (www.youku.com), AcFun (www.acfun.tv) and BiliBili (www.bilibili.com) in China, and NicoNico (www.nicovideo.jp) in Japan.

In this paper, we focus on extracting time-sync video tags from TSCs efficiently, which can enhance users’ search experience. When watching a video, many people are willing to share their feelings and exchange ideas with others. TSC is such a new form of real-time and interactive crowdsourced comments (Wang et al., 2017, 2016a; Gu et al., 2017; Hyung et al., 2017). TSCs are displayed as streams of moving subtitles overlaid on the video screen, and convey information involving the content of current video frame, feelings of users or replies to other TSCs. In TSC-enabled online video platforms, users can make their comments synchronized to a video’s playback time. That is, once a user posts a TSC, it will be synchronized to the associated video time and immediately displayed onto the video. All viewers (including the writer) of the video can see the TSCs when they watch around the associated video time. Moreover, each TSC has a timestamp to record the corresponding video time when posted. Therefore, compared with traditional video reviews, TSCs are much easier to obtain the local tags with timestamp rather than video-level tags. Moreover, the TSCs are more personalized than traditional reviews, therefore the tags generated by TSCs can better reflect the user’s perspective. The users can thereby get high-quality retrieval results when they search for videos with these tags (Wu et al., 2014).

Recently, some methods have been proposed to generate temporal tags or labels based on TSCs. Wu et al. (Wu et al., 2014) use statistics and topic model to build Temporal and Personalized Topic Modeling (TPTM) to generate temporal tags. However, their approach is based on the Latent Dirichlet Allocation (LDA) model (Blei et al., 2003), which has poor performance when dealing with short and noisy text like TSC (Yan et al., 2013). Lv et al. (Lv et al., 2016) propose a Temporal Deep Structured Semantic Model (T-DSSM) to generate video labels in a supervised way. However, their approach does not consider the semantic association between TSCs, so that some of the video content-independent noises cannot be processed. In summary, TSCs have some features distinguished from the common comments (Yang et al., 2017; Liao et al., 2018), which make the above methods not very effective in the TSCs: (1) Semantic relevance. Abundant video semantic information is contained that describes both local and global video contents by selecting the time interval of the timestamp. (2) Real-time. TSC is synchronous to the real-time content of the videos. Users may produce different topics when it comes to the same video contents. (3) Herding effects. Herding effects are common in TSCs (He et al., 2016; Yu et al., 2015). That means, latter TSCs may depend on the former ones and have a semantic association with the preceding ones. (4) Noise. Some video content-independent comments and internet slang are included in TSCs, which makes trouble for tag extraction. Due to the above features of TSCs, how to deal with the herding effects, distinguishing the importance of each TSC and consequently identify high-impact TSCs and noises are the major challenges for extracting video tags from TSCs.

To make full use of the features of TSC and tackle the above challenges, we propose a graph-based algorithm named Semantic Weight-Inverse Document Frequency (SW-IDF) to generate time-sync video tags automatically. More precisely, we design to reduce the impact of noises by clustering the semantic similar and time-related TSCs and identify high-impact TSCs by their semantic relationships. Intuitively, TSCs including video tags are usually within hot topics and impact on the trend of their follow-up TSCs. On the contrary, the noises usually neither have similar semantic relationships with other TSCs over a period nor influence other TSCs (Yang et al., 2017). Moreover, we find that the density of TSCs (number of TSCs per unit time) affects how users communicate. When the density is low (the TSC in a period is sparse), the user can more clearly distinguish the content of each nearby TSC, and therefore is more likely for the user to reply to a specific TSC when posting the new one. Conversely, when the density is high (the TSC in a period is dense), the user cannot clearly distinguish the content of each TSC, but only roughly distinguish the topic of these TSCs. Therefore, the user is more likely to reply to the entire topic instead of a specific TSC. Specifically, in the SW-IDF algorithm, we first treat the TSCs as vertices, generating the semantic association graph (SAG) based on semantic similarities and timestamps of TSCs. Then, we intend to cluster TSCs into different topics. For the videos with low-density TSCs, we propose a dialogue-based clustering algorithm, which is inspired by community detection theory (Huang et al., 2017; Fortunato, 2010; Lancichinetti and Fortunato, 2009). For the videos with high-density TSCs, we propose a topic center-based cluster algorithm, which is a novel hierarchical agglomerative clustering (Pandove et al., 2018; Murtagh and Legendre, 2014; Murtagh and Contreras, 2012). These two cluster algorithms can identify the topic of each TSC and distinguish the popularity of each topic in any case. In the clustered subgraph, the in-degrees of each TSC express its affecting TSCs, while the out-degrees express its affected TSCs. Therefore, we design a graph iteration algorithm to assign the weight of each TSC by its degrees so that we can differentiate the meaningful TSCs from noises. Moreover, similar to TF-IDF algorithm, we gain the weight of each word by combining Semantic Weight (SW) and Inverse Document Frequency (IDF) and the video tags are extracted automatically.

Particularly, this paper is an extended version of (Yang et al., 2017). In this extended version, we propose a novel topic center-based cluster algorithm at first, which is more suitable for high-density TSCs. Then, we provide a greedy optimization for the topic center-based algorithm and prove this optimization will not delete any valid case. Finally, we add more experiments to verify the effectiveness of the algorithms. The main contributions of our paper are as follows:

  1. We propose a novel graph-based Semantic Weight-Inverse Document Frequency (SW-IDF) algorithm, which can extract video tags in an unsupervised way by mining TSCs.

  2. We design two graph clustering algorithms based on the density of the TSCs, i.e., dialogue-based clustering algorithm and topic center-based cluster algorithm, to cluster in the semantic association graph (SAG). These algorithms take the features of TSCs into account and effectively reduce the impact of noises.

  3. We evaluate our proposed algorithms with real-world datasets on mainstream video-sharing websites and compare results with classical keyword extraction methods. The results show that SW-IDF outperforms baselines in both precision and recall of video tag extraction.

In the rest of the paper, we introduce the related work in Section 2, and then formally propose our algorithm in Section 3. In Section 4, we verify the effectiveness of our algorithm with experimental results. Conclusions of this work are presented in Section 5.

2. Related Work

In this section, we introduce the related work from four aspects.

2.1. Analysis of time-sync video comments

Time-Sync Comments (TSCs) provide a new source of information regarding the video and have received growing research interests. Wu et al. (Wu et al., 2014) first introduce TSCs and propose a Temporal and Personalized Topic Modeling (TPTM) to generate temporal tags. However, their approach is based on the Latent Dirichlet Allocation (LDA) model (Blei et al., 2003), which has poor performance when dealing with short text like TSC (Yan et al., 2013). To describe the video more specifically, Xu and Zhang (Xu and Zhang, 2017) extract representative TSCs based on a temporal summarization model. Their methods need the pre-extracted keywords in the TSCs, so our algorithm can improve the effectiveness of them. There are also some other applications based on TSCs. Lv et al. (Lv et al., 2016)

propose a Temporal Deep Structured Semantic Model (T-DSSM) to represent comments as semantic vectors and recognize video highlights by semantic vectors in a supervised way. They are the first to analyze the TSC using the neural network. Then, Chen et al.

(Chen et al., 2017b) propose the neural network based collaborative filtering to recommend the personalized keyframe from TSCs. However, both the models of (Lv et al., 2016) and (Chen et al., 2017b) rely on a large amount of human-labeled video segments or predefined emotional tags to train, which limits its applicability to more general scenarios. In this paper, we design a novel graph-based algorithm according to the features of TSC to efficiently and accurately extract keywords automatically in an unsupervised way.

2.2. Tag/keyword extraction

Keyword extraction is a classical problem in the field of information retrieval. At present, mainly three categories unsupervised keyword extraction methods are available. The first one is based on word frequency statistics, where TF-IDF is the most commonly used and well-known method. However, this kind of methods only consider the frequency of words and ignore the semantics, which may generate keywords that are not related to video content. The second kind of methods depends on the co-occurrence of words, such as textrank (Mihalcea and Tarau, 2004), which is a graph-based ranking model. Similar to the first one, this kind of methods does not consider semantics either, so it cannot solve the noise well. And the last one is according to the topic model. It brings document-topic and topic-word distribution together by simulating document generation process. Blei et al. (Blei et al., 2003) propose the Latent Dirichlet Allocation(LDA) model, the most representative model. To better deal with short text situation, Yan et al. (Yan et al., 2013) propose the Bi-term Topic Model (BTM), which models the generation of word co-occurrence patterns (i.e., bi-terms) in the whole corpus directly. Yin and Wang (Yin and Wang, 2014, 2016) propose the Gibbs Sampling algorithm for the Dirichlet multinomial mixture model for short text clustering and keyword extraction. Although the topic model-based approaches consider the semantics, their basic hypothesis is that the generation of each word is independent and identically distributed. However, some TSCs are generated by herding effects, which does not satisfy the assumptions. Compared with the methods above, our algorithms are well-designed to identify noises by analyzing the semantic relationship between TSCs.

2.3. Semantic similarity

Semantic similarity calculation is an essential issue of natural language processing, which is widely used in text classification

(Wang et al., 2016b), fuzzy retrieval (Alhabashneh et al., 2017), and so on. Generally, there are mainly two kinds of approaches to measuring the similarity of documents. One is based on the similarity of the words in sentences. The representations of this approach are proposed by (Kenter and De Rijke, 2015)

on unsupervised learning and

(Socher et al., 2011; Kusner et al., 2015)

on supervised learning. Considering that time-sync comments contain a mass of newborn internet slangs, it is difficult to obtain accurate results in this way. The other one is based on the sentence vector. The topic model such as LDA, and embedding model such as word2vec

(Mikolov et al., 2013; Levy et al., 2015) are the representations of this kind of methods. Since the embedding model offers much denser feature representation, embedding based similarity computation is better TSCs than the topic model-based methods. Kenter and De Rijke (Kenter and De Rijke, 2015) propose a supervised learning method based on external sources of semantic knowledge with word embedding, which considers the weight of the semantic feature. In this paper, we only consider the topics discussed by TSCs while the word order will not change the topics discussed in the TSCs. Therefore, the word order is not important and the sentence2vec (Iacobacci et al., 2015; Levy and Goldberg, 2014b)

and deep learning

(He et al., 2015; Mueller and Thyagarajan, 2016) based methods are not used in this paper.

2.4. Graph clustering algorithm

Graph clustering algorithms have attracted much research interest in the past. There are two main theories, i.e., community detection theory and hierarchical agglomerative clustering inspired our work. Community detection theory is first proposed by (Newman and Girvan, 2004) to make natural divisions of network nodes into densely connected subgroups, which brings great inspiration to the graph clustering field. Recently, Ramezani et al. (Ramezani et al., 2018) exploit the diffusion information and utilize the conditional random fields to discover the community structures. Li et al. (Li et al., 2018a) propose a novel local expansion via minimum one norm approach for finding overlapping communities, and provide the theoretical analysis of the local spectral properties. Chakraborty et al. (Chakraborty et al., 2016) find that the belongingness of nodes in a community is not uniform and design a new vertex-based metric to quantify the degree of belongingness within a community. To reduce the time complexity, Bae et al. (Bae et al., 2017) propose an algorithm to optimize the map equation, which makes the iterations take less time, and the algorithm converges faster. These above-mentioned community detection theory based graph clustering algorithms provide us with good inspiration for designing dialogue-based clustering algorithms. Besides, hierarchical agglomerative clustering is also a method of graph clustering (Pandove et al., 2018; Murtagh and Legendre, 2014; Murtagh and Contreras, 2012). Recently, Pang et al. (Pang et al., 2015) propose a topic-restricted similarity diffusion process to efficiently identify real topics from a large number of candidates. Although their method has a good clustering effect, it has a high time complexity and is not suitable for large-scale data. Compared with the aforementioned hierarchical agglomerative clustering algorithms, we proposed a novel topic center-based clustering algorithm have lower time complexity under the condition of ensuring accuracy.

3. Algorithms

In this section, we first introduce the construction of Semantic Association Graph (SAG) for TSCs with their semantic similarity in Section 3.1. Then, we propose two graph cluster algorithms, i.e., dialogue-based algorithm and topic center-based algorithm, to cluster the TSCs into subgraphs of different topics in Section 3.2. Moreover, we propose an out-in degree iterative algorithm to get the weight of each TSC and extract keywords as video tags automatically by combining Semantic Weight (SW) and inverse document frequency (IDF) in Section 3.3. Finally, we give the complexity analysis in Section 3.4.

The Notation list is shown in Table 1.

3.1. Preliminaries and Graph Construction

Directed graph
Set of nodes
Set of Edges
Number of nodes in
Number of edges in
node in
edge in
Timestamp of node
Topic set of node
Number of nodes in set
The first node of edge
The second node of edge
The weight of edge
Attenuation coefficient
Threshold of dialogue bsed intra-cluster density
Threshold of topic based intra-cluster density
The embedding vector of TSC
Topic center vector of the set
Start time of the set
Center time of the set
Universal set of topic sets
A set that matches
Max Affinity value of set
A priority queue with set pairs
A queue with sets to be updated
Popularity of comment
Total number of topics in SAG
Influence matrix
Influence value of comment i after k iterations
Weight of comment
Table 1. Notations

In this section, we construct the semantic association graph and define the attributes in the graph.

Since TSCs appear in chronological order, they can only affect the upcoming TSCs rather than prior TSCs. We use a directed graph to describe the relationships between TSCs and construct the semantic association graph (SAG).

In SAG, the vertices (nodes) are TSCs and the edges reflect their semantic association in a topic. Let denote the directed graph, represented by , where and are the sets of nodes and edges. Specifically, , , where is the number of nodes in , and is the number of edges in . For each TSC , it has a timestamp , denoting the post time in the video, where . Since the TSCs are the short texts (Wu et al., 2014), in our algorithm, we assume that each TSC has one exact topic. For vertex , is used to describe the set that contain the vertices which have the same topic as and is used to express the number of vertices in set . We use the domain to describe the attributes of edges. For edge , and are two vertices that are linked by edge where . The weight of edge is described as . Besides, also describes the edge with vertices and where . Next we will provide the definition of edge weights.

As we mentioned in Section 2, an embedding based method word2vec (more details see Section 4.1) is selected to calculate the semantic similarity between each pair of TSCs. Since we only care about the topic of the TSC, the word order is not important. In this paper, we calculate the mean vector of each word in a TSC as the sentence vector. We set the dimension of each vector as . Therefore, the semantic similarity between TSC and is calculated by the cosine angle between vectors:


Besides, the greater the timestamp interval between two TSCs, the less likely they are in the same topic. So we use the exponential function to express the decay of TSC associations:



is a hyperparameter that control the decay speed.

Combining the semantic similarity and the time decay, the weight of edge that link vertices and is defined by


Empirically derived threshold, two TSCs with a negative weight edge are less semantically related (because their angle in the semantic embedding space is greater than ), and negative edge weights are inconvenient to calculate in graph algorithms. Therefore, when , we set and delete this edge.

Figure 1. An example of SAG Construction

For a more intuitive description, an example of SAG construction is shown in Fig. 1 (a), which is a UEFA Champions League video. We select 10 TSCs as nodes and construct the SAG. User A made the TSC 1⃝ as “Great player Messi!” when he saw the goal. Then user B responded with “Messi deserves MVP!” as the TSC 3⃝. User C makes a TSC “What is the BGM ?” as TSC 2⃝ to ask the background music, which deviates the video content. So the TSC 2⃝ has the less semantic association with other TSCs, while TSC 1⃝ and TSC 3⃝ have a semantic edge.

3.2. Topic Partitioning

In this section, we will partition the topic of each TSC according to the semantic relationships in SAG. In our algorithm, the TSC that has the similar semantics and similar timestamps should belong to the same topic. However, the density of TSCs (number of TSCs per unit time) affects how users communicate. Therefore, we propose a dialogue-based cluster algorithm in Section 3.2.1 for the videos with sparse TSCs and a topic center-based cluster algorithm in Section 3.2.2 for the videos with dense TSCs.

3.2.1. Dialogue-based Algorithm

First, we provide a dialogue-based algorithm. When the density of TSCs is low, the user can more clearly distinguish the content of each nearby TSC, and therefore is more likely for the user to reply to a specific TSC when posting the new one. Therefore, we cluster the TSCs according to the semantic relationship between each pair. The main idea is that the mean weight of edges in intra-topic is large while the mean weight of edges that link different topics is small, which satisfies community detection theory (Lancichinetti et al., 2008).

Specifically, in the beginning, each TSC belongs to a unique topic. We use a unique set that only contains itself to achieve the objective. That is, for , . Then edges in set are sorted by descending order of weight. The new edge set is obtained, where . We process each edge from to . For edge , and represent the set and . The set and should be merged if and only if TSCs in two sets discuss the similar topics. Therefore, we merge and if




where is the threshold of intra-cluster density. That is, we merge S1 and S2 only if the average edge weight of the their union is greater than the threshold. In this paper, disjoint-set (union-find set) algorithm (Tarjan, 1975) is used to merge the sets efficiently. When all the edges are solved, TSCs with high semantic similarity are merged into a topic, and the intra-cluster density of each subgraph is higher than the threshold.

An example of dialogue-based topic partitioning is shown in Fig. 1 (b1). The SAG constructed in Fig. 1 (a) is finally partitioned into two topics marked as red and blue, and several noises marked as purple in Fig. 1 (c) . The TSC “Great player Messi!” and “Messi deserves MVP!” belong to the red topic, while the TSC “What is the BGM ?” is identified as a noise.

The full algorithm is shown in Algorithm 1.

0:  the edge set
0:  the topic set of each time-sync comment
1:  sort by descending order of , obtain
2:  for  = 1 to  do
3:     set as , as
4:     if  then
5:        merge and
6:     end if
7:  end for
ALGORITHM 1 Dialogue-based algorithm

3.2.2. Topic Center-based Algorithm

In the dialogue-based algorithm, we assume that TSCs are in the form of dialogues. However, when the density of TSCs is high, the user cannot clearly distinguish the content of each TSC, but only roughly distinguish the topic of these TSCs. Therefore, the user is more likely to reply to the entire topic instead of a specific TSC. The results of dialogue-based model will be disturbed by these situations. Therefore, we provide a Topic Center-based algorithm, which is inspired by Hierarchical Agglomerative Clustering (Pang et al., 2015; Murtagh and Legendre, 2014; Murtagh et al., 2008; Murtagh and Contreras, 2012).

Before proposing this algorithm, the definition of topic center is given at first. As we defined in Section 3.1, the set is used to describe the topic and each TSC can be express as an embedding vector by word2vec. The topic center is the average vectors of all TSCs within the topic. We use to express the topic center vector, and and to express the start time and center time of topic set , respectively. Initially, each TSC belongs to a unique topic, so , , where is the sentence embedding vector of TSC . All these sets belong to , which is the universal set of topic sets.

Generally, this algorithm can be divided into two parts. (1) Find the nearest two topic centers. (2) Merge the two topic centers. It is actually a Nearest Neighbor Search (NNS) problem (Bentley, 1975; Alstrup et al., 2000), where the k-d tree (Bentley, 1990; Friedman et al., 1977; Bentley, 1975) is one of the most effective methods. However, the analyses of binary search trees have found that the worst case time for range search in a k-dimensional k-d tree containing N nodes is given by the following equation (Lee and Wong, 1977): . Besides, the k-d tree has a larger constant.

In this paper, we propose a greedy algorithm to solve this problem efficiently. In the beginning, for each , we find that




The decay function is still added to avoid that the topics with large time interval are merged.

We use to express the set that matches with maximum value . And the pair is added to a queue , which is a priority queue where the pair with the maximum is the front.

Each time, we take out the front pair , merging and , and pop it, until . When merging sets, the following updates will be done: First, since and are merged, all pairs that contain or , for instance , should be deleted from . Then, these sets that matched or previously like are added into the update list . Next, the sets and are removed from , and a new set is added into and , where




That is, the center time and the center vector of are the weighted average of and , and the start time of is the minimum of and . Finally, for each set , we find a new according to Eq.(6) in to match it.

What is more, there exists a greedy optimization in the algorithm. Before giving the greedy optimization, we propose a lemma at first:

Lemma 3.1 ().

For the set , let . Then the pair will never be solved in if .


Since , we have , and . There exist two cases:

Case : Then, in the priority queue , the pair will be solved before because . Therefore, the pair will be removed from when solving .

Case :

Then we have (otherwise ). So the pair will be solved before in the priority queue . When solving , will be removed, and set will find a new in . If , then is re-added into (at that time, ). Otherwise, . In that case, the pair will be solved before , and will be removed when solving . Therefore, will always be removed and never be solved in any case. ∎

According to Lemma 3.1, we propose the greedy optimization: for the set , if and , then the pair is rejected and not added into .

The process of Topic Center-based Algorithm is described in Fig. 1 (b2) and the clustering results are the same with the dialogue-based algorithm in this example showing in Fig. 1 (c). The full algorithm is shown in Algorithm 2.

0:  the vectors and timestamp of time-sync comments
0:  the topic set of each time-sync comment
1:  for  = 1 to  do
5:  end for
6:  for  = 1 to  do
7:     find using Eq.(6)
8:     calculate using Eq.(7)
9:     if () and (then
10:        push the pair into
11:     end if
12:  end for
13:  while  not empty do
15:     Remove all the pairs and in
16:     if  and  then
17:        add into
18:     end if
19:     Remove all the pairs and in
20:     if  and  then
21:        add into
22:     end if
23:     calculate , , and and using Eq.(8), Eq.(9), Eq.(10)
24:     remove and from
25:     add into and
26:     while  not empty do
28:        find using Eq.(6)
29:        calculate using Eq.(7)
30:        if () and (then
31:           push the pair into
32:        end if
33:     end while
34:  end while
ALGORITHM 2 Topic center-based algorithm

3.3. Weight Distribution and Tag Extraction

We partition the topic in Section 3.2 and get the topic of each TSC. In this section, we will attribute weight to each TSC according to the influence of its topic and the relationship in the semantic graph.

The weight of a TSC is affected by its topic popularity, so we define the popularity of the TSC as:


where is the topic in SAG, and is the total number of topics in SAG. Obviously those topics with fewer TSCs are more likely to be noises and have less weight. According to Eq.(11), noises will have small values of popularity.

Within the topic, a TSC which affects more TSCs and is affected by fewer TSCs should have a higher weight. In order to quantitatively measure the weight of the TSC in a topic, we design a graph iterative algorithm below.

An influence matrix is established at first to express semantic relations within each topic. For the elements in the matrix,


we use to denote the influence value of TSC after iterations. For each TSC , initially. Then in the turn of iteration, there are two steps as follows:




In the iteration, we increase the influence value of TSC based on the values of TSCs that affected by TSC . We know that a TSC only affects the TSCs lagging behind it, so the TSCs are processed from down to . That is, before we process TSC , all the TSCs that have been processed. In the iteration, we reduce the influence value of TSC based on the values of the TSCs that affect TSC . Contrary to the iteration, we process the TSCs from to in the iteration.

The iteration process of SAG in Fig. 1 (c) is shown in Fig. 2. Fig. 2 (a) shows the calculation of the last two nodes (marked as red) that need to be processed in the iteration (ignore the noise node ), where the orange edges express their out-degree edges. While Fig. 2 (b) shows the calculation of the last two nodes (marked as red) that need to be processed in the iteration (ignore the noise node ), where the green edges express their in-degree edges.

Figure 2. The Iteration process of SAG in Fig. 1 (c).

The converged influence values of the 10 TSCs in Fig. 1 (c) is shown in Fig. 3. After 20 iterations, all TSCs converge to the interval .

Figure 3. The influence value of TSCs in Fig. 1

To combine the popularity and the influence value, the weight of TSC is obtained by


where is the number of turns of iterations and depends on the number of nonzero elements in the matrix . Therefore, the weight of each word is formulated as below:


where denotes the TSC that word appears and is the inverse document frequency as defined in TF-IDF method. We extract words with the highest SW-IDF value as video tags. After the above steps, those words which appear in the TSCs that are popular and have high impact will be extracted as tags. The complete algorithm is shown in Algorithm 3.

0:  Semantic Association Graph
0:  Tags of video
1:  Assign time-sync comments to a set by Algorithm 1 or Algorithm 2
2:  Calculate the influence matrix using Eq.(12)
3:  for  = 1 to  do
5:     Calculate the popularity of TSC using Eq.(11)
6:  end for
7:  for  = 1 to  do
8:     for  = N downto  do
9:        Calculate using Eq.(13)
10:     end for
11:     for  = 1 to  do
12:        Calculate using Eq.(14)
13:     end for
14:  end for
15:  Calculate the SW-IDF of each word using Eq.(16)
16:  Select words with max SW-IDF as video tags

3.4. Complexity Analysis

In this section, we analyze the time complexity and the space complexity of each algorithm.

In Algorithm 1, the time complexity of the edge sorting algorithm in line 1 is by using quicksort, and the space complexity is . The amortized time complexity of merging sets by disjoint-set is (Tarjan, 1979) and the space complexity is , where is the inverse Ackermann function that . So the total time complexity of Algorithm 1 is , and the space complexity is .

In Algorithm 2, the time complexity of initialization from line 1 to line 12 is , and the space complexity is . In , the number of times of merge-operation is limited to (because there are at most sets), and the amortized removal operation is limited to 1 each merge-operation. For each merge operation, the lookup operation and remove operation can be dealt in by naive algorithm, or by binary balance tree (Bentley, 1975). The worst complexity of total Algorithm 2 is . The total space complexity is just .

In Algorithm 3, the time complexity is and the space complexity is apparently. In our SAG, because two TSCs with a negative semantic similarity do not have an edge. Therefore, in the true TSC data, and the dialogue-base algorithm has a more efficient time complexity than the topic center-based algorithm.

4. Experimental Study

In this section, we verify the effectiveness of our proposed method by comparing with four unsupervised methods of keyword extraction. The datasets are crawled from AcFun (www.acfun.cn) and Bilibili. We provide the necessary parameters in our algorithms in Section 4.1 and then analyze the performance of our algorithms on video tag extraction in Section 4.2.

4.1. Experimental Setup and Datasets

We crawl TSCs from two famous Chinese time-sync comments video websites AcFun and Bilibili. The raw TSC texts are full of noises, so we manually remove non-textual TSCs (such as emojis) and establish a set of mapping rules for network slang, which will be substituted by their real meaning in the text. For instance, 233… (2 followed by several 3) means laughter, 666… (several 6) means playing games very well. After that, we segment the words and remove the anomaly symbol (the symbolic expression, such as a smiley face (^_^) ) in TSCs by an open-source Chinese-language processing toolbox Jieba 111https://github.com/fxsjy/jieba. To analyze the algorithms from different aspects, we collected two datasets. To be specific, in the first dataset (called it D1), totally 287 videos with 227,780 comments are collected randomly from music, sports, and movie. To set the hyper-parameters in this paper, we select 167 videos with 126,146 TSCs for the validation set and 120 videos with 101,634 comments for the test set. In the second dataset (called it D2), totally 180 videos with 569,996 comments are collected from Japanese anime. We use D1 to compare our algorithms with baselines, and use D2 to accurately analyze the effects of the two algorithms we proposed at different densities.

We define the density of TSCs as the average number of TSCs per minute. In D2, we divide the density into 5 levels: 0-30, 30-60, 60-90, 90-120 and more than 120 (the intervals are left-closed and right-open). More details include the length of the video, total number of TSCs, density and the number of videos about test set are shown in Table 2 for D1 and Table 3 for D2.

Validation set Test set
Total length (minute) 1,573.29 1,441.38
Total TSCs number 126,146 101,634
Density 80.18 70.51
Total video number 167 120
Table 2. Data Description Table for D1
0-30 30-60 60-90 90-120 ¿120
Total length (minute) 644.37 433.01 855.40 883.61 1,221.55
Total TSCs number 11,489 19,368 60,152 99,671 379,316
Density 17.83 43.72 70.32 112.80 310.52
Total video number 29 21 37 42 51
Table 3. Data Description Table for D2

We select two undergraduate students and one Ph.D. student as volunteers. For each video, each volunteer chooses 15 words from TSCs and votes them as video tags. The words with two or more votes are selected as the standard tags. Therefore, the number of standard tags per video is different. Moreover, the order of these tags is determined by the number of votes at first. TSCs with more votes rank in front. When the number of votes is the same, the order is determined by the Ph.D. student. 222The code of our algorithm is uploaded to https://github.com/sdq11111/SAG.

In Section 3.1, we use the word2vec method get the embedding vectors of TSCs. In this paper, we choose the skip-gram model of word2vec to pre-train the word embedding vectors and the training algorithm is hierarchical softmax, because both skip-gram model and hierarchical softmax algorithm are better for infrequent words (Mikolov et al., 2013), which is more relevant to the features of the TSCs. We use gensim 333https://radimrehurek.com/gensim/models/word2vec.html to train the model, and the training data is crawled from Bilibili with the TSCs of 6,743,912 words. Since we have sufficient training corpus, the dimension of word2vec is set to 300 as (Li et al., 2018b).

To further prove the rationality of using the word2vec to calculate the similarities of the TSCs, we use several traditional unsupervised learning and other word embedding methods to calculate the semantic similarities, i.e.

  1. LDA, a famous topic model based method, Latent Dirichlet Allocation (Blei et al., 2003).

  2. PPMI, a co-occurrence probability based distributional model, Positive Pointwise Mutual Information

    (Levy and Goldberg, 2014a)

  3. HowNet, a HowNet hierarchical sememe tree based approach (Wu et al., 2012), where HowNet (Dong and Dong, 2003) is a common-sense knowledge base unveiling inter-conceptual relations and inter-attribute relations of concepts.

  4. GLoVe, a famous word embedding method, Global Vectors for Word Representation (Pennington et al., 2014).

We test the top 10 tag extraction results using the above methods to calculate the similarity and build the graph on the verification set (the hyper-parameters used in the experiment are discussed later). In this paper, we use F1-score and MAP (Mean Average Precision, which is the mean of the average precision scores for each query (Zhu, 2004)) to measure the performance of tag extraction. The results are shown in Table 4.

Method F1 (dialogue) MAP (dialogue) F1 (topic center) MAP (topic center)
LDA 0.3625 0.3372 0.3641 0.3224
PPMI 0.3919 0.3705 0.4101 0.3806
HowNet 0.3537 0.3423 0.3468 0.3194
GLoVe 0.4045 0.4012 0.4202 0.4079
Word2Vec 0.4183 0.4041 0.4342 0.4160
Table 4. The effect of semantic similarity calculation method on the results

The experimental results show that, in the verification set, Hownet performs the worst among the baselines because of the limited number of word lists. LDA also performs poorly because it is not good at handling short texts. Among the word embedding based methods PPMI, GLoVe, and word2vec, word2vec performs best, which indicates that the fully trained word2vec method has better robustness and is more suitable for calculating the similarity of the TSCs.

What is more, in our algorithm, three parameters need to be determined, i.e., the threshold of intra-cluster density and , and the attenuation coefficient . The and control the accuracy of topic clustering. The is the attenuation coefficient of the interval between time-sync comments, which controls the value of the edge weights in the graph.

We first fix and adjust the values of and so that the F1-score and MAP in the verification set are optimal. Then, we select the optimal and and re-adjust so that the F1-score and MAP in the verification set is optimal. In Bilibili video site, the default time for each TSC to appear on the screen is 10 seconds. Therefore, we assume that the semantic half-life of each TSC is 5 seconds, and calculate the initial according to Eq. (2).

Figure 4. The effect of threshold in F1-score
Figure 5. The effect of threshold in MAP
Figure 6. The effect of threshold in F1-score
Figure 7. The effect of threshold in MAP

To determine , we fix , adjusting

from 0 to 0.5 in 0.02 steps and observe the F-score and MAP of Top 10 tagging results generated by the dialogue-based algorithm. The results of F1-score and MAP in the validation set are shown in Fig.

5 and Fig.5, respectively. Both in F1-score and MAP, gains better results in the range of 0.32 to 0.38 and get optimal performance at 0.34. Therefore, we choose for the following experiments.

To determine , we also fix , adjusting from 0 to 0.5 in 0.02 steps and observe the F-score and MAP of Top 10 tagging results generated by the topic center-based algorithm. The results of F1-score and MAP in the validation set are shown in Fig.7 and Fig.7, respectively. For F1-score, gains better results in the range of 0.34 to 0.42 and get optimal performance at 0.40. For MAP, gains better results in the range of 0.34 to 0.40 and get the optimal performance at 0.38. Considering both F1-score and MAP, we choose for the following experiments.

Figure 8. The effect of attenuation coefficient on F1-score
Figure 9. The effect of attenuation coefficient on MAP

With the optimal and obtained before, we re-adjust from 0 to 0.2 in steps 0.01, and observe the F-score and MAP of video tags generated by our algorithms. The results of F1-score and MAP in the validation set are shown in the Fig. 9 and Fig. 9. For the dialogue-based algorithm, gains better performance in the range of 0.10 to 0.13 and gets optimal performance at 0.12 for F1-score and 0.11 for MAP. For topic the center-based algorithm, gains better performance in the range of 0.10 tp 0.14 and gets optimal performance at 0.13 for both F1-score and MAP. To take comprehensive consideration of both F1-score and MAP, we choose for the dialogue-based algorithm, and for the topic center-based algorithm in the following experiments. In fact, when , the semantic association graph is independent of time; when , all weights of edge equal to 0, and our model is equivalent to TF-IDF.

Besides, the number of iterations also needs to be determined. We count the number of iterations when algorithms converge at different densities (we consider the algorithm converges when the average of ), the results are shown in Table 5.

0-30 31-60 60-90 90-120 120
Dialogue 7.32 13.59 27.59 35.15 43.82
Topic center 6.89 14.92 23.15 31.42 45.62
Table 5. The number of iterations when algorithms converged at different densities

As shown in Table 5, when the density of TSCs is low, the SAG generated by two algorithms is sparse, and therefore the number of iterations is few. As the density increases, the SAG becomes dense and the number of iterations increases. To simplify, we choose in the experiment.

4.2. Results

In this section, we first use D2 to analyze the clustering effect of the two algorithms we proposed at different densities. Then, we use the test set of D1 to verify the effectiveness of the greedy optimization we proposed, and compare our algorithms with the existing methods TF-IDF, TextRank (Mihalcea and Tarau, 2004), BTM (Yan et al., 2013) GSDPMM (Yin and Wang, 2014, 2016), and TPTM (Wu et al., 2014).

In the beginning, an experiment was designed to compare the clustering effect of the two algorithms. Given a set of topics , two distance scores are introduced (Yan et al., 2013).

Average Intra-Cluster Distance:


Average Inter-Cluster Distance:


Since we use function to calculate the semantic similarity between two topics, where the higher the similarity is, the greater the function value is. Intuitively, if the Average Intra-Cluster Distance is high and the Average Inter-Cluster Distance is low, then the algorithm has a great clustering effect. Therefore, we calculate


to evaluate the quality of clustering algorithms as (Guo et al., 2011; Bordino et al., 2010).

Due to the time decay function in the semantic association graph, the value, the IntraDis and the topic number (cluster number) of the videos vary greatly with the video duration. Therefore, we do not calculate the average value of all the videos directly but define an score instead. That is, for each video, we compare the score obtained by the two cluster algorithms, and the algorithm with the larger score obtains a hit. The H-hit that the dialogue-based algorithm gets is called D-Hit, and the H-hit that the topic center-based algorithm obtains is called T-Hit.

Figure 10. The comparison of two clustering algorithms

The results are shown in Fig. 10. The dialogue-based algorithm performs better when the density is lower than 60. As the density increases and exceeds 60, the topic center-based algorithm performs better than the dialogue-based model. Moreover, we directly compare the top 10 tag extraction results of two clustering algorithms at different densities. The results are shown in Table 6.

0-30 31-60 60-90 90-120 120
Dialogue F1-score 0.4357 0.4412 0.4219 0.4108 0.4383
Dialogue MAP 0.3742 0.4027 0.4615 0.4013 0.4872
Topic center F1-score 0.4139 0.4276 0.4275 0.4216 0.4433
Topic center MAP 0.3615 0.3988 0.4747 0.4077 0.5093
Table 6. The tag extraction results at different densities.

The tag extraction results are similar to Fig. 10. From Fig. 10 and Table 6, we can conclude that the dialogue-based algorithm is better for videos with a density lower than 60, while topic center-based algorithm has significant advantages for videos with the density higher than 60, which fits our assumptions in Section 3.2. Based on the conclusions above, in the test set of D1, we consider the videos with the density of TSCs greater than 60 as high-density videos, and others are low-density videos. Then, the test set in D1 is divided into two parts: videos with high-density TSCs and with low-density TSCs. The details are shown in Table 7.

High-density Low-density
Total length (minute) 124.58 1316.80
Total TSCs number 41,556 60,078
Density 333.56 45.62
Total video number 89 31
Table 7. Data Description Table for the test set of D1

We use the data in Table 7 to verify the effectiveness of greedy optimization we proposed in Section 3.2.2. Specifically, we run the code of Algorithm 2 for 10 times, counting the running time from line 6 to line 34, with and without the greedy optimization (in line 9), respectively. The experiment platform we used is one MacBook Pro 13-inch, 2.9 Ghz Inter Core i5, 8GB 2133MHz LPDDR3 with single thread. We add up the total time of all the samples (since the single sample only runs for a short time). The average time of 10 runs is shown in Table 8.

High-density Low-density
Topic center only 7.671 10.725
Topic center with greed 6.905 10.060
Table 8. Validation of greedy optimization

The results show that the greedy optimization reduces 9.99% running time of high-density data and 6.20% of low-density data, respectively, which verifies the effectiveness of our greedy algorithm.

Then, we compare our algorithm with different existing methods using the test set of D1. To evaluate the performance of the proposed video tag extraction algorithm, we compare our method with 5 unsupervised keyword extraction methods, i.e.,

  1. TF-IDF, a classical keyword extraction algorithm.

  2. TX, a graph-based text ranking model, textrank (Mihalcea and Tarau, 2004), which is inspired by PageRank.

  3. BTM, a topic model based algorithm, Biterm Topic Model (Yan et al., 2013), which is the improvement of LDA (Blei et al., 2003) for short texts. The number of topics is 20 in this experiment.

  4. GSDPMM, a collapsed Gibbs Sampling algorithm for the Dirichlet Process Multinomial Mixture model (Yin and Wang, 2014, 2016), which has good performance when dealing with short texts. We set ( is the number of documents in the dataset), , and in this experiment.

  5. TPTM, a Temporal and Personalized Topic Model (Wu et al., 2014), which is the first work on automatic TSC tagging. All parameters are set in accordance with (Wu et al., 2014).

Method Prec Recall F1-score MAP
TF-IDF 0.2674 0.5735 0.3648 0.4224
TX 0.2427 0.5205 0.3310 0.3696
BTM 0.2337 0.5012 0.3188 0.3094
GSDPMM 0.2445 0.5094 0.3302 0.3374
TPTM 0.2539 0.5446 0.3463 0.3824
SW-IDF (dialogue) 0.3079 0.6602 0.4210 0.4932
SW-IDF (topic center) 0.3258 0.6988 0.4444 0.5122
Table 9. Comparison of different methods on video tag extraction of the top 10 candidate tags with high-density TSCs.
Method Prec Recall F1-score MAP
TF-IDF 0.3411 0.4028 0.3694 0.3098
TX 0.3224 0.3709 0.3450 0.3147
BTM 0.3210 0.3662 0.3369 0.2927
GSDPMM 0.3440 0.4038 0.3715 0.3202
TPTM 0.3677 0.4334 0.3979 0.3359
SW-IDF (dialogue) 0.3912 0.4693 0.4267 0.3623
SW-IDF (topic center) 0.3877 0.4562 0.4207 0.3522
Table 10. Comparison of different methods on video tag extraction of the top 10 candidate tags with low-density TSCs.

For each method, we calculate the precision, recall, MAP (Mean Average Precision) and F1-score of top 10 tagging results at first. Results of high-density and low-density of TSCs are shown in Table 9 and Table 10, respectively.

In high-density condition, our topic center-based SW-IDF algorithm achieves optimal results in both F1-score and MAP. It increases the F1-score by 21.82% and the MAP by 21.26% compared with the state-of-the-art method TF-IDF in the baselines. In low-density condition, our dialogue-based SW-IDF algorithm achieves optimal results in both F1-score and MAP. It increases the F1-score by 7.24% and the MAP by 7.86% compared with the state-of-the-art method TPTM in the baselines. Compare the two algorithms, we find that the dialogue-based algorithm performs better in low-density condition, while topic center-based algorithm performs better in high-density condition, which further proves our assumption in Section 3.2.

What is more, when the density of TSCs becomes high, the noises increase. Therefore the result of topic model based methods, BTM, GSDPMM, and TPTM are poor and even worse than classical method TF-IDF. However, TF-IDF only counts the number of words and does not consider the semantic relationship of TSCs, so the result is not as good as our algorithms. Relatively, in low-density comments, the graph is sparse and noises reduce. That is why our algorithms achieve greater improvement in high-density than in low-density.

Method H-Top 5 H-Top 15 L-Top 5 L-top 15
Prec Recall Prec Recall Prec Recall Prec Recall
TF-IDF 0.4182 0.4483 0.1871 0.5997 0.4140 0.2434 0.2993 0.5255
TX 0.3012 0.3234 0.1810 0.5831 0.3838 0.2250 0.2814 0.5071
BTM 0.2715 0.2924 0.1771 0.5692 0.3678 0.2158 0.2609 0.4602
GSDPMM 0.2812 0.3013 0.1832 0.5930 0.4181 0.2486 0.3067 0.5390
TPTM 0.3627 0.3945 0.1805 0.5927 0.4365 0.2662 0.3183 0.5624
SW-IDF(d) 0.4935 0.5362 0.2273 0.7241 0.4654 0.2893 0.3556 0.6327
SW-IDF(c) 0.5300 0.5692 0.2345 0.7571 0.4518 0.2783 0.3410 0.6269
Table 11. Comparison of different methods on video tag extraction of the top 5 and top 15 candidate tags

To further validate our algorithm, we show the precision and recall of top 5 and top 15 candidate tags in Table 11. The results of each algorithm are similar to the performance of Top 10, which prove that our two algorithms have better performance when extracting video tags from time-sync comments in any situation.

Video number AcFun ac2643295_1 AcFun ac2656362_6 AcFun ac2474006_1 AcFun ac2669229_1
Timeline 0:00:000:01:10 0:07:28 0:09:49 0:00:001:04:07 0:00:000:15:41
Amount 785 764 2933 2460
Density 672.84 325.08 45.78 156.84
Brief Encounter
Peng Julia
Wind and Cloud
the Twin Swords
Cheung Wai Kin
Jen Hsien-chi
Jimmy Lin
ID card
cell phone
acting skill
Cheung Ka Fai
Louis Cheung
Brief Encounter
Peng Julia
Jen Hsien-chi
Cheung Wai Kin
ID card
Cheung Ka Fai
Brief Encounter
Jen Hsien-chi
Cheung Ka Fai
Wind and Cloud
Jen Hsien-chi
Jimmy Lin
New Year
ID card
Brief Encounter
Peng Julia
Cheung Wai Kin
the Twin Swords
Jimmy Lin
cell phone
Cheung Ka Fai
dracula movie
Brief Encounter
Peng Julia
Wind and Cloud
the Twin Swords
Cheung Wai Kin
Jen Hsien-chi
Jimmy Lin
ID card
cell phone
Cheung Ka Fai
Louis Cheung
Brief Encounter
Peng Julia
Wind and Cloud
theme song
Cheung Wai Kin
the Twin Swords
Jimmy Lin
Jen Hsien-chi
ID card
cell phone
Cheung Ka Fai
dracula movie
Kuo Tsai-chieh
Table 12. The top5 results of video tags generated by different algorithms

Finally, we show the Top 5 of video tags generated by the algorithms above in Table 12. The Bold italic words indicate the good tags (the tags that all three volunteers voted), while the underline words indicate the bad tags ((the tags that less than two volunteers voted)). The results show that the SW-IDF (Topic Center) and SW-IDF(dialogue) have more good tags and less bad tags than other algorithms, which intuitively demonstrates the superiority of our algorithms.

5. Conclusion

In this paper, we proposed a novel video tag extraction algorithm to acquire video tags for time-sync videos. To deal with the features of time-sync comments, SW-IDF was designed to cluster comments into semantic association graph by taking advantage of their semantic similarities and timestamps. In this way, the noises could be differentiated from the meaningful comments, and thus be effectively eliminated. Finally, video tags were well recognized and extracted in an unsupervised way. Extensive experiments on real-world dataset proved that our algorithm could effectively extract video tags with a significant improvement of precision and recall compared with several baselines, which obviously validates the potential of our algorithm on tag extraction, as well as tackling with the features of time-sync comments.

This work is supported by Chinese National Research Fund (NSFC) Key Project No. 61532013 and No. 61872239. NSFC Project No. 61872195 and No. 61702330. FDCT/0007/2018/A1, DCT-MoST Joint-project No. (025/2015/AMJ), University of Macau Grant Nos: MYRG2018-00237-RTO, CPG2018-00032-FST and SRG2018-00111-FST of SAR Macau, China.


  • (1)
  • Alhabashneh et al. (2017) Obada Alhabashneh, Rahat Iqbal, Faiyaz Doctor, and Anne James. 2017. Fuzzy rule based profiling approach for enterprise information seeking and retrieval. Information Sciences 394 (2017), 18–37.
  • Alstrup et al. (2000) Stephen Alstrup, G Stolting Brodal, and Theis Rauhe. 2000. New data structures for orthogonal range searching. In Foundations of Computer Science, 2000. Proceedings. 41st Annual Symposium on. IEEE, 198–207.
  • Bae et al. (2017) Seung-Hee Bae, Daniel Halperin, Jevin D West, Martin Rosvall, and Bill Howe. 2017. Scalable and efficient flow-based community detection for large-scale graph analysis. ACM Transactions on Knowledge Discovery from Data (TKDD) 11, 3 (2017), 32.
  • Bentley (1975) Jon Louis Bentley. 1975. Multidimensional binary search trees used for associative searching. Commun. ACM 18, 9 (1975), 509–517.
  • Bentley (1990) Jon Louis Bentley. 1990. K-d trees for semidynamic point sets. In Proceedings of the sixth annual symposium on Computational geometry. ACM, 187–197.
  • Blei et al. (2003) David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation.

    Journal of Machine Learning Research

    3 (2003), 993–1022.
  • Bordino et al. (2010) Ilaria Bordino, Carlos Castillo, Debora Donato, and Aristides Gionis. 2010. Query similarity by projecting the query-flow graph. In Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval. ACM, 515–522.
  • Chakraborty et al. (2016) Tanmoy Chakraborty, Sriram Srinivasan, Niloy Ganguly, Animesh Mukherjee, and Sanjukta Bhowmick. 2016. Permanence and community structure in complex networks. ACM Transactions on Knowledge Discovery from Data (TKDD) 11, 2 (2016), 14.
  • Chen et al. (2017a) Shizhe Chen, Jia Chen, Qin Jin, and Alexander Hauptmann. 2017a. Video captioning with guidance of multimodal latent topics. In Proceedings of the 2017 ACM on Multimedia Conference. ACM, 1838–1846.
  • Chen et al. (2017b) Xu Chen, Yongfeng Zhang, Qingyao Ai, Hongteng Xu, Junchi Yan, and Zheng Qin. 2017b. Personalized key frame recommendation. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 315–324.
  • Dong and Dong (2003) Zhendong Dong and Qiang Dong. 2003. HowNet-a hybrid language and knowledge resource. In

    Natural Language Processing and Knowledge Engineering, 2003. Proceedings. 2003 International Conference on

    . IEEE, 820–824.
  • Fortunato (2010) Santo Fortunato. 2010. Community detection in graphs. Physics reports 486, 3 (2010), 75–174.
  • Friedman et al. (1977) Jerome H Friedman, Jon Louis Bentley, and Raphael Ari Finkel. 1977. An algorithm for finding best matches in logarithmic expected time. ACM Transactions on Mathematical Software (TOMS) 3, 3 (1977), 209–226.
  • Gu et al. (2017) Liqiu Gu, Kun Wang, Xiulong Liu, Song Guo, and Bo Liu. 2017. A reliable task assignment strategy for spatial crowdsourcing in big data environment. In 2017 IEEE International Conference on Communications (ICC). IEEE, 1–6.
  • Guo et al. (2011) Jiafeng Guo, Xueqi Cheng, Gu Xu, and Xiaofei Zhu. 2011. Intent-aware query similarity. In Proceedings of the 20th ACM international conference on Information and knowledge management. ACM, 259–268.
  • He et al. (2015) Hua He, Kevin Gimpel, and Jimmy J Lin. 2015.

    Multi-Perspective Sentence Similarity Modeling with Convolutional Neural Networks.. In

    EMNLP. 1576–1586.
  • He et al. (2016) Ming He, Yong Ge, Le Wu, Enhong Chen, and Chang Tan. 2016. Predicting the Popularity of DanMu-enabled Videos: A Multi-factor View. In Proceedings of International Conference on Database Systems for Advanced Applications. Springer, 351–366.
  • Huang et al. (2017) Faliang Huang, Xuelong Li, Shichao Zhang, Jilian Zhang, Jinhui Chen, and Zhinian Zhai. 2017. Overlapping community detection for multimedia social networks. IEEE Transactions on Multimedia 19, 8 (2017), 1881–1893.
  • Hussein and Piccardi (2017) Fairouz Hussein and Massimo Piccardi. 2017. V-JAUNE: A Framework for Joint Action Recognition and Video Summarization. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 13, 2 (2017), 20.
  • Hyung et al. (2017) Ziwon Hyung, Joon-Sang Park, and Kyogu Lee. 2017. Utilizing context-relevant keywords extracted from a large collection of user-generated documents for music discovery. Information Processing & Management 53, 5 (2017), 1185–1200.
  • Iacobacci et al. (2015) Ignacio Iacobacci, Mohammad Taher Pilehvar, and Roberto Navigli. 2015. Sensembed: Learning sense embeddings for word and relational similarity. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Vol. 1. 95–105.
  • Kenter and De Rijke (2015) Tom Kenter and Maarten De Rijke. 2015. Short text similarity with word embeddings. In Proceedings of the 24th ACM international on conference on information and knowledge management. ACM, 1411–1420.
  • Kusner et al. (2015) Matt J Kusner, Yu Sun, Nicholas I Kolkin, and Kilian Q Weinberger. 2015. From word embeddings to document distances. In Proceedings of the 32nd International Conference on Machine Learning (ICML 2015). 957–966.
  • Lancichinetti and Fortunato (2009) Andrea Lancichinetti and Santo Fortunato. 2009. Community detection algorithms: a comparative analysis. Physical review E 80, 5 (2009), 056117.
  • Lancichinetti et al. (2008) Andrea Lancichinetti, Santo Fortunato, and Filippo Radicchi. 2008. Benchmark graphs for testing community detection algorithms. Physical review E 78, 4 (2008), 046110.
  • Lee and Wong (1977) Der-Tsai Lee and CK Wong. 1977. Worst-case analysis for region and partial region searches in multidimensional binary search trees and balanced quad trees. Acta Informatica 9, 1 (1977), 23–29.
  • Levy and Goldberg (2014a) Omer Levy and Yoav Goldberg. 2014a. Linguistic regularities in sparse and explicit word representations. In Proceedings of the eighteenth conference on computational natural language learning. 171–180.
  • Levy and Goldberg (2014b) Omer Levy and Yoav Goldberg. 2014b. Neural word embedding as implicit matrix factorization. In Advances in neural information processing systems. 2177–2185.
  • Levy et al. (2015) Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Improving distributional similarity with lessons learned from word embeddings. Transactions of the Association for Computational Linguistics 3 (2015), 211–225.
  • Li et al. (2018b) Shen Li, Zhe Zhao, Renfen Hu, Wensi Li, Tao Liu, and Xiaoyong Du. 2018b. Analogical Reasoning on Chinese Morphological and Semantic Relations. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers. ACL, 138–143.
  • Li et al. (2018a) Yixuan Li, Kun He, Kyle Kloster, David Bindel, and John Hopcroft. 2018a.

    Local Spectral Clustering for Overlapping Community Detection.

    ACM Transactions on Knowledge Discovery from Data (TKDD) 12, 2 (2018), 17.
  • Liao et al. (2018) Zhenyu Liao, Yikun Xian, Xiao Yang, Qinpei Zhao, Chenxi Zhang, and Jiangfeng Li. 2018. TSCSet: A Crowdsourced Time-Sync Comment Dataset for Exploration of User Experience Improvement. In 23rd International Conference on Intelligent User Interfaces. ACM, 641–652.
  • Lv et al. (2016) Guangyi Lv, Tong Xu, Enhong Chen, Qi Liu, and Yi Zheng. 2016. Reading the Videos: Temporal Labeling for Crowdsourced Time-Sync Videos Based on Semantic Embedding. In

    Proceedings of the 30th AAAI Conference on Artificial Intelligence

  • Mihalcea and Tarau (2004) Rada Mihalcea and Paul Tarau. 2004. TextRank: Bringing order into texts. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 8–15.
  • Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems. 3111–3119.
  • Mueller and Thyagarajan (2016) Jonas Mueller and Aditya Thyagarajan. 2016. Siamese Recurrent Architectures for Learning Sentence Similarity.. In AAAI. 2786–2792.
  • Murtagh and Contreras (2012) Fionn Murtagh and Pedro Contreras. 2012.

    Algorithms for hierarchical clustering: an overview.

    Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 2, 1 (2012), 86–97.
  • Murtagh et al. (2008) Fionn Murtagh, Geoff Downs, and Pedro Contreras. 2008.

    Hierarchical clustering of massive, high dimensional data sets by exploiting ultrametric embedding.

    SIAM Journal on Scientific Computing 30, 2 (2008), 707–730.
  • Murtagh and Legendre (2014) Fionn Murtagh and Pierre Legendre. 2014. Ward’s hierarchical agglomerative clustering method: which algorithms implement ward’s criterion? Journal of Classification 31, 3 (2014), 274–295.
  • Newman and Girvan (2004) Mark EJ Newman and Michelle Girvan. 2004. Finding and evaluating community structure in networks. Physical review E 69, 2 (2004), 026113.
  • Pandove et al. (2018) Divya Pandove, Shivan Goel, and Rinkl Rani. 2018. Systematic review of clustering high-dimensional and large datasets. ACM Transactions on Knowledge Discovery from Data (TKDD) 12, 2 (2018), 16.
  • Pang et al. (2015) Junbiao Pang, Fei Jia, Chunjie Zhang, Weigang Zhang, Qingming Huang, and Baocai Yin. 2015. Unsupervised web topic detection using a ranked clustering-like pattern across similarity cascades. IEEE Transactions on Multimedia 17, 6 (2015), 843–853.
  • Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods on Natural Language Processing, Vol. 14. 1532–43.
  • Raamkumar et al. (2017) Aravind Sesagiri Raamkumar, Schubert Foo, and Natalie Pang. 2017. Using author-specified keywords in building an initial reading list of research papers in scientific paper retrieval and recommender systems. Information Processing & Management 53, 3 (2017), 577–594.
  • Ramaboa and Fish (2018) Kutlwano KKM Ramaboa and Peter Fish. 2018. Keyword length and matching options as indicators of search intent in sponsored search. Information Processing & Management 54, 2 (2018), 175–183.
  • Ramezani et al. (2018) Maryam Ramezani, Ali Khodadadi, and Hamid R Rabiee. 2018. Community Detection Using Diffusion Information. ACM Transactions on Knowledge Discovery from Data (TKDD) 12, 2 (2018), 20.
  • Siersdorfer et al. (2009) Stefan Siersdorfer, Jose San Pedro, and Mark Sanderson. 2009. Automatic video tagging using content redundancy. In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 395–402.
  • Socher et al. (2011) Richard Socher, Cliff C Lin, Chris Manning, and Andrew Y Ng. 2011. Parsing natural scenes and natural language with recursive neural networks. In Proceedings of the 28th International Conference on Machine Learning (ICML-11). 129–136.
  • Tarjan (1975) Robert Endre Tarjan. 1975. Efficiency of a good but not linear set union algorithm. Journal of the ACM (JACM) 22, 2 (1975), 215–225.
  • Tarjan (1979) Robert Endre Tarjan. 1979. A class of algorithms which require nonlinear time to maintain disjoint sets. Journal of computer and system sciences 18, 2 (1979), 110–127.
  • Wang et al. (2016b) Chenguang Wang, Yangqiu Song, Dan Roth, Ming Zhang, and Jiawei Han. 2016b. World knowledge as indirect supervision for document clustering. ACM Transactions on Knowledge Discovery from Data (TKDD) 11, 2 (2016), 13.
  • Wang et al. (2017) Kun Wang, Liqiu Gu, Song Guo, Hongbin Chen, Victor CM Leung, and Yanfei Sun. 2017. Crowdsourcing-based content-centric network: a social perspective. IEEE Network 31, 5 (2017), 28–34.
  • Wang et al. (2016a) Kun Wang, Xin Qi, Lei Shu, Der-jiunn Deng, and Joel JPC Rodrigues. 2016a. Toward trustworthy crowdsourcing in the social internet of things. IEEE Wireless Communications 23, 5 (2016), 30–36.
  • Wu et al. (2012) Benbin Wu, Jing Yang, and Liang He. 2012. Chinese hownet-based multi-factor word similarity algorithm integrated of result modification. In International Conference on Neural Information Processing. Springer, 256–266.
  • Wu et al. (2014) Bin Wu, Erheng Zhong, Ben Tan, Andrew Horner, and Qiang Yang. 2014. Crowdsourced time-sync video tagging using temporal and personalized topic modeling. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 721–730.
  • Xu and Zhang (2017) Linli Xu and Chao Zhang. 2017. Bridging Video Content and Comments: Synchronized Video Description with Temporal Summarization of Crowdsourced Time-Sync Comments.. In AAAI. 1611–1617.
  • Yan et al. (2013) Xiaohui Yan, Jiafeng Guo, Yanyan Lan, and Xueqi Cheng. 2013. A biterm topic model for short texts. In Proceedings of the 22nd international conference on World Wide Web. ACM, 1445–1456.
  • Yang et al. (2017) Wenmian Yang, Na Ruan, Wenyuan Gao, Kun Wang, Wensheng Ran, and Weijia Jia. 2017. Crowdsourced time-sync video tagging using semantic association graph. In Multimedia and Expo (ICME), 2017 IEEE International Conference on. IEEE, 547–552.
  • Yin and Wang (2014) Jianhua Yin and Jianyong Wang. 2014. A dirichlet multinomial mixture model-based approach for short text clustering. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 233–242.
  • Yin and Wang (2016) Jianhua Yin and Jianyong Wang. 2016.

    A model-based approach for text clustering with outlier detection. In

    Proceedings of Data Engineering (ICDE), 2016 IEEE 32nd International Conference on. IEEE, 625–636.
  • Yu et al. (2015) Zhiwen Yu, Zhu Wang, Huilei He, Jilei Tian, Xinjiang Lu, and Bin Guo. 2015. Discovering information propagation patterns in microblogging services. ACM Transactions on Knowledge Discovery from Data (TKDD) 10, 1 (2015), 7.
  • Zhu (2004) Mu Zhu. 2004. Recall, precision and average precision. Department of Statistics and Actuarial Science, University of Waterloo, Waterloo 2 (2004), 30.