Unsupervised Multi-stream Highlight detection for the Game "Honor of Kings"

by   Li Wang, et al.

With the increasing popularity of E-sport live, Highlight Flashback has been a critical functionality of live platforms, which aggregates the overall exciting fighting scenes in a few seconds. In this paper, we introduce a novel training strategy without any additional annotation to automatically generate highlights for game video live. Considering that the existing manual edited clips contain more highlights than long game live videos, we perform pair-wise ranking constraints across clips from edited and long live videos. A multi-stream framework is also proposed to fuse spatial, temporal as well as audio features extracted from videos. To evaluate our method, we test on long game live videos with an average length of about 15 minutes. Extensive experimental results on videos demonstrate its satisfying performance on highlights generation and effectiveness by the fusion of three streams.


page 1

page 2

page 4


Promoting Mental Well-Being for Audiences in a Live-Streaming Game by Highlight-Based Bullet Comments

This paper proposes a method for generating bullet comments for live-str...

Bettors' reaction to match dynamics – Evidence from in-game betting

It is still largely unclear to what extent bettors update their prior as...

Real-Time Video Highlights for Yahoo Esports

Esports has gained global popularity in recent years and several compani...

Towards Extracting Highlights From Recorded Live Videos: An Implicit Crowdsourcing Approach

Live streaming platforms need to store a lot of recorded live videos on ...

Discovering Picturesque Highlights from Egocentric Vacation Videos

We present an approach for identifying picturesque highlights from large...

Indirect Match Highlights Detection with Deep Convolutional Neural Networks

Highlights in a sport video are usually referred as actions that stimula...

VideoKifu, or the automatic transcription of a Go game

In two previous papers [arXiv:1508.03269, arXiv:1701.05419] we described...

1 Introduction

The past decade has seen a rapid development of the live broadcast market, especially the e-sports market such as ’Douyu’, ’Kuaishou’ and ’Penguin E-sports’ in China. One of the core function on live platforms is Highlight Flashback, which aims at showing the most exciting fighting clips during the lengthy live broadcast. However, the current highlight flashbacks on platforms are manually edited and uploaded. This content generation process consumes considerable human resources. Hence, automatic highlights generation is an urgent demand for these live platforms.

Figure 1: Examples of highlights in the game live videos ’Honor of Kings’. 1)The raw long live video; 2) Highlight segments corresponded to the original raw video; 3) Audio clips of the video.

To tackle the issue above, previous works explored highlight detection from frame-level or clip-level [13, 4, 15, 3, 10, 14]. [3] addressed highlights detection task as a classification task, that was, highlight parts were regarded as target class and the rest were background. This method needed accurate annotations for each frame or clips, which was always used under supervised pattern. [10]

regarded highlight as a novel event for every frame in a video. A convolutional autoencoder was constructed for visual analysis of the game scene, face, and audio. On the other hand,

[4, 15, 14] took use of the internal relationship between highlight clips and non-highlight clips, where highlight clips got higher scores than the non-highlight ones. Based on this point, a ranking net was employed to implement the relationship under both supervised and unsupervised pattern.

In this paper, we focus on the highlight generation for the game live videos ’Honor of Kings’. We define the intense fight scenes in videos as highlights, at which audiences are excited. Example is demonstrated in Figure 1. However, laborious annotations are needed for training a network specially designed for the game scenes. Thus, we take a novel training strategy for network learning using the existing videos downloaded from the ’Penguin E-sports’ without any additional annotations.

Figure 2: Overview of our framework. Three streams are trained using ranking loss, respectively. And in the inference time, we first get three scores for the target video clip and then fuse them together to obtain the final score.

Considering that the unlabeled videos are hard to be classified as highlight or non-highlight, we adopt the ranking net as our basic model. Since a significant number of highlight videos are edited and uploaded to the platform by journalists and fanciers, we take clips from edited videos as the positive samples, and clips from long videos as the negative samples for our network. Since the long videos also contain high clips, which may introduce much noise for data training, we use Huber loss

[4] to further reduce the impact of noisy data.

Besides, as shown in Figure 2, a multi-stream network is constructed to take full use of the temporal, spatial and audio information of the videos. We use simple convolutional layers to produce final highlights from temporal aspects, while fuse the auxiliary audio and spatial components to further fine-tune the results.

Our contributions can be summarized as follow:

  • A novel training strategy for network training is proposed, which uses existing downloaded videos without any additional annotations.

  • A multi-stream network containing temporal, spatial and audio components is constructed, which takes full use of the information from the game videos.

  • Further experiment results on the game videos demonstrate its effectiveness on automatically highlights generation.

2 Dataset Collection

Existing highlight datasets [13, 11] contain variety of videos in natural scenes, which have great gap with the scenes in game videos. For this reason, we recollect the videos for the target scenes. We collect long raw live videos and highlight videos from Penguin E-sports platform. For network training, first, we randomly select 10 players and query their game videos, then 450 edited highlight videos and 10 long raw videos are then downloaded. The highlight videos are average 21 seconds long while the lengths of raw videos range from 6 to 8 hours. Due to the extreme length of the raw live videos, we randomly intercept 20 video clips from videos with an average length of 13 minutes each, so that the positive and negative samples can be balanced. Note that each video clip contains both the highlight and non-highlight intervals, where highlight clips account for about 20% of the whole video. For testing, We download another four long videos from different players, and got one video clip from each long video with 60 minutes. Specially, the evaluated video clips contain different master heroes so that the scenarios varies dramatically and the task becomes a challenge. To evaluate the effectiveness of our approach, we annotate the evaluated video clips on the second-level. Each annotated video has 55 highlight time periods totally with average 7.83 seconds for each period.

3 Methodology

In this section, we introduce our multi-stream architecture which is demonstrated in Figure 2. We combine three components for highlight generation: Temporal Network extracts the temporal information; Spatial Network gets the spatial context information for each frame; Audio Network filters the unrelated scenes by utilizing the internal most powerful sound effect, which reveals the player’s immersion. All of the three networks use the ranking framework, which constrains the scores produced from positive and negative samples.

Temporal Network. In this component, we exploit temporal dynamics using 3D features. We extract the features from the output of final pooling layer of ResNet-34 model [6] with 3D convolution layers [5], which is pre-trained on the Kinetics dataset [1]. The inputs for the network are video clips with 16 frames each. After the corresponding features are obtained, three fully-connected layers with the channels {512, 128, 1} are added to perform the ranking constraint.

Spatial Network. The spatial and context information plays an important role in object classification and detection tasks, which is also critical for highlight detection by providing distinguish appearances for different scenes and objects. Therefore, we set up a stream to classify highlight or non-highlight video frames. Different from the Temporal Network, we train the Spatial Network on frame level. A fixed-length feature is firstly extracted for each frame and then fed into the spatial ranking net. Here, we use the AlexNet [8]

, which is pre-trained on 1.2 million images of ImageNet challenge dataset

[2], to generate 9216-dimension feature from the last pooling layer. And seven fully-connected layers are followed with the channels {9216, 4096, 1024, 512, 256, 128, 64, 1} respectively.

Figure 3: Quantitative highlight generation results. Each row demonstrates the inference scores for frames from low to high in a test video. The first two rows show several non-game frames which get lower scores than the game frames. And the last two rows show the ranking scores between game frames. Obviously, the frames from intense fight scenes obtain higher scores than that from non-fight scenes.

Audio Network. It is observed that different scenes in game live contain different characteristics. For example, the highlight clips may be immersed in the audible screams and sounds of struggle while the non-highlight parts share the more quiet property. Each one-second audio is firstly fed into a pre-trained Speaker Encoder [7] to generate a 256-dimension feature. Then two fully-connected layers with the channels {64, 1} are followed.

Training & Inference process. We train three streams separately, and samples from highlights and non-highlights satisfy the constraint:


where are the features of input samples (frames or video clips) from highlights and non-highlights videos, respectively. denotes the ranking network, and is the Dataset.

Therefore, to optimize the networks, we employ the ranking loss between input positive and negative samples, which is described as follow:


The loss function is gained under the condition that the negative samples are non-highlights. However, in our case, the input clips from live videos can also be highlights. Thus, we apply the Huber loss proposed in


to decrease the negative effect of outliers:


where , so that the losses for outliers are small than the normal ones. Here we set to distinct the outliers.

For inference, after the scores are got from three streams, we need to fuse them to form the final scores. Here, the simple weighted summation is used with the weight {0.7, 0.15, 0.15} for temporal, spatial, and audio scores respectively, which demonstrates the importance of 3D information.

Implementation details.

The framework is implemented in PyTorch


. Three networks all use Stochastic Gradient Descent (SGD) with a weight decay of 0.00005 and a momentum of 0.9. The learning rate for Temporal and Spatial network is 0.005 while for Audio network is 0.1. ReLu non-linearity

[9] and Dropout[12] are applied after each fully-connected layer of the three streams.

4 Experiments

We evaluate our approach on four raw videos with an average length of 15 minutes. Specially, we divide the video into clips with 5 seconds each, and measure the average scores across clips. Since no other approach is trained and tested on our dataset, we only demonstrate ablation experiment and whole results of our approach.

Metrics. As described in [4], the commonly used metric mean Average Precision (mAP) is sensitive to video length, that is, longer video will lead to lower score. Considering that our test videos are about 3 to 5 times longer than other highlight datasets [11, 13], we use the normalized version of MSD (nMSD) proposed in [4] to measure the method. The nMSD refers to the relative length of the selected highlight clips at a recall rate , it can be defined as:


where means the length of corresponding videos. denotes the raw test video while the and are the ground truth and the predicted highlights under the recall rate of . Note that the lower , the better performance is, so that the best performance occurred when .

Results. The results can be seen in Table 1. As shown in the table, performance of the network becomes better when more information is fused into the network. It is notable that the single Audio Network has inferior performance since the audios extracted from videos are riddle with much noise, such as the voice of anchor and background music played by the player. It is concluded that our approach performs well even though no annotated video is available. More quantitative results can be seen in Figure 3.

temporal spatial audio nMSD mAP
15.13% 21.51%
14.41% 21.24%
43.28% 13.38%
13.58% 21.74%
13.36% 22.27%

Table 1: Highlight score fusion strategies. The result achieves the best performance when three streams are fused together.

5 Conclusion

In this paper, we propose a multi-stream architecture to automatically generate highlights for live videos of the game ’Honor of Kings’, which saves a lot of human resources. Particularly, since the edited highlight clips and the long live videos satisfy the constraint that the former gain higher scores than the latter, we take use of the existing highlight videos on ’Penguin E-sports’ platform to optimize the network. On the other hand, we exploit the information from spatial, temporal and audio aspect, which further improves the performance on highlight generation. In the future, we will explore more effective techniques to make the best of inherent characteristics in game videos, namely audio information. For example, the video and audio can learn from each other via teacher-student mechanism. Besides, good performances on different categories of game videos can be easily achieved by applying transfer learning, such as transformation between the MOBA and RPG games.


  • [1] J. Carreira and A. Zisserman (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, pp. 6299–6308. Cited by: §3.
  • [2] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In CVPR, pp. 248–255. Cited by: §3.
  • [3] C. Fu, J. Lee, M. Bansal, and A. C. Berg (2017) Video highlight prediction using audience chat reactions. arXiv preprint arXiv:1707.08559. Cited by: §1.
  • [4] M. Gygli, Y. Song, and L. Cao (2016) Video2gif: automatic generation of animated gifs from video. In CVPR, pp. 1001–1009. Cited by: §1, §1, §3, §4.
  • [5] K. Hara, H. Kataoka, and Y. Satoh (2018) Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?. In CVPR, pp. 6546–6555. Cited by: §3.
  • [6] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §3.
  • [7] Y. Jia, Y. Zhang, R. Weiss, Q. Wang, J. Shen, F. Ren, P. Nguyen, R. Pang, I. L. Moreno, Y. Wu, et al. (2018) Transfer learning from speaker verification to multispeaker text-to-speech synthesis. In Advances in neural information processing systems, pp. 4480–4490. Cited by: §3.
  • [8] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012)

    Imagenet classification with deep convolutional neural networks

    In Advances in neural information processing systems, pp. 1097–1105. Cited by: §3.
  • [9] V. Nair and G. E. Hinton (2010) Rectified linear units improve restricted boltzmann machines. In ICML, pp. 807–814. Cited by: §3.
  • [10] C. Ringer and M. A. Nicolaou (2018) Deep unsupervised multi-view detection of video game stream highlights. In Proceedings of the 13th International Conference on the Foundations of Digital Games, pp. 15. Cited by: §1.
  • [11] Y. Song, J. Vallmitjana, A. Stent, and A. Jaimes (2015) Tvsum: summarizing web videos using titles. In CVPR, pp. 5179–5187. Cited by: §2, §4.
  • [12] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting.

    The journal of machine learning research

    15 (1), pp. 1929–1958.
    Cited by: §3.
  • [13] M. Sun, A. Farhadi, and S. Seitz (2014) Ranking domain-specific highlights by analyzing edited videos. In ECCV, pp. 787–802. Cited by: §1, §2, §4.
  • [14] B. Xiong, Y. Kalantidis, D. Ghadiyaram, and K. Grauman (2019) Less is more: learning highlight detection from video duration. In CVPR, pp. 1258–1267. Cited by: §1.
  • [15] T. Yao, T. Mei, and Y. Rui (2016) Highlight detection with pairwise deep ranking for first-person video summarization. In CVPR, pp. 982–990. Cited by: §1.