Short-form video sharing platforms such as TikTok and Kwai are becoming increasingly popular and lead to the requirement of generating short-form videos. Users prefer to consume their time in more compact short-form videos, while high quality videos such as reality show and TV series are usually long (e.g. 1 hour). In such context, developing new algorithms that can truncate long videos into short, attractive and unbroken stories is of special interest.
|Total Video Number||19994||1470|
|Total Video Length||648.2h||2101.0h|
|Average Video Length||2.0min||80.0min|
|Total Story(Action) Length||315.3h||845h|
|Total Story(Action) Number||23064||16891|
|Average Story(Action) Length||0.8min||3min|
|Average Story(Action) Number per Video||1.2||11.0|
Although much progress has been made in video highlight detection and video summarization [32, 33, 6, 7, 5], many of the them focus on producing a coherent whole story in the final combined highlight and there is no requirement of an unbroken story to each sub-video. And the problem of story-preserving long video truncation is still not well studied due to the limitation of existing datasets. Obviously, story integrity or completeness is a crucial measurement for short-form videos under this scenario, but is not yet considered in existing video highlight datasets. For example, a climactic fight fragment is interesting and important enough to be a ground-truth keyshot in SumMe  and TVSum , but as an individual short-form video, it has to involve the beginning and the ending parts to clarify the cause and effect of the story. On the other hand, ActivityNet  provides action intervals with accurate temporal boundaries, but its data distribution is different from the requirement of short-form video production from the massive long video database. As Table 1 shows, the average video length of ActivityNet is too short ( 2 minutes) such that the average story number per video is only 1.2 and the average story length is only 0.8 minute.
In this paper, we collect a new large dataset, named as TruNet, which contains 1470 videos with a total of 2101 hours and an average video length longer than 1 hour. It covers a wide range of popular topics including variety show, reality show, talk show, and TV series. TruNet provides on average 11 short stories per long video and each story is annotated with accurate temporal boundaries. Figure 1(a) shows A sample video of the TruNet. The video is a variety show that contains 9 song and dance performances. The 9 short stories are indicated in different colors, and the third one is shown in Figure 1(b) with a higher temporal resolution.
With the new dataset, we further develop a baseline neural architecture for story-preserving long video truncation that consists of two components: a Boundary Aware Network (BAN) for proposal generation and a Fast-Forward Long Short-Term Memory (FF-LSTM) for story integrity classification. Different from previous state-of-the-art methods  whose proposal generation only depends on actionness, BAN utilizes additional frame-level boundaryness to generate proposals and achieves higher precision when the number of proposals is small. And different from traditional LSTM on sequence modeling, FF-LSTM 
introduces fast-forward connections to a stack of LSTM layers to encourage stable and effective backpropagation in the deep recurrent topology, which leads to an obvious performance improvement in our video story classification. To the best of our knowledge, this is the first time that FF-LSTM has been used for modeling sequences in the video domain.
In summary, our contributions are threefold: (1) We introduce a new practical problem in video truncation, story-preserving long video truncation, which requires to truncate a long-time video into multiple short-form videos with each one preserving a story. (2) We collect and annotate a new large dataset for studying this problem, which can become a complementary source to existing video datasets. (3) We propose a baseline framework that involves a new temporal proposal generation module and a new sequence modeling module, with better performance compared to traditional methods. We will also release the dataset for public academic research usage.
2 Related Work
2.1 Video Dataset
Here we briefly review typical video datasets that are related to our work. The SumMe dataset  consists of 25 videos covering 3 categories. The length of the videos ranges from about 1 to 6 minutes and to frames are extracted to be the summary of a video. Similarly, the TVSum dataset  contains 50 videos from 10 categories. The video duration is between 2 to 10 minutes and at most frames are selected to be the keyshots. Compared with SueMe and TVSum, our proposed TruNet dataset focuses on the story-preserving long video truncation problem such that each summary is an integral short-form video with accurate temporal boundaries.
On the other hand, although THUMOS14  and ActivityNet  also provide temporal boundaries, their data distributions are not suitable for our problem. Take ActivityNet as an example, the average video length of ActivityNet is too short and the average action number per video is only 1.2. Whether the daily activities collected by ActivityNet are suitable for video sharing is another consideration. In contrast, our proposed TruNet focuses on collecting high-quality long videos, such that the average short story number per video can be up to 11. The video topics in TruNet are also chosen to be suitable to share in video sharing platforms.
2.2 Video Summarization
Video summarization has been a long-standing problem in computer vision and multimedia. Previous studies typically treat it as extracting keyshots [32, 1, 17, 29, 13, 14], video skims [33, 31, 37, 6, 7, 20, 21], storyboards , time-lapses , montages , and video synopses . An exhaustive review is beyond the scope of this paper. We refer  for a survey of early works on video summarization. Compared with previous approaches, this paper proposes story-preserving long video truncation that formulates the video summarization problem in a different way. Each truncated short-form video should be an attractive and unbroken short story such that it can be used in the video sharing platforms.
2.3 Temporal Action Localization
Our work is also related to the work on temporal action localization. Early methods [27, 11, 34] of temporal action location relied on hand-crafted features and sliding window search. Using ConvNet-based features such as C3D  and two-stream CNN [2, 35] achieves both higher efficiency and better performance than hand-crafted features. [3, 4, 30] focused on utilizing stronger network structure to generate high-quality temporal proposals. Structured temporal modeling  and 1d temporal convolution  are also proved important to boost the performance of temporal action localization. While conceptually similar, our proposed model is novel in that it generates high quality temporal proposals by jointly considering frame-level attractiveness and boundaryness, which facilitates the truncated short-form videos to be unbroken. BSN  is also boundary sentitive framework, but it is designed for temporal action proposal and its proposals are generated from frame-level action-start and action-end confidence score, our BAN generate proposals according to story-start, story-end as well as the storyness score at group-of-frames level to suite the extremely long input.
3 The TruNet Dataset
The story-preserving long video truncation problem is not well studied due to the lack of publicly available dataset. Therefore, we construct the TruNet dataset to quantitatively evaluate the proposed framework.
3.1 Dataset Setup
Considering the truncated short-form stories from long videos should be suitable to share in video sharing platforms, we choose four types of long videos, i.e. variety show, reality show, talk show and TV series. Most videos of the TruNet are downloaded from the video website: iQIYI.com, which has a large quantity of high quality long videos. Crowdsourced annotation based on a carefully designed annotation tool is then applied after data collection. Each annotation worker is trained by annotating a small number of videos, and can participate the formal annotation only when he/she passes the training program. During the annotation task, the worker is asked to 1) watch the whole video; 2) annotate temporal boundaries of short stories; 3) adjust the boundaries. Each long videos are annotated by multiple workers for quality assurance, and the annotations are finally reviewed by several experts. We randomly split the dataset into three parts: training set with 1241 videos, validation set with 115 samples and testing set with 114 samples. The validation set are not used throughout our experiments.
3.2 Dataset Statistics
The TruNet dataset consists of 1470 long videos with the duration of 80 minutes on average and 2101 hours in total. A long video contains 11 stories on average, and the manually labeled story number is 16891 in total. The average duration of a short story is 3 minutes. Figure 2 compares TruNet with existing video summarization and temporal action localization dataset such as SumMe, TVSum, ActivityNet and THUMOS14. As can be seen, TruNet has the largest total video length and average proposal duration.
We show a group of statistics of the TruNet dataset in Figure 3. The video length distribution is shown in Figure 3(a). The distribution of story number in each long video is shown in Figure 3(b). The distribution of story length is shown in Figure 3(c). The distribution of the ratio story region length to long video length is shown in Figure 3(d).
The mathematical formulation of the story-preserving long video truncation problem is similar with temporal action localization. The training dataset can be represented as where frame-level feature comes from long video , which is corresponding with a ground-truth sliced short-form video set. is the beginning and ending indexes of the interval of . is the number of training videos, is the frame number of and is the interval number of .
Figure 4 illustrates the architecture of the proposed framework, which includes three major components: feature extraction, temporal proposal generation and sequential structure modeling. Given an input long video, RGB-based 2D convolutional feature and audio feature are extracted and concatenated. In the temporal proposal generation component, the features are fed into the BAN to predict the attractiveness and boundaryness for each frame. A dilated merge algorithm is then carried out to generate temporal proposals according to the frame-level attractiveness and boundaryness scores. In the sequential structure modeling component, the sequential structure of each proposal is modeled by FF-LSTM, which outputs the classification confidence score of the proposal together with the refined boundaries.
4.1 Boundary Aware Network
A novel boundary aware network (BAN) is proposed in this paper to provide high quality story proposals. As Figure 4 shows, BAN takes the features of consecutive 7 frames as the input of a LSTM layer. The output of the LSTM is averaged pooled and a linear layer is utilized to predict a four-categories probability scores, include: within story, background, story beginning boundary and story ending boundary. The label of the center frame is decided by the category with the largest score. We regard a frame sequence as a story candidate if every frame in the sequence belongs to the within story category. A simple dilated merge algorithm is carried out to merge adjacent story candidates with small distance (5 frames). Non-Maximal Suppression (NMS) is carried out to reduce redundancy and generate the final proposals.
Different from previous popular methods [30, 3, 4, 38] that only depend on actionness for proposal generation, BAN utilizes additional frame-level boundaryness to generate proposals. A comparison between the actionness score of TAG  (the upper row) and the proposed BAN (the middle row) is shown in Figure 5. For TAG, we show actionness scores larger than 0.5. For BAN, we show the maximum scores among the three foreground categories in different colors. The ground-truth proposal intervals are shown in the bottom row. As can be seen, the score curve of BAN is smoother than TAG, and the boundaries better match the ground-truth proposal intervals.
We first briefly review LSTM that serves as the basis of sequential structure modeling. Long-Short Term Memory (LSTM) 
is an enhanced version of recurrent neural network (RNN) with a set of memory cells. The computation of LSTM can be written as
where is the time step, is the input,
is a concatenation of four vectors of equal size,is the output of , , and are input gate, forget gate and output gate respectively, , and
are input activation function, forget activation function and output activation function respectively,, , and are learnable parameters. The computation of (1) can be equivalently split into two consecutive steps: the hidden block and the recurrent block .
A straightforward deep LSTM can be constructed by directly stacking multiple LSTM layers. Suppose is the LSTM layer, then:
Figure 6(b) illustrates a deep LSTM with three stacked layers. As can be seen, the input of the hidden block is the output of the recurrent block at its previous layer. In FF-LSTM, a fast-forward connection is added to connect two hidden blocks of adjacent layers. The added connections build a fast path that contains neither non-linear activations nor recurrent computations such that the information or gradients can be propagated easily. Figure 6(a) illustrates a FF-LSTM with three layers, and the computation of deep FF-LSTM can be expressed as:
Supposing the multi-layer FF-LSTMs receive a proposal range from video , the hidden block of the first FF-LSTM can be calculated as
. On the topmost FF-LSTM, a max-pooling layer is used to obtain a global representation of the proposal. A binary classifier is calculated based on the global representation for story/background classification. We calculate the intersection-over-union (IoU) between each proposal and ground-truth story, and if the max IoU is larger than 0.7 the proposal is regarded as a positive sample, and a negative sample with IoU less than 0.3. A boundary regressor is also computed based on the max-pooled global representation. The multi-task loss over an training proposalcan be written as:
The first term is a cross-entropy loss
where is the label of the proposal and is the classification score defined by multi-layer FF-LSTMs, max-pooling and the binary classifier. is a balanced weight parameter, and means that the second term works only when the label of the proposal is 1.
The second term accumulates two smooth L1 losses
where the definition of is:
and are the beginning and ending indexes of the chosen (with max IoU) ground-truth story of .
5.1 Implementation Details
Because the size of the dataset is too large to process, we pre-process the videos and extract frame-level features. We decode the videos in 1 frame-per-second (FPS) and extract two kinds of features: “pool5” of ResNet-50 
trained on ImageNet and convolutional audio feature .
We train the BAN using Stochastic Gradient Decent (SGD) with momentum of 0.9, epoch number of 70, weight decay of 0.0005 and a mini-batch size of 256 on four K40 GPUs. One epoch means all training samples are passed through once. All parameters are randomly initialized. The learning rate is set at 0.001. We tried reducing the learning rate during training but found no benefit. The sampling ratio of the four categories (within story, background, beginning boundary and ending boundary) is. To enable boundary regression in the temporal structure modeling stage, we augment the BAN-generated proposals to extend the beginning and ending boundaries similar with .
We train a 5 layer FF-LSTM using SGD with momentum of 0.9, epoch number of 40, weight decay of 0.0008 and a mini-batch size of 256. The learning rate is kept 0.001 throughout the training. Positive and negative proposals are sampled with the ratio of . The balanced parameter is set at 5.
5.2 Evaluation Metrics
For story localization, the mean average precision (mAP) of different methods at three different IoU thresholds are reported. We also report the average of mAP with thresholds . For evaluating the quality of generated temporal proposals, we report the average recall vs. average number of retrieved proposals (AR-AN) curve defined in .
5.3 Ablation Studies
To study the effectiveness of the proposed BAN, we compare with sliding window search (SW), KTS  and TAG . The comparison results are summarized in Figure 6. We can see that the average recall of the proposed BAN is significantly higher than other methods when the proposal number is small, but it cannot generate as much proposals as others because of its selection standard. Noticing that TAG and BAN are highly complementary, merging their proposals obtains an substantial better curve.
Table 2 summarizes the ablation study results of temporal proposal generation and sequential structure modeling, emerging that both components are crucial for the final performance. FF-LSTM keeps beating LSTM with different proposal generation methods. We observe that methods with a training step (TAG and BAN) generate much better proposals than heuristic ones (sliding window and KTS). Using individual BAN proposals are slight better than using individual TAG proposals, but considering TAG can generate larger number of proposals, BAN and TAG are highly complementary. As shown in Table 2, merging them brings an obvious improvement over each individual one.
|TAG + BAN||52.55||57.34|
Table 3 summarizes the ablation study results of using different features. As can be seen, the RGB and audio features are highly complementary and dropping either feature decreases the performance obviously. Another interesting phenomenon is using single audio feature achieves better performance than using single RGB feature. We guess this is because audio patterns change rapidly in video story boundaries, especially for variety programs.
|RGB + Audio||57.34|
Table 4 summarizes the ablation study results of using different deep recurrent models for sequential structure modeling. Directly stacking multiple LSTM layers lead to performance drop because of convergence difficulty. Instead, using deeper FF-LSTM obtains noteworthy performance gains. A 3-layer FF-LSTM increases the mAP by 3.9 over 1-layer LSTM, and a 5-layer FF-LSTM furter increases the mAP by 0.9. We also tried more than 5-layer FF-LSTM, but found very marginal improvements.
Table 5 summarizes the ablation study results of the regression loss. From the results, we can see that adding the regression loss improves the mAP by 3.0. Considering that BAN has addressed the story boundaries when generating proposals, our FF-LSTM component can capture high-order dependencies among frames for further boundary refinement.
|Cla. Loss + Reg. Loss||57.34|
5.4 Quantitative Comparison Results
We compare our proposed framework with state-of-the-art video summary methods, including vsLSTM  and HD-VS . We have to re-implement vsLSTM and HD-VS to make them suitable for our formulation, because the completeness of the storytelling is essential in story-preserving long video truncation, but is not considered in previous video summary papers.
For vsLSTM, we use the merged TAG and BAN proposals as the input of vsLSTM, which replaces the FF-LSTM and servers as the basis of temporal modeling for regression and classification. For HD-VS, we use the merged TAG and BAN proposals as the input of a 5-layer FF-LSTM, and the cross-entropy loss is replaced with a deep ranking loss proposed in HD-VS. The results are summarized in Table 6. As can be seen, the proposed framework outperforms all previous methods in all cases in a large margin.
5.5 User-Study Results
We finally conduct subjective evaluation to compare the quality of generated short video story summary. 100 volunteers with different genders, education backgrounds and ages are required to process the following steps independently:
Watch 18 randomly selected long videos.
For each long video, choose one of the video story summaries as the best one.
The answers of the 100 volunteers are accumulated and the chosen ratio of the three methods are summarized in Table 7. As can be seen, the proposed framework generated summaries receive 46.4% of the votes, which is higher than vsLSTM (40.8%) and HD-VS(12.8%).
In this paper, we propose a new story-preserving video truncation problem that requires algorithms to truncate long videos into short, attractive, and unbroken stories. This problem is particularly important for resource production in video sharing platforms. We collect and annotate a new large TruNet dataset and propose a novel framework that combines BAN and FF-LSTM is proposed to address this problem.
-  (2006) Information theory-based shot cut/fade detection and video summarization. In IEEE Transactions on Circuits and Systems for VIdeo Technology, Cited by: §2.2.
-  (2017) Temporal context network for activity localization in videos. In ICCV, Cited by: §2.3.
-  (2016) DAPs: deep action proposals for action understanding. In ECCV, Cited by: §2.3, §4.1.
-  (2017) Turn tap: temporal unit regression network for temporal action proposals. ICCV. Cited by: §2.3, §4.1, §5.2.
-  (2016) Schematic storyboarding for video visualization and editing.. In ACM Transactions on Graphics, Cited by: §1, §2.2.
-  (2014) Creating summaries from user videos. In ECCV, Cited by: §1, §2.1, §2.2.
-  (2015) Video summarization by learning submodular mixtures of objectives. In CVPR, Cited by: §1, §2.2.
-  (2016) Deep residual learning for image recognition. In CVPR, Cited by: §5.1.
-  (2017) CNN architectures for large-scale audio classification. In ICASSP, Cited by: §5.1.
-  (1997) Long short-term memory. In Neural Computation, Cited by: §4.2.
-  (2014) Action localization by tubelets from motion. In CVPR, Cited by: §2.3.
-  (2014) THUMOS challenge: action recognition with a large number of classes. In http://crcv.ucf.edu/THUMOS14/, Cited by: §2.1.
-  (2013) Large-scale video summarization using web-image priors. In CVPR, Cited by: §2.2.
-  (2014) Joint summarization of large-scale collections of web images and videos for storyline reconstruction. In CVPR, Cited by: §2.2.
-  (2014) First-person hyperlapse videos. In ACM Transactions on Graphics, Cited by: §2.2.
-  (2012) Activitynet: a large-scale video benchmark for human activity understanding. In CVPR, Cited by: Table 1, §1, §2.1.
-  (2012) Discovering important people and objects for egocentric video summarization. In CVPR, Cited by: §2.2.
-  (2017) Single shot temporal action detection. In ACM Multimedia, Cited by: §2.3.
-  (2018) Bsn: boundary sensitive network for temporal action proposal generation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19. Cited by: §2.3.
-  (2013) Story-driven summarization for egocentric video. In CVPR, Cited by: §2.2.
-  (2014) Category-specific video summarization. In ECCV, Cited by: §2.2, §5.3, Table 2.
-  (2008) Nonchronological video synopsis and indexing. In IEEE Transactions on Pattern Analysis and Machine Intelligence, Cited by: §2.2.
-  (2015) ImageNet large scale visual recognition challenge. In IJCV, Cited by: §5.1.
-  (2016) Temporal action localization in untrimmed videos via multi-stage cnns. In CVPR, Cited by: §2.3.
-  (2015) Tvsum: summarizing web videos using titles. In CVPR, Cited by: §1, §2.1.
-  (2014) Salient montages from unconstrained videos. In ECCV, Cited by: §2.2.
-  (2013) Combining the right features for complex event recognition. In CVPR, Cited by: §2.3.
-  (2007) Video abstraction. In ACM TOMCCAP, Cited by: §2.2.
-  (1996) Key frame selection by motion analysis. In Acoustics, Speech, and Signal Processing, Cited by: §2.2.
-  (2017) A pursuit of temporal accuracy in general activity detection. In CVPR, Cited by: §1, §2.3, Figure 5, §4.1, §5.3, Table 2.
-  (2015) Gaze-enabled egocentric video summarization via constrained submodular maximization. In CVPR, Cited by: §2.2.
-  (2015) Unsupervised extraction of video highlights via robust recurrent auto-encoders. In ICCV, Cited by: §1, §2.2.
-  (2016) Highlight detection with pairwise deep ranking for first-person video summarization. In CVPR, Cited by: §1, §2.2, 2nd item, §5.4, Table 7.
-  (2016) Temporal action localization with pyramid of score distribution features. In CVPR, Cited by: §2.3.
-  (2017) Temporal action localization by structured maximal sums. CVPR. Cited by: §2.3.
-  (2016) Video summarization with long short-term memory. In ECCV, Cited by: 2nd item, §5.4, Table 7.
-  (2014) Quasi real-time summarization for consumer videos. In CVPR, Cited by: §2.2.
-  (2017) Temporal action detection with structured segment networks. In ICCV, Cited by: §2.3, §4.1, §5.1.
Deep recurrent models with fast-forward connections for neural machine translation. In Transactions of the Association for Computational Linguistics, Cited by: §1.