DeepAI
Log In Sign Up

End-to-End Compressed Video Representation Learning for Generic Event Boundary Detection

Generic event boundary detection aims to localize the generic, taxonomy-free event boundaries that segment videos into chunks. Existing methods typically require video frames to be decoded before feeding into the network, which demands considerable computational power and storage space. To that end, we propose a new end-to-end compressed video representation learning for event boundary detection that leverages the rich information in the compressed domain, i.e., RGB, motion vectors, residuals, and the internal group of pictures (GOP) structure, without fully decoding the video. Specifically, we first use the ConvNets to extract features of the I-frames in the GOPs. After that, a light-weight spatial-channel compressed encoder is designed to compute the feature representations of the P-frames based on the motion vectors, residuals and representations of their dependent I-frames. A temporal contrastive module is proposed to determine the event boundaries of video sequences. To remedy the ambiguities of annotations and speed up the training process, we use the Gaussian kernel to preprocess the ground-truth event boundaries. Extensive experiments conducted on the Kinetics-GEBD dataset demonstrate that the proposed method achieves comparable results to the state-of-the-art methods with 4.5× faster running speed.

READ FULL TEXT VIEW PDF
06/07/2022

Structured Context Transformer for Generic Event Boundary Detection

Generic Event Boundary Detection (GEBD) aims to detect moments where hum...
06/18/2021

Discerning Generic Event Boundaries in Long-Form Wild Videos

Detecting generic, taxonomy-free event boundaries invideos represents a ...
03/01/2022

Temporal Perceiver: A General Architecture for Arbitrary Boundary Detection

Generic Boundary Detection (GBD) aims at locating general boundaries tha...
11/29/2021

UBoCo : Unsupervised Boundary Contrastive Learning for Generic Event Boundary Detection

Generic Event Boundary Detection (GEBD) is a newly suggested video under...
11/27/2016

Long-Term Image Boundary Prediction

Boundary estimation in images and videos has been a very active topic of...
10/11/2022

Motion Aware Self-Supervision for Generic Event Boundary Detection

The task of Generic Event Boundary Detection (GEBD) aims to detect momen...
01/26/2021

Generic Event Boundary Detection: A Benchmark for Event Segmentation

This paper presents a novel task together with a new benchmark for detec...

1 Introduction

Video traffic will account for 82% percent of all internet traffic by 2022, up from 75% in 2017 [11]. Understanding video content using AI technology is an active area of research in recent years. However, it is still a challenging task due to the complex temporal evolution in the enormous size of raw video streams with high temporal redundancy.

Video understanding is one of the most fundamental problems in computer vision, which includes video tagging, action recognition, and video boundary detection, etc. In contrast to static images, videos provide rich information involving temporal consistency in consecutive frames which can be additionally utilized. Currently, the two-stream network

[30, 9, 8] and 3D convolutional network [33, 17, 34, 37] are two popular network architectures in the video understanding field. The two-stream network incorporates both the decoded RGB video frames and optical flow to exploit temporal information. However, extracting optical flow is very slow, which dominates the overall pre-processing time in the video understanding tasks. 3D convolutional network is another choice to model temporal information using the spatio-temporal filters. The drawbacks of 3D convolutional network is the massive parameters contained in 3D convolution operations, which slows down the inference speed. Besides the aforementioned methods, the new trend in video understanding is using the transformers, including [5, 1, 24, 6, 51], achieving competitive results.

In recent years, several methods [49, 45, 29, 42, 47, 16] demonstrate the advantages of directly taking videos in compressed domain as input for video understanding. These methods use motion vectors and residuals in the compressed representation that developed for storage and transmission of videos rather than operating on the decoded RGB frames, which run in two orders of magnitude faster than the methods using optical flow while achieving competitive results [29]. Specifically, these methods use the almost compute-free motion vectors and residuals encoded in P-frames as an alternative to the compute-intensive optical flow. For example, CoViAR [45] directly feeds motion vectors and residuals into 2D CNNs for action recognition, and DMC-Net [29] improves the CoViAR method by reconstructing the optical flow based on motion vectors and residuals. Although the aforementioned method achieves promising results, they are still far from satisfactory, which lack effective fusion strategies between different modalities, such as decoded I-frames, motion vectors, and residuals.

In this paper, we focus on the generic event boundary detection (GEBD [28]

) task that aims to localize the moments where humans naturally perceive taxonomy-free event boundaries that segment a longer event into shorter temporal segments. The ability to divide a long form video into small meaningful clips makes this task demanding for several downstream video understanding tasks and industry applications that requires high accuracy and low latency. The previous attempt

[28]

formulate it as a classification task by considering the context information of the candidate boundaries. However, it neglects the temporal relations between consecutive frames and operates inefficiently during feature extraction stage. Inspired by

[49, 45, 29, 42, 47, 16], we design an end-to-end trained network to exploit the discriminative features for GEBD in compressed domain, i.e., MPEG-4, which is able to save decoding cost and improve feature extraction efficiency. Specifically, most modern codecs split a video into several group of pictures (GOP), where each GOP is formed by one I-frames and P-frames. To solve difficulty arised from the long chain of dependency of the P-frames, inspired by [45], we use the back-tracing technique to compute the accumulated motion vectors and residuals in linear time. In this way, the consecutive P-frames in each GOP are only depending on the reference I-frame, which can be processed in parallel.

In contrast to the I-frame, it is difficult to learn the discriminative features of the P-frames. Refining the features of the reference I-frame based on the motion vectors and residuals becomes an intuitive option. Motion vectors and residuals provide information to reconstruct P-frames by referring the dependent I-frames. In addition to that, they also provide motion information that obtained from the video encoding process. To that end, we design a light-weight spatial-channel compressed encoder to refine the features of the reference I-frame with the guidance of the motion vectors and residuals. In this way, the features of P-frames and I-frames are converted to the same feature space, which benefits the subsequent processing. After that, a temporal contrastive module is proposed to capture the context information in temporal domain to predict the event boundaries of videos. Notably, our temporal contrastive module imitates humans, i.e., look back and forth around the candidate frames to determine event boundaries, by comparing the extracted features before and after the candidate frames. In addition, to remedy the ambiguities of annotations and speed up the training process, we use the Gaussian kernel to preprocess the ground-truth event boundaries instead of using the “hard lables” of boundaries. Extensive experiments conducted on the Kinetics-GEBD dataset to demonstrate the effectiveness of the proposed method. Specifically, the proposed method achieves comparable results to the state-of-the-art method at the CVPR’21 LOVEU Challenge [18] with faster running speed, see Figure 1.

The main contributions of this paper are listed as follows. (1) We propose an end-to-end compressed video representation learning method to solve the challenging GEBD task. (2) We design the spatial-channel compressed encoder to project the features of reference I-frame with the guidance of motion vectors and residuals to compute the features of P-frames with low cost. (3) A temporal contrastive module is proposed to determine the event boundaries of videos by exploiting the context information in temporal domain. (4) The proposed method achieves comparable results to the state-of-the-art methods at the CVPR’21 LOVEU Challenge [18] with faster running speed, demonstrating its effectiveness.

Figure 2: The architecture of the proposed method. The spatial-channel compressed encoder (SCCE) is designed to obtain the refined P-frame representation based on reference I-frame feature , motion vectors and residuals . This module regards each GOP as a process unit, which is efficient and can be paralleled in a large batch size. Then we use temporal contrastive module to capture temporal dependence explicitly based on unified representation

, which provides strong cues for boundary detection. After that, a simple classifier is used to make final predictions trained with the Gaussian smoothed soft labels.

2 Related Work

Video recognition.

Over the last decade, video recognition has achieved great progress thanks to the emergence of deep learning. Early methods

[38, 39, 26, 20] use hand-crafted features for video recognition. After the arriving of deep learning, the video recognition field is quickly dominated by the CNN-based methods, such as the two-stream network and 3D convolutional network. The two-stream network based methods [30, 9, 8] use additional temporal stream to learn motion information and design various fusion strategies to combine the information from the image stream and temporal stream, achieving superior results. The optical flow is generally used to describe the motion information, which is computationally expensive. Some other methods [33, 17, 34, 37] attempt to use the 3D convolutional network with the spatio-temporal filters to integrate temporal information. However, these methods are hard to optimize and require large-scale datasets in the training phase. The new trend in recent years is the introduction of transformers [5, 1, 24, 6, 51], which achieving promising results on various datasets in video understanding.

Meanwhile, some recent methods attempt to directly take the raw compressed videos as input for different tasks in the video understanding field, such as action recognition [49, 45, 29, 47, 16], object detection [42], and video segmentation [10]. The aforementioned methods use motion vectors and residuals directly obtained from the compressed videos as the alternatives to optical flow and achieve comparable results in terms of both the speed and accuracy.

Generic event boundary detection. Generic event boundary detection (GEBD) [28] aims to localize the moments where humans naturally perceive taxonomy-free event boundaries that break a longer event into shorter temporal segments. The previous method [28] takes video frames before and after the candidate boundaries as input, and separately determines whether each candidate is the event boundary or not. Kang et al. [18] propose to use the temporal self-similarity matrix (TSM) as the intermediate representation and use the popular contrastive learning method to exploit the discriminative features for better performance. Hong et al. [14] use the cascade classification heads and dynamic sampling strategy to boost both recall and precision. Meanwhile, Rai et al. [27] attempt to learn the spatiotemporal features using a two stream inflated 3D convolutions architecture. To the best of our knowledge, there do not exist any prior work focuses on the GEBD task in the compressed domain.

Attention mechanism. To learn more discriminative features, numerous methods have been proposed, which mainly focus on enhancing the feature representations using the attention mechanisms on the spatial or(and) channel dimensions. SENet [15] develops the “Squeeze-and-Excitation” (SE) block that adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels. Non-local network [43] capture long-range dependencies by computing the response at a position as a weighted sum of the features at all positions in the input feature maps. SKNet [22] proposes to adaptively adjust the receptive field size of the input feature map by fusing multiple feature maps of different kernel sizes with softmax attention in a weighted manner. CBAM [44] sequentially infers attention maps along both channel and spatial dimensions, and then uses the attention maps to recalibrate the origin input feature. In contrast to the aforementioned methods, we attempt to refine the P-frame feature with the guidance of motion vectors and residuals by considering both spatial and channel dimensions of the features of I-frame, which fully leverages the information of the decoded reference I-frame to enrich the features of P-frames.

3 Method

The existing method [28]

formulates the GEBD task as binary classification, which predicts the boundary labels of each frame by considering the temporal contextual information. That is, the preceding and succeeding frames of each video frame are feed into a neural network to detect the boundaries. It is inefficient due to the duplicated computation is conducted of consecutive frames. To remedy this, we propose an end-to-end compressed video representation method for GEBD, which regards each video clip as a whole. Specifically, we use MPEG-4 encoded videos as our input. Each video clip

is formed by groups of pictures (GOPs), and each GOP contains one I-frame and P-frames, i.e.,

(1)

where denotes the reference I-frame and denotes the -th P-frame of the -th GOP, and and are the height and width of the video frame. For simplicity, we assume that there exists the same number of P-frames in all GOPs. The P-frame in the -th GOP is formed by the motion vector and residual , which can be obtained nearly cost-free from the compressed video stream. Notably, the motion vectors and residuals alone do not contain the full information of a P-frame. The P-frame depends on the reference I-frame or other P-frames, making it difficult to learn the discriminative feature representations for P-frames. Following [45], we trace all motion vectors back until to the reference I-frame and accumulate the residual on the way to decouple the dependencies between the consecutive P-frames. In this way, each P-frame only depends on the reference I-frame rather than other P-frames. After that, we build our model based on the back-traced motion vectors and residuals and regard each GOP as a process unit. The overall network architecture is presented in Figure 2. As shown in Figure 2, the GOP is first encoded by the designed spatial-channel compressed encoder (SCCE) to generate the unified video representation. After that, a temporal contrastive module is used to exploit the temporal context information to get the discriminative feature representations. Finally, a classifier is used to generate the accurate event boundaries.

3.1 Spatial-Channel Compressed Encoder

Motion, uncovered regions, and lighting variations frequently happen in video sequences. Modern codecs use macroblock as the basic unit for motion compensated prediction in a number of mainstream visual coding standards such as MPEG-4, H.263, and H.264. Motion vectors record the moving direction of each macroblock with respective to its reference frame(s), describing the motion patterns of videos, which is important for the GEBD task. The residuals can be regraded as the compensations of the motion information, which contains the boundary information of moving objects and plays a crucial role to identify the important regions in the I-frame. Thus, we propose to apply the attention mechanism to different regions of I-frame with the guidance of motion vectors to enrich the features by considering both channel and spatial dimensions. For simplicity, we omit the index of the GOP in the following sections.

Firstly, we use the convolutional neural network taking the decoded RGB image as input to extract the feature representation

of the I-frame , i.e., , where is the features of the I-frame , and , and are the channel, height and width of the features , respectively. denotes the model used to extract features for I-frame, which is pretrained on the large-scale datasets (e.g.

, ResNet50 pretrained on ImageNet). Meanwhile, we can similarly compute the features for the P-frames

with a more lightweight model than the one used for I-frame, by directly taking the motion vectors and residuals , i.e., , and as input, where denote the features of the motion vectors and residuals, respectively. In this way, a considerable amount of time can be saved on extracting features for the P-frames. This simple strategy can only bring limited performance gain [45]. The method [29] attempts to integrate the optical flow in training phase, which can further improve the accuracy. However, there is still much room for improvement of the aforementioned methods. Specifically, the motion vectors record the motion patterns of both the scenes and objects in videos, and the residuals provide the compensation information. Both of them do not contain the context information of scenes. To that end, we design the spatial-channel compressed encoder module by integrating the features of the reference I-frame in computing the features of P-frames.

Figure 3: The architecture of the proposed spatial-channel compressed encoder (SCCE) module. We concatenate the features of I-frame , motion vectors and the features of motion vectors to modulate the features of the reference I-frame in both channel and spatial dimensions. After that, the modulated features is residually added with the features of motion vectors to obtain the refined vector representations .

We first compute the features of the motion vectors by refining the features of the reference I-frame in both the channel and spatial dimensions. As indicated by [48], different regions on the feature maps focus on different parts of images. Thus, we introduce the attention weight for each feature map of based on the information of the P-frame . Specifically, we concatenate I-frame feature , motion vector feature and motion vectors together in the channel dimension to compute the channel weight using a lightweight PWC-Net [32], i.e.,

(2)

where

is the sigmoid function,

is the ReLU function, and

are the learnable weights of the FC layers. After that, the features of the I-frame are updated based on as follows.

(3)

where is the channel-wise multiplication. In this way, we can compute the channel-weighted feature by updating in channel dimension, with the guidance of the motion vectors. Meanwhile, the channel-weighted feature is further updated in the spatial dimension and the spatial dimension is reduced. That is, given the features of the reference I-frame, motion vector features and motion vectors , we compute the 2D weight map , i.e.,

(4)

where is the spatial weight map. After that, we use to weight the features in the spatial dimension to compute the enriched features of the motion vectors , i.e.,

(5)

where enumerates all spatial positions of . Finally, we add to the original features of the P-frame to obtain the refined features of the motion vectors , i.e.,

(6)

The overall computing process of is presented in Figure 3. Similarly, we can compute the refined features for the residuals . The final feature representations for the P-frame is further computed as

(7)

In this way, we can compute the features of the P-frames in the GOP by considering the reference I-frame in both channel and spatial dimensions. The overall process is very efficient and can be processed in parallel in GOPs. After extracting the discriminative features for both the I-frames and P-frames in the same feature space, we can predict the event boundaries efficiently and accurately.

3.2 Temporal Contrastive Module

Based on the extracted features of the video , we design the temporal contrastive module to predict the event boundaries. Inspired by humans, i.e., look back and forth around the candidate boundary frames to determine event boundaries, we compute the contrastive features before and after the candidate boundary frames in the temporal domain. Specifically, given feature representations before frames of candidate boundary frame , we compute the left features of the candidate boundary frame using the simple linear weighted summation strategy, i.e.,

(8)

where is the learnable weights and shared at different position . The simple linear weighted summation can be efficiently implemented using the 1D convolutional operation. Meanwhile, the right features can be similarly computed, i.e., weighted summing the feature representations of the features after the candidate boundary frame . After that, the contrastive feature is computed as the concatenation of and , i.e., . Then for the final classification, we use the contrastive representations to make the event boundary predictions.

3.3 Loss Function

Given the feature representations of each video frame and the corresponding ground-truth labels, the event boundary detection task is intuitively formulated as the binary classification task. However, the ambiguities of annotations disrupt the learning process, which leads to poor convergence. To solve this issue, we use the Gaussian kernel to preprocess the ground-truth event boundaries to obtain the soft labels instead of using the “hard labels” of boundaries. Specifically, for each annotated boundary, the intermediate label of the neighboring position is computed as:

(9)

where indicates the intermediate label at time corresponding to the annotated boundaries at time . We set in all our experiments. The final soft labels are computed as the summation of all intermediate labels. Finally, a simple nonlinear Conv1D classifier is applied to predict the boundary score and the binary cross-entropy loss is used to guide the training process.

Figure 4: Visualization of the compressed information. The decoded RGB frames, motion vectors and residuals are presented in different columns. Best view in color.

4 Experiments

Implementation detail. ResNet50 and ResNet18 [13]

pretrained on ImageNet

[4] are used to extract the features for I-frames and P-frames in all experiments if not particularly indicated. Our method is implemented based on the MPEG-4 Part 2 specifications [12], where each GOP contains I-frame and P-frames. We sample P-frames in each GOP to reduce the redundancy, i.e., in (1). We use the standard SGD with momentum set to , weight decay set to , and learning rate set to . We set the batch size to for each GPU and train the network on NVIDIA Tesla V100 GPUs, resulting in a total batch size of . The network is trained for epochs with a learning rate drop by a factor of after epochs and epochs, respectively. We test the running speed of all methods on NVIDIA Tesla V100 GPU. All the source code of our method will be made publicly available after the paper is accepted.

HMDB-51 UCF-101
Decoded video based methods (RGB only)
ResNet-50 [13] 48.9 82.3
ResNet-152 [13] 46.7 83.4
ActionFlowNet (2-frames) [25] 42.6 71.0
ActionFlowNet [25] 56.4 83.9
PWC-Net (ResNet-18) + CoViAR [32] 62.2 90.6
TVNet [7] 71.0 94.5
C3D [34] 51.6 82.3
Res3D [35] 54.9 85.8
ARTNet [40] 70.9 94.3
MF-Net [3] 74.6 96.0
S3D [46] 75.9 96.8
I3D RGB [2] 74.8 95.6
Compressed video based methods
EMV-CNN [49] 51.2 (split1) 86.4
DTMV-CNN [50] 55.3 87.5
CoViAR [45] 59.1 90.4
DMC-Net(ResNet-18) [29] 62.8 90.9
DMC-Net(I3D) [29] 71.8 92.3
Ours (ResNet-18) 63.3 91.0
Ours (I3D) 72.1 92.5
Table 1: Accuracy on the HMDB-51 and UCF-101 datasets for both decoded video based methods and compressed video based methods. Our spatial-channel compressed encoder (SCCE) performs favorably against the state-of-the-art compressed video based methods.

Datasets. We conduct our experiments on the Kinetics-GEBD dataset [28], which contains the largest number of temporal boundaries. The Kinetics-GEBD dataset includes videos and

event boundaries, spans a broad spectrum of video domains in the wild and is open-vocabulary rather than building on a pre-defined taxonomy. Besides, to verify the generality and effectiveness of our method, we also conduct experiments on the popular action recognition datasets UCF101

[31] and HMDB51 [19]. UCF101 consists of 101 action classes over videos and HMDB51 contains distinct action categories with a total of video clips.

Rel.Dis. Threshold 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 avg
BMN[23] 0.186 0.204 0.213 0.220 0.226 0.230 0.233 0.237 0.239 0.241 0.223
BMN-StartEnd[28] 0.491 0.589 0.627 0.648 0.660 0.668 0.674 0.678 0.681 0.683 0.640
TCN-TAPOS[28] 0.464 0.560 0.602 0.628 0.645 0.659 0.669 0.676 0.682 0.687 0.627
TCN[21] 0.588 0.657 0.679 0.691 0.698 0.703 0.706 0.708 0.710 0.712 0.685
PC[28] 0.625 0.758 0.804 0.829 0.844 0.853 0.859 0.864 0.867 0.870 0.817
PC + Optical Flow 0.646 0.776 0.818 0.842 0.856 0.864 0.868 0.874 0.877 0.879 0.830
Ours 0.743 0.830 0.857 0.872 0.880 0.886 0.890 0.893 0.896 0.898 0.865
Table 2: The evaluation results on the Kinetics-GEBD validation set with different Rel.Dis. threshold. Our method improves the F1 score over all thresholds by a large margin.

4.1 Discussion

Kinetics-GEBD. We first train and evaluate the proposed method on the Kinetics-GEBD [28] train-validation split. The evaluation protocol presented in [28] uses Relative Distance (i.e., Rel.Dis.

, the error between the predicted and ground truth timestamps) to determine whether a prediction is correct or not and then use the precision, recall, and F1 scores as the evaluation metrics. The results are shown in Table

2. Compared to the previous method PC [28], our method achieves 11.8% absolute improvement while running faster. Meanwhile, we also add an additional optical flow input stream to PC. A slight improvement is observed after integrating optical flow, which indicates that the motion information (i.e., optical flow) alone can only provide limit temporal information for the generic event boundary detection task. Using motion vectors and residuals, our method provides more information features of the compressed P-frames by considering both the spatial and channel dimensions. The performance gap between the PC with optical flow and the proposed method demonstrates that the proposed method provides strong temporal cues for GEBD explicitly.

UCF101 and HMDB51. To validate the effectiveness of our method, we also conduct experiments on the action recognition datasets UCF-101 and HMDB-51. We follow the same settings as CoViAR [45] except that we use spatial-channel compressed encoder to process the motion vectors and residuals. Note that our temporal contrastive module is designed to capture temporal dependency, which is more suitable for event boundary detection. Thus, it is not applied in the action recognition task. The results are shown in Table 1. Our method achieves competitive results comparing with the state-of-the-art methods in compressed domain, i.e., EMV-CNN [49], DTMV-CNN [50], CoViAR [45] and DMC-Net [29]. We believe that this is because our method is able to generate more discriminative P-frame representations with the help of the proposed SCCE module. In contrast to other methods that process motion vectors and residuals in separate branches from the reference I-frame, we integrate the features of I-frame with the guidance of motion vectors and residuals on both spatial and channel dimensions. In this way, the rich information from the features of I-frame, motion vectors and residuals could be effectively fused together to generate high quality P-frame features with little overhead. It’s worth noting that DMC-Net [29] needs extra optical flow as supervision during training phase while our method can directly learn discriminative features with the spatial-channel compressed encoder.

4.2 Ablation Study

We conduct several ablation studies to demonstrate the effectiveness of different components in the proposed method. All experiments are conducted on the Kinetics-GEBD train split with ResNet50 backbone and tested on a local minval split to reduce the computation cost. The local minval split is constructed from the Kinetics-GEBD validation split by randomly sampling videos.

Rec Prec F1 Speed(ms)
PC [28] 0.611 0.631 0.621 46.4
+ E2E 0.629 0.640 0.634 9.3
+ GS 0.665 0.643 0.654 9.3
Table 3: The effectiveness of our proposed end-to-end architecture. “E2E” indicates end-to-end training strategy and “GS” indicates using soft labels generated by the Gaussian kernel strategy. To study the influence of these modules, we simply replace the inputs of the PC method by down-sampling each video into a succession of frames and use a Gaussian kernel to smooth the labels. This strategy improves the accuracy of the PC [28] method with a much faster running speed by reducing redundant computations.

The end-to-end architecture. The previous PC method [28] formulate GEBD as the classification task, which feeds preceding and succeeding frames as inputs to provide temporal context information. To verify the effectiveness and the efficiency of the proposed end-to-end architecture, we conduct experiments with the identical architecture as PC [28] except the feature inputs and target labels, shown in Table 3. Simply replacing the feature inputs of PC [28] with continuous video frames gives 1.3% absolute performance gain while increasing inference speed by a large margin, which indicates that sharing features between nearby frames is beneficial for the GEBD task. Besides, using the soft labels generated by Gaussian kernel provides further 2.0% absolute improvements. Using the ambiguous “hard labels” disrupt the learning process, which leads to poor convergence. Our soft label strategy effectively solve this issue and speeds up the training process.

Method Repre. Rec Prec F1 Speed(ms)
PC [28] - 0.611 0.631 0.621 46.4
OF 0.635 0.658 0.646 69.3
Vanilla 0.643 0.641 0.642 33.2
SCCE 0.709 0.638 0.669 34.5
Ours - 0.665 0.643 0.654 9.3
OF 0.649 0.673 0.661 15.7
Vanilla 0.659 0.656 0.657 4.1
SCCE 0.725 0.651 0.686 4.5
Table 4: Ablation study of different compressed representation. “OF” indicates the optical flow and “Vanilla” indicates using the vanilla ResNet-18 to extract the features of motion vectors and residuals. We observe that both the methods improved from PC [28] and our method benefit from optical flow and motion vectors and residuals. The proposed spatial-channel compressed encoder (SCCE) module further improve the accuracy with similar running speed.

Compressed representation. We conduct the ablation studies on various strategies of using the compressed representations, i.e., (1) use optical flow (OF) in PC [28], (2) replace RGB images of P-frames in PC [28] with compressed representation (i.e., motion vectors and residuals), (3) use our spatial-channel compressed encoder in PC [28], (4) remove compressed representation in our method, (5) use optical flow in our method, and (6) replace spatial-channel compressed encoder in our method with a vanilla encoder. The results are shown in Table 4. Visualization exemplars of the compressed information are shown in Figure 4. Both PC [28] and our method are benefit from optical flow, compressed representation and the spatial-channel compressed encoder module respectively. Specifically, the optical flow branch improves F1 score by compared to the original PC [28] method, with a much slower inference speed. The compressed information only brings limited improvements to PC [28] and our end-to-end method with a relatively faster inference speed. This phenomenon indicates that the simple usage of motion vectors and residuals cannot fully exploit the rich information contained in compressed domain. The proposed spatial-channel compressed encoder provides significant performance improvements without extra computation cost, indicating that the proposed encoder can learn more discriminative features of P-frames. This module allows our proposed method to fully exploit the compressed representations and capture crucial motion information from the nearly cost-free motion vectors and residuals.

Window size Rec Prec F1
0.725 0.651 0.686
0.675 0.749 0.710
0.697 0.745 0.720
0.729 0.744 0.736
0.757 0.736 0.746
0.725 0.750 0.737
0.696 0.761 0.727
Table 5: Ablation study of our temporal contrastive module by varying the window size . means we remove the temporal contrastive module. This study shows that it’s critical to learn the temporal dependency explicitly for event boundary detection. However, the value of window size gives limited influence to the performance.

Temporal contrastive module. Besides the discriminative features of P-frames, the temporal dependencies are also important to predict the accurate event boundaries. To validate the effectiveness of the temporal contrastive module, we conduct several experiments, shown in Table 5. As shown in Table 5, without the temporal contrastive module (i.e., ), the overall accuracy (F1 score) decreased dramatically. After adapting the proposed temporal module, the F1 score improves sharply, i.e., vs. at . To further analyze the effective of different window size in model accuracy, we also perform several experiments with different values. Table 5 shows that the recall starts to drop when . We believe that it is because larger window size mixes temporal information cross boundaries, resulting in the combination of multiple different predictions and decreasing the recall value. Considering the performance, we set in our experiments as the default setting.

Method Rec Prec F1 Speed(ms)
CLA [18] 0.815 0.768 0.791 90.2
CASTANET [14] 0.838 0.732 0.781 93.9
Ours (CSN+R18) 0.813 0.761 0.786 20.4
Ours (R50+R18) 0.751 0.742 0.746 4.7
Table 6: Comparisons with the state-of-the-art methods. The results are evaluated in validation split. indicate the results come from our implementations since the test server is unavailable now. CLA [18] uses the concatenation of pre-trained two-stream TSN [41] and SlowFast [8] features as input, CASTANET [14] and ours uses pretrained CSN [36] as backbone. The speed is computed by averaging per-frame decoding and inference time.

Comparisons with the state-of-the-arts. We compare the proposed method with the state-of-the-art methods at CVPR’21 LOng-form VidEo Understanding (LOVEU) Challenge111https://sites.google.com/view/loveucvpr21. CLA [18] uses contrastive learning based approach to deal with the GEBD and utilizes temporal self-similarity matrix (TSM) as an intermediate representation. However, their approach relies on pre-extracted features and uses global similarity matrix, which hurts model’s scalability. CASTANET [14] adapts the identical framework from PC [28] except the feature extractor and thus introduces redundant computations between nearby frames. Our method with ResNet50 runs extremely fast, i.e., faster than CLA [18]. After replacing I-frame feature extractor with a more powerful backbone CSN [36], we achieve competitive result, i.e., 0.787 compared with CLA 0.795 and CASTANET 0.784, while improving inference speed for more than . The result shows the efficiency of working on the compressed domain using a lightweight network as the P-frame feature extractor and the effectiveness of our proposed method on high quality representation learning.

5 Conclusion

In this work, we propose an end-to-end compressed video representation learning method for GEBD. Specifically, we convert the video input into successive frames and use the Gaussian kernel to preprocess the annotations. Meanwhile, we design a spatial-channel compressed encoder to make full use of the motion vectors and residuals to learn discriminative feature representations for P-frames. After that, we propose a temporal contrastive module to model the temporal dependency between frames and generate accurate event boundaries. Extensive experiments have conducted on the Kinetics-GEBD dataset demonstrate that the proposed method performs favorably against the state-of-the-art methods.

6 Acknowledgement

This work was supported by the Key Research Program of Frontier Sciences, CAS, Grant No. ZDBS-LY-JSC038. Libo Zhang was supported CAAI-Huawei MindSpore Open Fund and Youth Innovation Promotion Association, CAS (2020111).

References