DeepAI
Log In Sign Up

Structured Context Transformer for Generic Event Boundary Detection

06/07/2022
by   Congcong Li, et al.
0

Generic Event Boundary Detection (GEBD) aims to detect moments where humans naturally perceive as event boundaries. In this paper, we present Structured Context Transformer (or SC-Transformer) to solve the GEBD task, which can be trained in an end-to-end fashion. Specifically, we use the backbone convolutional neural network (CNN) to extract the features of each video frame. To capture temporal context information of each frame, we design the structure context transformer (SC-Transformer) by re-partitioning input frame sequence. Note that, the overall computation complexity of SC-Transformer is linear to the video length. After that, the group similarities are computed to capture the differences between frames. Then, a lightweight fully convolutional network is used to determine the event boundaries based on the grouped similarity maps. To remedy the ambiguities of boundary annotations, the Gaussian kernel is adopted to preprocess the ground-truth event boundaries to further boost the accuracy. Extensive experiments conducted on the challenging Kinetics-GEBD and TAPOS datasets demonstrate the effectiveness of the proposed method compared to the state-of-the-art methods.

READ FULL TEXT VIEW PDF

page 8

page 11

06/25/2022

SC-Transformer++: Structured Context Transformer for Generic Event Boundary Detection

This report presents the algorithm used in the submission of Generic Eve...
03/29/2022

End-to-End Compressed Video Representation Learning for Generic Event Boundary Detection

Generic event boundary detection aims to localize the generic, taxonomy-...
12/09/2021

Progressive Attention on Multi-Level Dense Difference Maps for Generic Event Boundary Detection

Generic event boundary detection is an important yet challenging task in...
10/11/2022

Motion Aware Self-Supervision for Generic Event Boundary Detection

The task of Generic Event Boundary Detection (GEBD) aims to detect momen...
07/03/2022

Exploiting Context Information for Generic Event Boundary Captioning

Generic Event Boundary Captioning (GEBC) aims to generate three sentence...
09/30/2021

CoSeg: Cognitively Inspired Unsupervised Generic Event Segmentation

Some cognitive research has discovered that humans accomplish event segm...
04/13/2020

Event detection in coarsely annotated sports videos via parallel multi receptive field 1D convolutions

In problems such as sports video analytics, it is difficult to obtain ac...

1 Introduction

Video has accounted for a large part of human’s life in recent years. Aided by the rapid developments of hardware, video understanding has witnessed an explosion of new designed architectures [13, 38, 6, 7, 29, 24] and datasets [16, 36, 34, 18, 31]. The cognitive science [39] suggests that humans naturally divide video into meaningful units. To enable machines to develop such ability, Generic Event Boundary Detection [35] (GEBD) is recently proposed which aims at localizing the moments where humans naturally perceive event boundaries.

Figure 1:

Overview architecture of the proposed method. The proposed method can predict all boundaries of video sequences in a single forward pass with high efficiency. We use a CNN backbone to extract the 2D features of each video frame. These features are then pooled into the vectors and converted into a sequence. The structured partition of sequence (SPoS) mechanism is employed to re-partition input frame sequence and provides

structured context for each candidate frame. Based on this structured context, the transformer encoder blocks are used to learn the high level representations of each local sequence, which have linear computational complexity with respect to the video length and enable feature sharing. After that, we compute the group similarities to encode frame differences and use a lightweight fully convolutional network (FCN) is predict event boundaries based on the computed 2D grouped similarity maps.

Event boundaries in the GEBD task are taxonomy-free in nature and can be seen as a new attempt to interconnect human perception mechanisms to video understanding. Annotators are required to localize boundaries at “one level deeper” granularity compared to the video-level event. To remedy the ambiguities of event boundaries based on human perception, five different annotators are employed for each video to label the boundaries based on predefined principles. These characteristics differentiate GEBD from the previous video localization tasks [42] by several high-level causes, for example, 1) Change of Subject, i.e., new subject appears or old subject disappears, 2) Change of Action, i.e., an old action ends, or a new action starts, 3) Change in Environment,i.e., significant changes in color or brightness of the environment, 4) Change of Object of Interaction, i.e., the subject starts to interact with a new object or finishes with an old object. The aforementioned factors make GEBD to be a more challenging task compared to video localization.

Solving GEBD task is not trivial since detecting event boundaries highly rely on temporal context information. Existing methods tackle this problem by processing each frame individually [35, 37, 11] or computing the global self-similarity matrix and using extra parsing algorithm to find boundary patterns based on the self-similarity matrix [15, 14]. The methods in the first category introduce substantial redundant computations of adjacent frames in a video sequence when predicting boundaries and have to solve the class imbalance issue of event boundaries. The methods in the second category have quadratic computation complexity regarding to the length of input videos due to computation of self-attention globally and the usage of the extra parsing algorithm to predict boundaries.

To that end, we proposed an end-to-end method to predict all boundaries of video sequences in a single forward pass of the network with high efficiency. The overall architecture of the proposed method is shown in Figure 1. Specifically, the Structured Context Transformer (SC-Transformer) is designed for GEBD based on the designed structured partition of sequence (SPoS) mechanism, which has linear computational complexity with respect to input video length and enables feature sharing by design. Structured partition of sequence (SPoS) mechanism brings the local feature sequences for each frame in a one-to-one manner, which is termed as structured context

. We also find that 1D CNNs actually make the candidate frames attend to adjacent frames in a Gaussian distribution manner

[27], which is not optimal for boundary detection as adjacent frames are equally important. Our proposed SC-Transformer can learn a high level representation for each frame within its structured context which is critical for boundary detection. After that, we use the group similarity to exploit discriminative features to encode the differences between frames. The concept of groups as a dimension for model design has been widely studied, including Group Convolutions [17, 43], Group Normalization [41], Multi-head self attention [40], etc. However, to the best of our knowledge, there is still no study on the grouped similarity learning. Previous methods [15, 14, 37] actually compute similarity matrix on one dimension group. Our proposed group similarity allows the network to learn a varied set of similarities and we find it is effective for GEBD. Following the group similarity, a lightweight fully convolutional network [26] (FCN) is used to predict event boundaries. Note that, to speed up the training phase, the Gaussian kernel is used to preprocess the ground-truth event boundaries. Extensive experiments conducted on two challenging Kinetics-GEBD and TAPOS datasets demonstrate the effectiveness of the proposed method compared to the state-of-the-art methods. Specifically, compared to DDM-Net [37], our method produces 1.3% absolute improvement. Meanwhile, compared to PC [35], our method achieves 15.2% absolute improvement with faster running speed. We also conduct several ablation studies to analyze the effectiveness of different components in the proposed method. We hope the proposed method can inspire future work.

The main contributions of this paper are summarized as follows. (1) We propose the structured context transformer for GEBD, which can be trained in an end-to-end fashion. (2) To capture differences between frames, we compute the group similarities to exploit the discriminative features to encode the differences between frames and use a lightweight FCN to predict the event boundaries. (3) Several experiments conducted on two challenging Kinetics-GEBD and TAPOS datasets demonstrate the effectiveness of the proposed method compared to the state-of-the-art methods.

2 Related Works

Generic Event Boundary Detection (GEBD). The goal of GEBD [35] is to localize the taxonomy-free event boundaries that break a long event into several short temporal segments. Different from TAL, GEBD only requires to predict the boundaries of each continuous segments. The current methods[14, 11, 32] all follow the similar fashion in [35], which takes a fixed length of video frames before and after the candidate frame as input, and separately determines whether each candidate frame is the event boundary or not. Kang et al. [14] propose to use the temporal self-similarity matrix (TSM) as the intermediate representation and use the popular contrastive learning method to exploit the discriminative features for better performance. Hong et al. [11] use the cascade classification heads and dynamic sampling strategy to boost both recall and precision. Rai et al. [32] attempt to learn the spatiotemporal features using a two stream inflated 3D convolutions architecture.

Temporal Action Localization (TAL). TAL aims to localize the action segments from untrimmed videos. More specifically, for each action segment, the goal is to detect the start point, the end point and the action class it belongs to. Most approaches could be categorised into two groups, A two-stage method[33, 30, 2, 46, 3] and a single-stage method[19, 22, 1, 25, 44, 28, 44, 45]. In a two-stage method setting, the first stage generates action segment proposals. The actionness and the type of action for each proposal are then determined by the second stage, along with some post-processing methods such as grouping [46] and Non-maximum Suppression (NMS) [21] to eliminate redundant proposals. For one-stage methods, the classification is performed on the pre-defined anchors[22, 25] or video frames[28, 44]. Even though TAL task has some similarity to GEBD task, there is no straightforward way to directly apply these methods on the GEBD dataset. Since GEBD requires event boundaries to be taxonomy-free and continuous, which is different from the TAL settings.

Transformers. Transformer [40]

is a prominent deep learning model that has achieved superior performance in various fields, such as natural language processing (NLP) and computer vision (CV). Despite it’s success, the computational complexity of its self-attention is quadratic to image size and hard to applied to high-resolution images. To address this issue,

[23] proposes a hierarchical Transformer whose representation is computed with shifted windows and has linear computational complexity with respect to image size. In this paper, we show that these Transformer variants are not suitable for GEBD.

3 Method

The existing methods [35, 37, 11] formulates the GEBD task as binary classification, which predict the boundary label of each frame by considering the temporal context information. However, it is inefficient because the redundant computation is conducted while generating the representations of consecutive frames. To remedy this, we propose an end-to-end efficient and straightforward method for GEBD, which regards each video clip as a whole. Specifically, given a video clip of arbitrary length, we first use conventional CNN backbone to extract the 2D feature representation for each frame and get the frame sequence, i.e., , where and is the length of the video clip. Then the structured partition of sequence (SPoS) mechanism is employed to re-partition input frame sequence and provide structured context for each candidate frame. The Transformer encoder blocks [40] are then used to learn the high level representation of each local sequence. After that, we compute the group similarities to capture temporal changes and a following lightweight fully convolutional network [26] (FCN) is used to recognize different patterns of the grouped 2D similarity maps. We will introduce the details of each module in the following sections. The overall architecture of proposed method is presented in Figure 1.

3.1 Structured Context Transformer

The existence of an event boundary in a video clip implies that there is a visual content change at that point, thus it is very difficult to infer the boundary from one single frame. As a result, the key clue for event boundary detection is to localize changes in the temporal domain. Modeling in temporal domain has long been explored by different approaches, including LSTM [10], Transformer [40], 3D Convolutional Neural Network[38], etc. Transformer [40] has recently demonstrated promising results on both natural language processing (NLP) tasks and computer vision tasks. Despite its success, it is difficult to apply Transformer directly to the GEBD task due to its quadratic computational complexity of self-attention. The computation cost and memory consumption increase dramatically as the length of video increases. Previous methods [35, 37] regard each individual frame as one sample and its nearby frames are fed into network together to provide temporal information for this frame. This method introduces redundant computation in adjacent frames since each frame is fed into the network as input for multiple times. In this paper, we seek to explore a more general and efficient temporal representation for GEBD task.

Structured Partition of Sequence. Given the video snippet , where is the time span of the video snippet and can be any arbitrary length, is the feature vector of frame , which is generated from ResNet50 [9] backbone followed by a global average pooling layer, our goal is to obtain adjacent frames before candidate frame and adjacent frames after candidate frame , where is the adjacent window size. We term this local sequence centred with candidate frame as structured context for frame

. To accomplish this while enabling feature sharing and maintaining efficiency and parallelism, we propose the novel Structured Partition of Sequence (SPoS) mechanism. Specifically, we first pad video

with zero vectors at the end of the frame sequence so that the new video length is divisible by . Then given the padded video , we split it into slices where each slice ( is the slice number, starts from ) is responsible to provide structured context frames for all th frames (i.e., all frames that start from with a step of ). In this way, all video frames can be covered within all slices and these slices can be efficiently processed in parallel.

Figure 2: Illustration of proposed structured partition of sequence (SPoS). To obtain adjacent frames before candidate frame (denoted as ) and frames after (denoted as ), we split the input video sequence into slices. Each slice is responsible to produce adjacent frames and for the frames of specific indices (i.e., all frames that start from with a step of ). All video frames can be covered within all slices and can be efficiently processed in parallel. Our SPoS differs from Swin-Transformer [23] and 1D CNNs in that Swin-Transformer tends to learn a global representation after several stacks and is not structured and 1D CNNs actually make candidate frame attend to adjacent frames in a Gaussian distribution manner [27], respectively.

In each frame slice , we obtain structured context for frame in two directions, i.e., frames before frame and frames after frame . We implement this through efficient memory view method provided by modern deep learning framework. Specifically, To obtain structured context frames before frame , we replicate the first frame of the padded video sequence times and concatenate to the beginning of video and the last frames of video are dropped, and thus the number of frames is kept and still divisible by . We denote this shifted video sequence as . Then we view as , where denotes the number of processed frames in slice . In this way, we obtain the left structured context frames for all frames (i.e., all th frames of origin video ). Similarly, to obtain structured context frames after frame , we replicate the last frame of the padded video sequence times and concatenate to the ending of video and the first frames of video are also dropped to keep the number of frames. In this way, we obtain the right structured context frames for all frames. Finally, we can obtain all temporal context frames by repeating times for slices, and each frame is represented by its adjacent frames in a local window.

A key design element of our structured partition of sequence (SPoS) is its shared structured context information. We term this context information “structured” since SPoS maps each candidate frame to individual frame sequences and in a one-to-one manner, which is the key for accurate boundary detection. Our SPoS differs from Swin-Transformer [23] in that Swin-Transformer makes each frame able to attend very distant frames (i.e., tend to learn a global representation) due to its stacked shifted windows design. This is deleterious for boundary detection as very distant frames may cross multiple boundaries and thus provides less useful information. Another advantage of SPoS is that we can model these structured sequences using any sequential modeling method without considering computation complexity, due to its local shared and parallel nature, which can be computed in linear time to video length.

Encoding with Transformer. We use Transformer to model the structured context information. Given structured context features of frame , we first concatenate them in the temporal dimension to obtain context sequence for frame , i.e.,

(1)

where , and denotes the concatenating operation. Then to model temporal information, we adapt a 6-layer Transformer [40] block to processing the context sequence to get temporal representation within this structured context window. Unlike other methods [14, 15] where the computation of multi-head self attention (MSA) is based on global video frames sequence, our MSA computation is based only on the local temporal window. The computational complexity of the former is quadratic to video length , i.e., , and the computational complexity of our method is linear when is fixed (set to 8 by default, i.e., ), i.e., . Global self-attention computation is generally unaffordable for a large video length , while our local structured based self-attention is scalable.

3.2 Group Similarity

The event boundaries of the GEBD task could be located at the moments where the action changes (e.g., Run to Jump), the subject changes (e.g., a new person appears), or the environment changes (e.g.

, suddenly become bright), for example. We experimentally observed that the frames within an adjacent local window provide more cues for event boundary detection than distant frames. This is consistent with human’s intuition since the change of visual content can be regarded as an event boundary only in a short time period. Based on this observation, we can model local temporal information naturally based on structured context features extracted in Section

3.1.

The Transformer block aims at discovering relationships between frames and giving high level representation of frames sequence. However, event boundaries emphasize the differences between adjacent frames and neural networks tend to take shortcuts during learning [8]

. Thus classifying these frames directly into boundaries may lead to inferior performance due to non-explicit cues. Based on this intuition, we propose to guide classification with feature similarity of each frame pair in the structured temporal window

. Instead of performing similarity calculation with all -dimensional channels, we found it beneficial to split the channels into several groups and calculate the similarity of each group independently. The concept of groups as a dimension for model design has been more widely studied, including Group Convolutions [17, 43], Group Normalization [41], Multi-Head Self Attention [40], etc. However, to the best of our knowledge, there is still no study on similarity learning with grouping. Formally, given , we first split into G groups:

(2)

where and . Then the group similarity map is calculated using the grouped feature:

(3)
Figure 3: Visualization of grouped similarity maps , in this example. First row indicates that there is a potential boundary in this local sequence while the second row shows no boundary in this sequence. We can also observe slightly different patterns between the same group, which may imply that each group is learning in a different aspect.

where , and

can be cosine similarity or euclidean similarity. As the group similarity map

contains the similarity patterns (efficient score of each frame pair, i.e., high response value when two frames are visually similar), it shows different patterns (as shown in Figure 3) in different sequences, which are critical for boundary detection. To keep our model as simple as possible, we use a 4-layer fully convolutional network [26] to learn the similarity patterns, which we found it work very well and efficient enough. Then we average pool the output of FCN to get a vector representation , and this vector is used for downstream classification:

(4)

where indicates the intermediate representation, . The design principle of this module is extremely simple: computing group similarity patterns within local structured context based on previously encoded and using a small FCN to analyse the patterns.

3.3 Optimization

Our SC-Transformer and group similarity module are fully end-to-end, lightweight and in-place i.e. no dimension change between input and output. Therefore they can be directly used for further classification which is straightforward to implement and optimize. The video frame sequence is represented by after group similarity module, i.e., . Then we stack 3 layers of 1D convolutional neural network to predict boundary scores. We use a single binary cross entropy loss to optimize our network.

GEBD is a taxonomy-free task and interconnects the mechanism of human perception to deep video understanding. The event boundary labels of each video are annotated by around 5 different annotators to capture human perception differences and therefore ensure diversity. However, this brings ambiguity of annotations and is hard for network to optimize, which may lead to poor convergence. To solve this issue and prevent the model from predicting the event boundaries too confidently, we use the Gaussion distribution to smooth the ground-truth boundary labels and obtain the soft labels instead of using the “hard labels” of boundaries. Specifically, for each annotated boundary, the intermediate label of the neighboring position is computed as:

(5)

where indicates the intermediate label at time corresponding to the annotated boundaries at time . We set in all our experiments. The final soft labels are computed as the summation of all intermediate labels. Finally, binary cross entropy loss is used to minimize the difference between model predictions and the soft labels.

4 Experiments

We show that our method achieves competitive results compared to previous methods in quantitative evaluation on Kinetics-GEBD [35] and TAPOS [34]. Then, we provide a detailed ablation study of different model design with insights and quantitative results.

Dataset. We perform experiments on both Kinetics-GEBD dataset [35] and TAPOS dataset [34]. Kinetics-GEBD dataset consists of videos and

temporal boundaries, which spans a broad spectrum of video domains in the wild and is open-vocabulary, taxonomy-free. Videos in Kinetics-GEBD dataset are randomly selected from Kinetics-400

[16]. The ratio of training, validation and testing videos of Kinetics-GEBD is nearly 1:1:1. Since the ground truth labels for the testing videos is not released, we train our model on training set and test on validation set. TAPOS dataset containing Olympics sport videos cross 21 action classes. The training set contains action instances and the validation set contains action instances. Following [35], we re-purpose TAPOS for GEBD task by trimming each action instance with its action label hidden and conducting experiments on each action instance.

Evaluation Protocol. To quantitatively evaluate the results of generic event boundary detection task, F1 score is used as the measurement metric. As described in [35], Rel.Dis. (Relative Distance, the error between the detected and ground truth timestamps, divided by the length of the corresponding whole action instance) is used to determine whether a detection is correct (i.e., threshold) or incorrect (i.e., threshold). A detection result is compared against each rater’s annotation and the highest F1 score is treated as the final result. We report F1 scores of different thresholds range from 0.05 to 0.5 with a step of 0.05.

Implementation Details. For fair comparison with other methods, a ResNet50 [9]

pretrained on ImageNet

[4]

is used as the basic feature extractor in all experiments if not particularly indicated, note that we don’t freeze the parameters of ResNet50 and they are optimized through backpropagation. Images are resized to 224

224 following [35]. We uniformly sample 100 frames from each video for batching purpose, i.e., in section 3. We use the standard SGD with momentum set to , weight decay set to , and learning rate set to . We set the batch size to (4 videos, equivalent to 400 frames) for each GPU and train the network on NVIDIA Tesla V100 GPUs, resulting in a total batch size of , and automatic mixed precision training is used to reduce the memory burden. The network is trained for epochs with a learning rate drop by a factor of after epochs and epochs, respectively. All the source code of our method will be made publicly available after our paper is accepted.

Figure 4: Example qualitative results on Kinetics-GEBD validation split. Compared with PC [35], our SC-Transformer can generate more accurate boundaries which are consistent with ground truth.
Rel.Dis. threshold 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 avg
BMN [21] 0.186 0.204 0.213 0.220 0.226 0.230 0.233 0.237 0.239 0.241 0.223
BMN-StartEnd [21] 0.491 0.589 0.627 0.648 0.660 0.668 0.674 0.678 0.681 0.683 0.640
TCN-TAPOS [20] 0.464 0.560 0.602 0.628 0.645 0.659 0.669 0.676 0.682 0.687 0.627
TCN [20] 0.588 0.657 0.679 0.691 0.698 0.703 0.706 0.708 0.710 0.712 0.685
PC [35] 0.625 0.758 0.804 0.829 0.844 0.853 0.859 0.864 0.867 0.870 0.817
SBoCo-Res50 [15] 0.732 - - - - - - - - - 0.866
DDM-Net [37] 0.764 0.843 0.866 0.880 0.887 0.892 0.895 0.898 0.900 0.902 0.873
Ours 0.777 0.849 0.873 0.886 0.895 0.900 0.904 0.907 0.909 0.911 0.881
Table 1: F1 results on Kinetics-GEBD validation split with Rel.Dis. threshold set from 0.05 to 0.5 with 0.05 interval.
Rel.Dis. threshold 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 avg
ISBA [5] 0.106 0.170 0.227 0.265 0.298 0.326 0.348 0.369 0.382 0.396 0.302
TCN [20] 0.237 0.312 0.331 0.339 0.342 0.344 0.347 0.348 0.348 0.348 0.330
CTM [12] 0.244 0.312 0.336 0.351 0.361 0.369 0.374 0.381 0.383 0.385 0.350
TransParser [34] 0.289 0.381 0.435 0.475 0.500 0.514 0.527 0.534 0.540 0.545 0.474
PC [35] 0.522 0.595 0.628 0.646 0.659 0.665 0.671 0.676 0.679 0.683 0.642
DDM-Net [37] 0.604 0.681 0.715 0.735 0.747 0.753 0.757 0.760 0.763 0.767 0.728
Ours 0.618 0.694 0.728 0.749 0.761 0.767 0.771 0.774 0.777 0.780 0.742
Table 2: F1 results on TAPOS validation split with Rel.Dis. threshold set from 0.05 to 0.5 with 0.05 interval.

4.1 Main Results

Kinetics-GEBD. Table 1 illustrates the results of our models on Kinetics-GEBD validation set. Our method surpasses all previous methods in all Rel.Dis. threshold settings, demonstrating the effectiveness of structured partition of sequence and group similarity. Compared to the method PC [35], our method achieves 15.2% absolute improvement with faster running speed (i.e., 10.8ms per frame vs 1.9ms per frame). Compared to DDM-Net [37]

, we also achieve 1.3% absolute improvement. Since DDM-Net is not open sourced yet we are not able to compare runtime speed with our method. However it is worth noting that DDM-Net leverage the same input representation as PC 

[35], i.e., each frame and it’s adjacent frames are fed into network individually, which introducing many redundant computations. For example, given a video clip of length 100 and the window is set to 11 as mentioned in their paper, they have to process frames individually to get all boundary predictions for this single video. Our method is independent of video length and can get all boundary predictions in a single forward pass by just feeding the necessary 100 frames. The example qualitative results on Kinetics-GEBD are shown in Figure 4.

TAPOS. We also conduct experiments on TAPOS dataset [34]. TAPOS dataset contains Olympics sport videos with 21 actions and is not suitable for GEBD task. Following [35], we re-purpose TAPOS for GEBD task by trimming each action instance with its action label hidden, resulting in a more fine-grained sub-action boundary detection dataset. The results are presented in Table 2. We boost F1@0.05 score by 9.6% and 1.4% compared with PC [35] and DDM-Net [37], respectively. This verified the effectiveness of our method and our method can learn more robust feature presentation in different scenes.

4.2 Ablations

Structured partition of sequence re-partition the video frame sequence into a more suitable format for GEBD task. Based on this unified and shared representation, we use simple yet effective group similarity to capture differences between frames. In our ablation analysis, we explore how each component of our method and loss influences the final performance. For the study we conduct experiments on Kinetics-GEBD dataset and use ResNet-50 as the backbone. In these experiments, we only present F1 score with , and Rel.Dis. threshold due to limited space. Average column indicates average F1 score of Rel.Dis. threshold set from 0.05 to 0.5 with 0.05 interval.

Importance of structured partition of sequence (SPoS). Structured partition of sequence provides shared local temporal context for each frame to predict event boundaries. To verify its effectiveness, we attempt to remove it completely and use 1D convolution neural network and shifted window (Swin) representation [23] as replacements, results can be found in Table 3. We observed a significant performance drop after replacing SPoS. It can be interpreted that 1D CNNs only enlarge the receptive field of each candidate frame and this impact actually distributes as a Gaussian [27]. This is not optimal for event boundary detection since nearby frames may have equal importance. As for Swin [23], it’s designed to relieve Transformer’s global self-attention computation burden by leveraging non-overlapped shifted windows. And each frame can attend to very distant frames after several Swin Transformer Block stacks. We think this is not aligned with GEBD task since adjacent frames are more important while distant frames may cross multiple different boundaries and thus disturb the convergence. This also verifies that structured representation is crucial for accurate boundary detection.

Representation 0.05 0.25 0.5 Average
1D CNN 0.609 -0.168 0.838 -0.057 0.864 -0.044 0.810 -0.071
Swin [23] 0.703 -0.074 0.870 -0.025 0.891 -0.017 0.849 -0.032
SPoS 0.777 - 0.895 - 0.911 - 0.881 -
Table 3: Importance of structured partition of sequence (SPoS). When replacing our SPoS with 1D convolution neural network and Swin-Transformer [23] (non-overlapping shifted window representation), we can observe a significant performance drop. rows show the differences with our SPoS. This verifies that SPoS is crucial for boundary detection.

Adjacent window size . Adjacent window size defines how far can the subsequent module capture context information in the temporal domain. A smaller may not be able to capture enough necessary context information for a boundary while a larger will introduce noise information when cross two or more different boundaries. As presented in Table 4, we observed different F1 scores after varying . We believe that event boundaries in a video may span different number of frames to recognize them. Hence intuitively, different kinds of boundaries may prefer to different window size . Although more sophisticated mechanism like adapting size may further increase the performance, we choose a fixed-length window in all our experiments for simplicity and remain this as a future work. The performance gain diminishes as increases, and we choose as the adjacent window size.

Window size 0.05 0.25 0.5 Average
0.745 0.854 0.869 0.842
0.762 0.881 0.897 0.867
0.771 0.889 0.904 0.875
0.777 0.895 0.911 0.881
0.776 0.894 0.912 0.880
0.777 0.896 0.912 0.882
Table 4: Effect of adjacent window size . Different F1 scores are observed after varying . This can be interpreted as that a smaller may not be able to capture enough necessary context information for a boundary while a larger will introduce noise information when cross two or more different boundaries.

Effect of model width. In Table 6 we study the model width (number of channels). We use by default and it has the best performance.

0.05 0.25 0.5 Average
0.775 0.895 0.913 0.881
0.777 0.895 0.911 0.881
0.774 0.892 0.908 0.879
0.768 0.887 0.904 0.875
0.770 0.889 0.905 0.876
Table 6: Effect of number of groups .
0.05 0.25 0.5 Average
0.761 0.871 0.887 0.861
0.769 0.891 0.907 0.877
0.777 0.895 0.911 0.881
0.778 0.896 0.913 0.882
0.777 0.896 0.912 0.881
Table 5: Effect of model width .

Number of groups. We evaluate the importance of group similarity by changing the number of groups , results are shown in Table 6. We observe steady performance improvements when increasing and saturated when . This result shows the effectiveness of grouping channels when computing similarity.

Effect of similarity function. We explore different distance metrics (we call them similarity since minus value is used) in Table 7. The results show that our method is effective to different metrics, and we use cosine metric in our experiments.

Function 0.05 0.25 0.5 Average
Chebyshev 0.770 0.887 0.905 0.872
Manhattan 0.774 0.894 0.907 0.878
Euclidean 0.776 0.895 0.910 0.881
Cosine 0.777 0.895 0.911 0.881
Table 7: Effect of in Equation 3.

Loss ablations. GEBD task can be regarded as a framewise binary classification (boundary or not) after capturing temporal context information. We train our model with binary cross entropy (BCE) loss and mean squared error (MSE) loss with turning Gaussian smoothing (introduced in section 3.3) on and off. As shown in Table 8, Gaussian smoothing can improve the performance in both settings, which shows its effectiveness. We attribute this improvement to two aspects: 1) Consecutive frames have similar feature representation in the latent space thus consecutive frames are always tend to output closer responses, hard labels violate this rule and lead to poor convergence. 2) Annotations of GEBD are ambiguous in nature and Gaussian smoothing prevents the network from becoming overconfident. We use “BCE + Gaussian” setting in all our experiments.

BCE MSE Gaussian 0.05 0.25 0.5 Average
0.758 0.881 0.899 0.865
0.771 0.893 0.910 0.877
0.763 0.887 0.905 0.872
0.777 0.895 0.911 0.881
Table 8:

Effect of loss function.

5 Conclusions

In this work, we presented SC-Transformer which is a fully end-to-end method for generic event boundary detection. Structured partition of sequence mechanism is proposed to provide structured context information for GEBD task and Transformer encoder is adapted to learn high-level representation. Then group similarity and FCN are used to exploit discriminative features to make accurate predictions. Gaussian kernel is used to preprocess the ground-truth annotations to speed up training process. The proposed method achieves start-of-the-art results on the challenging Kinetics-GEBD and TAPOS datasets with high running speed. We hope our method can inspire future work.

References

  • [1] Alwassel, H., Heilbron, F.C., Ghanem, B.: Action search: Spotting actions in videos and its application to temporal action localization. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 251–266 (2018)
  • [2]

    Caba Heilbron, F., Barrios, W., Escorcia, V., Ghanem, B.: Scc: Semantic context cascade for efficient action detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1454–1463 (2017)

  • [3] Chao, Y.W., Vijayanarasimhan, S., Seybold, B., Ross, D.A., Deng, J., Sukthankar, R.: Rethinking the faster r-cnn architecture for temporal action localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2018)
  • [4]

    Deng, J., Dong, W., Socher, R., Li, L., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: CVPR. pp. 248–255. IEEE Computer Society (2009)

  • [5] Ding, L., Xu, C.: Weakly-supervised action segmentation with iterative soft boundary assignment. In: CVPR. pp. 6508–6516. Computer Vision Foundation / IEEE Computer Society (2018)
  • [6] Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: ICCV. pp. 6201–6210. IEEE (2019)
  • [7] Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: CVPR. pp. 1933–1941. IEEE Computer Society (2016)
  • [8] Geirhos, R., Jacobsen, J., Michaelis, C., Zemel, R.S., Brendel, W., Bethge, M., Wichmann, F.A.: Shortcut learning in deep neural networks. Nat. Mach. Intell. 2(11), 665–673 (2020)
  • [9] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR. pp. 770–778. IEEE Computer Society (2016)
  • [10]

    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural computation

    9(8), 1735–1780 (1997)
  • [11] Hong, D., Li, C., Wen, L., Wang, X., Zhang, L.: Generic event boundary detection challenge at CVPR 2021 technical report: Cascaded temporal attention network (CASTANET). CoRR abs/2107.00239 (2021)
  • [12] Huang, D., Fei-Fei, L., Niebles, J.C.: Connectionist temporal modeling for weakly supervised action labeling. In: ECCV. vol. 9908, pp. 137–153 (2016)
  • [13] Ji, S., Xu, W., Yang, M., Yu, K.: 3d convolutional neural networks for human action recognition. In: ICML. pp. 495–502. Omnipress (2010)
  • [14] Kang, H., Kim, J., Kim, K., Kim, T., Kim, S.J.: Winning the cvpr’2021 kinetics-gebd challenge: Contrastive learning approach. CoRR abs/2106.11549 (2021)
  • [15] Kang, H., Kim, J., Kim, T., Kim, S.J.: Uboco : Unsupervised boundary contrastive learning for generic event boundary detection. CoRR abs/2111.14799 (2021)
  • [16] Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., Suleyman, M., Zisserman, A.: The kinetics human action video dataset. CoRR abs/1705.06950 (2017)
  • [17] Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS. pp. 1106–1114 (2012)
  • [18] Kuehne, H., Jhuang, H., Garrote, E., Poggio, T.A., Serre, T.: HMDB: A large video database for human motion recognition. In: ICCV. pp. 2556–2563. IEEE Computer Society (2011)
  • [19] Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 156–165 (2017)
  • [20] Lea, C., Reiter, A., Vidal, R., Hager, G.D.: Segmental spatiotemporal cnns for fine-grained action segmentation. In: ECCV. vol. 9907, pp. 36–52 (2016)
  • [21] Lin, T., Liu, X., Li, X., Ding, E., Wen, S.: BMN: boundary-matching network for temporal action proposal generation. In: ICCV. pp. 3888–3897. IEEE (2019)
  • [22] Lin, T., Zhao, X., Shou, Z.: Single shot temporal action detection. In: Proceedings of the 25th ACM international conference on Multimedia. pp. 988–996 (2017)
  • [23] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV. pp. 10012–10022 (October 2021)
  • [24] Liu, Z., Ning, J., Cao, Y., Wei, Y., Zhang, Z., Lin, S., Hu, H.: Video swin transformer. CoRR abs/2106.13230 (2021)
  • [25] Long, F., Yao, T., Qiu, Z., Tian, X., Luo, J., Mei, T.: Gaussian temporal awareness networks for action localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 344–353 (2019)
  • [26] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015. pp. 3431–3440. IEEE Computer Society (2015)
  • [27] Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NIPS. pp. 4898–4906 (2016)
  • [28] Ma, S., Sigal, L., Sclaroff, S.: Learning activity progression in lstms for activity detection and early detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1942–1950 (2016)
  • [29] Neimark, D., Bar, O., Zohar, M., Asselmann, D.: Video transformer network. CoRR abs/2102.00719 (2021)
  • [30] Ni, B., Yang, X., Gao, S.: Progressively parsing interactional objects for fine grained action detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1020–1028 (2016)
  • [31] Perazzi, F., Pont-Tuset, J., McWilliams, B., Gool, L.V., Gross, M.H., Sorkine-Hornung, A.: A benchmark dataset and evaluation methodology for video object segmentation. In: CVPR. pp. 724–732. IEEE Computer Society (2016)
  • [32] Rai, A.K., Krishna, T., Dietlmeier, J., McGuinness, K., Smeaton, A.F., O’Connor, N.E.: Discerning generic event boundaries in long-form wild videos. CoRR abs/2106.10090 (2021)
  • [33] Richard, A., Gall, J.: Temporal action detection using a statistical language model. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 3131–3140 (2016)
  • [34] Shao, D., Zhao, Y., Dai, B., Lin, D.: Intra- and inter-action understanding via temporal action parsing. In: CVPR. pp. 727–736. Computer Vision Foundation / IEEE (2020)
  • [35] Shou, M.Z., Ghadiyaram, D., Wang, W., Feiszli, M.: Generic event boundary detection: A benchmark for event segmentation. CoRR abs/2101.10511 (2021)
  • [36] Soomro, K., Zamir, A.R., Shah, M.: UCF101: A dataset of 101 human actions classes from videos in the wild. CoRR abs/1212.0402 (2012)
  • [37] Tang, J., Liu, Z., Qian, C., Wu, W., Wang, L.: Progressive attention on multi-level dense difference maps for generic event boundary detection. CoRR abs/2112.04771 (2021)
  • [38] Tran, D., Bourdev, L.D., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: ICCV. pp. 4489–4497. IEEE Computer Society (2015)
  • [39] Tversky, B., Zacks, J.M.: Event perception. Oxford handbook of cognitive psychology 1(2),  3 (2013)
  • [40] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA. pp. 5998–6008 (2017)
  • [41] Wu, Y., He, K.: Group normalization. In: ECCV. vol. 11217, pp. 3–19 (2018)
  • [42] Xia, H., Zhan, Y.: A survey on temporal action localization. IEEE Access 8, 70477–70487 (2020)
  • [43] Xie, S., Girshick, R.B., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: CVPR. pp. 5987–5995. IEEE Computer Society (2017)
  • [44] Yuan, Z., Stroud, J.C., Lu, T., Deng, J.: Temporal action localization by structured maximal sums. In: CVPR (July 2017)
  • [45] Zhao, P., Xie, L., Ju, C., Zhang, Y., Wang, Y., Tian, Q.: Bottom-up temporal action localization with mutual regularization. In: European Conference on Computer Vision. pp. 539–555. Springer (2020)
  • [46] Zhao, Y., Xiong, Y., Wang, L., Wu, Z., Tang, X., Lin, D.: Temporal action detection with structured segment networks. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 2914–2923 (2017)