TransNet: A deep network for fast detection of common shot transitions

by   Tomáš Souček, et al.
Charles University in Prague

Shot boundary detection (SBD) is an important first step in many video processing applications. This paper presents a simple modular convolutional neural network architecture that achieves state-of-the-art results on the RAI dataset with well above real-time inference speed even on a single mediocre GPU. The network employs dilated convolutions and operates just on small resized frames. The training process employed randomly generated transitions using selected shots from the TRECVID IACC.3 dataset. The code and a selected trained network will be available at



There are no comments yet.


page 1

page 2

page 3

page 4


Ridiculously Fast Shot Boundary Detection with Fully Convolutional Neural Networks

Shot boundary detection (SBD) is an important component of many video an...

TransNet V2: An effective deep network architecture for fast shot transition detection

Although automatic shot transition detection approaches are already inve...

Large-scale, Fast and Accurate Shot Boundary Detection through Spatio-temporal Convolutional Neural Networks

Shot boundary detection (SBD) is an important pre-processing step for vi...

Fast Video Shot Transition Localization with Deep Structured Models

Detection of video shot transition is a crucial pre-processing step in v...

X3D: Expanding Architectures for Efficient Video Recognition

This paper presents X3D, a family of efficient video networks that progr...

RepVGG: Making VGG-style ConvNets Great Again

We present a simple but powerful architecture of convolutional neural ne...

PSDet: Efficient and Universal Parking Slot Detection

While real-time parking slot detection plays a critical role in valet pa...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

A popular way to structure a video is by making use of a shot composition, where shots are delimited by transitions. Since information about the transitions is not available in the video format, automated shot boundary detection is an important step for video management and retrieval systems. For example, information about shots can be employed for video summarization, advanced browsing and filtering in known-item search tasks (Cobârzan et al., 2017; Lokoč et al., 2018, 2019). Shot changes can be either immediate (hard cuts) or gradual, the later spanning from basic linear interleaving of two shots over a certain number of video frames to more exotic geometric transformations from one shot to another one. To make matters worse shot boundary detectors must distinguish between shot transitions and sudden changes in a video caused by partial occlusion of the scene by an object passing closer to the camera. Fast camera motion or motion of an object in the scene also should not be mistaken for a shot transition. This may indicate that some semantic representation of a scene is necessary to correctly segment a video.

In this work, we propose TransNet, a scalable architecture with multiple dilated 3D convolutional operations per layer (instead of only one as is usual) resulting in the greater field of view with less trainable parameters. Even though the architecture is trained on just two common types of transitions (hard cuts and dissolves), it achieves state-of-the-art results on the RAI dataset (Baraldi et al., 2015).

2. Related work

The goal of the shot boundary detection is to temporally segment a video into shots. To determine the shot boundary, one of the first methods utilized thresholded pixel differences (Zhang et al., 1993)

effective for stationary shots with a small number of moving objects. Since then, more robust techniques to compare images were developed based on local color histograms, color coherence vectors

(Pass et al., 1996) or SIFT features. The work of Shao et al. (Shao et al., 2015) utilizes HSV and gradient histograms for shot boundary detection, Apostolidis et al. (Apostolidis and Mezaris, 2014) use not only the histogram but also a set of SURF descriptors to detect the differences between a pair of frames. Other approaches revolve around edge information (Huan et al., 2008) or motion vectors (Amel et al., 2010).

With the advent of deep learning, new methods for shot detection using convolutional neural networks (CNN) emerged. Baraldi et al. (Baraldi et al., 2015)

utilize spectral clustering given a set of features for every frame extracted by a deep siamese network. Recently, Gygli

(Gygli, 2018) used a relatively shallow neural network with 3D convolutions with the third dimension over time. Even though 3D convolutions significantly increase computational complexity and memory requirements over standard 2D convolutions due to the added dimension, Gygli has beaten the previous approach in accuracy and speed as well. Another approach by Hassanien et al. (Hassanien et al., 2017)

also uses 3D CNN however its output is fed through SVM classifier and further postprocessing is done to reduce false alarms of gradual transitions through a histogram-driven temporal differencing. Our work partially overcomes problem of computationally hungry 3D convolutions when a large field of view is required to cope with long gradual transitions by using dilated convolutions over the time dimension, which had been proven useful in speech generation task

(van den Oord et al., 2016).

The deep learning approaches revolve around the need for large annotated datasets. Until recently (Tang et al., 2018), the size of publicly available datasets for SBD was the limiting factor. Fortunately, synthetic training data can be easily generated from virtually any video content by interleaving randomly selected sequences from different videos as is done in (Gygli, 2018) and others. The downside of this method is, however, that the real data can contain cuts between shots of the same scene which rarely occur in the synthetic data sets due to the nature how they are generated.

3. Model architecture

The proposed TransNet architecture (Figure 1) follows the work of Gygli (Gygli, 2018) and other standard convolutional architectures. As an input, the network takes a sequence of consecutive video frames and applies series of 3D convolutions returning a prediction for every frame in the input. Each prediction expresses how likely a given frame is a shot boundary.

The main building block of the model (Dilated DCNN cell) is designed as four 3D 33

3 convolutional operations. The convolutions employ different dilation rates for the time dimension and their outputs are concatenated in the channel dimension. This approach significantly reduces the number of trainable parameters compared to standard 3D convolutions with the same field of view. Multiple DDCNN cells on top of each other followed by spatial max pooling form a Stacked DDCNN block. The TransNet consists of multiple SDDCNN blocks, every next block operating on smaller spatial resolution but a greater channel dimension, further increasing the expressive power and the receptive field of the network.

Two fully connected layers refine the features extracted by the convolutional layers and predict a possible shot boundary for every frame representation independently (layers’ weights are shared). ReLU activation function is used in all layers with the only exception of the last fully connected layer with softmax output. Stride 1 and the ‘same’ padding is employed in all convolutional layers.


Conv 333 dilation 1

Conv 333 dilation 2

Conv 333 dilation 4

Conv 333 dilation 8


DDCNN cell, each conv with channels

stack S times

Max pooling 122

SDDCNN block

stack L times

Dense D

Dense 2


Figure 1. TransNet shot boundary detection network architecture for and . Note that represents length of video sequence, not batch size. In our case .

4. Training

This section describes the employed dataset and training settings.

4.1. Dataset

The TRECVID IACC.3 dataset (Awad et al., 2017) was utilized as it is provided with a set of predefined temporal segments. Hence, pairs of the predefined segments can be randomly selected from the pool for automatic creation of transitions for training purposes. More specifically, we considered segments of 3000 IACC.3 randomly selected videos. Furthermore, segments with less than 5 frames were excluded and from the remaining set only every other segment was picked, resulting in selected 54884 segments.

The training examples were generated on demand during training by randomly sampling two shots and joining them by a random type of a transition. Only hard cuts and dissolves were considered for training. Position of the transition was generated randomly. For dissolves, also its length was generated randomly from the interval . The length of each training sequence was selected to be 100 frames. The size of the input frames was set to pixels.

In order to validate the models, additional 100 IACC.3 videos (i.e., different from the training set) were manually labeled, resulting in 3800 shots. For testing, the RAI dataset (Baraldi et al., 2015) was considered.

4.2. Training details

The proposed architecture provides the following meta-parameters that were investigated by a grid search:

  1. , the number of DDCNN cells in a SDDCNN layer,

  2. , the number of SDDCNN layers,

  3. , the number of filters in the first set of DDCNN layers (doubled in each following SDDCNN layer),

  4. , the number of neurons in the dense layer.

For training, batch size of 20 was used for all investigated networks. In order to prevent overfitting, only 30 epochs were considered, each with 300 batches. Adam optimizer

(Kingma and Ba, 2014) with the default learning rate

and cross entropy loss function were used. According to our preliminary evaluations, dropout did not improve results. Nevertheless, we plan to investigate advanced forms of regularization and training data augmentation in the future. Depending on the architecture, the whole training took approximately two to four hours to complete on one Tesla V100 GPU.

Even in the case of dissolves, when the transition is over multiple frames, the network was trained to predict only the middle frame as a shot boundary. This creates a discrepancy between the number of ‘transition’ frames (each sequence contains only one) and frames without a transition (99 in our case). Increasing the weight of the transitions in the loss function did not produce better results than lowering the acceptance threshold under commonly used ; therefore, the latter approach is used.

5. Evaluation

During validation and testing, the list of shots is constructed in the following way: The shot starts at the first frame when the prediction drops below a threshold and ends at the first frame when the prediction exceeds

. The evaluation metric described in Section

5.1 compares the generated shot list with the ground truth. Note that only predictions for frames 25-75 are used due to incomplete temporal information for the first/last frames. Therefore, when processing a video, the input window is shifted by 50 frames between individual forward passes through the network.

5.1. Evaluation metric

The F1 score is used as an evaluation metric which is the same metric as in (Baraldi et al., 2015). Reported F1 score is computed as an average of individual F1 scores for each video. Based on our analysis of the evaluation script111Source code of the evaluation method is available at, Figure 2 shows cases when detected shots are considered to be true positive, false positive, or false negative. A true positive is detected only if the detected shot transition overlaps with the ground truth transition (3, 4 in green). A false positive is detected if the predicted transition has no overlap with the ground truth (1, 4 in red) or the transition is detected for the second time (3 in red). A false negative is detected if there is no transition overlapping with the ground truth (1, 2 dotted) – the ground truth transition is missed.

Figure 2. Visualization of the evaluation approach. Predicted transitions shown with solid and missed with dotted rectangles.

5.2. Results

Figure 3. Observed average F1 scores of tested networks for the validation and test datasets.

Figure 3 presents the F1 scores of investigated models for validation and test datasets. Note that the top performing weights for each model configuration were selected based on results on validation dataset after each epoch. The confidence threshold indicating transition was set to as it performed reasonably well for most of the models. The effect of on precision, recall and F1 score is depicted in Figure 4.

Figure 4. Precision/Recall curve for the best performing model with corresponding thresholds next to the points (in red) and F1 score dependency on threshold (in blue). Measured on RAI dataset.

Based on the evaluations presented in Figure 3, the best performing model is considered the one with 16 filters in the first layer, two stacked DDCNN cells in every one of the three SDDCNN blocks and with 256 neurons in the dense layer (F=16, L=3 S=2, D=256). The average F1 score of the top performing model on the RAI dataset (see Table 1) is on par with the score reported by Hassanien et al. (Hassanien et al., 2017). The overall F1 score even slightly outperforms the work of Hassanien et al., even though they proposed a network with more than 40 times as many parameters trained for a larger set of transition types. Furthermore, our model has the advantage that no additional post-processing is needed.

Baraldi et al. Gygli Hassanien et al. ours
average 0.84 (Baraldi et al., 2015) 0.88 (Gygli, 2018) (Hassanien et al., 2017)
overall - - 0.934 (Hassanien et al., 2017)
Table 1. Average and overall F1 scores for the RAI test dataset of the best architectures. The overall F1 scores are computed by calculating precesion and recall over the whole dataset, not just single video.
Video #T TP FP FN P R F1
V1 80 57 2 23 0.966 0.713 0.820
V2 146 132 5 14 0.964 0.904 0.933
V3 112 111 4 1 0.965 0.991 0.978
V4 60 59 5 1 0.922 0.983 0.952
V5 104 101 8 3 0.927 0.971 0.948
V6 54 53 3 1 0.946 0.981 0.964
V7 109 103 1 6 0.990 0.945 0.967
V8 196 181 4 15 0.978 0.923 0.950
V9 61 55 2 6 0.965 0.902 0.932
V10 63 57 0 6 1.000 0.905 0.950
Overall 985 909 34 76 0.964 0.923 0.943
Table 2. Per video results on the RAI dataset. For each video the total number of transitions (#T), true positives (TP), false positives (FP), false negatives (FN), precision (P), recall (R) and F1 score (F1) are shown.

Since the validation dataset contains various sequences of frames where even annotators are not sure whether there is a shot transition, the reported scores for the validation data are lower. In addition, even the top performing TransNet model faces problems with detection of some transitions, for example, false positives in dynamic shots and false negatives in gradual transitions.

The model detected 1058 false positives and 679 false negatives with respect to the annotation. After closer inspection, for about 20% of false negatives there was one very close false positive (shifted by one frame). This is in contrast to the RAI dataset results (Table 2) where the network achieves a lower number of false positives than false negatives. Based on manual inspection of the videos we conclude that RAI videos do not contain many highly dynamic shots (i.e. resulting in false positives) compared to the IACC.3 validation set.

6. Conclusion

In this paper, we present the TransNet neural network, the first shot detection model based on dilated 3D convolutions. The effectiveness of dilated 3D convolutions has been shown on RAI dataset with the TransNet performing on par with the current state-of-the-art approach without any additional post-processing and with a fraction of learnable parameters. The network also runs more than 100x faster than real-time on a single powerful GPU222It took just 50s to detect shot boundaries of preprocessed frames from the whole RAI dataset (about 98 minutes of video) using Tesla V100 GPU..

In the future, we plan to do further evaluation and improvements to enable deeper and more robust models. The source code and our trained model will be available at

This paper has been supported by Czech Science Foundation (GAČR) project Nr. 19-22071Y.