In computer vision, many existing algorithms on video analysis use fixed number of frames for processing. For example optical flow or motion estimation techniques and human activity recognition [2, 3]. However, it would be more intuitive, and more efficient, to work with perceptually meaningful entity obtained from a low-level grouping process which we call it ‘superframe’.
Similar to superpixels  which are key building blocks of many algorithms and significantly reduce the number of image primitives compared to pixels, superframes also do the same in time domain. They can be used in many different applications such as video segmentation , video summarization , and video saliency detection . They are also useful in the design of a video database management system  that manages a collection of video data and provides content-based access to users . Video data modeling, insertion, storage organization and management, and video data retrieval are among the basic problems that are addressed in a video database management system which can be solved more efficiently by using superframes. By temporal clustering of the video, it is easier to identify the significant segment of the video to achieve better representation, indexing, storage, and retrieval of the video data.
The main goal of this work is an automatic temporal clustering of a video by analyzing the visual content of the video and partitioning it into a set of units called superframes. This process can also be referred to as video data segmentation. Each segment is defined as a continuous sequence of video frames which have no significant inter-frame difference in terms of their motion contents. Motion is the main criteria we use in this work, therefore we assume all the videos are taken from a single fixed camera.
There is little literature on specifically temporal segmenting of video, and some works on related areas. In this section, we briefly discuss the most relevant techniques to this work: temporal superpixexls, scene cut, and video segmentation.
The main idea of using superpixels as primitives in image processing was introduced by Ren and Malik in . Using superpixels instead of raw pixel data is even beneficial for video applications. Although until recently, superpixel algorithms were mainly on the still images, researchers started to apply them to the video sequences. There are some recent works on using the temporal connection between consecutive frames. Reso et al.  proposed a new method for generating superpixels in a video with temporal consistency. Their approach performs an energy-minimizing clustering using a hybrid clustering strategy for a multi-dimensional feature space. This space is separated into a global color subspace and multiple local spatial subspaces. A sliding window consisting multiple consecutive frames is used which is suitable for processing arbitrarily long video sequences.
A generative probabilistic model is proposed by Chang et al. in  for temporally consistent superpixels in video sequences which uses past and current frames and scales linearly with video length. They have presented a low-level video representation which is very related to a volumetric voxel , but still different in such a way that temporal superpixels are mainly designed for video data, whereas supervoxels are for 3D volumetric data.
Lee et al. in 
developed a temporal superpixel algorithm based on proximity-weighted patch matching. They estimated superpixel motion vectors by combining the patch matching distances of neighboring superpixels and the target superpixel. Then, they initialized the superpixel label of each pixel in a frame, by mapping the superpixel labels in the previous frames using the motion vectors. Next, they refined the initial superpixels by updating the superpixel labels of boundary pixels iteratively based on a cost function. Finally, they performed some postprocessing including superpixel splitting, merging, and relabeling.
In video indexing, archiving and video communication such as rate control, scene change detection plays an important role. It can be very challenging when scene changes are very small and sometimes other changes like brightness variation may cause false change detection. Many various methods have been proposed for scene change detection. Yi and Ling in , proposed a simple technique to detect sudden and unexpected scene change based on only pixel values without any motion estimation. They first screen out many non-scene change frames and then normalize the rest of the frames using a histogram equalization process.
A fully convolutional neural network has been used for shot boundary detection task in by Gygli. He considered this work as a binary classification problem to correctly predict if a frame is part of the same shot as the previous frame or not. He also created a new dataset of synthetic data with one million frames to train this network.
Video segmentation aims to group perceptually and visually similar video frames into spatio-temporal regions, a method applicable to many higher-level tasks in computer vision such as activity recognition, object tracking, content-based retrieval, and visual enhancement. In , Grundmann et al. presented a technique for spatio-temporal segmentation of long video sequences. Their work is a generalization of Felzenszwalb and Huttenlocher’s  graph-based image segmentation technique. They use a hierarchical graph-based algorithm to make an initial over-segmentation of the video volume into relatively small space-time regions. They use optical flow as a region descriptor for graph nodes.
Kotsia et al. in  proposed using the gradient correlation function operating at the frequency domain for action spotting in a video sequence. They used the Fourier transform which is invariant to spatiotemporal changes and frame recording. In this work, the estimation of motion relies on the detection of the maximum of the cross-correlation function between two blocks of video frames .
Similar to all these tasks, in this work we propose a simple and efficient motion-based method to segment a video over time into compact episodes. This grouping leads to an increased computational efficiency for subsequent processing steps and allows for more complex algorithms computationally infeasible on frame level. Our algorithm detects major changes in video streams, such as when an action begins and ends.
The only input to our method is a video and a number which shows the desired number of clusters in that video. The output will be the frame numbers which shows the boundaries between the uniform clusters. Visual features like colors and textures per frame are of low importance compare to motion features. This motivates us to employ motion information to detect big changes over the video frames. We use FlowNet-2 , a very accurate optical flow estimation with deep networks, to extract the motion between every subsequent frame. We then use both the average and the histogram of flow over video frames and compare our results over a baseline.
Ii The proposed superframe technique
Our superframe segmentation algorithm detects the boundary between temporal clusters in video frames. The superframe algorithm takes the number of desired clusters, , as input, and generates superframes based on the motion similarity and proximity in the video frames. In other words, the histogram of magnitude is used with the direction of motion per frame and the frame position in the video as features to cluster video frames. Our superframe technique is motivated by SLIC , a superpixel algorithm for images, which we generalize it for video data (see Figure 1).
Ii-a The proposed algorithm
Our model segments the video using motion cues. We assume the videos are taken with a fixed camera. Therefore, one can represent the type of motion of the foreground object by computing features from the optical flow. We first apply FlowNet-2  to get the flow per video frame, and then we make a histogram of magnitude (HOM) and direction (HOD) of flow per frame to initialize cluster centers . Therefore each cluster center has feature values as with . indicates ‘Histogram of Magnitude’ and indicates ‘Histogram of Direction’ for different directions, and stands for the frame index in the video.
For a video with frames, in the initialization step, there are equally-sized superframes with approximately frames. Since initially, the length of each superframe is , like , we safely assume that the search area to find the best place for a cluster center is a area around each cluster center over the location of video frames.
Following the cluster center initialization, a distance measure is considered to specify each frame belongs to which cluster. We use as a distance measure defined as follows:
Distance between cluster and frame is calculated by:
where is a feature value and is a feature vector of values per video frame including values for the histogram of the magnitude of flow and values for the histogram of the direction of flow. We consider frame location separately as . is the interval and is a measure of compactness of a superframe which regarding the experiment we choose it as of the input number of clusters in this work i.e. to make the result comparison easier.
After the initialization of the cluster centers, we then move each of them to the lowest gradient position in a neighborhood of frames. The neighbourhood of frames is chosen arbitrarily but reasonable. This is done to avoid choosing noisy frames. The gradient for frame is computed as Eq. 2:
We associate each frame in the video with the nearest cluster center in the search area of . When all frames are associated with the nearest cluster center, a new cluster center is computed as the average of flow values of all the frames belonging to that cluster. We repeat this process until convergence when the error is less than a threshold.
At the end of this process, we may have few clusters which their length is very short. So, we do a postprocessing step to merge these very small clusters to the closer left or right cluster. The whole algorithm is summarized in Algorithm 1
Ii-B Evaluation criteria for the performance of the algorithm
Superpixel algorithms are usually assessed using two error metrics for evaluation of segmentation quality: boundary recall and under-segmentation error. Boundary recall measures how good a superframe segmentation adhere to the ground-truth boundaries. Therefore, higher boundary recall describes better adherence to video segment boundaries. Suppose be a superframe segmentation with number of clusters and be a ground-truth segmentation with number of clusters. Then, boundary recall is defined as follows:
where is the number of boundary frames in for which there is a boundary frame in in range and is the number of boundary pixels in for which there is no boundary frame in in range . In simple words, Boundary Recall, , is the fraction of boundary frames in which are correctly detected in . In this work, the range , as a tolerance parameter, is set to times the video length in frames based on the experiments.
Under-segmentation error is another error metric which measures the leakage from superframes with respect to the ground truth segmentation. The lower under-segmentation error, the better match between superframes and the ground truth segments. We define under-segmentation error, , as follows:
Where is the number of frames in the video, is the number of ground truth superframes, indicates the length of a segment in frames and . By doing some experiments, we found the best number of as an overlap threshold, is of each superframe .
This proposed algorithm requires two initialization: (the number of desired clusters) and ‘compactness’. In our work, we initialize compactness to to make the evaluation of results easier.
Experiments are carried out using the ‘MAD’ 111Multimodal Action Database database and the ‘UMN’ databases. The MAD 222Videos available from database  is recorded using a Microsoft Kinect sensor in an indoor environment with a total of video sequences of subjects. Each subject performs sequential actions twice and each video is about – frames. We also labelled superframes manually for each video. That shows in which frames there is a big change of motion for the frame, which illustrates the video segments with similar motions, superframes.
UMN 333Videos available from is an unusual crowd activity dataset , a staged dataset that depicts sparsely populated areas. Normal crowd activity is observed until a specified point in time where behavior rapidly evolves into an escape scenario where each individual runs out of camera view to simulate panic. The dataset comprises separate video samples that start by depicting normal behavior before changing to abnormal. The panic scenario is filmed in three different locations, one indoors and two outdoors. All footage is recorded at a frame rate of frames per second at a resolution of using a static camera.
Each frame is represented by features, values for HOM features, values for HOD features, and one value for the frame location in the video. HOF features provide us with a normalized histogram at each frame of the video.
A dense optical flow  is used in this work to detect motion for each video frame. We have employed FlowNet-2 which has an improved quality and speed compared to other optical flow algorithms. With the MAD and UMN databases it took only about and respectively per frame to extract the flow using FlowNet-2. Figure 2 illustrates the results of FlowNet-2 on both databases.
To compare our results against, we have used ‘phase correlation’ between two group of frames to determine relative translative movement between them as proposed in . This method relies on estimating the maximum of the phase correlation, which is defined as the inverse Fourier transform of the normalized cross-spectrum between two space-time volumes in the video sequence. We call this phase-correlation technique as PC-clustering in this work. We sampled every nd frame in each video and chose a sub-section of each frame. We use pixels in the middle of each frame to carry out the phase correlation. Therefore, each space-time volume is considered to be , in which is the number of frames and shows the temporal length of each volume.
Given two space-time volumes and , we calculate the normalized cross-correlation using Fourier transform and taking the complex conjugate of the second result (shown by ) and then the location of the peak using equation 6.
where is the element-wise product.
where is the correlation between two space-time volumes and . We calculate this correlation every nd frame of each video and then we conclude that there is a big change in the video when the cross-correlation is less than a threshold (see Figure 3). Figure. 4 illustrates how boundary recall and under-segmentation error changes over the number of clusters.
To test the performance of our method on video clustering to superframes, we consider two groups of features: first the averaged value of optical flow over and , i.e. the and components of the optical flow and second the histogram of magnitude and direction of flow. Figure. 5 shows the boundary recall with respect to the input number of desired superframes for one of the videos () in the MAD dataset. For this video, the number of superframes in the ground truth is which for , HOF features has outstanding improvement on boundary recall over the averaged features.
As discussed before, the output number of clusters is usually less than the input number of desired clusters . Since in the postprocessing step some of the clusters may get merged with other clusters as their length is too short. Figure. 6 illustrates the relation between the output number of clusters, and the boundary recall for a video from the MAD database with ground-truth clusters. The ground-truth number of clusters is shown using a red horizontal line in the figure. It is shown that when the output number of clusters for this video is or more, is more than and the boundary recall is bigger than .
The ground-truth and the result boundaries for a video of MAD database are shown in Figure 7. The difference between them is also shown in this figure which is almost zero except for clusters between to between frames and .
As stated before a combination of under-segmentation error and boundary recall is a good evaluation metric for the algorithm. Under-segmentation error accurately evaluates the quality of superframe segmentation by penalizing superframes overlapping with other superframes. Higher boundary recall also indicates less true boundaries are missed. Figure8 shows the experimental results for both boundary recall and under-segmentation error which are the average values over all videos in the MAD database. According to this figure, for more than about , the boundary recall is and the under-segmentation error is less than .
A quantitative evaluation of all videos in each dataset is done and the results are illustrated in Table I. These results are calculated by averaging over all the videos in each dataset when the number of clusters is about . The experimental results show that our model works quite well on both datasets. Regarding this table, the histogram of flow works as a better feature than just averaging the flow or using phase correlation between volumes in the video. There is a improvement of boundary recall for the MAD dataset and for UMN dataset. The boundary recall is quite low for the baseline on UMN dataset, although the segmentation error is still very low. An interesting point in this table is that, there a higher boundary recall for all methods on the MAD dataset, however lower under-segmentation error on UMN datasets.
|MAD database||UMN database|
|Histogram of flow (proposed)|
In this paper, we proposed using Histogram of Optical Flow (HOF) features to cluster video frames. These features are independent of the scale of the moving objects. Using these atomic video segments to speed up later-stage visual processing, has been recently used in some works. The number of desired clusters, , is the input to our algorithm. This parameter is very important and may cause over/under-segmenting the video.
Our superframe method divides a video into superframes by minimizing a cost function which is distance-based and makes each superframe belong to a single motion pattern and not overlap with others. We used FlowNet-2 to extract motion information per video frame to describe video sequences and quantitatively evaluated out method over two databases, MAD and UMU.
One interesting trajectory for the future work is to estimate the saliency score for each of these superframes. This helps us to rank different episodes of a video by their saliency scores and detect the abnormalities in videos as the most salient segment.
This work was supported in part by the European Unions Horizon 2020 Programme for Research and Innovation Actions within IoT (2016): Large Scale Pilots: Wearables for smart ecosystem (MONICA).
E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox, “Flownet
2.0: Evolution of optical flow estimation with deep networks,” in
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
-  V. Bloom, D. Makris, and V. Argyriou, “Clustered spatio-temporal manifolds for online action recognition,” in International Conference on Pattern Recognition, 2014, pp. 3963–3968.
-  F. Caba Heilbron, V. Escorcia, B. Ghanem, and J. Carlos Niebles, “Activitynet: A large-scale video benchmark for human activity understanding,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 961–970.
-  R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Susstrunk, “Slic superpixels compared to state-of-the-art superpixel methods,” IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), vol. 34, no. 11, pp. 2274–2282, 2012.
-  M. Grundmann, V. Kwatra, M. Han, and I. Essa, “Efficient hierarchical graph-based video segmentation,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
-  W.-S. Chu, Y. Song, and A. Jaimes, “Video co-summarization: Video summarization by visual co-occurrence,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 3584–3592.
-  Q. Tu, A. Men, Z. Jiang, F. Ye, and J. Xu, “Video saliency detection incorporating temporal information in compressed domain,” Image Communication, vol. 38, no. C, pp. 32–44, 2015.
-  W. G. Aref, A. C. Catlin, J. Fan, A. K. Elmagarmid, M. A. Hammad, I. F. Ilyas, M. S. Marzouk, and X. Zhu, “A video database management system for advancing video database research,” in International Workshop on Management Information Systems, 2002, pp. 8–17.
-  H. A, “Designing video data management systems,” Ph.D. dissertation, The University of Michigan, Ann Arbor, MI, USA, 1995.
-  X. Ren and J. Malik, “Learning a classification model for segmentation,” in IEEE International Conference on Computer Vision (ICCV), 2003, pp. 10–17.
-  M. Reso, J. Jachalsky, B. Rosenhahn, and J. Ostermann, “Temporally consistent superpixels,” IEEE International Conference on Computer Vision (ICCV), pp. 385–392, 2013.
-  J. Chang, D. Wei, and J. W. Fisher III, “A video representation using temporal superpixels,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013, pp. 2051–2058.
-  C. Xu and J. J. Corso, “Evaluation of super-voxel methods for early video processing,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1202–1209, 2012.
-  S.-H. Lee, W.-D. Jang, and C.-S. Kim, “Temporal superpixels based on proximity-weighted patch matching,” in IEEE International Conference on Computer Vision (ICCV), 2017, pp. 3610–3618.
-  N. L. Xiaoquan Yi, “Fast pixel-based video scene change detection,” in IEEE International Symposium on Circuits and Systems (ISCAS), 2005.
-  M. Gygli, “Ridiculously fast shot boundary detection with fully convolutional neural networks,” Computing Research Repository (CoRR), vol. abs/1705.08214, 2017.
-  P. F. Felzenszwalb and D. P. Huttenlocher, “Efficient graph-based image segmentation,” International Journal of Computer Vision (IJCV), vol. 59, no. 2, pp. 167–181, 2004.
-  I. Kotsia and V. Argyriou, “Action spotting exploiting the frequency domain,” in Workshop on CVPR, 2011, pp. 43–48.
-  V. Argyriou and T. Vlachos, “Quad-tree motion estimation in the frequency domain using gradient correlation,” IEEE Transactions on Multimedia, vol. 9, no. 6, pp. 1147–1154, 2007.
-  D. Huang, S. Yao, Y. Wang, and F. De La Torre, “Sequential max-margin event detectors,” in European Conference on Computer Vision (ECCV), 2014, pp. 410–424.
-  O. a. S. M. Mehran, R., “Abnormal crowd behaviour detection using social force model,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009, pp. 935––942.