1 Introduction
Recommendation services, event detection, clustering and categorization of video data, and retrieval algorithms for large video databases depend on efficient and reliable similarity identification amongst video segments [1, 2, 3]. In a nutshell, given a query video, we wish to find all similar video segments within a large video database in the most reliable and efficient way. The stateoftheart in similarity identification hinges on video fingerprinting algorithms [4, 5]. The aim of such algorithms is to provide for distinguishable representations that remain robust under visual distortions, such as, rotation, compression, blur, resizing, flicker, etc. Such distortions are expected to be present within large video collections, or when dealing with content “in the wild” [3].
In a broad sense, video similarity identification can be seen as a spatiotemporal matching problem via an appropriate feature space or descriptor. Recent results have shown that similarity identification algorithms based on local descriptors, such as the scale invariant feature transform (SIFT) [6] or dense SIFT [7], tend to significantly outperform previous approaches based on histogram methods [8] or fingerprinting algorithms [9], especially under the presence of distortions in the video data. Therefore, the stateoftheart in this area is based on vectors of locally aggregated descriptors (VLAD) [10], or BagofWords (BoW) methods [11], which merge feature descriptors in video frame regions. More recently, hyperpooling approaches have been proposed [5], which perform two consecutive VLAD stages in order to compact entire video sequences into a unique aggregated descriptor vector.
In this paper, we focus on VLADbased algorithms and examine the problem of creating compact representations that are suitable for efficient and accurate similarity identification of segments of videos within a large video collection. The paper makes the following contributions:

Instead of creating holistic hyperpooling approaches for entire video sequences, we concentrate on groups of frames (GoFs) within a video sequence in order to allow for video segment search.

Instead of directly compacting feature descriptors, we follow a twostage clustering approach: we first cluster features to obtain localfeaturecenters (LFCs) and then encode the latter with respect to a given set of centers of localfeaturecenters (CLFCs), computed from a training set.

Similar to VLAD, we encode the LFCs by aggregating their differences with respect to their corresponding CLFCs, thereby creating vectors of locally aggregated centers (VLAC).

Experiments using a 100minute training set and a 1000minute test set from the Open Video Project reveal that, for the same compaction factor, our proposal is outperforming the stateoftheart VLAD method [10] by more than 15% in terms of mean Average Precision (mAP).
The remainder of the paper is as follows. Section 2 summarizes the operation of VLAD and hyperpooling that constitute the stateoftheart and form the basis of the proposed compaction algorithm. Section 3 presents the proposed VLAC approach. Section 4 presents experimental results, while Section 5 draws concluding remarks.
2 Background on VLAD and HyperPooling
2.1 Visual Feature Description
Current solutions make use of image descriptors to represent individual frames within a video [4, 5]. After extracting the local feature descriptors of a given set of frames using an algorithm such as SIFT [6] or dense SIFT ï¿œ[7], these descriptors are then accumulated to produce a compact frame representation. Recent work advocated the use of pooling strategies instead of simple averaging methods, in order to minimize information loss. A common way to achieve this is by using BoW methods [11] or VLAD [10]. In this paper, we focus on the latter as it has been shown to achieve stateoftheart results in terms of mAP in medium and largescale sets of image and video content.
2.2 Vector of Locally Aggregated Descriptors
VLAD [1, 10] is a vector aggregation algorithm that produces a fixedsize compact description of a set comprising a variable number of data points. VLAD was proposed as a novel approach aimed to optimize: (i) the representation of aggregated feature descriptors; (ii) the dimensionality reduction; (iii) the indexing of the output vectors.
These aspects are interrelated—for example, dimensionality reduction directly affects the way we index the output vectors. While high dimensional vectors produce more accurate search results, low dimensional vectors are easier to index and require less operations and storage.
Consider a set of video frames to be used for training purposes. For the th training frame, , a visual feature detector and descriptor (e.g., the SIFT detector and descriptor [6]) is calculated, thereby producing feature vectors , , each with dimension . The ensemble of these features comprises the th training frame’s set of visual features . The concatenation of all these sets for all training frames, given by
, undergoes a clustering approach, such as Kmeans
[12], thereby grouping all vectors in into clusters, with centers denoted by set . VLAD then encodes the set of visual features, , of the th frame as the group of dimensional vectors () given by(1) 
where is the quantization function that determines which cluster belongs to. Then, the VLAD of the th frame is given by the vector of aggregated local differences , with dimension . All these vectors are concatenated into the dimensional matrix
, which comprises the VLAD encoding of the training set. In order to allow for further dimensionality reduction (thereby accelerating the matching process), principal component analysis (PCA) is applied to
, and themost dominant eigenvectors are maintained in the
matrix in order to be used in the test set.When considering a test video frame, once its set of visual features is produced by the SIFT descriptor (assuming points were detected), VLAD performs the following step: (i) calculation of () via (1) with the precalculated center set ; (ii) aggregation of these into a composite vector and application of dimensionality reduction via the retained PCA coefficients in :
(2) 
where denotes the VLAD of the test video frame after compaction with PCA. The similarity between two VLAD vectors of two test video frames and is simply measured via . Thresholding the set of similarity (i.e., inner product) results between a test video frame and the entire test set of video frames provides the list of similar frames retrieved under the selected threshold value.
2.3 HyperPooling
A recent method proposed by Douze et al. [5]
makes use of hyperpooling (HP) strategies on the video description level. Hyperpooling works by using a second layer of data clustering and encoding a set of frame VLAD descriptors into a single vector. Hyperpooling utilizes an enhanced hashing scheme by exploiting the temporal variance properties of VLAD vectors
[5]that have been produced per frame. After performing PCA, the temporal variance of VLAD vectors is most prominent in the components associated with low eigenvalues. Hence, hyperpooling postulates that we can get a more stable set of centers by applying a clustering algorithm (such as Kmeans) on the set of components relating to the highest eigenvalues. Indeed, hashing the components that vary less with time has been shown to provide better results in terms of stability and robustness to noise
[5].2.4 Motivation Behind the Proposed Concept
From the previous description, it is evident that the crucial aspects of VLAD and hyperpooling are the clustering and the PCA process performed on the training set. Ideally, for a given set of video frames, we would like to produce principal component vectors for compaction of VLADs that do not change substantially when the video frames undergo realworld visual distortions. For example, consider two ensembles of training video frame sets, and , with the latter produced by distorting the video frames in via blurring, compression artifacts, rotation, gamma changes, etc. During the training stage, applying PCA on the vectors of local differences (obtained per frame) will produce dominant eigenvectors forming the matrices and . In case of hyperpooling the aforementioned matrices will have a dimension of , where is the number of dimensions retained after the first VLAD stage. Ideally, the vectors in and should be reasonablywell aligned, which is an indication that the compaction process is robust to noise. This can be tested by computing the sumofinnerproducts between the dominant eigenvectors of both cases: For both VLAD and hyperpooling, we obtain
(3) 
where denotes the element of . We carried out such an indicative test in a set of video frames taken from 10 video clips of 10minute duration each. Each video underwent seven different visual distortions, as tabulated in Table 1 and detailed in Section 4. Using clusters for VLAD and for dense SIFT, we obtain and . However, utilizing the SIFT vectors directly, performing PCA decomposition to produce the two matrices and , and computing
(4) 
we get . The significant difference between and and represents the reduction in tolerance to distortions incurred when the vectors are projected to their principal components, which is performed in order to gain the benefit of compaction.
In this paper, our aim is to design a method leading to the same compaction factor as VLAD, albeit having increased tolerance to distortion in the video frames, which will allow for high recall rates even when dealing with distorted versions of the input video content. A secondary aim is to design our approach in a way that directly deals with video segments rather that individual video frames, thus allowing for video segment similarity detection. These two aspects are elaborated in the next section.
3 Vector of Locally Aggregated Centers
3.1 VLAD per Video Frame
The similarity between two videos can be estimated by obtaining the VLAD inner products per frame and averaging. We consider this approach as the baseline for video similarity detection. This direct application of VLAD to video achieves good results in terms of retrieval accuracy, albeit at the expense of high complexity and storage requirements, even when the video is sampled at a substantially lower framerate. All the solutions proposed are designed to approach the performance of this baseline as much as possible while requiring a fraction of its computational complexity and storage, or, alternatively, significantlyexceed the VLAD performance while incurring the same complexity and storage.
3.2 Temporal Compaction for Video Segment Searching
Video description algorithms such as hyperpooling [5] were designed for holistic video description, namely, the derived vector describes the entire video information as a whole. Temporal coherency is lost when using such holistic description methods, thereby making the detection of video segments within longer videos impossible. This problem can be solved by modifying holistic solutions to work on groups of frames (GoFs) within each given video. GoFs can be viewed as fixedsize temporal windows, each of which is then compacted into a single VLAD, hyperpooling or VLAC descriptor (referred to as VLADGoF, HPGoF and VLACGoF, respectively). A video segment can then be matched by finding maximum inner product between its VLADGoF, HPGoF, or VLACGoF descriptor and the corresponding descriptor from the a GoF in the video. Evidently, the length of the GoF controls the accuracy of the detection of video segments within longer videos. In addition, GoFs can also be overlapping to allow for better temporal resolution within the matching process.
3.3 Proposed Vector of Locally Aggregated Centers
Instead of clustering the local descriptors found within each GoF, we propose to cluster the centers of clusters of local descriptors. The aim is to produce results that are increasingly robust to distortions that may be found in a typical large video database. Encoding centers is expected to be more robust to such visual distortions since, compared to local feature descriptors, the centers of local feature descriptors will vary less when artifacts from processing are incurred on video frames.
Consider training GoFs stemming from a set of training videos. From the frames of each th GoF (), we extract a set of dense SIFT feature vectors , each having dimensions. From each , we calculate local feature centers (LFCs) . By concatenating the LFCs for each , we acquire the training set of LFCs . We then apply a second stage of clustering on to generate a set of centers of LFCs (CLFCs) , where each CLFC has dimensions.
We now consider a test video query that contains GoFs. For every , we extract local features to obtain . Then, for every , we obtain a set of local feature centers . Using VLAD we encode each set of centers with the set of trained centers to generate a vector of locally aggregated centers (VLAC). Particularly, we first obtain the dimensional vector for each center in by applying
(5) 
The VLAC for is then obtained by concatenating for all into a single dimensional vector . We observe that does not affect the dimension of VLAC, but serves as a control variable for the coarseness of the description. After calculating for all , we project them on a trained set of principal eigenvectors to perform dimensionality reduction. We then concatenate these vectors to generate a compact dimensional vector for video . The similarity between two videos and is given by calculating . A threshold is then applied on to determine whether the videos are similar. If two videos contain a different number of GoFs (e.g., and GoFs with ), is calculated for all possible alignments of the vectors and . Finally, the maximum over is taken to be the similarity score. This can be expressed as
(6) 
Examining the performance of VLAC under the experiment of Section 2.4, we obtain , which is more than 13 times higher than . We therefore expect the proposed method to be significantly more robust than VLAD and hyperpooling when assessing video similarity under noisy conditions. However, in order to be suitable for video retrieval, it must also be discriminative, i.e., be able to differentiate between dissimilar videos that would inherently lead to different features. This is assessed experimentally in the following section.
4 Evaluation of Video Descriptors
4.1 Dataset
We selected 100 random videos from the Open Video Project (OVP), comprising 1000 minutes of video. Seven types of distortions (Table 1) were applied to this footage to examine the performance of VLAD, hyperpooling (HP) and VLAC under noise. Training for VLAD, VLAC and HP centers was done on different OVP videos from the utilized test material.
To generate the queries, oneminute video segments were extracted from each original videos. Then, the dataset and query videos were sampled at a rate of framespersecond (fps). The sampling of the query videos, however, is shifted by 0.25 seconds with respect to the sampling of the videos in the dataset. In this way, sampling misalignments were also taken into account. First, we evaluate the similarity detection of the proposed VLAC versus the stateoftheart VLAD when both are extracted from each sampled frame in the sequence (that is, ). For VLAD, we set , while for VLAC we use and . This provides an upper bound on the detection accuracy and assesses the performance of the proposed method versus the standard perframe VLAD. Next, the proposed VLACGoF is compared against VLADGoF and HPGoF, where one descriptor per GoF of 5 frames is derived and the overlap is set to one frame. Concerning the parameters for each method, we use for VLADGoF, and for VLACGoF. For HPGoF, the number of centers used to encode the first stage VLAD is and for the second stage , where we keep dimensions from the first stage VLAD.
Distortion  Parameters 

Scaling  FFMPEG:vf scale=iw/2:1 
Rotation  FFMPEG:vf rotate 
Blurring  FFMPEG: vf boxblur 1:2:2 
Compression  FFMPEG: crf 35 
Gamma Correction  FFMPEG: vf mp 1:1.2:0.5:1.25:1:1:1 
Flicker  OpenCV: Random brightness change (120%–170%) 
Perspective Change  OpenCV AffineTransform triangle [(0,0),(0.85,0.1),(0,1)] 
4.2 Performance and Results
Fig. 1 depicts the precision versus the recall achieved with the proposed VLAC and the stateoftheart VLAD [10], when both descriptors are extracted from each of the frames in the compared video segments. The results show that the proposed descriptor offers a substantial detection accuracy improvement compared to VLAD across the entire precisionrecall range. The improved performance of VLAC can be explained by its improved tolerance to noise, i.e., , which indicates that the principal component projections do not vary substantially after the application of distortions. Therefore, VLAC retains more information after being projected on its trained principle components. Note that the training videos used to generate the principal components did not have any distortions applied on them; this is to simulate reallife conditions where we cannot predict the distortions in the dataset. In addition, all distortions were applied on all videos in the dataset, meaning that higher recall reflects higher tolerance to distortions. Same observations can be made from the results in Fig. 2, where our VLAC is compared against VLAD and hyperpooling for a GoF of size 5 frames.
Table 2 shows the mean average precision (mAP) for the three compared methods, where is the number of dimensions after projection. The results show that, under the same , VLAC improves the mAP by compared to VLAD for framebyframe matching and for GoFbased matching. The improvement offered by VLACGoF over HPGoF reaches up to .
mAP  

VLAD [10]  256  0.7462 
128  0.6761  
Proposed VLAC  256  0.9600 
128  0.9330  
VLADGoF [10]  256  0.5647 
128  0.5262  
Proposed VLACGoF  256  0.7147 
128  0.6493  
HPGoF [5]  256  0.4382 
128  0.4135 
5 Conclusion
We proposed a novel compact video representation method based on aggregating local feature centers. Our results show that encoding local feature centers yields significantly better results than simply encoding the features, which are less tolerant to visual distortions commonly found in video databases. The proposed approach is therefore suitable for video similarity detection with robustness to visual distortions. The recallprecision results were improved without incurring extra complexity in the signature matching process. Future work will assess the performance of the proposed approach under uncontrolled distortion conditions and even larger datasets.
References

[1]
R. Arandjelovic and A. Zisserman,
“All about vlad,”
in
IEEE Int. Conf. on Computer Vision and Pattern Recognition (CVPR)
, 2013, pp. 1578–1585.  [2] C.L. Chou et al., “Nearduplicate video retrieval and localization using pattern set based dynamic programming,” in IEEE Int. Conf. on Multimedia and Expo (ICME), 2013, pp. 1–6.
 [3] M. Wang et al., “Largescale image and video search: Challenges, technologies, and trends,” J. of Visual Communication and Image Representation, vol. 21, no. 8, pp. 771–772, 2010.
 [4] J. Revaud, M. Douze, C. Schmid, and H. Jégou, “Event retrieval in large video collections with circulant temporal encoding,” in IEEE Int. Conf. on Computer Vision and Pattern Recognition (CVPR), 2013, pp. 2459–2466.
 [5] M. Douze, J. Revaud, C. Schmid, and H. Jégou, “Stable hyperpooling and query expansion for event detection,” in IEEE Int. Conf. on Computer Vision (ICCV), 2013, pp. 1825–1832.
 [6] D. G. Lowe, “Object recognition from local scaleinvariant features,” in IEEE Int. Conf. on Computer vision, 1999, vol. 2, pp. 1150–1157.
 [7] A. Vedaldi and B. Fulkerson, “Vlfeat: An open and portable library of computer vision algorithms,” in ACM Int. Conf. on Multimedia, 2010, pp. 1469–1472.
 [8] A. Hampapur, K. Hyun, and R. M. Bolle, “Comparison of sequence matching techniques for video copy detection,” in Electronic Imaging 2002, 2001, pp. 194–201.
 [9] M. M. Esmaeili, M. Fatourechi, and R. K. Ward, “A robust and fast video copy detection system using contentbased fingerprinting,” IEEE Trans. on Information Forensics and Security, vol. 6, no. 1, pp. 213–226, 2011.
 [10] H. Jégou, F. Perronnin, M. Douze, J. Sánchez, P. Pérez, and C. Schmid, “Aggregating local image descriptors into compact codes,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 34, no. 9, pp. 1704–1716, 2012.

[11]
J. Yang, Y.G. Jiang, A. G. Hauptmann, and C.W. Ngo,
“Evaluating bagofvisualwords representations in scene classification,”
in Int. Workshop on Multimedia Information Retrieval. ACM, 2007, pp. 197–206. 
[12]
C. M. Bishop et al.,
Pattern recognition and machine learning
, vol. 1, springer New York, 2006.  [13] C. Kim and B. Vasudev, “Spatiotemporal sequence matching for efficient video copy detection,” IEEE Trans. Circuits and Systems for Video Technology, vol. 15, no. 1, pp. 127–132, 2005.
 [14] M. R Naphade, M. M Yeung, and B.L. Yeo, “Novel scheme for fast and efficent video sequence matching using compact signatures,” in Electronic Imaging. International Society for Optics and Photonics, 1999, pp. 564–572.
 [15] S. S. Cheung and A. Zakhor, “Estimation of web video multiplicity,” in Electronic Imaging. International Society for Optics and Photonics, 1999, pp. 34–46.
 [16] J. Lu, “Video fingerprinting for copy identification: from research to industry applications,” in IS&T/SPIE Electronic Imaging. International Society for Optics and Photonics, 2009, pp. 725402–725402.
Comments
There are no comments yet.