Vectors of Locally Aggregated Centers for Compact Video Representation

09/13/2015 ∙ by Alhabib Abbas, et al. ∙ UCL 0

We propose a novel vector aggregation technique for compact video representation, with application in accurate similarity detection within large video datasets. The current state-of-the-art in visual search is formed by the vector of locally aggregated descriptors (VLAD) of Jegou et. al. VLAD generates compact video representations based on scale-invariant feature transform (SIFT) vectors (extracted per frame) and local feature centers computed over a training set. With the aim to increase robustness to visual distortions, we propose a new approach that operates at a coarser level in the feature representation. We create vectors of locally aggregated centers (VLAC) by first clustering SIFT features to obtain local feature centers (LFCs) and then encoding the latter with respect to given centers of local feature centers (CLFCs), extracted from a training set. The sum-of-differences between the LFCs and the CLFCs are aggregated to generate an extremely-compact video description used for accurate video segment similarity detection. Experimentation using a video dataset, comprising more than 1000 minutes of content from the Open Video Project, shows that VLAC obtains substantial gains in terms of mean Average Precision (mAP) against VLAD and the hyper-pooling method of Douze et. al., under the same compaction factor and the same set of distortions.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recommendation services, event detection, clustering and categorization of video data, and retrieval algorithms for large video databases depend on efficient and reliable similarity identification amongst video segments [1, 2, 3]. In a nutshell, given a query video, we wish to find all similar video segments within a large video database in the most reliable and efficient way. The state-of-the-art in similarity identification hinges on video fingerprinting algorithms [4, 5]. The aim of such algorithms is to provide for distinguishable representations that remain robust under visual distortions, such as, rotation, compression, blur, resizing, flicker, etc. Such distortions are expected to be present within large video collections, or when dealing with content “in the wild[3].

In a broad sense, video similarity identification can be seen as a spatio-temporal matching problem via an appropriate feature space or descriptor. Recent results have shown that similarity identification algorithms based on local descriptors, such as the scale invariant feature transform (SIFT) [6] or dense SIFT [7], tend to significantly outperform previous approaches based on histogram methods [8] or fingerprinting algorithms [9], especially under the presence of distortions in the video data. Therefore, the state-of-the-art in this area is based on vectors of locally aggregated descriptors (VLAD) [10], or Bag-of-Words (BoW) methods [11], which merge feature descriptors in video frame regions. More recently, hyper-pooling approaches have been proposed [5], which perform two consecutive VLAD stages in order to compact entire video sequences into a unique aggregated descriptor vector.

In this paper, we focus on VLAD-based algorithms and examine the problem of creating compact representations that are suitable for efficient and accurate similarity identification of segments of videos within a large video collection. The paper makes the following contributions:

  • Instead of creating holistic hyper-pooling approaches for entire video sequences, we concentrate on groups of frames (GoFs) within a video sequence in order to allow for video segment search.

  • Instead of directly compacting feature descriptors, we follow a two-stage clustering approach: we first cluster features to obtain local-feature-centers (LFCs) and then encode the latter with respect to a given set of centers of local-feature-centers (CLFCs), computed from a training set.

  • Similar to VLAD, we encode the LFCs by aggregating their differences with respect to their corresponding CLFCs, thereby creating vectors of locally aggregated centers (VLAC).

  • Experiments using a 100-minute training set and a 1000-minute test set from the Open Video Project reveal that, for the same compaction factor, our proposal is outperforming the state-of-the-art VLAD method [10] by more than 15% in terms of mean Average Precision (mAP).

The remainder of the paper is as follows. Section 2 summarizes the operation of VLAD and hyper-pooling that constitute the state-of-the-art and form the basis of the proposed compaction algorithm. Section 3 presents the proposed VLAC approach. Section 4 presents experimental results, while Section 5 draws concluding remarks.

2 Background on VLAD and Hyper-Pooling

2.1 Visual Feature Description

Current solutions make use of image descriptors to represent individual frames within a video [4, 5]. After extracting the local feature descriptors of a given set of frames using an algorithm such as SIFT [6] or dense SIFT ᅵ[7], these descriptors are then accumulated to produce a compact frame representation. Recent work advocated the use of pooling strategies instead of simple averaging methods, in order to minimize information loss. A common way to achieve this is by using BoW methods [11] or VLAD [10]. In this paper, we focus on the latter as it has been shown to achieve state-of-the-art results in terms of mAP in medium and large-scale sets of image and video content.

2.2 Vector of Locally Aggregated Descriptors

VLAD [1, 10] is a vector aggregation algorithm that produces a fixed-size compact description of a set comprising a variable number of data points. VLAD was proposed as a novel approach aimed to optimize: (i) the representation of aggregated feature descriptors; (ii) the dimensionality reduction; (iii) the indexing of the output vectors.

These aspects are interrelated—for example, dimensionality reduction directly affects the way we index the output vectors. While high dimensional vectors produce more accurate search results, low dimensional vectors are easier to index and require less operations and storage.

Consider a set of video frames to be used for training purposes. For the th training frame, , a visual feature detector and descriptor (e.g., the SIFT detector and descriptor [6]) is calculated, thereby producing feature vectors , , each with dimension . The ensemble of these features comprises the th training frame’s set of visual features . The concatenation of all these sets for all training frames, given by

, undergoes a clustering approach, such as K-means

[12], thereby grouping all vectors in into clusters, with centers denoted by set . VLAD then encodes the set of visual features, , of the th frame as the group of -dimensional vectors () given by


where is the quantization function that determines which cluster belongs to. Then, the VLAD of the th frame is given by the vector of aggregated local differences , with dimension . All these vectors are concatenated into the -dimensional matrix

, which comprises the VLAD encoding of the training set. In order to allow for further dimensionality reduction (thereby accelerating the matching process), principal component analysis (PCA) is applied to

, and the

most dominant eigenvectors are maintained in the

matrix in order to be used in the test set.

When considering a test video frame, once its set of visual features is produced by the SIFT descriptor (assuming points were detected), VLAD performs the following step: (i) calculation of () via (1) with the precalculated center set ; (ii) aggregation of these into a composite vector and application of dimensionality reduction via the retained PCA coefficients in :


where denotes the VLAD of the test video frame after compaction with PCA. The similarity between two VLAD vectors of two test video frames and is simply measured via . Thresholding the set of similarity (i.e., inner product) results between a test video frame and the entire test set of video frames provides the list of similar frames retrieved under the selected threshold value.

2.3 Hyper-Pooling

A recent method proposed by Douze et al. [5]

makes use of hyper-pooling (HP) strategies on the video description level. Hyper-pooling works by using a second layer of data clustering and encoding a set of frame VLAD descriptors into a single vector. Hyper-pooling utilizes an enhanced hashing scheme by exploiting the temporal variance properties of VLAD vectors


that have been produced per frame. After performing PCA, the temporal variance of VLAD vectors is most prominent in the components associated with low eigenvalues. Hence, hyper-pooling postulates that we can get a more stable set of centers by applying a clustering algorithm (such as K-means) on the set of components relating to the highest eigenvalues. Indeed, hashing the components that vary less with time has been shown to provide better results in terms of stability and robustness to noise


2.4 Motivation Behind the Proposed Concept

From the previous description, it is evident that the crucial aspects of VLAD and hyper-pooling are the clustering and the PCA process performed on the training set. Ideally, for a given set of video frames, we would like to produce principal component vectors for compaction of VLADs that do not change substantially when the video frames undergo real-world visual distortions. For example, consider two ensembles of training video frame sets, and , with the latter produced by distorting the video frames in via blurring, compression artifacts, rotation, gamma changes, etc. During the training stage, applying PCA on the vectors of local differences (obtained per frame) will produce dominant eigenvectors forming the matrices and . In case of hyper-pooling the aforementioned matrices will have a dimension of , where is the number of dimensions retained after the first VLAD stage. Ideally, the vectors in and should be reasonably-well aligned, which is an indication that the compaction process is robust to noise. This can be tested by computing the sum-of-inner-products between the dominant eigenvectors of both cases: For both VLAD and hyper-pooling, we obtain


where denotes the element of . We carried out such an indicative test in a set of video frames taken from 10 video clips of 10-minute duration each. Each video underwent seven different visual distortions, as tabulated in Table 1 and detailed in Section 4. Using clusters for VLAD and for dense SIFT, we obtain and . However, utilizing the SIFT vectors directly, performing PCA decomposition to produce the two matrices and , and computing


we get . The significant difference between and and represents the reduction in tolerance to distortions incurred when the vectors are projected to their principal components, which is performed in order to gain the benefit of compaction.

In this paper, our aim is to design a method leading to the same compaction factor as VLAD, albeit having increased tolerance to distortion in the video frames, which will allow for high recall rates even when dealing with distorted versions of the input video content. A secondary aim is to design our approach in a way that directly deals with video segments rather that individual video frames, thus allowing for video segment similarity detection. These two aspects are elaborated in the next section.

3 Vector of Locally Aggregated Centers

3.1 VLAD per Video Frame

The similarity between two videos can be estimated by obtaining the VLAD inner products per frame and averaging. We consider this approach as the baseline for video similarity detection. This direct application of VLAD to video achieves good results in terms of retrieval accuracy, albeit at the expense of high complexity and storage requirements, even when the video is sampled at a substantially lower frame-rate. All the solutions proposed are designed to approach the performance of this baseline as much as possible while requiring a fraction of its computational complexity and storage, or, alternatively, significantly-exceed the VLAD performance while incurring the same complexity and storage.

3.2 Temporal Compaction for Video Segment Searching

Video description algorithms such as hyper-pooling [5] were designed for holistic video description, namely, the derived vector describes the entire video information as a whole. Temporal coherency is lost when using such holistic description methods, thereby making the detection of video segments within longer videos impossible. This problem can be solved by modifying holistic solutions to work on groups of frames (GoFs) within each given video. GoFs can be viewed as fixed-size temporal windows, each of which is then compacted into a single VLAD, hyper-pooling or VLAC descriptor (referred to as VLAD-GoF, HP-GoF and VLAC-GoF, respectively). A video segment can then be matched by finding maximum inner product between its VLAD-GoF, HP-GoF, or VLAC-GoF descriptor and the corresponding descriptor from the a GoF in the video. Evidently, the length of the GoF controls the accuracy of the detection of video segments within longer videos. In addition, GoFs can also be overlapping to allow for better temporal resolution within the matching process.

3.3 Proposed Vector of Locally Aggregated Centers

Instead of clustering the local descriptors found within each GoF, we propose to cluster the centers of clusters of local descriptors. The aim is to produce results that are increasingly robust to distortions that may be found in a typical large video database. Encoding centers is expected to be more robust to such visual distortions since, compared to local feature descriptors, the centers of local feature descriptors will vary less when artifacts from processing are incurred on video frames.

Consider training GoFs stemming from a set of training videos. From the frames of each th GoF (), we extract a set of dense SIFT feature vectors , each having dimensions. From each , we calculate local feature centers (LFCs) . By concatenating the LFCs for each , we acquire the training set of LFCs . We then apply a second stage of clustering on to generate a set of centers of LFCs (CLFCs) , where each CLFC has dimensions.

We now consider a test video query that contains GoFs. For every , we extract local features to obtain . Then, for every , we obtain a set of local feature centers . Using VLAD we encode each set of centers with the set of trained centers to generate a vector of locally aggregated centers (VLAC). Particularly, we first obtain the -dimensional vector for each center in by applying


The VLAC for is then obtained by concatenating for all into a single -dimensional vector . We observe that does not affect the dimension of VLAC, but serves as a control variable for the coarseness of the description. After calculating for all , we project them on a trained set of principal eigenvectors to perform dimensionality reduction. We then concatenate these vectors to generate a compact -dimensional vector for video . The similarity between two videos and is given by calculating . A threshold is then applied on to determine whether the videos are similar. If two videos contain a different number of GoFs (e.g., and GoFs with ), is calculated for all possible alignments of the vectors and . Finally, the maximum over is taken to be the similarity score. This can be expressed as


Examining the performance of VLAC under the experiment of Section 2.4, we obtain , which is more than 13 times higher than . We therefore expect the proposed method to be significantly more robust than VLAD and hyper-pooling when assessing video similarity under noisy conditions. However, in order to be suitable for video retrieval, it must also be discriminative, i.e., be able to differentiate between dissimilar videos that would inherently lead to different features. This is assessed experimentally in the following section.

4 Evaluation of Video Descriptors

4.1 Dataset

We selected 100 random videos from the Open Video Project (OVP), comprising 1000 minutes of video. Seven types of distortions (Table 1) were applied to this footage to examine the performance of VLAD, hyper-pooling (HP) and VLAC under noise. Training for VLAD, VLAC and HP centers was done on different OVP videos from the utilized test material.

To generate the queries, one-minute video segments were extracted from each original videos. Then, the dataset and query videos were sampled at a rate of frames-per-second (fps). The sampling of the query videos, however, is shifted by 0.25 seconds with respect to the sampling of the videos in the dataset. In this way, sampling misalignments were also taken into account. First, we evaluate the similarity detection of the proposed VLAC versus the state-of-the-art VLAD when both are extracted from each sampled frame in the sequence (that is, ). For VLAD, we set , while for VLAC we use and . This provides an upper bound on the detection accuracy and assesses the performance of the proposed method versus the standard per-frame VLAD. Next, the proposed VLAC-GoF is compared against VLAD-GoF and HP-GoF, where one descriptor per GoF of 5 frames is derived and the overlap is set to one frame. Concerning the parameters for each method, we use for VLAD-GoF, and for VLAC-GoF. For HP-GoF, the number of centers used to encode the first stage VLAD is and for the second stage , where we keep dimensions from the first stage VLAD.

Distortion Parameters
Scaling FFMPEG:-vf scale=iw/2:-1
Rotation FFMPEG:-vf rotate
Blurring FFMPEG: -vf boxblur 1:2:2
Compression FFMPEG: -crf 35
Gamma Correction FFMPEG: -vf mp 1:1.2:0.5:1.25:1:1:1
Flicker OpenCV: Random brightness change (120%–170%)
Perspective Change OpenCV AffineTransform triangle [(0,0),(0.85,0.1),(0,1)]
Table 1: Set of distortions applied to the videos in the database.
Figure 1: Precision versus recall for VLAD [10] and the proposed VLAC, when extracted per each frame (GoF ); (a) and (b) .

4.2 Performance and Results

Fig. 1 depicts the precision versus the recall achieved with the proposed VLAC and the state-of-the-art VLAD [10], when both descriptors are extracted from each of the frames in the compared video segments. The results show that the proposed descriptor offers a substantial detection accuracy improvement compared to VLAD across the entire precision-recall range. The improved performance of VLAC can be explained by its improved tolerance to noise, i.e., , which indicates that the principal component projections do not vary substantially after the application of distortions. Therefore, VLAC retains more information after being projected on its trained principle components. Note that the training videos used to generate the principal components did not have any distortions applied on them; this is to simulate real-life conditions where we cannot predict the distortions in the dataset. In addition, all distortions were applied on all videos in the dataset, meaning that higher recall reflects higher tolerance to distortions. Same observations can be made from the results in Fig. 2, where our VLAC is compared against VLAD and hyper-pooling for a GoF of size 5 frames.

Figure 2: Precision versus recall for VLAD [10], HP [5], and the proposed VLAC under and the overlap is 1 frame; (a) and (b) .

Table 2 shows the mean average precision (mAP) for the three compared methods, where is the number of dimensions after projection. The results show that, under the same , VLAC improves the mAP by compared to VLAD for frame-by-frame matching and for GoF-based matching. The improvement offered by VLAC-GoF over HP-GoF reaches up to .

VLAD [10] 256 0.7462
128 0.6761
Proposed VLAC 256 0.9600
128 0.9330
VLAD-GoF [10] 256 0.5647
128 0.5262
Proposed VLAC-GoF 256 0.7147
128 0.6493
HP-GoF [5] 256 0.4382
128 0.4135
Table 2: Mean Average Precision (mAP) for VLAD [10], HP [5] and the proposed VLAC under: frame-by-frame operation (top two) and GoF-based operation (bottom three).

5 Conclusion

We proposed a novel compact video representation method based on aggregating local feature centers. Our results show that encoding local feature centers yields significantly better results than simply encoding the features, which are less tolerant to visual distortions commonly found in video databases. The proposed approach is therefore suitable for video similarity detection with robustness to visual distortions. The recall-precision results were improved without incurring extra complexity in the signature matching process. Future work will assess the performance of the proposed approach under uncontrolled distortion conditions and even larger datasets.


  • [1] R. Arandjelovic and A. Zisserman, “All about vlad,” in

    IEEE Int. Conf. on Computer Vision and Pattern Recognition (CVPR)

    , 2013, pp. 1578–1585.
  • [2] C.-L. Chou et al., “Near-duplicate video retrieval and localization using pattern set based dynamic programming,” in IEEE Int. Conf. on Multimedia and Expo (ICME), 2013, pp. 1–6.
  • [3] M. Wang et al., “Large-scale image and video search: Challenges, technologies, and trends,” J. of Visual Communication and Image Representation, vol. 21, no. 8, pp. 771–772, 2010.
  • [4] J. Revaud, M. Douze, C. Schmid, and H. Jégou, “Event retrieval in large video collections with circulant temporal encoding,” in IEEE Int. Conf. on Computer Vision and Pattern Recognition (CVPR), 2013, pp. 2459–2466.
  • [5] M. Douze, J. Revaud, C. Schmid, and H. Jégou, “Stable hyper-pooling and query expansion for event detection,” in IEEE Int. Conf. on Computer Vision (ICCV), 2013, pp. 1825–1832.
  • [6] D. G. Lowe, “Object recognition from local scale-invariant features,” in IEEE Int. Conf. on Computer vision, 1999, vol. 2, pp. 1150–1157.
  • [7] A. Vedaldi and B. Fulkerson, “Vlfeat: An open and portable library of computer vision algorithms,” in ACM Int. Conf. on Multimedia, 2010, pp. 1469–1472.
  • [8] A. Hampapur, K. Hyun, and R. M. Bolle, “Comparison of sequence matching techniques for video copy detection,” in Electronic Imaging 2002, 2001, pp. 194–201.
  • [9] M. M. Esmaeili, M. Fatourechi, and R. K. Ward, “A robust and fast video copy detection system using content-based fingerprinting,” IEEE Trans. on Information Forensics and Security, vol. 6, no. 1, pp. 213–226, 2011.
  • [10] H. Jégou, F. Perronnin, M. Douze, J. Sánchez, P. Pérez, and C. Schmid, “Aggregating local image descriptors into compact codes,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 34, no. 9, pp. 1704–1716, 2012.
  • [11] J. Yang, Y.-G. Jiang, A. G. Hauptmann, and C.-W. Ngo,

    “Evaluating bag-of-visual-words representations in scene classification,”

    in Int. Workshop on Multimedia Information Retrieval. ACM, 2007, pp. 197–206.
  • [12] C. M. Bishop et al.,

    Pattern recognition and machine learning

    , vol. 1,
    springer New York, 2006.
  • [13] C. Kim and B. Vasudev, “Spatiotemporal sequence matching for efficient video copy detection,” IEEE Trans. Circuits and Systems for Video Technology, vol. 15, no. 1, pp. 127–132, 2005.
  • [14] M. R Naphade, M. M Yeung, and B.-L. Yeo, “Novel scheme for fast and efficent video sequence matching using compact signatures,” in Electronic Imaging. International Society for Optics and Photonics, 1999, pp. 564–572.
  • [15] S. S. Cheung and A. Zakhor, “Estimation of web video multiplicity,” in Electronic Imaging. International Society for Optics and Photonics, 1999, pp. 34–46.
  • [16] J. Lu, “Video fingerprinting for copy identification: from research to industry applications,” in IS&T/SPIE Electronic Imaging. International Society for Optics and Photonics, 2009, pp. 725402–725402.