Today, the amount of video content available in the public domain is huge, exceeding millions of hours, and is rapidly growing. Similar growth characterizes video-related metadata such as subtitle tracks and user-generated annotations and tags. However, these two types of information belong to two separate and largely unbridged domains. For example, English subtitles available on a DVD version of the Godfather movie are hard-wired to the timeline of the DVD video and cannot be used with a different version of the movie, e.g. downloaded from Bittorrent, streamed from YouTube, or broadcast over the air, which has a different timeline. Similarly, user-generated annotations and comments of a YouTube fragment of the Godfather are not accessible to a user watching the movie on DVD.
A way to reconcile between the timelines of different versions of a video and the associated metadata is by using content-based synchronization. For this purpose, a time-dependent signature is computed for each video, allowing to match and align similar parts in different versions of the video, thus giving a translation from one system of time coordinates to another. In a prototype application consisting of a client and server, the signature is computed in real-time during the video playback on the client side and sent to the server where it is matched to a database of video signatures. After having established the correspondence to a database sequence, the corresponding metadata on the server side is sent to the client. With this approach, it is sufficient to keep a database of video signatures computed from some prototype sequence with synchronized metadata. A new version of the video, previously unseen and coming from any source (e.g. read-only media, streaming, etc.), can be matched to the prototype timeline and the corresponding metadata retrieved. Thus, at least theoretically, any video can be enriched with metadata, provided that similar videos have signatures in the database.
The described application poses some requirements on the signature construction and matching algorithms. First, they should be able to handle large amounts (thousands or millions of hours) of data. This, in turn, imposes the requirements that the signature is compact, easily indexable, and can be searched and matched fast. Secondly, the signature computation should be efficient, and ideally computed in real-time. Finally and most importantly, two versions of a video may different significantly due to post-production processing (e.g. resolution and aspect ratio change, cropping, color and contrast modification, overlay of logos, compression artifacts, blur, etc.) and editing (e.g. advertisement insertion or adaptation of a movie for a certain rating category). The signature matching algorithm must therefore be able to cope with such modifications.
Surprisingly, similar problems are encountered in an apparently unrelated field of genetic research, where one of the main problems is matching of DNA and protein sequences. Many recent efforts, including the notorious Gene Bank and Human Genome projects, resulted in having large collections of annotated DNA and protein sequences, in which newly discovered sequences can be looked up. The problem of post-processing distortions and editing is analogous to mutations occurring in biological DNA sequences. The scale of genetic data is comparable to that of video sequences (for example, the human genome contains sequences with nearly 3 billion symbols ). Over the past decades, many efficient methods have been developed for the analysis of genomic sequences, giving birth to the field of bioinformatics .
In this paper, we borrow well-established bioinformatic methods for the analysis of video, which can be considered similarly to DNA sequences as shown in Section 2. A prototype application considered and shown in the supplementary materials is content-based metadata mapping between versions of video. The central problem in this application is finding correspondences between video sequences. In Section 3, we draw the analogy to genomic research, which allows to employ dynamic programming sequence alignment [23, 27]
and its fast heuristics[24, 1], as well as multiple sequence alignment and phylogenetic analysis . Exploring the analogy between mutations in genetic sequences and post-production processing and editing in video, we propose in Section 4 a generative approach for learning invariance to such mutations by means of metric learning. We obtain a very compact representation ( bit per second of video), which is robust to video transformations and allows efficient indexing and search. Section 5 presents experimental results demonstrating the robustness and efficiency of the proposed approach in a variety of applications, including video retrieval and alignment in a large-scale (1K hours) database. Finally, Section 6 concludes the paper.
1.1 Related work
The problem of metadata mapping addressed in this paper is intimately related to content-based copy detection and search in video [17, 6]. There, one tries to find copies of a video that has undergone modifications (whether intentional or not) that potentially make it very different visually from the original. This problem should be distinguished from action and event recognition [31, 3, 16], where the similarity criterion is semantic. Broadly speaking, copy detection problems boil down to invariant retrieval (finding a video invariant to a certain class of transformations) and action recognition are problems of categorization (recognizing a certain class of behaviors in video). To illustrate the difference, imagine three video sequences: a movie quality version of Star Wars, the same version broadcast on TV with ad insertion and captured off screen with a camcorder, and the lighsabre fight scene reenacted by amateur actors. The purpose of copy detection is to say that the first and the second video sequences are similar; action recognition, on the other hand, should find similarity between the second and third videos.
One of the cornerstone problems in content-based copy detection and search is the creation of a video representation that would allow to compare and match videos across versions. Different representations based on mosaic , shot boundaries , motion, color, and spatio-temporal intensity distribution , color histograms , and ordinal measure , were proposed. When considering large variability of versions due to post-production modifications, methods based on spatial [20, 21, 2] and spatio-temporal  points of interest and local descriptors were shown to be advantageous . In addition, these methods proved to be very efficient in image search in very large databases [26, 4]. More recently, Willems et al.  proposed feature-based spatio-temporal video descriptors combining both visual information of single video frames as well as the temporal relations between subsequent frames.
One of the main disadvantages of existing video representations is a constructive approach to invariance to video transformations. Usually, the representation is designed based on quantities and properties of video insensitive to typical transformations. For example, using gradient-based descriptors [20, 2] is known to be insensitive to illumination and color changes. Such a construction may often be unable to generalize to other classes of transformations, or result in a suboptimal tradeoff between invariance and discriminativity.
An alternative approach, adopted in this paper, is to learn the invariance from examples of video transformations. By simulating the post-production and editing process, we are able to produce pairs of video sequences that are supposed to be similar (different up to a transformations) and pairs of sequences from different videos supposed to be dissimilar. Such pairs are used as a training set for similarity preserving hashing and metric learning algorithms [25, 13, 29] in order to create a metric between video sequences that achieves optimal invariance and discriminativity on the training set.
2 Video DNA
Biological DNA data encountered in bioinformatic applications are long sequences consisting of four letters (representing aminoacids in the DNA molecule, denoted as A, T, C, G and referred to as nucleotides). Extending this example to our problem, one can conceptually think of video as of a sequence of visual information units, which can be represented over some potentially very large alphabet of visual concepts, resulting in a sequence of “letters” (or visual nucleotides) which we call video DNA by analogy to genetic sequences. Video DNA sequencing, the process of creating a video DNA sequence out of a video, is performed by computing descriptors for each frame (or short sequence of frames) and arranging them on the video timeline (see Figure 1).
In this paper, we used a feature-based representation following the standard bag of features paradigm [26, 4]. For each frame in the video, we scale down to horizonal resolution of , detect feature points, and compute local image descriptors around these points using a modification of the speeded-up robust features (SURF)  feature detection and description algorithm (Figure 1, top). strongest feature points are used. Each feature point is described by a -dimensional grayscale and -dimensional color descriptor. Second, the local descriptors are quantized using the -means clustering algorithm, separately for the grayscale and color feature descriptors, creating grayscale and color visual vocabularies. Vocabulary of and visual words are used for grayscale and color descriptors, respectively. Each local feature descriptor is replaced by the index of the nearest visual word in the vocabulary. Third, each frame is divided into four quadrants with 10% overlap and a bag of features
(histogram of visual words) in each quadrant is computed. Four concatenated histograms yield a vector of sizewhich is used as the frame descriptor (Figure 1, bottom). Fourth, a median of frame descriptors in fixed time intervals is computed, creating the video DNA sequence. The intervals taken are of size with step . A typical choice is and .
The resulting video DNA is a timed sequence of -dimensional bags of features, which we call visual nucleotides by analogy to biological DNA sequences. The similarity of two video sequences can be quantified by measuring the distance between the corresponding visual nucleotides, which we denote here by . In the simplest case, a Euclidean distance in is used. In , it was shown that a Euclidean distance weighted by the statistical distribution of visual words (term frequency-inverse document frequency or tf-idf) is a better way to compare bags of features. We will address the construction of an optimal distance between visual nucleotides in Section 4.
3 Search and alignment
Dynamic programming methods used to align biological DNA sequences, notably, the Needleman-Wunsch (NW)  and Smith-Waterman (SWAT)  algorithms, can be applied to finding correspondence between versions of video sequences.
Let and be two video DNA sequences representing two versions of a video obtained by temporal editing. In this case, and will typically have locally similar sequences of nucleotides. In order to find such similarities, we look for an optimal local alignment between and , i.e., such a correspondence of indices and that on one hand will make the corresponding nucleotides the most similar and on the other will contain gaps of minimum total length. The quality of the correspondence is represented by a similarity score, taking into consideration both the similarity of the nucleotides and the gaps.
The minimum dissimilarity score between the substring of of length and substring of of length is given by the following recursive equation,
where and for all . is the similarity between nucleotides and is the gap penalty. The values of are determined by means of dynamic programming and the optimal correspondence is established by backtracking .
3.1 Fast heuristics
The main disadvantage of dynamics programming alignment methods is their high complexity of . In our application, when a short sequence ( of order of for a typical movie assuming ) is compared to a large database containing signature of thousands or millions of hours of videos ( in the order of ), such an approach may be computationally prohibitive. A similar complexity problem is encountered in gene search applications in bioinformatics, where typical databases contain sequences totalling in millions or billions of letters.
To overcome this problem, fast heuristics such as FASTA  and BLAST  have been developed. The key idea of these approaches is to first locate matches of short combinations of nucleotides of fixed size (typically ranging between 2 and 10), which establish multiple coarse initial correspondence between regions in the two sequences. Using search engine terminology, the initial correspondence established by FASTA/BLAST algorithms are a short list of candidates. The correspondence is later refined using a banded version of the SWAT algorithm, applied on sequences around the initial regions. At this stage, video DNA sequences at higher temporal resolution can be used.
3.2 Multiple sequence alignment
In many cases, it is desired to find alignment between more than two videos, a problem analogous to multiple sequence alignment (MSA) in bioinformatics. MSA is used in phylogenetic analysis , in order to discover evolutionary relations between DNA sequences. In video, a similar problem is version control, where multiple versions of a video are given and one wishes to establish, for example, from which source they were derived and which sequence was the original.
Straightforward generalization of dynamic programming alignment algorithms to MSA results in an exponential complexity. For this reason, sub-optimal heuristics such as progressive sequence alignment are used. For example, in CLUSTAL , first all pairs of sequences are aligned separately. Alignment cost acts as a measure of the pair-wise sequence dissimilarity. Given the pairwise dissimilarity matrix, a guide tree is constructed by means of clustering (e.g. neighorhood joining). Finally, series of pair-wise alignments following the branching order in the tree are performed. This way, most similar sequences are aligned first and most dissimilar last (for detailed algorithm description, see ).
4 Mutation-invariant metric
Post-production transformations in video are analogous to mutations in biological DNA sequences and can be manifested either as insertion or deletion of visual nucleotides (indel mutations) as a result of temporal editing, or as substitution mutations, in which the visual content is replaced by another as the result of spatial editing such as resolution or aspect ratio change, cropping, compression artifacts, overlay of subtitles or channel logo, etc. While local alignment is efficient in coping with insertion or deletion mutations by proper selection of the gap penalty, substitution mutations can be a major challenge, as they may have a global effect on the entire video DNA sequence (imagine, for example, that due to non-uniform scaling of the video, the bag of features changes in every frame).
In biological DNA sequence analysis, the exact mechanism of mutations is not completely understood or reproduced; therefore, empirical models of nucleotide mutation probability are used. In our case, on the other hand, it is easy to reproduce the post-production processing that causes mutation in video DNA. Ideally, our visual nucleotides should be discriminative (such that two intervals belonging to different videos are dissimilar) and invariant (such that two transformations of the same interval are similar). Though our construction of visual nucleotides rely on feature descriptors that are insensitive to certain transformations of the frame (scale, mild brightness and contrast variations), other transformations (e.g. cropping, subtitle overlay, etc.) may result in different visual nucleotides. As a consequence, the simple Euclidean metric would not be invariant under such transformations.
Yet, it is possible to learn the best mutation-invariant metric between nucleotides on a training set. Assume that we are given a set of nucleotides describing different intervals of video, and the class of all transformations invariance to which is desired. We denote by the set of all positive pairs (visual nucleotides of identical intervals, differing up to some transformation), and by the set of all negative pairs (visual nucleotides of different intervals). Negative pairs are modeled by sampling numerous intervals from different videos, which are known to be distinct. For positive pairs, we generate representative transformations from class . Our goal is to find a metric between nucleotides that ideally is as small as possible on the set of positives and as large as possible on the set of negatives.
Shakhnarovich  considered metric parameterized as
is the Hamming metric in the -dimensional Hamming space of binary sequences of length . and are an matrix and an vector, respectively, parameterizing the metric. Our goal is to find and such that reflects the desired similarity of pairs of visual nucleotides in the training set.
Ideally, we would like to achieve for , and for , where is some threshold. In practice, this is rarely achievable as the distributions of on and have cross-talks responsible for false positives ( on ) and false negatives ( on ). Thus, optimal should minimize these cross-talks,
In , Shakhnarovich proposed considering learning optimal parameters as a boosted binary classification problem, where
acts as a strong binary classifier, and each dimension of the linear projectioncan be considered as a weak classifier. This way, AdaBoost algorithm can be used to progressively construct and , which would be a greedy solution of (4). At the -th iteration, the -th row of the matrix and the -th element of the vector are found minimizing a weighted version of (4). Weights of false positive and false negative pairs are increased, and weights of true positive and true negative pairs are decreased, using the standard Adaboost reweighting scheme . While it is difficult to find minimizing (4) because of the non-linearity, we found that the minimizer of the exponential loss is related to another simpler problem,
where and are the covariance matrices of the positive and negative pairs, respectively. It can be shown that maximizing (5
) is the largest generalized eigenvector of. Since the minimizers of (4) and (5) do not coincide exactly, in our implementation, we select a subspace spanned by the largest ten eigenvectors, out of which the direction as well as the threshold parameter minimizing the exponential loss are selected.
There are a few advantages to the described approach. First, the metric is constructed to achieve the best discriminativity and invariance on the training set. If the training set is sufficiently representative, such a metric generalizes well. It can be used as in the alignment and search algorithms described in Section 3. Secondly, the projection itself has an effect of dimensionality reduction, and results in a very compact representation of visual nucleotides as bitcodes (for example, the frame shown in Figure 1 is represented by the hexadecimal word 223E9DF01ADB3E00). Such bitcodes can be efficiently stored and manipulated in standard databases. Thirdly, modern CPU architectures allow very efficient computation of Hamming distances using bit counting and SIMD instructions. Since each of the bits can be computed independently, score computation in the alignment algorithm can be further parallelized on multiple CPUs using either shared or distributed memories. Due to the compactness of the bitcode representation, search can be performed in memory (a single 8GB memory system is sufficient to store about 300,000 hours of video with second resolution for ).
In the experimental validation, we worked with a database containing 1013 hours of assorted video content (movies, 2D and 3D cartoons, talk shows, sports) taken from DVDs. Video DNA sequences were computed with parameters and . Hamming space of dimension was used for bitcode representation. Metric learning was performed offline on a training set containing positive and negative pairs. Positives were created using transformation simulated with AviSynth frame server.
Large scale search. For the evaluation of search and alignment, we used a scheme proposed by . Randomly selected short sequences from the database were used as queries. The queries were constructed in such a way that there was exactly one correct match with the database. In BLAST and FASTA-type algorithms, the queries represent the short nucleotide sequences used to establish initial matches. The queries underwent transformations (shown in Figure 2) typical for the video post-production, including spatial and pixel transformations (cropping, letter and pillar box, contrast and color balance, compression noise, resolution and aspect ratio change, subtitle overlay) and temporal transformations (framerate change and time shift). Each transformation appeared at multiple strengths (denoted as 1–3).
Short sequences locally matching to the queries were found in the database using a FASTA/BLAST-type algorithm described in Section 3. The matching precision was measured as precision with recall of , i.e., the percentage of correct first matches. Matches were considered correct if they were within 1 sec tolerance off the groundtruth match (i.e., falling within the temporal resolution of our representation). Typical search time was msec.
Table 1 shows the breakdown of search precision according to transformations types and strengths. 10,770 queries of 10 sec length were used. Table 2 shows the search precision as function of the query length (varying from 5 to 30 sec), on a query set of 20,160 queries, including all transformations of strength 1–3. It shows that 10 sec of video are sufficient to achieve less than 3% search error in a database of 1013 hours across versions including significant transformations. This number falls below 1% for a 20 sec query.
Local alignment. In order to evaluate the performance of local alignment, we performed alignment of sequences from subset of the database containing approximately 300 hours of video using the dynamic programming algorithm described in Section 3. Query sequences underwent spatial transformations from the previous experiments, as well as different temporal transformations. The latter included deletion of portions of video, substitution with other videos, and insertion of blackness periods (both with sharp or gradual fade-in and fade-out transitions of different durations); local speeding up and slowing down of the video playback speed; and removal of significant parts of the original footage from the query sequence. Table 3 shows the breakdown of alignment precision according to transformations types and strengths. An example of two aligned versions of a sequence from the Desperate Housewives series is shown in Figure 3.
Phylogenetic analysis. Figure 4 shows a dendrogram representing the evolutionary relations between six versions derived from the Desperate Housewives from Figure 3. Version x.y was obtained by removing a shot from sequence x. The dendrogram was constructed from the matrix of pairwise sequence distances (computed as the ratio of the gaps length to the total sequence length in aligned pairs of sequences) using neighbor joining approach. One can clearly see how subsequent versions were derived.
|5 sec||10 sec||15 sec||20 sec||30 sec|
Deletion & insertion
|Fade ins & outs||99.92||99.16||99.40|
|Local speed changes||99.91||99.90||99.90|
We presented a framework for the construction of robust and compact video representations. By appealing to the analogy between genetic sequences and video, we employed bioinformatics algorithms that allow efficient search and alignment of video sequences. Also, we showed that using metric learning, it is possible to design an optimal metric on a training set of generated video transformations.
We believe that harvesting video and related metadata available in the public domain and creating a database of annotated video DNA sequences together with search and alignment tools could eventually have an impact similar to that of the Human Genome project in genomic research. Having, for example, a large database containing signatures of the most popular Hollywood movies would allow identifying and synchronizing any version of a movie no matter when, where, and from which source it is played. The database can be used for finding copies and versions of movies on the web, in order to cope with piracy, enhance video content with metadata such as subtitles, or provide keywords for contextual advertisement engines. Finally, human annotations and semantic information would enable video understanding by using matching annotations of similar videos from the database.
-  S.F. Altschul, W. Gish, W. Miller, E.W. Myers, and D.J. Lipman. Basic local alignment search tool. J. Molecular Biology, 215:403–410, 1990.
-  H. Bay, T. Tuytelaars, and L. Van Gool. SURF: Speeded up robust features. Lecture Notes in Computer Science, 3951:404, 2006.
-  O. Boiman and M. Irani. Detecting irregularities in images and in video. IJCV, 74(1):17–31, 2007.
-  O. Chum, J. Philbin, M. Isard, and A. Zisserman. Scalable near identical image and shot detection. In Proceedings of the 6th ACM international conference on Image and video retrieval, pages 549–556, 2007.
-  M.O. Dayhoff, R. Schwartz, and B.C. Orcutt. A model of evolutionary change in proteins. Nat. Biomed. Res. Found., pages 345–358, 1978.
-  M. Douze, A. Gaidon, H. Jegou, M. Marszalek, and C. Schmid. INRIA-LEAR’s video copy detection system. In Proc. TRECVID, 2008.
Y. Freund and R.E. Schapire.
A decision-theoretic generalization of on-line learning and an
application to boosting.
Proc. European Conf. Computational Learning Theory, 1995.
-  A. Hampapur, K. Hyun, and R. Bolle. Comparison of sequence matching techniques for video copy detection. In Conf. on Storage and Retrieval for Media Databases, pages 194–201, 2002.
-  X.S. Hua, X. Chen, and H.J. Zhang. Robust video signature based on ordinal measure. In Proc. ICIP, 2004.
-  P. Indyk, G. Iyengar, and N. Shivakumar. Finding pirated video sequences on the internet. 1999.
-  International Human Genome Consortium. Finishing the euchromatic sequence of the human genome. Nature, 431(7011):931–945, 2004.
-  M. Irani, P. Anandan, J. Bergen, R. Kumar, and S. Hsu. Efficient representations of video sequences and their applications. Signal Processing: Image Communication, 1996.
-  P. Jain, B. Kulis, and K. Grauman. Fast image search for learned metrics. In Proc. CVPR, 2008.
-  A. Joly, C. Frelicot, and O. Buisson. Feature statistical retrieval applied to content based copy identification. In Image Processing, 2004. ICIP’04. 2004 International Conference on, volume 1, 2004.
On space-time interest points.
International Journal of Computer Vision, 64(2):107–123, 2005.
-  I. Laptev, M. Marszałek, C. Schmid, and B. Rozenfeld. Learning realistic human actions from movies. In Proc. CVPR, 2008.
-  J. Law-To, L. Chen, A. Joly, I. Laptev, O. Buisson, V. Gouet-Brunet, N. Boujemaa, and F. Stentiford. Video copy detection: a comparative study. In Proc. ACM International Conf. Image and Video Retrieval, pages 371–378, 2007.
-  Y. Li, JS Jin, and X. Zhou. Video matching using binary signature. In Proc. ISPACS, pages 317–320, 2005.
-  D.J. Lipman, S.F. Altschul, and J.D. Kececioglu. A tool for multiple sequence alignment. PNAS, 86:4412–4415, 1989.
-  D. Lowe. Distinctive image features from scale-invariant keypoint. IJCV, 2004.
-  J. Matas, O. Chum, M. Urban, and T. Pajdla. Robust wide-baseline stereo from maximally stable extremal regions. Image and Vision Computing, 22(10):761–767, 2004.
-  D.W. Mount. Bioinformatics: sequence and genome analysis. Cold Spring Harbor Press, 2004.
-  S.B. Needleman and C.D. Wunsch. A general method applicable to the search of similarities in the amino acid sequence of two proteins. J. Molecular Biology, 48:443–453, 1970.
-  W.R. Pearson. Rapid and sensitive sequence comparison with FASTP and FASTA. Methods in Enzymology, 183, 1990.
-  G. Shakhnarovich. Learning task-specific similarity. PhD thesis, MIT, 2005.
-  J. Sivic and A. Zisserman. Video google: A text retrieval approach to object matching in videos. In Proc. CVPR, 2003.
-  T.F. Smith and M.S. Waterman. Identification of common molecular subsequences. J. Molecular Biology, 147:195–197, 1981.
-  J.D. Thompson, D.G. Higgins, and T.J. Gibson. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position specific gap penalties and weight matrix choice. Nucleic acids research, 22, 1994.
-  A. Torralba, R. Fergus, and Y. Weiss. Small codes and large image databases for recognition. In Proc. CVPR, 2008.
-  G. Willems, T. Tuytelaars, and L. Van Gool. Spatio-temporal features for robust content-based video copy detection. 2008.
-  L. Zelnik-Manor and M. Irani. Statistical analysis of dynamic actions. Trans. PAMI, 28(9):1530, 2006.