Convolutional Hashing for Automated Scene Matching

02/09/2018 ∙ by Martin Loncaric, et al. ∙ Hive 0

We present a powerful new loss function and training scheme for learning binary hash functions. In particular, we demonstrate our method by creating for the first time a neural network that outperforms state-of-the-art Haar wavelets and color layout descriptors at the task of automated scene matching. By accurately relating distance on the manifold of network outputs to distance in Hamming space, we achieve a 100-fold reduction in nontrivial false positive rate and significantly higher true positive rate. We expect our insights to provide large wins for hashing models applied to other information retrieval hashing tasks as well.



There are no comments yet.


page 1

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many information retrieval tasks rely on high dimensional searches, including K-nearest neighbors (KNN), approximate nearest neighbors (ANN), and exact

-neighbor lookup in Hamming space. At scale, these searches are enabled by indexes on binary hashes, such as locality-sensitive hashing (LSH) and multi-indexing [1]. Recent research has flourished on these topics due to enormous growth in data volume and industry applications [2]. We present a powerful new approach to a fundamental challenge in these tasks: learning a good binary hash function.

We demonstrate the effectiveness of our method by applying it to the task of automated scene matching (ASM) with a multi-index system. We call our model convolutional hashing for automated scene matching (CHASM). To the best of our knowledge, it is the first neural network to outperform state-of-the-art hash functions like Haar wavelets and color layout descriptors at ASM across the board.

frame pair error type benchmarks with error
FN all
FP 192-bit CLD
FP 64-bit wavelets, 256-bit wavelets
Figure 1: Select examples of false positives and false negatives that state-of-the-art hashes made but our 64-bit CHASM correctly handled.

1.1 Automated Scene Matching

ASM is an important information retrieval task, used to perform reverse video lookup for broadcasting, research, and copyright infringement monitoring [3, 4]. The goal of ASM is to take a query sequence of video frames and return all matching videos in a dataset, along with the start and end times of the matches111 ASM sometimes encompasses more than this specific definition. In particular, some ASM research aims to retrieve footage of the same 3D scene based on videos taken from another angle [5]. .

For instance, suppose a research library indexes all their documentaries for reverse video lookup. A researcher might then query the infamous Zapruder film, and her results should be all the documentaries containing a subset of it. The results should also include the specific time segments these documentaries matched the Zapruder film.

For large video datasets, this can be solved by implementing the following procedure222

Other procedures exist, especially ones that downsample to heuristic-selected keyframes rather than using a fixed frame rate. However, these approaches are plagued by low recall and cannot distinguish between time granularities finer than their keyframes

[3]. :

  1. For each video in the dataset:

    1. Downsample to a fixed frame rate (fps) and image size.

    2. Create a binary hash of each frame.

    3. Using a multi-index lookup table, index each binary hash, pointing back to the source video and timestamp (Section 1.2).

  2. For each query video:

    1. Apply the same fps and image size downsampling.

    2. Create a binary hash of each frame.

    3. Retrieve matches from the index for each binary hash.

    4. Based on the individual frame matches, use heuristics to decide which dataset videos match during which time segments.

Our work optimizes the binary hash function used in steps 1b and 2b of this procedure. An ideal hash function for ASM must satisfy many requirements:

  • Frames from the same video that are offset by up to a small time difference should map to hashes within the Hamming radius so that videos with a time shift still match together.

  • Frames offset by more time should map to hashes outside the Hamming radius so that the matching heuristics can determine precise start and end times.

  • Frames from different videos should map to hashes outside the Hamming radius to avoid false positive matches.

  • The false positive rate for each of the multi-index’s indices must be extremely low, since the dataset may be very large, and each false positive increases query time and the probability of mismatching scenes.

It is worth noting that as dataset size increases, an ASM hash function’s precision and recall drop, but its true positive and false positive rates stay the same. Therefore we used true positive and false positive rates as our test metrics.

We compare our approach to state-of-the-art ASM methods, as well as variants of our own method, trained with other binary hash loss functions from recent research [6, 7].

1.2 Multi-Indexing

Multi-indexing can enable search within a Hamming radius by splitting the -bit hash into substrings of length [1]333 In scenarios with a combination of extremely large datasets, short hashes, and large , it may be more practical to use fewer than substrings and make up for the missing Hamming radius with brute-force searches around each substring [1]. However, for ASM these conditions can be avoided by using larger hashes. . Each of these substrings is inserted into its own index, pointing back to the full hash, video, and timestamp.

Lookup is performed as follows:

  1. Taking an input hash , split into substrings .

  2. Initialize an empty list .

  3. For , add exact matches for in the th index to .

  4. Filter duplicate results out of .

  5. Filter results with Hamming distance out of .

  6. Return

The expected lookup runtime scales with , where is the number of exact matches per substring. Therefore, with CHASM we seek to minimize not only the overall false positive rate, but also the false positive rate for each index.

2 Related Work

2.1 Learning Binary Hash Functions

Relevant to our method, some work has been done to find a good general method for learning binary hash codes. Thus far these methods have relaxed discrete Hamming distance losses into differentiable optimizations by using piecewise-linear transformations on the hash embeddings

[8, 9]. In this work we take these ideas further and leverage a more natural transformation.

2.2 Asm

So far neural networks have failed to outperform hand-picked features at hashing for ASM. The main difference among existing state-of-the-art approaches comes from their hash functions, which are typically chosen from the frequency responses of some basis [3]. For wavelets, the discrete wavelet transform is run on images in grayscale, returning embeddings in the corresponding basis [3]

. The most common color descriptor representation is Color Layout Descriptor (CLD), which performs a discrete cosine transform on each channel of a smoothed image in YCbCr color space. Each of these embeddings is generally binarized with a 1 for each above-median response and a 0 for the others


The state of ASM research leaves a major gap: learned methods that can perform temporally accurate scene matching quickly on very large datasets. In this paper, we used 3 benchmarks: the 64-bit (8x8) Haar wavelet hash, the 256-bit (16x16) Haar wavelet hash, and the 192-bit CLD hash.

2.3 CBVR and CBIR

Content based video retrieval (CBVR) is a broad topic that involves using video, audio, and/or metadata to retrieve similar videos from a dataset. This is an easier task than ASM in that an entire video is retrieved, rather than a specific video segment. There has been some recent research into learning a binary hash function for entire videos based on high-level, semantic labels [10, 11].

Similarly, the objective of content based image retrieval (CBIR) is to take a query image as input and return a set of similar images in an image dataset. Many recent papers in this field have also used deep learning approaches to train embeddings that get binarized into hashes.

So far, deep learning papers in these topics have mainly used a combination of three loss terms:

  • terms that minimize or maximize the Euclidean distance between embeddings depending on whether they belong to similar or dissimilar content [12, 13, 6, 14, 15, 7, 10, 11, 16]

  • classification loss terms that use a bottleneck before the classification layer as the hash layer [17, 18, 10, 11]

  • binarization loss terms that punish embeddings for being far from [13, 6, 14, 15, 16]

Occasionally other loss terms are applied, including MSE from predefined target hash codes [19] and adversarial error [16].

We experimented heavily with these loss functions, but ultimately developed our own. However, ideas from CBVR and CBIR papers such as using loss terms between each pair of images in a batch [13] proved valuable in creating a good training scheme for CHASM.

Another notable trend in CBVR and CBIR research is the use of either binarization loss or learning by continuation [7]; that is, gradually sharpening sigmoids to force embeddings close to . This draws inspiration from the iterative quantization (ITQ) approach, which solves an alternating optimization problem of improving the embedding based on other metrics, then updating a rotation matrix to minimize binarization loss [20]

. Unlike ITQ, more recent papers now allow gradients caused by binarization loss and learning by continuation to backpropagate through their network.

We find that backpropagating binarization loss or using learning by continuation causes learning to plateau, with the model only learning from a shrinking gray area of data points in between disconnected regions of data points near the corners of the hypercube . Moreover, the values in these regions do not binarize with the function any differently than less extreme values. The main blocker preventing researchers from abandoning these methods is that Euclidean distance becomes a bad approximation for Hamming distance otherwise.

3 Method

3.1 Interpretation of Embedding

We propose an alternative to binarization loss and learning by continuation that respects the geometry of the embedding without punishing intermediate values. We instead let our model produce embeddings following an approximately Gaussian distribution.

3.1.1 Distribution of Embedding


be the vector of hash node outputs for an input frame

, and let be the distribution of video frames to consider. We motivate our loss function with the following assumptions:

  • If is a random video frame variable, then

    (enforced by batch normalization of

    and a loss term on skew).

  • is independent of other .

Let be the -normalized output vector. Since is a vector of independent random normal variables,

is a random variable distributed uniformly on the hypersphere.

This -normalization is the same as SphereNorm [21] and very similar to Riemannian Batch Normalization [22]. Liu et al. posed the question of why this technique works better in conjunction with batch norm than either approach alone, and our work bridges that gap. An

-normalized vector of IID random normal variables forms a uniform distribution on a hypersphere, whereas most other distributions would not. An uneven distribution would limit the regions on the hypersphere where learning can happen and leave room for internal covariate shift toward different, unknown regions of the hypersphere.

3.1.2 Estimate of Distribution of Hamming Distance

To avoid the assumption that Euclidean distance translates to Hamming distance, we further study the distribution of Hamming distance given these -normalized vectors. We derive the exact probability that two bits match, given two uniformly random points on the hypersphere, conditioned on the angle between them.

We know that , so the arc length of the path on the unit hypersphere between them is . A half loop around the unit hypersphere would cross each of the

axis hyperplanes (i.e.

) once, so a randomly positioned arc of length crosses axis hyperplanes on average. Each axis hyperplane crossed corresponds to a bit flipped, so the probability that a random bit differs between these vectors is

Given this exact probability, we estimate the distribution of Hamming distance by making the approximation that each bit position between the two vectors differs independently from the others with probability

. Therefore, the probability of Hamming distance being within is approximately where

is the binomial cumulative distribution function. This approximation proves to be very close for large

(Figure 3.1.2).

Figure 2: An arc of length on the unit hypersphere starting from a random point in a random direction has probability for the sign of a particular component to change along its course. In the 3D example above, crossing the great circle implies that the sign of one component differs between and .

Prior hashing research has made inroads with a similar observation, but applied it in the limited context of choosing vectors to project an embedding onto for binarization [23]. We apply this idea directly in network training.

Figure 3: The distribution of Hamming distance for two uniformly random vectors on the -hypersphere, conditioned on being separated by an angle . From left to right, . Each empirical distribution was calculated from the results of trials.

3.2 Classes of Time Differences

For brevity, we define four classes of pairs of frames, depending on how far separated in time they are (Table 3.2). Our goal in CHASM is to maximize how often frame pairs in match together while minimizing how often frame pairs in and do. Among and , it is by far most important that frame pairs in do not match together, since by far most frames in a video index will be from videos different than the query.

name time difference same shot same video
yes yes
yes yes
no yes
no no
Table 1: Classes of frame pairs

3.3 Loss Function

With batch size , let

be our batch-normalized logit layer for a batch of frames

and be the -row-normalized version of ; that is, . Similarly, let be the -row-normalized submatrices formed by splitting into vertical slices; in other words, define a submatrix for the logits of the th substring. Let and .Let be the vector of all our model’s learnable weights. Let be matrices that depend on which class each pair of frames ) is in (Table 3.3). We define our loss to be


  • , the class-weighted average log likelihood of each pair of frames to be within Hamming distance (Table 3.3).

  • , the class-weighted average log likelihood of each pair of frames to be outside Hamming distance (Table 3.3).

  • , the class-weighted log likelihood that substrings differ between frames, summed over each substring (Table 3.3). This term is particular to multi-indexing.

  • , penalizing high skew and enforcing our assumption that the embedding follows a Gaussian distribution. We used .

  • , a regularization term on the model’s learnable weights to minimize overfitting. We used .

1 0 0 0
0 5
0 0
Table 2: Loss weights by frame pair class of

Note that terms and work on all pairwise combinations of images in the batch, providing us with a very accurate estimate of the true gradient.

3.4 Dataset

We trained our model using frames from Google’s AVA video dataset [24], which consists of 154 training and 38 test videos annotated with activities. For our purpose of automated scene matching, we disregarded the activity annotations. We were able to obtain 136 of the training videos and 36 of the test videos.

To ensure that our model would learn meaningful similarities between frames, we selected the distinct cut-free shots of each video. Then we filtered down to shots at least 4 seconds long and cut them to a maximum of 8 seconds. We used a subset such that each was separated by at least 60 seconds from any other. We then took training and testing shots from videos in the respective category, downsampling each shot at 15fps and resolution to produce video frames. We used all frames from the training set in training and distinct subsets from the test videos for validation and testing.

To find the cuts in each video, we used a cut detection model defined by [25]. We will make our shot annotations publicly available.

3.5 Architecture

The network that learns the hash function is composed of 5 main blocks of convolutions (Table 3). Structurally it is similar to a wide resnet [26] with the additional block added to handle the

input size. Additionally, the pooling, classification, and softmax layers are removed and a fully-connected layer is added to specify the hash size. By removing the global pooling, we allow the network to learn information about the position of features in images, which is important for automated scene matching. We batch normalize the fully connected layer’s outputs, giving the embedding. From there, they are either

-normalized during training or binarized with the sign function during inference.

Following [27]

we used batch normalization before each convolutional layer and remove the activation function from the residual path. In all our experiments, the depth factor was 6, which makes the network 49 convolutional layers. We experimented with different width factors, but ultimately found no gains for width factors over 1. This means the bulk of our resnet is identical to that of


group name output size block
fc hash size
Table 3: Hash function architecture. Downsampling is performed by the first

convolution in each block with a stride of 2. Batch normalization and ReLU activation precede each convolutional layer (except the first), and we add dropout between each convolutional weight layer. We used

and .

3.6 Training Scheme

We chose to be , such that 2 frames sampled at 15fps left or right of the query frame belong in .

Using a batch size of , we used what we call “hierarchical batches”, which include pairs of images from each of and . To construct one, we

  • choose 35 random videos from our dataset with replacement,

  • choose 2 random shots from each video without replacement,

  • choose 3 random frames from each shot without replacement, and

  • choose 1 random additional frame within for each of those frames without replacement.

This ensures that even for very large datasets, each class is available enough to train on.

We trained our model using stochastic gradient descent with momentum for 8M images, or

steps. Our network’s weights randomly initialized to configurations with very high and loss terms, so we started our learning rate at a very low number for the first 1000 steps, gradually increasing until we began a cosine decay at a more typical learning rate :

where is the global step.

To minimize overfitting, we used dropout with 30% probability and flipped each batch of images horizontally with 50% probability.

4 Results

Hash Function chosen TP rate FP rate FP rate FP rate
Haar Wavelets (64-bit) 3 0.810 0.261
Haar Wavelets (256-bit) 14 0.834 0.265
CLD (192-bit) 16 0.835 0.265
CHASM (64-bit) 3 0.885 0.334
CHASM (192-bit) 7 0.319
CHASM-C (192-bit) 8 0.621
CHASM-B (192-bit) 1 0.800 0.282
CHASM-N (192-bit) 7 0.878 0.288
Table 4: Positive rate by frame pair class and hash function. and are by far the most important classes for these metrics. Values of were chosen by scanning the ROC curves of true positive rate vs. false positive rate for the best tradeoff (Figure 4).

We trained CHASM models for hashes of 64 and 192 bits, optimizing for binary substrings size of 32. We tested our results on the over distinct pairs of frames in our test set. CHASM achieved higher true positive rates and lower false positive rates for each class and hash size; in fact, even our 64-bit hash beat the 192- and 256-bit benchmark hashes on both true positive rate and false positive rate by a large margin (Figure 4).

The lowest possible false positive rate on our test set was , since various videos contained perfectly identical black frames. These results can be avoided in practice by ignoring any perfectly black frames, which are not very informative. Without these frames, the 192-bit CHASM achieves nontrivial false positive rate at 88.8% true positive rate, over a 100-fold reduction of the nontrivial false positive rate of the best benchmark hash (256-bit wavelets) at 85.9% true positive rate.

In addition, we compare against two variants of CHASM, modified by removing batch and normalization on the embedding and using different loss functions:

  • CHASM-B, using a squared error loss term between each frame pair depending on similarity and an binarization loss term for how far the embedding is from binary.

  • CHASM-C, using a logistic loss term based on the dot product of embeddings. We also used learning by continuation, computing the embeddings by passing our final layer through a layer that periodically gets sharper throughout training.

We implemented the loss function from [6] for the former and that of [7]

for the latter, along with appropriately tuned learning rate schedules and hyperparameters.

Neither approach performed on the same level as any of our benchmark hashes, let alone CHASM (Table 4). However, this is not in contradiction with their respective papers’ results; both worked for low-recall, high-precision tasks like finding nearest neighbors.

Finally, we compared against a model CHASM-N trained without skew loss (). We found that it had lower true positive rate and higher false positive rate for every value of .

Figure 4: True positive rate vs. false positive rate among frames from different videos, plotted at different values of . Each star corresponds to a heuristically chosen value of on this dataset that has at least 60% TP rate and maximizes . CHASM-N is omitted due to clutter.

5 Conclusion

Our results show for the first time that a neural network is capable of outperforming traditional hashing methods at the task of hashing for ASM. Our model was able to reduce nontrivial false-positive rate on a large dataset by a factor of 100, even at higher true positive rate. This constitutes a massive improvement in the speed and accuracy of ASM systems.

In contrast, we found that state-of-the-art approaches to CBIR were unable to beat even our benchmarks. We attribute our model’s comparable success to four main factors.

  • CHASM’s loss depends on the chance of misclassifying an image.

  • CHASM uses good estimates for the distribution of Hamming distance as a function of embeddings.

  • CHASM does not restrict the embedding’s values near during training, which (without actually changing its binarized values) can prevent the model from learning.

  • CBIR research has focused on low-recall, moderate-precision regimes like nearest neighbors, whereas ASM demands extremely low false positive rate.

We also shed light on why -normalization of layer outputs improves learning in conjunction with batch norm. Perhaps most importantly, we provide a powerful new loss function and training scheme for learning binary hash functions in general.