[lines=2]This paper explores convolutional architectures as robust visual descriptors for image patches and evaluates them in the context of patch and image retrieval. We explore several levels of supervision for training such networks, ranging from fully supervised to unsupervised. In this context, requiring supervision may seem unusual since data for retrieval tasks typically does not come with labels. Convolutional Neural Networks (CNNs) have achieved state-of-the-art in many other computer vision tasks, but require abundant labels to learn their parameters. For this reason, previous work with CNN architectures on image retrieval have focused on using global(babenko2014neural) or aggregated local (razavian2014cnn; gong2014multi; ng2015exploiting; babenko2015aggregating; tolias2015particular) CNN descriptors that were learned on an unrelated classification task. To improve the performance of these transferred features, babenko2014neural showed that fine-tuning global descriptors on a dataset of landmarks results in improvements on retrieval datasets that contain buildings. It is unclear, however, if the lower levels of a convolutional architecture, appropriate for a local description, will be impacted by such a global fine-tuning (yosinsky2014transfer). Recent approaches have, thus, attempted to discriminatively learn low-level convolutional descriptors, either by enforcing a certain level of invariance through explicit transformations (fischer2014descriptor) or by training with a patch-matching dataset (zagoruyko2015learning; simoserra2015discriminative)
. In all cases, the link between the supervised classification objective and image retrieval is artificial, which motivates us to investigate the performance of new unsupervised learning techniques.
To do so, we propose an unsupervised patch descriptor based on Convolutional Kernel Networks (CKNs)(mairal2014convolutional). This required to turn the CKN proof of concept of mairal2014convolutional into a descriptor with state-of-the-art performance on large-scale benchmarks. This paper introduces significant improvements of the original model, algorithm, and implementation, as well as adapting the approach to image retrieval. Our conclusion is that supervision might not be necessary to train convolutional networks for image and patch retrieval, since our unsupervised descriptor achieves the best performance on several standard benchmarks.
One originality of our work is also to jointly evaluate our models on the problems of patch and image retrieval. Most works that study patch representations (brown2011discriminative; winder2009picking; zagoruyko2015learning; fischer2014descriptor), do so in the context of patch retrieval only, and do not test whether conclusions also generalize to image retrieval (typically after an aggregation step). In fact, the correlation between the two evaluation methods (patch- and image-level) is not clear beforehand, which motivated us to design a new dataset to answer this question. We call this dataset RomePatches; it consists of views of several locations in Rome (li2010location), for which a sparse groundtruth of patch matches is obtained through 3D reconstruction. This results in a dataset for patch and image retrieval, which enables us to quantify the performance of patch descriptors for both tasks.
To evaluate the descriptor performance, we adopt the following pipeline (see Fig. 1), described in detail in section 6.1.1. We use the popular Hessian-Affine detector of mikolajczyk2004scale, which has been shown to give state-of-the-art results (tinne2007). The regions around these points are encoded with the convolutional descriptors proposed in this work. We aggregate local features with VLAD-pooling (jegou2010aggregating)
on the patch descriptors to build an approximate matching technique. VLAD pooling has been shown to be better than Bag-of-Word, and to give similar performance to Fisher Vectors(perronnin2007fisher), another popular technique for image retrieval (jegou2012aggregating).
A preliminary version of this article has appeared in (paulin2015local). Here we extend the related work (Section 2) and describe in detail the convolutional kernel network (CKN), in particular its reformulation which leads to a fast learning algorithm (Section 4). We also add a number of experiments (see Section 6). We compare to two additional supervised local CNN descriptors. The first one is based on the network of fischer2014descriptor, but trained on our RomePatches dataset. The second one is trained with the Siamese architecture of zagoruyko2015learning. We also compare our approach to an AlexNet network fine-tuned on the Landmarks dataset. Furthermore, we provide an in-depth study of PCA compression and whitening, standard post-processing steps in image retrieval, which further improve the quality of our results. We also show that our method can be applied to dense patches instead of Hessian-Affine, which improves performance for some of the benchmarks.
The remainder of the paper is organized as follows. We discuss previous work that is most relevant to our approach in Section 2. We describe the framework for convolutional descriptors and convolutional kernel networks in Sections 3 and 4. We introduce the RomePatches dataset as well as standard benchmarks for patch and image retrieval in Section 5. Section 6 describes experimental results.
2 Related Work
In this section we first review the state of the art for patch description and then present deep learning approaches for image and patch retrieval. For deep patch descriptors, we first present supervised and, then, unsupervised approaches.
2.1 Patch descriptors
A patch is a image region extracted from an image. Patches can either be extracted densely or at interest points. The most popular patch descriptor is SIFT (lowe2004distinctive), which showed state-of-the-art performancemikolajczyk2005performance for patch matching. It can be viewed as a three-layer CNN, the first layer computing gradient histograms using convolutions, the second, fully-connected, weighting the gradients with a Gaussian, and the third pooling across a 4x4 grid. Local descriptors that improve SIFT include SURF (bay2006surf), BRIEF (brief2010) and LIOP (liop2011). Recently, dong2015domain build on SIFT using local pooling on scale and location to get state-of-the-art performance in patch retrieval.
All these descriptors are hand-crafted and their relatively small numbers of parameters have been optimized by grid-search. When the number of parameters to be set is large, such an approach is unfeasible and the optimal parametrization needs to be learned from data.
A number of approaches learn patch descriptors without relying on deep learning. Most of them use a strong supervision. brown2011discriminative (see also winder2009picking) design a matching dataset based on 3D models of landmarks and use it to train a descriptor consisting of several existing parts, including SIFT, GLOH (mikolajczyk2005performance) and Daisy (tola2010daisy). philbin2010descriptor
learn a Mahalanobis metric for SIFT descriptors to compensate for the binarization error, with excellent results in instance-level retrieval.simonyan2014learning propose the “Pooling Regions” descriptor and learn its parameters, as well as a linear projection using stochastic optimization. Their learning objective can be cast as a convex optimization problem, which is not the case for classical convolutional networks.
An exception that departs from this strongly supervised setting is (bo2010kernel) which presents a match-kernel interpretation of SIFT, and a family of kernel descriptors whose parameters are learned in an unsupervised fashion. The Patch-CKN we introduce generalizes kernel descriptors; the proposed procedure for computing an explicit feature embedding is faster and simpler.
2.2 Deep learning for image retrieval
If a CNN is trained on a sufficiently large labeled set such as ImageNet(deng2009imagenet), its intermediate layers can be used as image descriptors for a wide variety of tasks including image retrieval (babenko2014neural; razavian2014cnn). The output of one of the fully-connected layers is often chosen because it is compact, usually 4,096-dim. However, global CNN descriptors lack geometric invariance (gong2014multi), and produce results below the state of the art for instance-level image retrieval.
In (razavian2014cnn; gong2014multi), CNN responses at different scales and positions are extracted. We proceed similarly, yet we replace the (coarse) dense grid with a patch detector. There are important differences between (razavian2014cnn; gong2014multi) and our work. While they use the penultimate layer as patch descriptor, we show in our experiments that we can get improved results with preceding layers, that are cheaper to compute and require smaller input patches. Closely related is the work of ng2015exploiting which uses VLAD pooling on top of very deep CNN feature maps, at multiple scales with good performance on Holidays and Oxford. Their approach is similar to the one of gong2014multi, but faster as it factorizes computation using whole-image convolutions. Building on this, tolias2015particular uses an improved aggregation method compared to VLAD, that leverages the structure of convolutional feature maps.
babenko2014neural use a single global CNN descriptor for instance-level image retrieval and fine-tune the descriptor on an external landmark dataset. We experiment with their fine-tuned network and show improvement also with lower levels on the Oxford dataset. CKN descriptors still outperform this approach. Finally, (wang2014learning) proposes a Siamese architecture to train image retrieval descriptors but do not report results on standard retrieval benchmarks.
2.3 Deep patch descriptors
Recently (long2014corresp; fischer2014descriptor; simoserra2015discriminative; zagoruyko2015learning) outperform SIFT for patch matching or patch classification. These approaches use different levels of supervision to train a CNN. (long2014corresp) learn their patch CNNs using category labels of ImageNet. (fischer2014descriptor) creates surrogate classes where each class corresponds to a patch and distorted versions of this patch. Matching and non-matching pairs are used in (simoserra2015discriminative; zagoruyko2015learning). There are two key differences between those works and ours. First, they focus on patch-level metrics, instead of actual image retrieval. Second, and more importantly, while all these approaches require some kind of supervision, we show that our Patch-CKN yields competitive performance in both patch matching and image retrieval without any supervision.
2.4 Unsupervised learning for deep representations
To avoid costly annotation, many works leverage unsupervised information to learn deep representations. Unsupervised learning can be used to initialize network weights, as in erhan2009difficulty; erhan2010does. Methods that directly use unsupervised weights include domain transfer (donahue2014decaf)
and k-means(coates2012learning). Most recently, some works have looked into using temporal coherence as supervision (goroshin2015learning; goroshin2015unsupervised). Closely related to our work, agrawal2015learning propose to train a network by learning the affine transformation between synchronized image pairs for which camera parameters are available. Similarly, jayaraman2015learning uses a training objective that enforces for sequences of images that derive from the same ego-motion to behave similarly in the feature space. While these two works focus on a weakly supervised setting, we focus on a fully unsupervised one.
3 Local Convolutional Descriptors
In this section, we briefly review notations related to CNNs and the possible learning approaches.
3.1 Convolutional Neural Networks
In this work, we use convolutional features to encode patches extracted from an image. We call convolutional descriptor any feature representation that decomposes in a multi-layer fashion as:
where is an input patch represented as a vector, the ’s are matrices corresponding to linear operations, the
’s are pointwise non-linear functions, e.g., sigmoids or rectified linear units, and the functionsperform a downsampling operation called “feature pooling”. Each composition is called a “layer” and the intermediate representations of , between each layer, are called “maps”. A map can be represented as pixels organized on a spatial grid, with a multidimensional representation for each pixel. Borrowing a classical terminology from neuroscience, it is also common to call “receptive field” the set of pixels from the input patch that may influence a particular pixel value from a higher-layer map. In traditional convolutional neural networks, the matrices have a particular structure corresponding to spatial convolutions performed by small square filters, which will need to be learned. In the case where there is no such structure, the layer is called “fully-connected”.
The hyper-parameters of a convolutional architecture lie in the choice of non-linearities , type of pooling , in the structure of the matrices (notably the size and number of filters) as well as in the number of layers.
The only parameters that are learned in an automated fashion are usually the filters, corresponding to the entries of the matrices . In this paper we investigate the following ways of learning: (i) encoding local descriptors with a CNN that has been trained for an unrelated classification task (Sec. 3.2.1), (ii) using a CNN that has been trained for a classification problem that can be directly linked to the target task (e.g. buildings, see Sec. 3.2.1), (iii) devising a surrogate classification problem to enforce invariance (Sec. 3.2.2), (iv) directly learning the weights using patch-level groundtruth (Sec. 3.3) or (v) using unsupervised learning, such as convolutional kernel networks, which we present in Section 4.
3.2 Learning Supervised Convolutional Descriptors
The traditional way of learning the weights in (1) consists in using a training set of examples, equipped with labels
, choose a loss functionand minimize it over using stochastic gradient optimization and back-propagation (lecun1989handwritten; bottou2012stochastic). The choice of examples, labelings and loss functional leads to different weights.
3.2.1 Learning with category labels
A now classical CNN architecture is AlexNet (krizhevsky2012imagenet). AlexNet consists of 8 layers: the first five are convolutional layers and the last ones are fully connected.
In this case, the training examples are images that have been hand-labeled into classes such as “bird” or “cow” and the loss function is the softmax loss:
In Eq. (2) and throughout the paper, is the output of the -th layer () of the network applied to example . The notation corresponds to the -th element of the map.
Even though the network is designed to process images of size
, each neuron of a map has a “receptive field”, see the “coverage” column in Table2 from Section 6. Using an image of the size of the receptive field produces a 1x1 map that we can use as a low dimensional patch descriptor. To ensure a fair comparison between all approaches, we rescale the fixed-size input patches so that they fit the required input of each network.
We explore two different sets of labellings for AlexNet: the first one, which we call AlexNet-ImageNet, is learned on the training set of ILSVRC 2012 (), as in the original paper (krizhevsky2012imagenet). This set of weights is popular in off-the-shelf convolutional features, even though the initial task is unrelated to the target image retrieval application. Following babenko2014neural, we also fine-tune the same network on the Landmarks dataset, to introduce semantic information into the network that is more related to the target task. The resulting network is called AlexNet-Landmarks.
3.2.2 Learning from surrogate labels
Most CNNs such as AlexNet augment the dataset with jittered versions of training data to learn the filters in (1). dosovitskiy2014discriminative; fischer2014descriptor use virtual patches, obtained as transformations of randomly extracted ones to design a classification problem related to patch retrieval. For a set of patches , and a set a transformations , the dataset consists of all . Transformed versions of the same patch share the same label, thus defining surrogate classes. Similarily to the previous setup, the network use softmax loss (2).
In this paper, we evaluate this strategy by using the same network, called PhilippNet, as in fischer2014descriptor. The network has three convolutional and one fully connected layers, takes as input 64x64 patches, and produces a 512-dimensional output.
3.3 Learning with patch-level groundtruth
When patch-level labels are available, obtained by manual annotation or 3D-reconstruction (winder2009picking), it is possible to directly learn a similarity measure as well as a feature representation. The simplest way to do so, is to replace the virtual patches in the architecture of dosovitskiy2014discriminative; fischer2014descriptor described in the previous section with labeled patches of RomeTrain. We call this version “FisherNet-Rome”.
It can also be achieved using a Siamese network (chopra2005learning), i.e. a CNN which takes as input the two patches to compare, and where the objective function enforces that the output descriptors’ similarity should reproduce the ground-truth similarity between patches.
Optimization can be conducted with either a metric-learning cost (simoserra2015discriminative):
or as a binary classification problem (“match”/“not-match”) with a softmax loss as in eq. (2) (zbontar2014computing; zagoruyko2015learning). For those experiments, we use the parameters of the siamese networks of zagoruyko2015learning, available online111https://github.com/szagoruyko/cvpr15deepcompare. Following their convention, we refer to these architectures as “DeepCompare”.
4 Convolutional Kernel Networks
In this paper, the unsupervised learning strategy for learning convolutional networks is based on the convolutional kernel networks (CKNs) of mairal2014convolutional. Similar to CNNs, these networks have a multi-layer structure with convolutional pooling and nonlinear operations at every layer. Instead of learning filters by optimizing a loss function, say for classification, they are trained layerwise to approximate a particular nonlinear kernel, and therefore require no labeled data.
The presentation of CKNs is divided into three stages: (i) introduction of the abstract model based on kernels (Sections 4.1, 4.2, and 4.3); (ii) approximation scheme and concrete implementation (Sections 4.4, 4.5, and 4.7); (iii) optimization (Section 4.6).
4.1 A Single-Layer Convolutional Kernel for Images
The basic component of CKNs is a match kernel that encodes a similarity between a pair of images of size pixels, which are assumed to be square. The integer represents the number of channels, say 3 for RGB images. Note that when applied to image retrieval, these images correspond to regions – patches – extracted from an image. We omit this fact for simplicity since this presentation of CKNs is independent of the image retrieval task.
We denote by the set of pixel locations, which is of size , and choose a patch size . Then, we denote by (resp. ) the patch of (resp. ) at location (resp. ). Then, the single-layer match kernel is defined as follows:
Single-Layer Convolutional Kernel.
where and are two kernel hyperparameters,
are two kernel hyperparameters,denotes the usual norm, and and are -normalized versions of the patches and .
is called a convolutional kernel; it can be interpreted as a match-kernel that compares all pairs of patches from and with a nonlinear kernel , weighted by a Gaussian term that decreases with their relative distance. The kernel compares indeed all locations in with all locations in . It depends notably on the parameter , which controls the nonlinearity of the Gaussian kernel comparing two normalized patches and , and on , which controls the size of the neighborhood in which a patch is matched with another one. In practice, the comparison of two patches that have very different locations and will be negligible in the sum (5) when is small enough. Hence, the parameter allows us to control the local shift-invariance of the kernel.
4.2 From Kernels to Infinite-Dimensional Feature Maps
Designing a positive definite kernel on data is equivalent to defining a mapping of the data to a Hilbert space, called reproducing kernel Hilbert space (RKHS), where the kernel is an inner product (rkhs-book:2007); exploiting this mapping is sometimes referred to as the “kernel trick” (scholkopf2002learning). In this section, we will show how the kernel (5) may be used to generalize the concept of “feature maps” from the traditional neural network literature to kernels and Hilbert spaces.222Note that in the kernel literature, “feature map” denotes the mapping between data points and their representation in a reproducing kernel Hilbert space (RKHS). Here, feature maps refer to spatial maps representing local image characteristics at every location, as usual in the neural network literature lecun1998. The kernel is indeed positive definite (see the appendix of mairal2014convolutional) and thus it will suits our needs.
Basically, feature maps from convolutional neural networks are spatial maps where every location carries a finite-dimensional vector representing information from a local neighborhood in the input image. Generalizing this concept in an infinite-dimensional context is relatively straightforward with the following definition:
Let be a Hilbert space. The set of feature maps is the set of applications .
Given an image , it is now easy to build such a feature map. For instance, consider the nonlinear kernel for patches defined in Eq. (6). According to the Aronzsjan theorem, there exists a Hilbert space and a mapping such that for two image patches – which may come from different images or not –, . As a result, we may use this mapping to define a feature map for image such that , where is the patch from centered at location . The first property of feature maps from classical CNNs would be satisfied: at every location, the map carries information from a local neighborhood from the input image .
We will see in the next subsection how to build sequences of feature maps in a multilayer fashion, with invariant properties that are missing from the simple example we have just described.
4.3 From Single-Layer to Multi-Layer Kernels
We now show how to build a sequence of feature maps , …, for an input image initially represented as a finite-dimensional map , where is the set of pixel locations in and is the number of channels. The choice of initial map is important since it will be the input of our algorithms; it is thus discussed in Section 4.7. Here, we assume that we have already made this choice, and we explain how to build a map from a previous map . Specifically, our goal is to design such that
for in carries information from a local neighborhood from centered at location ;
the map is “more invariant” than .
These two properties can be obtained by defining a positive definite kernel on patches from . Denoting by its RKHS, we may call the mapping to of a patch from centered at . The construction is illustrated in Figure 3.
Concretely, we choose a patch shape , which is a set of coordinates centered at zero along with a set of pixel locations such that for all in and in , the location is in . Then, the kernel for comparing two patches from and at respective locations in is defined as
for all in , where (resp. ) are normalized—that is, if and otherwise. This kernel is similar to the convolutional kernel for images already introduced in (5), except that it operates on infinite-dimensional feature maps. It involves two parameters , to control the amount of invariance of the kernel. Then, by definition, is the mapping such that the value (7) is equal to the inner product .
This framework yields a sequence of infinite-dimensional image representations but requires finite-dimensional approximations to be used in practice. Among different approximate kernel embeddings techniques (williams2001using; Bach:Jordan:2002; Rahimi:Recht:2008; PerronninSL10; VedaldiZ12), we will introduce a data-driven approach that exploits a simple expansion of the Gaussian kernel, and which provides a new way of learning convolutional neural networks without supervision.
4.4 Approximation of the Gaussian Kernel
Specifically, the previous approach relies on an approximation scheme for the Gaussian kernel, which is plugged in the convolutional kernels (7) at every layer; this scheme requires learning some weights that will be interpreted as the parameters of a CNN in the final pipeline (see Section 4.5).
More precisely, for all and in , and , the Gaussian kernel can be shown to be equal to
Furthermore, when the vectors and are on the sphere—that is, have unit -norm, we also have
where is a nonlinear function such that and
is the density of the multivariate normal distribution. Then, different strategies may be used to approximate the expectation by a finite weighted sum:
which can be further simplified, after appropriate changes of variables,
for some sets of parameters in and in , , which need to be learned. The approximation leads to the kernel approximations where , which may be interpreted as the output of a one-layer neural network with neurons and exponential nonlinear functions.
The change of variable that we have introduced yields a simpler formulation than the original formulation of mairal2014convolutional. Given a set of training pairs of normalized signals in , the weights and scalars may now be obtained by minimizing
which is a non-convex optimization problem. How we address it will be detailed in Section 4.6.
4.5 Back to Finite-Dimensional Feature Maps
Convolutional kernel networks use the previous approximation scheme of the Gaussian kernel to build finite-dimensional image representations of an input image with the following properties:
There exists a patch size such that a patch of at location —which is formally a vector of size —provides a finite-dimensional approximation of the kernel map . In other words, given another patch from a map , we have .333Note that to be more rigorous, the maps need to be slightly larger in spatial size than since otherwise a patch at location from may take pixel values outside of . We omit this fact for simplicity.
Computing a map from involves convolution with learned filters and linear feature pooling with Gaussian weights.
The first one is to consider CKNs as an approximation of the infinite-dimensional feature maps presented in the previous sections (see mairal2014convolutional, for more details about the approximation principles).
The second one is to see CKNs as particular types of convolutional neural networks with contrast-normalization. Unlike traditional CNNs, filters and nonlinearities are learned to approximate the Gaussian kernel on patches from layer .
With both interpretations, this representation induces a change of paradigm in unsupervised learning with neural networks, where the network is not trained to reconstruct input signals, but where its non-linearities are derived from a kernel point of view.
Extract patches of size from the input map ;
Compute contrast-normalized patches
Produce an intermediate map with linear operations followed by non-linearity:
where the exponential function is meant “pointwise”.
Produce the output map by linear pooling with Gaussian weights:
4.6 Large-Scale Optimization
One of the challenge we faced to apply CKNs to image retrieval was the lack of scalability of the original model introduced by mairal2014convolutional, which was a proof of concept with no effort towards scalability. A first improvement we made was to simplify the original objective function with changes of variables, resulting in the formulation (12), leading to simpler and less expensive gradient computations.
The second improvement is to use stochastic optimization instead of the L-BFGS method used by mairal2014convolutional
. This allows us to train every layer by using a set of one million patches and conduct learning on all of their possible pairs, which is a regime where stochastic optimization is unavoidable. Unfortunately, applying stochastic gradient descent directly on (12) turned out to be very ineffective due to the poor conditioning of the optimization problem. One solution is to make another change of variable and optimize in a space where the input data is less correlated.
More precisely, we proceed by (i) adding an “intercept” (the constant value 1) to the vectors in , yielding vectors in ; (ii) computing the resulting (uncentered) covariance matrix
; (iii) computing the eigenvalue decomposition of, where is orthogonal and is diagonal with non-negative eigenvalues; (iv) computing the preconditioning matrix , where is an offset that we choose to be the mean value of the eigenvalues. Then, the matrix may be used as a preconditioner since the covariance of the vectors
is close to identity. In fact, it is equal to the identity matrix whenand is invertible. Then, problem (12) with preconditioning becomes
obtained with the change of variable . Optimizing with respect to to obtain a solution turned out to be the key for fast convergence of the stochastic gradient optimization algorithm.
Note that our effort also consisted on implementing heuristics for automatically selecting the learning rate during optimization without requiring any manual tuning, following in part standard guidelines frombottou2012stochastic. More precisely, we select the initial learning rate in the following range: , by performing 1K iterations with mini-batches of size and choosing the one that gives the lowest objective, evaluated on a validation dataset. After choosing the learning rate, we keep monitoring the objective on a validation set every 1K iteration, and perform backtracking in case of divergence. The learning rate is also divided by every 50K iterations. The total number of iterations is set to 300K. Regarding initialization, weights are randomly initialized according to a standard normal distribution. These heuristics are fixed over all experiments and resulted in a stable parameter-free learning procedure, which we will release in an open-source software package.
4.7 Different Types of CKNs with Different Inputs
We have not discussed yet the initial choice of the map for representing an image . In this paper, we follow mairal2014convolutional and investigate three possible inputs:
CKN-raw: We use the raw RGB values. This captures the hue information, which is discriminant information for many application cases.
CKN-white: It is similar to CKN-raw with the following modification: each time a patch is extracted from , it is first centered (we remove its mean color), and whitened by computing a PCA on the set of patches from . The resulting patches are invariant to the mean color of the original patch and mostly respond to local color variations.
CKN-grad: The input simply carries the two-dimensional image gradient computed on graysale values. The map has two channels, corresponding to the gradient computed along the x-direction and along the y-direction, respectively. These gradients are typically computed by finite differences.
Note that the first layer of CKN-grad typically uses patches of size , which are in , and which are encoded by the first layer into channels, typically with in . This setting corresponds exactly to the kernel descriptors introduced by bo2010kernel, who have proposed a simple approximation scheme that does not require any learning. Interestingly, the resulting representation is akin to SIFT descriptors.
Denoting by the gradient components of image at location , the patch is simply the vector in . Then, the norm can be interpreted as the gradient magnitude , and the normalized patch represents a local orientation. In fact, there exists such that . Then, we may use the relation (8) to approximate the Gaussian kernel . We may now approximate the integral by sampling evenly distributed orientations , and we obtain, up to a constant scaling factor,
where . With such an approximation, the -th entry of the map from (13) should be replaced by
This formulation can be interpreted as a soft-binning of gradient orientations in a “histogram” of size at every location . To ensure an adequate distribution in each bin, we choose . After the pooling stage, the representation becomes very close to SIFT descriptors.
A visualization of all input methods can be seen in figure 4. See (mairal2014sparse) for more analysis of image preprocessing.
In this section, we give details on the standard datasets we use to evaluate our method, as well as the protocol we used to create our new dataset.
5.1 Standard datasets
We give details on the commonly used benchmarks for which we report results.
5.1.1 Patch retrieval
The dataset introduced in (mikolajczyk2005comparison) in now standard to benchmark patch retrieval methods. This dataset consists of a set of 8 scenes viewed under 6 different conditions, with increasing transformation strength. In contrast to (winder2009picking; zagoruyko2015learning) where only DoG patches are available, the mikolajczyk2005comparison dataset allows custom detectors. We extract regions with the Hessian-Affine detector and match the corresponding descriptors with Euclidean nearest-neighbor. A pair of ellipses is deemed to match if the projection of the first region using the ground-truth homography on the second ellipse overlaps by at least 50%. The performance is measured in terms of mean average precision (mAP).
5.1.2 Image retrieval
We selected three standard image retrieval benchmarks: Holidays, Oxford and the University of Kentucky Benchmark (UKB).
The Holidays dataset (jegou2008hamming) contains 500 different scenes or objects, for which 1,491 views are available. 500 images serve as queries. Following common practice, in contrast to (babenko2014neural) though, we use the unrotated version, which allows certain views to display a rotation with respect to their query. While this has a non-negligible impact on performance for dense representations (the authors of (babenko2014neural) report a 3% global drop in mAP), this is of little consequence for our pipeline which uses rotation-invariant keypoints. External learned parameters, such as k-means clusters for VLAD and PCA-projection matrix are learned on a subset of random Flickr images. The standard metric is mAP.
The Oxford dataset (philbin2007object) consists of 5,000 images of Oxford landmarks. 11 locations of the city are selected as queries and 5 views per location is available. The standard benchmarking protocol, which we use, involves cropping the bounding-box of the region of interest in the query view, followed by retrieval. Some works, such as (babenko2014neural) forgo the last step. Such a non-standard protocol yields a boost in performance. For instance babenko2015aggregating report a close to 6% improvement with non-cropped queries. mAP is the standard measure.
Containing 10,200 photos, the University of Kentucky Benchmark (UKB) (nister2006scalable) consists of 4 different views of the same object, under radical viewpoint changes. All images are used as queries in turn, and the standard measure is the mean number of true positives returned in the four first retrieved images (recall@).
One of the goals of this work is to establish a link between performance in patch retrieval with performance in image retrieval. Because of the way datasets are constructed differs (e.g. Internet-crawled images vs successive shots with the same camera, range of viewpoint changes, semantic content of the dataset, type of keypoint detector), patch descriptors may display different performances on different datasets. We therefore want a dataset that contains a groundtruth both at patch and image level, to jointly benchmark the two performances. Inspired by the seminal work of (winder2009picking), we introduce the Rome retrieval dataset, based on 3D reconstruction of landmarks. The Rome16K dataset (li2010location) is a Community-Photo-Collection dataset that consists of 16,179 images downloaded from photo sharing sites, under the search term “Rome”. Images are partitioned into 66 “bundles”, each one containing a set of viewpoints of a given location in Rome (e.g. “Trevi Fountain”). Within a bundle, consistent parameters were automatically computed and are available444www.cs.cornell.edu/projects/p2f. The set of 3D points that were reconstructed is also available, but we choose not to use them in favor of our Hessian-Affine keypoints. To determine matching points among images of a same bundle, we use the following procedure. i) we extract Hessian-Affine points in all images. For each pair of images of a bundle, we match the corresponding SIFTs, using Lowe’s reverse neighbor rule, as well as product quantization (jegou2011product) for speed-up. We filter matches, keeping those that satisfy the epipolar constraint up to a tolerance of
pixels. Pairwise point matches are then greedily aggregated to form larger groups of 2D points viewed from several cameras. Groups are merged only if the reproduction error from the estimated 3D position is below thepixel threshold.
To allow safe parameter tuning, we split the set of bundles into a train and a test set, respectively containing 44 and 22 bundles.
5.2.1 Patch retrieval
We design our patch retrieval datasets by randomly sampling in each train and test split a set of 3D points for which at least views are available. The sampling is uniform in the bundles, which means that we take roughly the same amount of 3D points from each bundle. We then sample views for each point, use one as a query and the remaining as targets. Both our datasets therefore contain queries and retrieved elements. We report mean average precision (mAP). An example of patch retrieval classes can be seen in Fig. 5.
5.2.2 Image retrieval
Using the same aforementioned train-test bundle split, we select 1,000 query images and 1,000 target images evenly distributed over all bundles. Two images are deemed to match if they come from the same bundle, as illustrated in Fig. 6
In this section, we describe the implementation of our pipelines, and report results on patch and image retrieval benchmarks.
6.1 Implementation details
We provide details on the patch and image retrieval pipelines. Our goal is to evaluate the performance of patch descriptors, and all methods are therefore given the same input patches (computed at Hessian-Affine keypoints), possibly resized to fit the required input size of the method. We also evaluate all methods with the same aggregation procedure (VLAD with 256 centroids). We believe that improvements in feature detection and aggregation are orthogonal to our contribution and would equally benefit all architectures.
We briefly review our image retrieval pipeline.
A popular design choice in image representations, inspired by text categorization methods, is to consider images as sets of local patches, taken at various locations. The choice of these locations is left to the interest point detector, for which multiple alternatives are possible.
In this work, we use the popular “Hessian-Affine” detector of mikolajczyk2004scale. It aims at finding reproducible points, meaning that selected 3D points of a scene should always belong to the set of detected image points when the camera undergoes small changes in settings (e.g. viewpoint, blur, lighting). Because of the “aperture” problem the set of such points is limited to textured patches, to the exclusion of straight lines, for which precise localization is impossible. This leaves “blobs”, as used for instance in Lowe’s Difference of Gaussians (DoG) detector (lowe2004distinctive) or corners, as is the case for our Hessian-Affine detector. Specifically, a “cornerness” measure is computed on the whole image, based on the Hessian of the pixel intensities. The set of points whose cornerness is above a threshold is kept. The detector takes the points at their characteristic scale and estimates an affine-invariant local region. Rotation invariance is achieved by ensuring the dominant gradient orientation always lies in a given direction. This results in a set of keypoints with locally-affine invariant regions. Fig. 7 shows the various steps for detecting keypoints. We sample the point with a resolution of pixels, value that was found optimal for SIFT on Oxford. Pixels that fall out of the image are set to their nearest neighbor in the image. This strategy greatly increases patch retrieval performance compared to setting them to black, as it does not introduce strong gradients.
Note that the choice of a particular interest point detector is arbitrary and our method is not specific to Hessian-Affine locations. To show this, we also experiment with dense patches in section 6.3.5
Because of the affine-invariant detectors, and as seen in Fig. 7, for a given 3D point seen in two different images, the resulting patches have small differences (e.g. lighting, blur, small rotation, skew). The goal of patch description, and the focus of this work, is to design a patch representation, i.e. a mapping of the space of fixed-size patches into some Hilbert space, that is robust to these changes.
Stereo-vision uses this keypoint representation to establish correspondences between images of the same instance, with 3D reconstruction as an objective. The cost of this operation is quadratic in the number of keypoints, which is prohibitive in image retrieval systems that need to scan through large databases. Instead, we choose to aggregate the local patch descriptors into a fixed-length global image descriptor. For this purpose, we use the popular VLAD representation (jegou2012aggregating). Given a clustering in the form of a Voronoi diagram with points of the feature space (typically obtained using k-means on an external set of points), VLAD encodes the set of visual words as the total shift with respect to their assigned centroid:
where is the assignment operator, which is 1 if is closer to centroid than to the others, and 0 otherwise.
The final VLAD descriptors is power-normalized with exponent 0.5 (signed square-root), as well as -normalized.
6.1.2 Convolutional Networks training
We use the Caffe package(jia2014caffe) and its provided weights for AlexNet, that have been learned according to krizhevsky2012imagenet
. Specifically, the network was trained on ILSVRC’12 data for 90 epochs, using a learning rate initialized toand decreased three times prior to termination. Weight decay was fixed to 0.0005, momentum to 0.9 and dropout rate to 50%. It uses three types of image jittering: random out of cropping, flip and color variation.
Following babenko2014neural, we fine-tune AlexNet using images of the Landmarks dataset555http://sites.skoltech.ru/compvision/projects/neuralcodes/
. Following their protocol, we strip the network of its last layer and replace it with a 671-dim fully-connected one, initialized with random Gaussian noise (standard deviation 0.01). Other layers are initialized with the old AlexNet-ImageNet weights. We use a learning rate offor fine-tuning. Weight decay, momentum and dropout rate are kept to their default values (0.0005, 0.9 and 0.5). We use data augmentation at training time, with the same transformations as in the original paper (crop, flip and color). We decrease the learning rate by 10 when it saturates (around 20 epochs each time). We report a validation accuracy of 59% on the Landmarks dataset, which was confirmed through discussion with the authors. On Holidays, we report a mAP of 77.5 for the sixth layer (against 79.3 in (babenko2014neural)), and 53.6 (against 54.5) on Oxford. Even though slightly below the results in the original paper, fine-tuning still significantly improves on ImageNet weights for retrieval.
For PhilippNet, we used the model provided by the authors. The model is learned on 16K surrogate classes (randomly extracted patches) with 150 representatives (composite transformations, including crops, color and contrast variation, blur, flip, etc.). We were able to replicate their patch retrieval results on their dataset, as well as on Mikolajczyk et al’s dataset when using MSER keypoints.
The patch retrieval dataset of RomePatches does not contain enough patches to learn a deep network. We augment it using patches extracted in a similar fashion, grouped in classes that correspond to 3D locations and contain at least 10 examples. We build two such training sets, one with 10K classes, and one with 100K classes. Training is conducted with the default parameters.
As previously described, we use the online code provided by zagoruyko2015learning. It consists of networks trained on the three distinct datasets of winder2009picking: Liberty, NotreDame and Yosemity. For our image retrieval experiments, we can only use the siamese networks, as the others do not provide a patch representation. These were observed in the original paper to give suboptimal results.
Convolutional Kernel Networks
To train the convolutional kernel networks, we randomly subsample a set of 100K patches in the train split of the Rome dataset. For each layer, we further extract 1M sub-patches with the required size, and feed all possible pairs as input to the CKN. The stochastic gradient optimization is run for 300K iterations with a batch size of 1000, following the procedure described in section 4.6. Training a convolutional kernel network, for a particular set of hyperparameters, roughly takes 10 min on a GPU. This stands in contrast to the 2-3 days for training using the L-BFGS implementation used in mairal2014convolutional. We show in Fig. 8 a visualization of the first convolutional filters of CKN-raw. The sub-patches for CKN-grad and CKN-white are too small () and not adapted to viewing.
6.2 Patch retrieval
6.2.1 CKN parametric exploration
The relatively low training time of CKNs (10 min. on a recent GPU), as well as the fact that the training is layer-wise – and therefore lower layer parameters can be reused if only the top layer change – allows us to test a relatively high number of parameters, and select the ones that best suit our task. We tested parameters for convolution and pooling patch sizes in range of to , number of neuron features in powers of from 128 to 1024. For the parameter, the optimal value was found to be for all architectures. For the other parameters, we keep one set of optimal values for each input type, described in Table 1.
|Input||Layer 1||Layer 2||dim.|
|CKN-raw||5x5, 5, 512||—-||41,472|
|CKN-white||3x3, 3, 128||2x2, 2, 512||32,768|
|CKN-grad||1x1, 3, 16||4x4,2,1024||50,176|
In Figure 9, we explore the impact of the various hyper-parameters of CKN-grad, by tuning one while keeping the others to their optimal values.
6.2.2 Dimensionality reduction
Since the final dimension of the CKN descriptor is prohibitive for most applications of practical value (see Table 1), we investigate dimensionality reduction techniques. Note that this step is unnecessary for CNNs, whose feature dimension do not exceed 512 (PhilippNet). We only perform unsupervised dimensionality reduction through PCA, and investigate several forms of whitening. Denoting the matrix of CKN features (we used
), the singular value decomposition ofwrites as
where and are orthogonal matrices and is diagonal with non-increasing values from upper left to bottom right. For a new matrix of observations , the dimensionality reduction step writes as
where is a projection matrix.
The three types of whitening we use are: i) no whitening; ii) full whitening; iii) semi-whitening. Full whitening corresponds to
while semi-whitening corresponds to
The matrix division denotes the matrix multiplication with the inverse of the right-hand matrix. The square-root is performed element-wise.
Results of the different methods on the RomePatches dataset are displayed in Fig. 10. We observe that semi-whitening works best for CKN-grad and CKN-white, while CKN-raw is slightly improved by full whitening. We keep these methods in the remainder of this article, as well as a final dimension of 1024.
We compare the convolutional architectures on our three patch datasets: RomePatches-train, RomePatches-test and Mikolajczyk et al’s dataset. Results are given in Table 2. For AlexNet CNNs, we report results for all outputs of the 5 convolutional layers (after ReLU). We note that SIFT is an excellent baseline for these methods, and that CNN architectures that were designed for local invariances perform better than the ones used in AlexNet, as observed by fischer2014descriptor. The results of the PhilippNet on Mikolajczyk et al’s dataset are different from the ones reported by fischer2014descriptor, as we evaluate on Hessian-Affine descriptors while they use MSER. To have a comparable setting, we use their network with an input of 64x64 that corresponds to the coverage of one top neuron, as well as their protocol that slide it on 91x91 patches. We notice that this last step only provides a small increase of performance (2% for patch retrieval and 1% for image retrieval). We observe that PhilippNet outperforms both SIFT and AlexNet, which was the conclusion of fischer2014descriptor; CKN trained on whitened patches do however yield better results.
As the architectures of DeepCompare (zagoruyko2015learning) do not rely on an underlying patch descriptor representation but rather on similarities between pairs of patches, some architectures can only be tested for patch-retrieval. We give in Table 3 the performances of all their architectures on RomePatches.
We note that the only networks that produce descriptors that can be used for image retrieval are the architectures denoted “siam” here. They also perform quite poorly compared to the others. Even the best architectures are still below the SIFT baseline (91.6/87.9).
The method of zagoruyko2015learning optimizes and tests on different patches (DoG points, 64x64 patches, greyscale), which explains the poor performances when transferring them to our RomePatches dataset. The common evaluation protocol on this dataset is to sample the same number of matching and non-matching pairs, rank them according to the measure and report the false positive rate at 95% recall (“FPR@95%”, the lower the better). The patches used for the benchmark are available online, but not the actual split used for testing. With our own split of the Liberty dataset, we get the results in Table 4. We note that all our results, although differing slightly from the ones of zagoruyko2015learning, do not change their conclusions.
|Best AlexNet (3)||13.5|
|siam-l2 (trained on Notre-Dame)||14.7|
|2ch-2stream (trained on Notre-Dame)||1.24|
We note that our CKNs whose hyperparameters were optimized on RomePatches work poorly. However, optimizing again the parameters for CKN-grad on a grid leads to a result of 14.4% (“best CKN”). To obtain this result, the number of gradient histogram bins was reduced to 6, and the pooling patch size was increased to 8 on the first layer and 3 on the second. These changes indicate that the DoG keypoints are less invariant to small deformations, and therefore require less rigid descriptors (more pooling, less histograms). The results are on par with all comparable architectures (siam-l2: 14.7%, AlexNet: 13.5%), as the other architectures either use central-surround networks or methods that do not produce patch representations. The best method of zagoruyko2015learning, 2ch-2stream, reports an impressive 1.24% on our split. However, as already mentioned, this architecture does not produce a patch representation and it is therefore difficult to scale to large-scale patch and image retrieval applications.
6.2.5 Impact of supervision
We study the impact of the different supervised trainings, between surrogate classes and real ones. Results are given in Table 5.
Figure 11 gives the first convolutional filters of the PhilippNet learned on surrogate classes, and on Rome. As can be seen, the plain surrogate version reacts to more diverse color, as it has been learned with diverse (ImageNet) input images. Both Rome 10K and Rome 100K datasets have colors that correspond to skies and buildings and focus more on gradients. It is interesting to note that the network trained with 100K classes seems to have captured finer levels of detail than its 10K counterpart.
We investigate in Fig. 12 the robustness of CKN descriptors to transformations of the input patches, specifically rotations, zooms and translations. We select 100 different images, and extract the same patch by jittering the keypoint along the aforementioned transformation. We then plot the average distance between the descriptor of the original patch and the transformed one. Note the steps for the scale transformation; this is due to the fact that the keypoint scale is quantized on a scale of powers of 1.2, for performance reasons.
6.3 Image retrieval
We learn a vocabulary of centroids on a related database: for Holidays and UKB we use 5,000 Flickr images and for Oxford, we train on Paris (philbin2008lost). The vocabulary for RomePatches-Train is learned on RomePatches-Test and vice-versa. The final VLAD descriptor size is 256 times the local descriptor dimension.
We compare all convolutional approaches as well as the SIFT baseline in the image retrieval settings. Results are summarized in Table 6.
On datasets for which color is dominant (e.g. Holidays or UKB), the best individual CKN results are attained by CKN-white, improved by combining the three channels. On images of buildings, gradients still perform best and the addition of color channels is harmful, which explains on the one hand the poor performance of AlexNet and on the other hand the relatively good performance of PhilippNet which was explicitly trained to be invariant to colorimetric transformations.
6.3.3 Influence of context
Through experiments with AlexNet-landmarks, we study the impact of context in the local features. Specifically, we test whether fine-tuning on a dataset that shares semantic information with the target retrieval dataset improves performance. We compare the same network architecture – AlexNet – in two settings: when parameters are learned on ImageNet (which involves a varied set of classes) and when they are learned on the Landmarks dataset which solely consists of buildings and places. Results, shown in Table 7, show clear improvement for Oxford, but not for Holidays. We explain this behavior by the fact that the network learns building-specific invariances and that fewer building structures are present in Holidays as opposed to Oxford.
6.3.4 Dimensionality reduction
As observed in previous work (jegou2012whiten), projecting the final VLAD descriptor to a lower dimension using PCA+whitening can lead to lower memory costs, and sometimes to slight increases in performance (e.g. on Holidays (gong2014multi)). We project to 4096-dim descriptors, the same output dimension as babenko2014neural and obtain the results in Table 8. We indeed observe a small improvement on Holidays but a small decrease on Oxford.
6.3.5 Dense keypoints
Our choice of Hessian-Affine keypoints is arbitrary and can be suboptimal for some image retrieval datasets. Indeed we observe that by sampling points on a regular grid of pixels at multiple scales, we can improve results. We learn the CKN parameters and the PCA projection matrix on a dense set of points in Rome, and apply it to image retrieval as before. We use SIFT descriptors as a baseline, extracted at the exact same locations. We observe that for CKN-grad and CKN-white, the previous models lead to suboptimal results. However, increasing the pooling from 3 (resp. 2 for the second layer) to 4 (resp. 3), leads to superior performances. We attribute this to the fact that descriptors on dense points require more invariance than the ones computed at Hessian-Affine locations. Results on the Holidays dataset are given in Table 9. As observed on Hessian-Affine points, the gradient channel performs much better on Oxford. We therefore only evaluate this channel. As explained in section 5, there are two ways to evaluate the Oxford dataset. The first one crops queries, the second does not. While we only consider the first protocol to be valid, it is interesting to investigate the second, as results in Table 10 tend to indicate that it favors dense keypoints (and therefore the global descriptors of babenko2014neural).
|Dense (same parameters)||70.3||68.5||72.3||76.8||.|
|Dense (changed pooling)||70.3||71.3||72.3||80.8||82.6|
|Hessian-Affine, no crop||45.7||49.0|
|Dense, no crop||51.1||55.4|
6.3.6 Comparison with state of the art
Table 11 compares our approach to recently published results. Approaches based on VLAD with SIFT (arandjelovic2013all; jegou2012aggregating) can be improved significantly by CKN local descriptors (+15% on Holidays). To compare to the state of the art with SIFT on Oxford (arandjelovic2013all), we use the same Hessian-Affine patches extracted with gravity assumption (perd2009efficient). Note that this alone results in a gain.
We also compare with global CNN (babenko2014neural). Our approach outperforms it on Oxford, UKB, and Holidays. For CNN features with sum-pooling encoding (babenko2015aggregating), we report better results on Holidays and UKB, and on Oxford with the same evaluation protocol. Note that their method works better than ours when used without cropping the queries (58.9%).
On Holidays, our approach is slightly below the one of gong2014multi, that uses AlexNet descriptors and VLAD pooling on large, densely extracted patches. It is however possible to improve on this result by using the same dimensionality reduction technique (PCA+whitening) which gives 82.9% or dense keypoints (82.6%).
|Sum-pooling OxfordNet (babenko2015aggregating)||80.2||3.65||53.1|
|Ours||79.3 (82.9)||3.76||49.8 (56.5*)|
We showed that Convolutional Kernel Networks (CKNs) offer similar and sometimes even better performances than classical Convolution Neural Networks (CNNs) in the context of patch description, and that the good performances observed in patch retrieval translate into good performances for image retrieval, reaching state-of-the-art results on several standard benchmarks. The main advantage of CKNs compared to CNNs is their very fast training time, and the fact that unsupervised training suppresses the need for manually labeled examples. It is still unclear if their success is due to the particular level of invariance they induce, or their low training time which allows to efficiently search through the space of hyperparamaters. We leave this question open and hope to answer it in future work.
This work was partially supported by projects “Allegro” (ERC), “Titan” (CNRS-Mastodon), “Macaron” (ANR-14-CE23-0003-01), the Moore-Sloan Data Science Environment at NYU and a Xerox Research Center Europe collaboration contract. We wish to thankfischer2014descriptor; babenko2014neural; gong2014multi for their helpful discussions and comments. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the GPUs used for this research.