I Introduction
Comparing image patches is a key task in a number of computer vision applications such as object identification, photo stitching, stereo baseline estimation, etc. Patch comparison usually takes place by comparing local descriptors that are robust to rotations, scale, illumination and, to some extents, perspective, and changes. Descriptors such as SIFT had to be handcrafted according to the specific task or considered signal. Recent advances in deep learning showed that it is possible to train a neural network to automatically learn and compare local descriptors without necessarily resorting to handcrafted descriptors.
Over the past years, a number of descriptors designs based upon Convolutional Neural Networks (CNN) have been proposed
[1]. The work of [2] explores different CNN architectures, showing that best results are achieved when pairs of patches are jointly encoded and a decision network is trained to learn an appropriate interpatch distance metric. The authors of [3] address the problem of generating SIFTlike descriptors within a deep learning framework, showing that deep learning generated descriptors can be used as a dropin replacement for SIFT descriptors as they retain key properties such as invariance to rotations, illumination and perspective changes. In [4], the specific problem of matching wide baseline stereo images by comparing local patches using a siamese CNN is addressed. The authors show that the network patch matching accuracy can be greatly enhanced augmenting the training set of patches by rotation and illumination changes. In [5], an approach based on fusing two complementary and asymmetric descriptors extracted from the convolutional domain is proposed. The authors of [6] address the related yet different problem of learning global hash codes over whole images. Their approach is limited to small size images due to complexity constraints. In addition, global descriptors are inherently unsuited for geometric verification, showing the advantages of local descriptors based approaches. Finally, in [7] a complete image matching pipeline based on a deep learning framework is presented: whereas such architecture goes beyond the scope of the present work, it clearly shows the potentials of image matching architectures based on deep learning frameworks.Our goal is to learn patch descriptors that are highly discriminative and, at the same time, rate efficient and computationally lightweight to generate and compare. According to such requirements, many of the above designs fall short in one or more aspect. In [3]
, descriptors are devised as a dropin replacement for SIFT descriptors, thus they are encoded as realvalued vectors, whereas binary vectors are desirable in reason of their improved rate efficiency. In
[2] best performance is achieved by jointly encoding pairs of patches, an approach that is not suitable for the common case where the reference patch is available only through its descriptor at query time. Such approach also requires replicating the same decision network learned at training time when deployed on the field, impacting significantly in terms of complexity. The approach of [5] shows stateoftheart results, however at the expense of duplicating the feature extraction pipeline complexity. For such reasons, how to learn and compare patch descriptors that are at the same time discriminative and efficient both in terms of rate and complexity is still an open research issue.In this work, we introduce the idea of fusing features from the convolutional layers and from the discrete cosine transform to learn binary patch descriptors within a deep learning framework. DCT features have been previously considered for feature dimensionality and redundancy reduction in patch matching and face recognition problems
[8, 9]. However, to the best of our knowledge, no previous work has considered the problem of fusing features from the convolutional and the DCT domains within a deep learning framework. We propose a framework that is designed around a Siamese network, which allows to independently generate descriptors from single patches. We learn realvalued compact descriptors by minimizing a loss function of the cosine distance between pairs of descriptors, where the cosine distance happens to be the bestfitting realvalued relaxation for the Hamming distance, enabling straightforward descriptors quantization and subsequent comparison. We experiment with three challenging datasets, showing better performance than stateoftheart competitors for identical bitrate and comparable performance to more complex approaches despite lower bitrate. The rest of this paper is organized as follows. In Sec.
II relevant background literature is surveyed. Sec. III describes the proposed convolutional approach to patch matching via feature fusion in terms of network architecture and training procedures. Sec. IV provides a thorough experimental validation of the framework over three sets of real image patches. Sec. V draws the conclusions and discusses potential directions for further investigations.Ii Background
This section provides the relevant background on Siamese neural networks, a class of feedforward artificial neural networks. Siamese networks find application in a number of problems where it is required to learn a similarity or distance function between pairs of equally dimensioned signals, ranging from face verification [10] to realtime object tracking [11]. Siamese networks are composed by (at least) two topologically identical and independent subnetworks which share an identical set of learnable parameters. In most image and video applications, each subnetwork typically includes one or more convolutional layers for extracting features from the input signal spatial domain. Optionally, a number of fully connected layers projects such features to a typically lowerdimensional space, yielding a vector of features representing the signal provided as input to the subnetwork. The output of each subnetwork can be seen as a (compact) description of the input signal, which is thus often defined as descriptor. Finally, one predefined or learnable distance function measures the similarity between the two descriptors produced by each subnetwork. In this context, the network produces as output a measure of the extent to which two input patches are similar or dissimilar. Concerning the patch matching problem addressed by this work, Siamese architectures enable to independently compute each patch descriptor, allowing for patch matching with precomputed descriptors. Therefore, in the following, we use a Siamese architecture as the cornerstone of our network, whereas we leave the application of our feature fusion approach to other network architectures for future research.
Iii Proposed Method
In this section, we first describe a Siamese convolutional network architecture suitable for convolutional and DCT feature fusion. Next, we describe a fullysupervised training procedure with the associated loss function. Finally, we briefly describe the descriptor quantization process.
Iiia Network Architecture
The architecture in Fig. 1 relies on a Siamese topology where the two subnetworks receive as input a pair of same size image patches . Each subnetwork is structured as follows: the first part is composed of M convolutional modules, where each kth module () is composed by one convolutional layer with an hyperbolic tangent nonlinearity and one 2
2 maxpooling layer. All the convolutional layers are composed of 5
5 filters (kernels), and convolutions are of the wide type with border padding and onepixel stride. The number of filters in the first
module’s convolutional layer is set to and doubles at each module, thus the number of featuremaps produced as output by each convolutional layer doubles with index . However, the resolution of each featuremap is reduced by a fourfold factor after each maxpooling layers, so the overall number of features generated by each module shrinks by a factor of two after each mth module. The number of convolutional features produced as output by the kth module is referred to as in the following. Fig. 1 shows that, for each subnetwork, a parallel branch of the pipeline implements a 2D discrete cosine transform of the input signal. Next, a subset of coefficients from the topleft corner of the transformed patch matrix is selected in zigzag scanning order starting from the DC coefficient. The result is a number of additional DCT features that complements the convolutional features.Finally, the convolutional features are concatenated with the DCT features, yielding a vector of fused features that represents a projection of the input signal to a dimensional space. Fused features are processed by a first fully connected layer with
units (neurons) with hyperbolic tangent activation functions. A second fully connected layer with B units, to which we refer as
bottleneck layer in the following, generates a Belements patch descriptor. Accordingly, the two subnetworks receive as input the pair of image patches and produce as output the corresponding pair of descriptors of B realvalued elements each.In order to establish the similarity between the two descriptors , we resort to the cosine distance function, which produces an output in the range [1,1], and it is defined as
(1) 
Such cosine distance represents the ultimate output of the proposed network. Hence, when a pair of patches is fed into the network, the associated cosine distance between its descriptors is produced. The cosine distance enjoys several useful properties: first, it is always continuous and differentiable, thus it allows fully supervised, endtoend, training of the network as discussed in the following; moreover, cosine distance affinity with the Hamming distance enables straightforward descriptor quantization and binary comparison as described in Sec. IIIC.
Finally, let us define as network complexity the number of learnable parameters in the network. Concerning the architecture showed in Fig. 1, the complexity of the first fully connected layer ( units) dominates the overall network complexity. Thus, the network complexity is approximately equal to , i.e. its complexity increases linearly with the number of fused features.
IiiB Training
Let us define as the ith training sample the pair of identically sized image patches . and are said to be matching (equivalently, that is a matching sample) whenever and represent the same image detail, and nonmatching otherwise. Let and be the network output and the corresponding target label (i.e., the expected network outcome), respectively. Our ultimate goal is to learn network parameters such that the network generates pairs of discriminative descriptors . However, as we aim at training the network endtoend with a fully supervised approach, we recast the problem on learning parameters such that the network, given as input the ith sample , generates an output for which . To this end, the choice of the training label is pivotal. Concerning pairs of matching patches, the network is expected to produce similar descriptors such that , so we impose for matching samples. Considering the comparison of nonmatching pairs of patches, we observe that such process is equivalent to comparing random i.i.d. patches , which in turn is equivalent to measuring the cosine distance between the associated i.i.d. descriptors . Given two zeromean i.i.d. , we experimentally observed that
has a zeromean Normal distribution, i.e.
. For all the experiments we run, we verified that the descriptors generated by our network are identically distributed and have zeromean, therefore we impose for nonmatching samples.With the training labels defined as indicated, the network can be trained via error gradient backpropagation, finding the parameters minimizing a loss function that, for the
ith training sample, is defined as(2) 
That is, the network is trained to minimize a sample classification error function defined as the square error between the desired and the actual network outcome (i.e., the sample label). Regarding the practical aspects of the training procedure, we verified that normalizing the input patches over their own
norm increases the robustness and generalization capacity of the network with respect to illumination variations. Also, we follow the common practice of normalizing the input patches with respect to mean pixel intensity and standard deviation as calculated over the entire training set. We also observed that Spatial Batch Normalization as defined in
[12] speeds up the training and boosts performance. Similarly, we normalize the DCT features with respect to mean and standard deviation values over the entire coefficients set so that their value share the same distribution of the convolutional features .IiiC Descriptor Quantization
Binary descriptors enable lower bitrates and simple Hamming distance computation. Having verified that the trained network generates zeromean descriptors, we quantize the Belements descriptors over 1 bit via sign quantization, obtaining single bit, Blong descriptors . Our choice of the cosine distance as similarity function is instrumental to comparing the binary descriptors via a simple Hamming distance. In fact, the cosine distance definition between , and in Eq. 1 can be recasted as the L2normalized inner product between and . Given the definition of Hamming distance
(3) 
where indicates the inner product, and refers to the bit count of the sequence, it follows that the cosine distance is equivalent to the Hamming distance, up to a scaling factor and an additive term. Therefore, pairs of binary descriptors are simply compared via the normalized Hamming distance
(4) 
which is close to 0 for pairs of matching patches, and close to 1 otherwise, and performance is computed as explained in the following section.
Iv Experimental Evaluation
We evaluate our proposed architecture for patch matching over three datasets [13] of 6464 patches extracted from 3D reconstructions of the Liberty statue (), the Notredame cathedral () and the Yosemite mountains (), centered upon Difference of Gaussians (DoG) key points with canonical scale and orientation, as shown in Fig. 2. Following the approach of [13], we train the network three times, one for each dataset, and we evaluate its performance on the two other datasets for a total of 6 different training/testing setups per experiment (e.g., the setup indicates training on Liberty and testing on Notredame). For each setup, the network is trained with pairs of matching pairs and pairs of nonmatching patches, whereas testing is performed over pairs of matching and pairs of nonmatching patches. We implemented the proposed architecture using the Torch7 framework and trained our network following the procedure described in the Sec. IIIB over an NVIDIA K80 GPU. We rely on gradient descent with adaptive gradient optimization as defined in [14], with an initial learning rate set to and over batches of pairs of samples ( matching samples and nonmatching samples per batch). The training ends when the error on the testing set has not decreased over the past epochs or after epochs. Coherently with [13], first we compute ROC curves by thresholding the normalized Hamming distance between pairs of binary Blong descriptors, then, for each setup, we measure the False Positive Rate (FPR) for a True Positive Rate (TPR) set to %. In the following, we refer to the FPR computed for a TPR of % simply as patch classification error for the sake of brevity.
Preliminarily, we experiment to find the convolutional configuration of the network architecture in Fig. 1 that yields the best baseline performance when bits. Namely, we want to find the best tradeoff between the number and resolution of convolutional featuremaps. To this end, we first vary the number of convolutional modules (as M increases, the number of featuremaps doubles while their resolution is halved horizontally and vertically). Furthermore, we experiment dropping the maxpooling layer in Mth convolutional module which additionally allows us to double the featuremaps resolution without affecting their count. As this experiment focuses on convolutional features, we deactivate the DCT branch of each subnetwork (i.e., ). Table I shows the performance of six different network configurations (mp rows account for the cases where the Mth module maxpooling layer is disabled). and columns indicate number and resolution of the featuremaps respectively, where . We observe that featuremaps consistently yield the best results: our hypothesis is that larger featuremaps do not convey enough semantic information, whereas smaller featuremaps lack the spatial resolution to preserve texture details.
Next, we evaluate the entire network performance when DCT features are fused with convolutional features while accounting for network complexity. Input patches resolution is pixels, so each input patch DCT can be represented as a vector of realvalued coefficients. To keep the complexity low, we consider only the first lower frequency coefficients selected as described in Sec. IIIA. The additional network complexity due to the introduction of the DCT features, defined as , is negligible as it takes the values of 3.42% and 1.71% for the and cases, respectively. Table II shows the results of the experiment: the second column reports the actual number of convolutional features compared to the number of DCT features . As a general trend, we see that fusing DCT and convolutional features always improve the performance. Most importantly, the configuration with convolutional modules ( fused features) now yields better performance than the configuration ( fused features), despite the overall network complexity is reduced nearly by a twofold factor (about 9M parameters versus 17M parameters). The experiment shows that fusing convolutional features with transformed domain features further improves performance while keeping the network complexity under control.
N/L  N/Y  

Common False Negatives [%]  29.0  28.5 
Common False Positives [%]  16.5  14.7 
We attempt now to understand why feature fusion boosts performance. Without loss of generality, we repeat the previous experiment over a smaller training set of training pairs extracted from the Notredame dataset (% matching, % non matching), testing over pairs of patches from the Liberty (i.e. N/L setup) and pairs from the Yosemite (i.e. N/Y setup) datasets. We aim at assessing the effect of convolutional and DCT features separately, on an equal number of features basis (i.e., ). Therefore, we first experiment with a configuration of our architecture in Fig. 1 where all the features generated by the DCT are provided as input to the fully connected layers without zigzag selection, whereas the convolutional features are dropped altogether (i.e., ). Next, we experiment dropping all the DCT features and considering instead a convolutional architecture with modules (i.e., ). Table III reports the percentage of misclassified pairs of patches in both scenarios, quantified as the intersection of the sets over their union. We break down such erroneously classified pairs of patches in terms of common false negatives and common false positives for a threshold on the normalized Hamming Distance computed over the quantized descriptors that are equal to . The ratio of pairs of patches that are misclassified in both scenarios is significantly lower than %. This experiment suggests that convolutional and DCT features convey complementary information, thus explaining the increased ability to discern between patches of matching and nonmatching pairs we observed when fusing convolutional and DCT features. Fig. 3 shows the Top5 misclassified nonmatching samples (corresponding to a % of common false positives in Table III). Only the second column in the left image (convolutional features only) is also found as third column in the right image (DCT features only), confirming that different types of features yield to complementary classification errors.
Next, we experiment varying the descriptor bitrate beyond the bits considered in our experiments so far. We experiment with the configuration with feature fusion (), which provided the best performancecomplexity tradeoff in Table II, and its counterpart with convolutional features only (). As expected, Fig. 4 shows that the patch classification error decreases when B increases. Most importantly, Fig. 4 shows that the error decreases faster when fusing features, i.e. feature fusion improves the performancebitrate tradeoff of our framework. Fig. 4 includes a third line that corresponds to the configuration where nonquantized descriptors are considered in place of binary descriptors. For , the performance loss due to descriptor quantization is lower than 1% and further decreases rapidly as B increases. Hence, sign quantization precisely matches the descriptors space into the Hamming space conserving the relative distance between vectors and allowing vector similarity preservation.
Finally, Table 4 compares our framework with several stateoftheart approaches in binary patch matching. Our approach outperforms all the competitors on an equal bitrate basis. Our approach is outperformed on the N/L, Y/L, and N/Y setups only by DeepCD [5], which however relies on an aggregated descriptor rate of bits that largely exceeds our maximum considered bitrate of bits.
V Conclusions and future works
In this work, we proposed to fuse convolutional features from the convolutional layers with features from the discrete cosine transform in a deep neural network for binary patch matching. Qualitative experiments suggest that different types of features are complementary in discriminating patches. Quantitative experiments over three challenging datasets confirm that our feature fusion approach outperforms several existing state of the art techniques based on convolutional features only. A careful design of the network topology and training procedures allowed us to capture distinctive features within the patches, enabling a very competitive architecture that also deploys feature fusion, allowing for performance improvement. We leave for our future investigations experimenting with different transform functions and evaluating our framework within a complete image matching pipeline.
References
 [1] L. Zheng, Y. Yang, and Q. Tian, “Sift meets cnn: A decade survey of instance retrieval,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.

[2]
S. Zagoruyko and N. Komodakis, “Learning to compare image patches via
convolutional neural networks,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, 2015, pp. 4353–4361.  [3] E. SimoSerra, E. Trulls, L. Ferraz, I. Kokkinos, P. Fua, and F. MorenoNoguer, “Discriminative learning of deep convolutional feature point descriptors,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 118–126.

[4]
J. Zbontar and Y. LeCun, “Stereo matching by training a convolutional neural
network to compare image patches,”
Journal of Machine Learning Research
, vol. 17, no. 132, p. 2, 2016.  [5] T.Y. Yang, J.H. Hsu, Y.Y. Lin, and Y.Y. Chuang, “Deepcd: Learning deep complementary descriptors for patch representations,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 3314–3322.
 [6] H. Zhu, M. Long, J. Wang, and Y. Cao, “Deep hashing network for efficient similarity retrieval.” in AAAI, 2016, pp. 2415–2421.
 [7] K. M. Yi, E. Trulls, V. Lepetit, and P. Fua, “Lift: Learned invariant feature transform,” in European Conference on Computer Vision. Springer, 2016, pp. 467–483.
 [8] G. Sorwar, A. Abraham, and L. S. Dooley, “Texture classification based on dct and soft computing,” in Fuzzy Systems, 2001. The 10th IEEE International Conference on, vol. 2. IEEE, 2001, pp. 545–548.
 [9] Z. Pan, A. G. Rust, and H. Bolouri, “Image redundancy reduction for neural network classification using discrete cosine transforms,” in Neural Networks, 2000. IJCNN 2000, Proceedings of the IEEEINNSENNS International Joint Conference on, vol. 3. IEEE, 2000, pp. 149–154.
 [10] S. Chopra, R. Hadsell, and Y. LeCun, “Learning a similarity metric discriminatively, with application to face verification,” in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 1. IEEE, 2005, pp. 539–546.
 [11] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. Torr, “Fullyconvolutional siamese networks for object tracking,” in European Conference on Computer Vision. Springer, 2016, pp. 850–865.
 [12] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in International Conference on Machine Learning, 2015, pp. 448–456.
 [13] M. Brown, G. Hua, and S. Winder, “Discriminative learning of local image descriptors,” IEEE transactions on pattern analysis and machine intelligence, vol. 33, no. 1, pp. 43–57, 2011.

[14]
J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online learning and stochastic optimization,”
Journal of Machine Learning Research, vol. 12, no. Jul, pp. 2121–2159, 2011.  [15] T. Trzcinski, M. Christoudias, P. Fua, and V. Lepetit, “Boosting binary keypoint descriptors,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 2874–2881.
 [16] Z. Liu, Z. Li, J. Zhang, and L. Liu, “Euclidean and hamming embedding for image patch description with convolutional networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2016, pp. 72–78.
 [17] K. Simonyan, A. Vedaldi, and A. Zisserman, “Learning local feature descriptors using convex optimisation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 8, pp. 1573–1585, 2014.