1 Introduction
The need to model and compute similarity between some objects is central to many applications ranging from medical imaging to biometric security. In various problems in different fields we need to compare object as different as functions, images, geometric shapes, probability distributions, or text documents. Each such problem has its own notion of data similarity.
A particularly challenging case of similarity arises in applications dealing with multimodal data, which have different representation, dimensionality, and structure. Data of this kind is encountered prominently in medical imaging (e.g. fusion of different imaging modalities like PET and CT) [5] and multimedia retrieval (e.g. querying image databases by text keywords) [2]. Such data are incomparable as apples and oranges by means of standard metrics and require the notion of multimodal similarity.
While such multimodal similarity is difficult to model, in many cases it is easy to learn from examples. For instance, in Internet vision applications we can easily obtain multiple examples of visual objects with a binary similarity function telling whether two objects are similar or not. Learning and representing such similarities in a convenient way is a big challenge.
Particular setting of similarity representation problem is similarity sensitive hashing [7]
, which has attracted significant attention in the computer vision and pattern recognition communities. In
[4], we extended the boostingbased similaritysensitive hashing (SSH) method to the multimodal setting (referred to as crossmodality SSH or CMSSH). This is, to the best of our knowledge, the first and the only multimodal similaritypreserving hashing algorithm in the literature.The purpose of this paper is to develop a different simpler and efficient multimodal hashing algorithm. The rest of the paper is organized as follows. In Section 2, we formulate the problem of multimodal hashing. In Section 3, we overview the CMSSH algorithm. In Section 4, we propose our new method (crossmodality diffhash or CMDIF) and in Section 5 discuss its extension (multimodal kernel diffhash or MMkDIF) using kernelization. Section 6 shows some experimental results.
2 Background
Let and be two spaces representing data belonging to different modalities (e.g., are images and are text descriptions). Note that even though we assume that the data can be represented in the Euclidean space, the similarity of the data is not necessarily Euclidean and in general can be described by some metrics and , to which we refer as intramodal dissimilarities. Furthermore, we assume that there exists some intermodal dissimilarity quantifying the “distance” between points in different modality. The ensemble of intra and intermodal structures is not necessarily a metric in the strict sense. In order to deal with these structures in a more convenient way, we try to represent them in a common metric space.
The broader problem of multimodal hashing is to represent the data from different modalities in a common space of
dimensional binary vectors with the Hamming metric
by means of two embeddings, and mapping similar points as close as possible to each other and dissimilar points as distant as possible from each other, such that , , and . In a sense, the embeddings act as a metric coupling, trying to construct a single metric that preserves the intra and intermodal similarities.A simplified setting of the multimodal hashing problem is crossmodality hashing, in which only the intermodal dissimilarity is taken into consideration and are ignored.
For simplicity, in the following discussion we assume the intermodal dissimilarity to be binary, , i.e., a pair of points can be either similar or dissimilar. This dissimilarity is usually unknown and hard to model, however, it should be possible to sample on some subset of the data . This sample can be represented as set of similar pairs of points (positives) and a set of dissimilar pairs of points (negatives) .
The problem of crossmodality hashing thus boils down to find two embeddings and such that . Alternatively, this can be expressed as having (i.e., the hash has high collision probability on the set of positives) and . The former can be interpreted as the false negative rate (FNR) and the latter as the false positive rate (FPR).
3 Crossmodality similaritysensitive hashing (CMSSH)
To further simplify the problem, consider embeddings given in parametric form as and [7, 4]. Here, are projection matrices of size and , respectively, and are threshold vectors of size .
In [4], we introduced the crossmodality similaritysensitive hashing (CMSSH) method, which is to the best of our knowledge, the first and the only multimodal hashing algorithm existing to date. The idea closely follows the similaritysensitive hashing (SSH) method [7]
, considering the hash construction as boosted binary classification, where each hash dimension acts as a weak binary classifier. For each dimension, AdaBoost is used to maximize the following loss function
(1) 
where is binary intramodal similarity and is the AdaBoost weigh for pair at th iteration. Since the minimization problem (1) is difficult, it is relaxed in the following way [4]: First, removing the nonlinearity and setting , find the projection vectors . Then, fixing the projections , find the thresholds .
The disadvantages of the boostingbased CMSSH is first high computational complexity, and second, the tendency to find unnecessary long hashes (the second problem can be partially resolved by using sequential probability testing [1] which creates hashes of minimum expected length).
4 Crossmodality diffhash (CMDIF)
In [8], we proposed a different and simpler approach (dubbed diffhash) to create similaritysensitive hash functions in the unimodal setting. We adopt similar ideas here to develop multimodal similaritysensitive hashing algorithms.
The optimal crossmodality hashing can be found by minimizing the loss
(2)  
with respect to the embedding functions , which is, up to constants, equivalent to minimizing the correlations
(3)  
w.r.t. the projection matrices and threshold vectors . The first and second terms in (3) can be thought of as FPR and FNR, respectively. The parameter controls the tradeoff between FPR and FNR. The limit case effectively considers only the positive pairs ignoring the negative set.
Problem (3) is a highly nonconvex nonlinear optimization problem difficult to solve straightforwardly. Similarly to [8, 4], we simplify the problem in the following way. First, we ignore the threshold and solve a simplified problem without the sign nonlinearity for projection matrices,
Second, fixing the projections we find optimal thresholds,
We details each step in Sections 4.1–4.2. The whole method is summarized in Algorithm 1.
4.1 Projection computation
Dropping the sign function and the offset, the loss function (3) becomes
(4)  
where denote the covariance matrices of the positive and negative multimodal data, respectively, and is the weighted difference of these covariances. The name of the algorithm, crossmodality diffhash (CMDIF), refers in fact to this covariance difference matrix. Note that in order to avoid trivial solution, we must constrain the projection matrices to be unitary, i.e., and .
The difference of covariance matrices has a singular value decomposition of the form
, where and are unitary matrices of singular vectors of size and , respectively (, ), and is a diagonal matrix of singular values of size .It can be easily shown that the loss is minimized by setting the projection matrices to be the smallest left and right singular vectors of the matrix , respectively: and . From this result it also follows that the problem is separable, and each dimension can be treated independently.
4.2 Threshold selection
Having the projection matrices fixed, the loss function (3) can be written as
(5)  
The problem is separable and can be solved independently in each dimension . We express the false positive and negative rates as a function of the thresholds as
(6)  
and
(7)  
The above probabilities can be estimated from histograms (cumulative distributions) of
and on the positive and negative sets. Optimal thresholds(8) 
are obtained by means of exhaustive search. To reduce the complexity of this search, we define a set of grids on the threshold parameter space.
4.3 Hash function application
Once the projections and thresholds are computed, given new data points , we construct the corresponding dimensional binary hash vectors as and .
5 Multimodal kernel diffhash (MMkDIF)
An obvious disadvantage of diffhash (and spectral methods in general) compared to AdaBoostbased methods is that it must be dimensionalityreducing: since we compute projections and as the singular vectors of a covariance matrix of size , the dimensionality of the embedding space must satisfy . In some cases, such a dimensionality may be too low and would not allow to correctly separate the data. A second disadvantage is of the crossmodality hashing problem in general, that it considers only the intermodal similarity , ignoring the intramodal similarities .
A standard way to cope with the first problem is the kernel trick [6], which transforms the data into some feature space that is never dealt with explicitly (only inner products in this space, referred to as kernel, are required). A kernel version of the unimodal diffhash was described in [3]. Here, we show that the use of kernels also allows incorporating intramodal similarities into the problem.
Since the problem is separable (as we have seen, projection in each dimension corresponds to a singular vector of the positives covariance matrix), we consider for simplicity onedimensional projections.
The whole method is summarized in Algorithm 2. Since it considers (though implicitly) the intramodal dissimilarities in addition to the intermodal dissimilarity, we refer to it as multimodal kernel diffhash (MMkDIF).
5.1 Projection computation
Let be a positive semidefinite kernel, and let . The map maps the data into some feature space, which we represent here as a Hilbert space (possibly of infinite dimension) with an inner product . It satisfies . Same way, we define the kernel and the associated map to some other Hilbert space for the second modality.
The idea of kernelization is to replace the original data with the corresponding feature vectors , replacing the linear projections and with
respectively. Here, are sunknown linear combination coefficients, and and denote some representative points of each modality acting as respective bases of subspaces used for the representation of data in each modality.
In this formulation, the approximate loss becomes
where and denote and matrices, and and denote and matrices with elements and , respectively. The optimal projection coefficients minimizing are given as the largest left and right singular vectors of the matrix
The kernels can be selected in a way to incorporate the intramodal similarities which are not accounted for in the previously discussed crossmodality hashing problem. For example, a classical choose is the Gaussian kernel, and . This way, we account both for the intermodal similarity (through the definition of the positive set ) and the intramodal similarities (through the definition of the kernels ). Furthermore, the dimensionality of the hash is now bounded by the number of the basis vectors, , which can be arbitrary and in practice limited only by the training set size and computational complexity. Finally, the use of kernels generalizes the embeddings to be of a more generic rather than affine form.
5.2 Threshold selection
As previously, the threshold should be selected to minimize the false positive and negative rates for each dimension of the projection,
(9)  
(10)  
The optimal thresholds are obtained as
(11) 
5.3 Hash function application
Once the linear combination coefficients and thresholds are computed, given new data points , we construct the corresponding dimensional binary hash vectors as and .
6 Results
To test the performance of the algorithms, we created simulated multimodal data of dimensionality and . In each modality, the data was created as follows: first, , and random vectors were generated as “centers”. To each “center” ( or
dimensional, respectively), i.i.d. Gaussian noise with different standard deviation in each dimension (varying between
) was added. Binary intermodal similarity partitioned the dataset into classes. As the intramodal dissimilarity in each modality, we used the Mahalanobis metric with respective diagonal covariance matrix.We compared boostingbased CMSSH [4] and our CMDIF and MMkDIF methods. Hash of different dimension was used for CMSSH and MMkDIF; for CMDIF was used. We used tradeoff parameter . For MMkDIFF, we used bases of size and Gaussian kernels of the form
For CMSSH, the settings were according to [4].
The training set consisted of positive and negative pairs. The training time for was approximately , , and seconds for CMSSH, CMDIF, and MMkDIF, respectively. Testing was performed on a set of pairs, using data from one modality as a query and data from another modality as the database. Performance was measured as mean average precision (mAP) and equal error rate (EER). Ideal performance is and .
Figures 1–2 show the performance of different multimodal hashing algorithms as a function of for datasets with a different number of classes. For comparison, we show the performance of unimodal retrieval (Euclidean distance). Our methods clearly outperform CMSSH both in accuracy and training time. Moreover, the performance of CMSSH seems to fall dramatically with increasing complexity of the dataset (more classes), while our methods continue producing good performance.
References
 [1] A. M. Bronstein, M. M. Bronstein, M. Ovsjanikov, and L. J. Guibas. WaldHash: sequential similaritypreserving hashing. Technical Report CIS201003, Technion, Israel, 2010.
 [2] A. M. Bronstein, M. M. Bronstein, M. Ovsjanikov, and L. J. Guibas. Shape Google: geometric words and expressions for invariant shape retrieval. ACM Trans. Graphics (TOG), 30(1):1–20, 2011.
 [3] M. M. Bronstein. Kernel diffhash. Technical Report arXiv:1111.0466v1, 2011.
 [4] M. M. Bronstein, A. M. Bronstein, F. Michel, and N. Paragios. Data fusion through crossmodality metric learning using similaritysensitive hashing. In Proc. CVPR, 2010.
 [5] F. Michel, M. M. Bronstein, A. M. Bronstein, and N. Paragios. Boosted metric learning for 3D multimodal deformable registration. In Proc. ISBI, 2011.

[6]
B. Schölkopf, A. Smola, and K.R. Müller.
Kernel principal component analysis.
Proc. ICANN, pages 583–588, 1997.  [7] G. Shakhnarovich. Learning taskspecific similarity. PhD thesis, MIT, 2005.
 [8] C. Strecha, A. M. Bronstein, M. M. Bronstein, and P. Fua. LDAHash: improved matching with smaller descriptors. PAMI, 2011.
Comments
There are no comments yet.