# Multimodal diff-hash

Many applications require comparing multimodal data with different structure and dimensionality that cannot be compared directly. Recently, there has been increasing interest in methods for learning and efficiently representing such multimodal similarity. In this paper, we present a simple algorithm for multimodal similarity-preserving hashing, trying to map multimodal data into the Hamming space while preserving the intra- and inter-modal similarities. We show that our method significantly outperforms the state-of-the-art method in the field.

## Authors

• 43 publications
• ### Deep Semantic Multimodal Hashing Network for Scalable Multimedia Retrieval

Hashing has been widely applied to multimodal retrieval on large-scale m...
01/09/2019 ∙ by Lu Jin, et al. ∙ 14

• ### Multimodal similarity-preserving hashing

We introduce an efficient computational framework for hashing data belon...
07/06/2012 ∙ by Jonathan Masci, et al. ∙ 0

• ### Deep Binary Reconstruction for Cross-modal Hashing

With the increasing demand of massive multimodal data storage and organi...
08/17/2017 ∙ by Xuelong Li, et al. ∙ 0

• ### Comprehensive Graph-conditional Similarity Preserving Network for Unsupervised Cross-modal Hashing

Unsupervised cross-modal hashing (UCMH) has become a hot topic recently....
12/25/2020 ∙ by Jun Yu, et al. ∙ 0

• ### A Survey on Learning to Hash

Nearest neighbor search is a problem of finding the data points from the...
06/01/2016 ∙ by Jingdong Wang, et al. ∙ 0

• ### Learning to Predict: A Fast Re-constructive Method to Generate Multimodal Embeddings

Integrating visual and linguistic information into a single multimodal r...
03/25/2017 ∙ by Guillem Collell, et al. ∙ 0

• ### Multimodal Dynamic Journey Planning

We present multimodal DTM, a new model for multimodal journey planning i...
04/16/2018 ∙ by Kalliopi Giannakopoulou, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The need to model and compute similarity between some objects is central to many applications ranging from medical imaging to biometric security. In various problems in different fields we need to compare object as different as functions, images, geometric shapes, probability distributions, or text documents. Each such problem has its own notion of data similarity.

A particularly challenging case of similarity arises in applications dealing with multimodal data, which have different representation, dimensionality, and structure. Data of this kind is encountered prominently in medical imaging (e.g. fusion of different imaging modalities like PET and CT) [5] and multimedia retrieval (e.g. querying image databases by text keywords) [2]. Such data are incomparable as apples and oranges by means of standard metrics and require the notion of multimodal similarity.

While such multimodal similarity is difficult to model, in many cases it is easy to learn from examples. For instance, in Internet vision applications we can easily obtain multiple examples of visual objects with a binary similarity function telling whether two objects are similar or not. Learning and representing such similarities in a convenient way is a big challenge.

Particular setting of similarity representation problem is similarity sensitive hashing [7]

, which has attracted significant attention in the computer vision and pattern recognition communities. In

[4], we extended the boosting-based similarity-sensitive hashing (SSH) method to the multimodal setting (referred to as cross-modality SSH or CM-SSH). This is, to the best of our knowledge, the first and the only multimodal similarity-preserving hashing algorithm in the literature.

The purpose of this paper is to develop a different simpler and efficient multimodal hashing algorithm. The rest of the paper is organized as follows. In Section 2, we formulate the problem of multimodal hashing. In Section 3, we overview the CM-SSH algorithm. In Section 4, we propose our new method (cross-modality diff-hash or CM-DIF) and in Section 5 discuss its extension (multimodal kernel diff-hash or MM-kDIF) using kernelization. Section 6 shows some experimental results.

## 2 Background

Let and be two spaces representing data belonging to different modalities (e.g., are images and are text descriptions). Note that even though we assume that the data can be represented in the Euclidean space, the similarity of the data is not necessarily Euclidean and in general can be described by some metrics and , to which we refer as intra-modal dissimilarities. Furthermore, we assume that there exists some inter-modal dissimilarity quantifying the “distance” between points in different modality. The ensemble of intra- and inter-modal structures is not necessarily a metric in the strict sense. In order to deal with these structures in a more convenient way, we try to represent them in a common metric space.

The broader problem of multimodal hashing is to represent the data from different modalities in a common space of

-dimensional binary vectors with the Hamming metric

by means of two embeddings, and mapping similar points as close as possible to each other and dissimilar points as distant as possible from each other, such that , , and . In a sense, the embeddings act as a metric coupling, trying to construct a single metric that preserves the intra- and inter-modal similarities.

A simplified setting of the multimodal hashing problem is cross-modality hashing, in which only the inter-modal dissimilarity is taken into consideration and are ignored.

For simplicity, in the following discussion we assume the inter-modal dissimilarity to be binary, , i.e., a pair of points can be either similar or dissimilar. This dissimilarity is usually unknown and hard to model, however, it should be possible to sample on some subset of the data . This sample can be represented as set of similar pairs of points (positives) and a set of dissimilar pairs of points (negatives) .

The problem of cross-modality hashing thus boils down to find two embeddings and such that . Alternatively, this can be expressed as having (i.e., the hash has high collision probability on the set of positives) and . The former can be interpreted as the false negative rate (FNR) and the latter as the false positive rate (FPR).

## 3 Cross-modality similarity-sensitive hashing (CM-SSH)

To further simplify the problem, consider embeddings given in parametric form as and [7, 4]. Here, are projection matrices of size and , respectively, and are threshold vectors of size .

In [4], we introduced the cross-modality similarity-sensitive hashing (CM-SSH) method, which is to the best of our knowledge, the first and the only multimodal hashing algorithm existing to date. The idea closely follows the similarity-sensitive hashing (SSH) method [7]

, considering the hash construction as boosted binary classification, where each hash dimension acts as a weak binary classifier. For each dimension, AdaBoost is used to maximize the following loss function

 minpi,qi,ai,bi∑(x,y)∈P∪Nwi(x,y)s(x,y)sign(pTix+ai)sign(qTiy+bi), (1)

where is binary intra-modal similarity and is the AdaBoost weigh for pair at th iteration. Since the minimization problem (1) is difficult, it is relaxed in the following way [4]: First, removing the non-linearity and setting , find the projection vectors . Then, fixing the projections , find the thresholds .

The disadvantages of the boosting-based CM-SSH is first high computational complexity, and second, the tendency to find unnecessary long hashes (the second problem can be partially resolved by using sequential probability testing [1] which creates hashes of minimum expected length).

## 4 Cross-modality diff-hash (CM-DIF)

In [8], we proposed a different and simpler approach (dubbed diff-hash) to create similarity-sensitive hash functions in the unimodal setting. We adopt similar ideas here to develop multimodal similarity-sensitive hashing algorithms.

The optimal cross-modality hashing can be found by minimizing the loss

 L = (2) = m(γ−1)2+12E{ξTη|N}−γ2E{ξTη|P}

with respect to the embedding functions , which is, up to constants, equivalent to minimizing the correlations

 L(P,Q,a,b) = E{sign(Px+a)Tsign(Qy+b)|N} (3) − γE{sign(Px+a)Tsign(Qy+b)|P}

w.r.t. the projection matrices and threshold vectors . The first and second terms in (3) can be thought of as FPR and FNR, respectively. The parameter controls the tradeoff between FPR and FNR. The limit case effectively considers only the positive pairs ignoring the negative set.

Problem (3) is a highly non-convex non-linear optimization problem difficult to solve straightforwardly. Similarly to [8, 4], we simplify the problem in the following way. First, we ignore the threshold and solve a simplified problem without the sign non-linearity for projection matrices,

 minP,QE{(Px)T(Qy)|N}−γE{(Px)T(Qy)|P}s.t.PTP=In,QTQ=In′.

Second, fixing the projections we find optimal thresholds,

 mina,bE{sign(Px+a)Tsign(Qy+b)|N}−γE{sign(Px+a)Tsign(Qy+b)|P}.

We details each step in Sections 4.14.2. The whole method is summarized in Algorithm 1.

### 4.1 Projection computation

Dropping the sign function and the offset, the loss function (3) becomes

 L(P,Q,a,b)≈^L(P,Q) = E{(Px)T(Qy)|N}−γE{(Px)T(Qy)|P} (4) = tr(PE{xyT|N}QT)−γtr(PE{xyT|P}QT) = tr(P(ΣNXY−γΣPXY)QT)=tr(PΣDXYQT)

where denote the covariance matrices of the positive and negative multi-modal data, respectively, and is the weighted difference of these covariances. The name of the algorithm, cross-modality diff-hash (CM-DIF), refers in fact to this covariance difference matrix. Note that in order to avoid trivial solution, we must constrain the projection matrices to be unitary, i.e., and .

The difference of covariance matrices has a singular value decomposition of the form

, where and are unitary matrices of singular vectors of size and , respectively (, ), and is a diagonal matrix of singular values of size .

It can be easily shown that the loss is minimized by setting the projection matrices to be the smallest left and right singular vectors of the matrix , respectively: and . From this result it also follows that the problem is separable, and each dimension can be treated independently.

### 4.2 Threshold selection

Having the projection matrices fixed, the loss function (3) can be written as

 L(a,b) = E{sign(Px+a)Tsign(Qy+b)|N} (5) − γE{sign(Px+a)Tsign(Qy+b)|P} = ∑mi=1E{sign(pTix+ai)sign(qTiy+bi)|N} − γ∑mi=1E{sign(pTix+ai)sign(qTiy+bi)|P}

The problem is separable and can be solved independently in each dimension . We express the false positive and negative rates as a function of the thresholds as

 FNi(ai,bi) = Pr(pTix+ai<0|P)⋅Pr(qTiy+bi>0|P) (6) + Pr(pTix+ai>0|P)⋅Pr(qTiy+bi<0|P) = Pr(pTix<−ai|P)⋅(1−Pr(qTiy<−bi|P)) + Pr(qTiy<−bi|P)⋅(1−Pr(pTix<−ai|P))

and

 FPi(ai,bi) = Pr(pTix+ai<0|N)⋅Pr(qTiy+bi<0|N) (7) + Pr(pTix+ai>0|N)⋅Pr(qTiy+bi>0|N) = Pr(pTix<−ai|N)⋅Pr(qTiy<−bi|N) + Pr(qTiy<−bi|N)⋅Pr(pTix<−ai|N).

The above probabilities can be estimated from histograms (cumulative distributions) of

and on the positive and negative sets. Optimal thresholds

 (a∗i,b∗i) = argmina,bγFNi(a,b)+FPi(a,b) (8)

are obtained by means of exhaustive search. To reduce the complexity of this search, we define a set of grids on the threshold parameter space.

### 4.3 Hash function application

Once the projections and thresholds are computed, given new data points , we construct the corresponding -dimensional binary hash vectors as and .

## 5 Multimodal kernel diff-hash (MM-kDIF)

An obvious disadvantage of diff-hash (and spectral methods in general) compared to AdaBoost-based methods is that it must be dimensionality-reducing: since we compute projections and as the singular vectors of a covariance matrix of size , the dimensionality of the embedding space must satisfy . In some cases, such a dimensionality may be too low and would not allow to correctly separate the data. A second disadvantage is of the cross-modality hashing problem in general, that it considers only the inter-modal similarity , ignoring the intra-modal similarities .

A standard way to cope with the first problem is the kernel trick [6], which transforms the data into some feature space that is never dealt with explicitly (only inner products in this space, referred to as kernel, are required). A kernel version of the uni-modal diff-hash was described in [3]. Here, we show that the use of kernels also allows incorporating intra-modal similarities into the problem.

Since the problem is separable (as we have seen, projection in each dimension corresponds to a singular vector of the positives covariance matrix), we consider for simplicity one-dimensional projections.

The whole method is summarized in Algorithm 2. Since it considers (though implicitly) the intra-modal dissimilarities in addition to the inter-modal dissimilarity, we refer to it as multimodal kernel diff-hash (MM-kDIF).

### 5.1 Projection computation

Let be a positive semi-definite kernel, and let . The map maps the data into some feature space, which we represent here as a Hilbert space (possibly of infinite dimension) with an inner product . It satisfies . Same way, we define the kernel and the associated map to some other Hilbert space for the second modality.

The idea of kernelization is to replace the original data with the corresponding feature vectors , replacing the linear projections and with

 p(x) = l∑i=1αi⟨ϕ(xi),ϕ(x)⟩V=αT[kX(x1,x)…kX(xl,x)] q(y) = l′∑j=1βj⟨ψ(yj),ψ(y)⟩V′=βT[kY(y1,y)…kY(yl′,y)],

respectively. Here, are sunknown linear combination coefficients, and and denote some representative points of each modality acting as respective bases of subspaces used for the representation of data in each modality.

In this formulation, the approximate loss becomes

 ^L(α,β) = 1|N|∑(x,y)∈Np(x)q(y)−γ|P|∑(x,y)∈Pp(x)q(y) = 1|N|∑(x,y)∈Nl∑i=1αi⟨ϕ(xi),ϕ(x)⟩Vl′∑j=1βj⟨ψ(yj),ψ(y)⟩V′ − 1|P|∑(x,y)∈Pl∑i=1αi⟨ϕ(xi),ϕ(x)⟩Vl′∑j=1βj⟨ψ(yj),ψ(y)⟩V′ = 1|N|∑(x,y)∈Nl∑i=1αikX(xi,x)l′∑j=1βjkY(yj,y) − γ|P|∑(x,y)∈Pl∑i=1αikX(xi,x)l′∑j=1βjkY(yj,y) = 1|N|αTKNX(KNY)Tβ−γ|P|αTKPX(KPY)Tβ,

where and denote and matrices, and and denote and matrices with elements and , respectively. The optimal projection coefficients minimizing are given as the largest left and right singular vectors of the matrix

 K=1|N|KNX(KNY)T−γ|P|KPX(KPY)T.

The kernels can be selected in a way to incorporate the intra-modal similarities which are not accounted for in the previously discussed cross-modality hashing problem. For example, a classical choose is the Gaussian kernel, and . This way, we account both for the inter-modal similarity (through the definition of the positive set ) and the intra-modal similarities (through the definition of the kernels ). Furthermore, the dimensionality of the hash is now bounded by the number of the basis vectors, , which can be arbitrary and in practice limited only by the training set size and computational complexity. Finally, the use of kernels generalizes the embeddings to be of a more generic rather than affine form.

### 5.2 Threshold selection

As previously, the threshold should be selected to minimize the false positive and negative rates for each dimension of the projection,

 FN(a,b) = Pr(p(x)<−a|P)⋅(1−Pr(q(y)<−b|P)) (9) + Pr(q(y)<−b|P)⋅(1−Pr(p(x)<−a|P)); FP(a,b) = Pr(p(x)<−a|N)⋅Pr(q(y)<−b|N) (10) + Pr(q(y)<−b|N)⋅Pr(p(x)<−a|N).

The optimal thresholds are obtained as

 (a∗,b∗) = argmina,bγFN(a,b)+FP(a,b). (11)

### 5.3 Hash function application

Once the linear combination coefficients and thresholds are computed, given new data points , we construct the corresponding -dimensional binary hash vectors as and .

## 6 Results

To test the performance of the algorithms, we created simulated multimodal data of dimensionality and . In each modality, the data was created as follows: first, , and random vectors were generated as “centers”. To each “center” (- or

-dimensional, respectively), i.i.d. Gaussian noise with different standard deviation in each dimension (varying between

) was added. Binary inter-modal similarity partitioned the dataset into classes. As the intra-modal dissimilarity in each modality, we used the Mahalanobis metric with respective diagonal covariance matrix.

We compared boosting-based CM-SSH [4] and our CM-DIF and MM-kDIF methods. Hash of different dimension was used for CM-SSH and MM-kDIF; for CM-DIF was used. We used tradeoff parameter . For MM-kDIFF, we used bases of size and Gaussian kernels of the form

 kX(x,x′) = e−d2X(x,x′)=e−(x−x′)TΣ−1/2X(x−x′); kY(y,y′) = e−d2Y(y,y′)=e−(y−y′)TΣ−1/2Y(y−y′).

For CM-SSH, the settings were according to [4].

The training set consisted of positive and negative pairs. The training time for was approximately , , and seconds for CM-SSH, CM-DIF, and MM-kDIF, respectively. Testing was performed on a set of pairs, using data from one modality as a query and data from another modality as the database. Performance was measured as mean average precision (mAP) and equal error rate (EER). Ideal performance is and .

Figures 12 show the performance of different multimodal hashing algorithms as a function of for datasets with a different number of classes. For comparison, we show the performance of unimodal retrieval (Euclidean distance). Our methods clearly outperform CM-SSH both in accuracy and training time. Moreover, the performance of CM-SSH seems to fall dramatically with increasing complexity of the dataset (more classes), while our methods continue producing good performance.

## References

• [1] A. M. Bronstein, M. M. Bronstein, M. Ovsjanikov, and L. J. Guibas. WaldHash: sequential similarity-preserving hashing. Technical Report CIS-2010-03, Technion, Israel, 2010.
• [2] A. M. Bronstein, M. M. Bronstein, M. Ovsjanikov, and L. J. Guibas. Shape Google: geometric words and expressions for invariant shape retrieval. ACM Trans. Graphics (TOG), 30(1):1–20, 2011.
• [3] M. M. Bronstein. Kernel diff-hash. Technical Report arXiv:1111.0466v1, 2011.
• [4] M. M. Bronstein, A. M. Bronstein, F. Michel, and N. Paragios. Data fusion through cross-modality metric learning using similarity-sensitive hashing. In Proc. CVPR, 2010.
• [5] F. Michel, M. M. Bronstein, A. M. Bronstein, and N. Paragios. Boosted metric learning for 3D multi-modal deformable registration. In Proc. ISBI, 2011.
• [6] B. Schölkopf, A. Smola, and K.R. Müller.

Kernel principal component analysis.

Proc. ICANN, pages 583–588, 1997.
• [7] G. Shakhnarovich. Learning task-specific similarity. PhD thesis, MIT, 2005.
• [8] C. Strecha, A. M. Bronstein, M. M. Bronstein, and P. Fua. LDAHash: improved matching with smaller descriptors. PAMI, 2011.