I Introduction
Multimodal data refer to correlated data of different types, such as imagetext pairs in Facebook and videotag pairs in Youtube. Multimodal hashing aims at embedding the multimodal data to binary codes in order to boost the speed of retrieval and reduce the storage requirement.
Unsupervised multmodal hashing methods learn a hashing function to generate binary codes whose Hamming distances can “simulate” the Euclidean distances between each pair of data features. They assume the handcrafted or learned features of images or texts can be linearly separated by hyperplanes. Hence, the Euclidean distances of data features in the same category are closer than those of data features in different categories. However, this assumption is impractical for large data sets of sophisticated data structures.
Supervised multimodal hashing methods incorporate label information to improve the retrieval accuracy. With manually labeled data for training, these models are generally superior to unsupervised ones. However, they required a huge number of manually labeled data, which leads to a heavy burden on human experts.
To our best knowledge, semisupervised semantic factorization hashing (S3FH) [1] is the only multimodal hashing method. It generates a graph for each modality to avoid compute the pairwise distances in large data sets. Then the labels are estimated by a transformation of the unified hashing codes, which leads to incremental performance improvement as the number of available labels increases. In this paper, we proposed a semisupervised multimodal hashing (SSMH) method by introducing fuzzy logic to estimate labels. The overall scheme of SSMH is given in Fig 1
. SSMH first learns the hashing functions for different modalities of the labeled data. Note that labels which are represented by a binary matrix are treated as a special modality here. These hashing functions are used to generate candidate labels for the unlabeled but not the final hashing codes. Inspired by fuzzy cmeans clustering, we introduce a membership variable to the hashing codes of unknown labels. Each membership variable represents the probability that the hashing code of an estimated label matches that of this modality. Finally, SSMH learns hashing functions to generate hashing codes.
Ii Related Works
Iia Unimodal Hashing
Unimodal hashing methods embeds a data matrix in one modality into binary code matrix. SH defines the unimodal hashing problem as:
(1) 
where is the data matrix of which each row is a data point, is the th row of , is the hashing code matrix, corresponds to the hashing code of , is the number of data points and
is the code length. According to the inequality of arithmetic and geometric means, the object function of Eq. (
1) gets its minimum when(2) 
In this case, the Hamming distance of and approximates the kernelized Euclidean distance of and . The retrieval performance of hashing codes will be identical to that of original data points. Except for the binary constraint on , the other two constraints are called orthogonality constraint and balance constraint respectively. The orthogonality constraint decorrelates bits of hashing codes. For an extreme example, if the th column and th column are linear correlated or even identical, the retrieval performance will not vary by removing either of them. The balance constraint requires each column of has the same number of and . For an extreme example, if all elements of the column of is , this column becomes redundant because it does not affect the Hamming distances. Hence, these two constraints were considered to be necessary for good codes [2].
Eq. (1
) is intractable for large data sets because it requires computing the pairwise distances in the whole data sets to construct the affinity matrix whose element in the
th row and th column is. Furthermore, the binary constraint makes it an NPhard problem. The authors circumvent these problems by relaxing the binary constraint. The final hashing codes are generated by thresholding eigenfunctions that are designed to avoid computing pairwise distances. On the other hand, anchor graph hashing
[3] and discrete graph hashing [4] choose some special points as anchor points. Then, the distances of data points to anchor points are computed to construct a highly sparse affinity matrix so that Eq. (1) can be used for large data sets.ITQ models unimodal hashing as a quantization loss minimization problem:
(3) 
where is comprised of the first principal components of and
is an orthogonal matrix. ITQ iteratively computes
and to minimize Eq. (3). In each iteration, is thresholded at 0 to generate binary codes. Isotropic hashing (IsoH) [5] equalizes the importances of principal components. Harmonious hashing [6] puts an orthogonality constraint on an auxiliary variable for the code matrix. Unlike ITQ, IsoH and HH that rotate the projected data matrix, okmeans [7]rotates the code matrix to minimize quantization loss. Despite of principal component analysis (PCA), linear discriminant analysis (LDA) can be used
[8]. Neighborhood discriminant hashing (NDH) [9] calculatesduring the minimization procedure rather than precomputing it by a linear transformation method.
All aforementioned unimodal models neglected the balance constraint. Spherical hashing (SpH) [10] and global hashing system (GHS) [11]
quantize the distance between a data point and an anchor point. The closer half is denoted as 1, while the further half is denoted 0 or 1. Therefore, the balance constraint can be easily fulfilled. Their major difference is on how to find these anchor points. SpH uses a heuristic algorithm while GHS treats it as a satellite distribution problem of the global positioning system (GPS).
IiB Multiview Hashing
Zhang et al. [12] proposed an unsupervised multiview hashing method by extending the unimodal hashing model AGH. It tunes the weights on each view to maximize the performance. Song et al. [13] jointly consider the local structural information and the relations between local structures to other local structures to design an unsupervised multiview hashing method. Multiview alignment hashing [14] combines the ideas of okmeans and SH for each view. MAH is also an unsupervised multiview hashing method.
Multigraph hashing (MGH) [15] directly combines the graph for the whole data set and the graph for the labeled data to design a semisupervised multiview hashing method. Semisupervised multiview discrete hashing (SSMDH) [16] predicts the unknown labels by linearly transforming the hashing code matrix of labeled data. It iteratively computes the linearly transformation matrix, hashing code matrix and hashing functions. SSMDH requires the computation of pairwise distances in the whole data set, so it is infeasible for large data set.
IiC Multimodal Hashing
Existing multimodal hashing methods can be categorized into supervised and unsupervised ones. Similar to unsupervised unimodal hashing methods, unsupervised multimodal hashing methods aim at preserving the Euclidean distances between each pair of data. Intermedia hashing (IMH) [17] exploits intermedia consistency and intramedia consistency to generate hashing codes. Like what AGH has done to SH, linear crossmedia hashing (LCMH) [18] uses the distances between each data point and each cluster centroid to construct a sparse affinity matrix. Collective matrix factorization hashing (CMFH) [19] can be treated as an extension of NDH. For each modality, CMFH consists of two terms: (1) calculating a transformation matrix for the data matrix to match the code matrix, and (2) calculating a transformation matrix for code matrix to match the data matrix.
By incorporating label information, supervised multimodal hashing can preserve semantic information and achieve higher accuracy. Crossmodality similaritysensitive hashing (CMSSH) [20] treats hashing as a binary classification problem. Crossview hashing (CVH) [21]
assumes the hashing codes be a linear embedding of the original data points. It extends SH by minimizing the weighted average Hamming distances of hashing codes of training data pairs. The minimization is solved as a generalized eigenvalue problem. The performance of CVH decreases with increasing bit number, because most of the variance is contained in the top few eigenvectors
[22]. Multilatent binary embedding (MLBE) [23] treats hashing codes as the binary latent factors in the proposed probabilistic model and maps data points from multiple modalities to a common Hamming space. Semanticspreserving hashing (SePH) [24]learns the hashing codes by minimizing the KLdivergence of probability distribution in Hamming space from that in semantic space. CMSSH, MLBE and SePH need to compute the pairwise distances among all data points.
Semantic correlation maximization (SCM) [25] circumvents this by learning only one bit each time and the explicit computation of affinity matrix is avoided through several mathematical manipulations. Multimodal discriminative binary embedding (MDBE) [22] derives from CMFH. MDBE transform data matrices and label matrix to a latent space and then transform data matrices in latent space to match label matrix.
Iii Methodology
Let us define the used notations first. is the data matrix in the th modality, where is the number of data points and is the dimension of a data point in th modality. Let us assume be shuffled and zerocentered. is the labeled data matrix comprised of the first rows of and is the unlabeled data matrix comprised of the remaining rows of . Hence, we have . is the hashing code matrix where is the code length. is the label matrix, where is the number of labeled data and is the number of classes. is the estimated labels.
is a vector whose elements are memberships and
is a diagonal matrix whose diagonal elements are . The number of modalities excluding the label modality is . The transformation matrices for estimating labels are and for the th modality and label matrix, respectively. The transformation matrices for generating hashing codes are and , respectively. The superscript is short for “labeled”, while is short for “unlabeled”. They indicate the correspondence to labeled or unlabeled data. The superscript and stands for “generating label matrix” and “generating binary code matrix”, respectively.As illustrated in Fig. 1, the core parts of the proposed semisupervised multimodal hashing (SSMH) are supervised hashing for generating labels, label estimation and supervised hashing for generating hashing codes which are highlighted by green blocks. We introduces them consecutively in the following subsections.
Iiia Supervised hashing for estimating labels
By treating label matrix as a special modality, we formulate the supervised hashing methods for estimating labels as:
(4) 
where is a temporary variable and is a predefined real positive constant. Eq. (4) is generally illposed since we have K+2 unknown variables and K+1 known constant matrices. Hence, it should be regularized. Inspired by the orthogonality regularization proposed in [26], we can modify Eq. (4) as:
(5) 
where
is the identity matrix and
is a predefined real positive constant.Eq. (5) is solved by iteratively calculating , and . Take the first derivative with respect to :
(6) 
Similarly, we have
(7) 
Taking the first derivative of Eq. (5) with respect to and set it as 0, we have
(8) 
Gradient descent method is used for updating and . The optimization procedure is shown in Algorithm 1, where is the step size. The first terms of derivatives with respect to and are normalized so that we can fix for all our experiments, otherwise should be tuned for different size of because increases dramatically with large .
IiiB Generating labels
We generate by for the unlabeled data in the th modality. Unlike which is a unified code matrix for all modalities, generally differs from each other. We assume can transform the unknown label matrix to match in a certain probability. The probability that transformed by matches is denoted as . The th element of corresponds to that probability of the th label vector, i.e., the th row of .
In fuzzy cmeans clustering (FCM), the membership indicates the probability that a data point belongs to a centroid. This probability is determined by the distance between the data point and the centroid. Similar to FCM, we use the following object function to estimate labels,
(9) 
where is the fuzzier that determines the level of fuzziness of , is the th row of , is the th element of , is an vector of ones, is an vector of zeros and
(10) 
Eq. (9) can be written in matrix form:
(11) 
where means the trace of a matrix and is a diagonal matrix of which the diagonal elements are . Similar to FCM, we iteratively minimize Eq. (11). Taking partial derivative with respect to and setting it to 0, we have
(12) 
where the superscript “1” means inverse matrix. is a square matrix, so is . When is randomly initialized and hence it is generally invertible, we empirically found that the hashing method in Subsection IIIA led to an invertible , too. It is easy to figure out if is invertible, is also invertible and its inverse matrix is .
By introducing Lagrangian multiplier , Eq. (11) can be formulated as an unconstrained minimization problem:
(13) 
Let us define
(14) 
It can be deduced that
(15) 
where the division and exponent of matrices are elementwise. In Eq. (15), it is unnecessary to compute . Only the diagonal elements are nonzeros after the elementwise multiplication with . Hence, we just compute the squared Frobenius norm of each row of . The key steps are summarized in Algorithm 2.
IiiC Supervised hashing for generating hashing codes
The hashing method proposed in this subsection is slightly different from that introduced in Subsection IIIA. As the estimated labels are not actual labels, we should deduce their effects on calculating the transformation matrix. Let us define and as the hashing code matrices for labeled and unlabeled data, respectively. We modify Eq. (5) to
(16) 
where and is predefined positive constant and is the concatenation matrix of and . The minimization procedure is similar to Algorithm 1. Therefore, to save space, we only give the substitutes of Eq. (8), Eq. (7) and Eq. (6). To generate hashing codes, Eq. (6) should be substituted by
(17) 
Eq. (7) should be substituted by
(18) 
and Eq. (8) should be substituted by
(19) 
IiiD Implementation details
Initialization. In our experiment, all transformation matrices, including , , and , were randomly initialized and normalized. In the label generation step, was initialized as an arbitrary .
Parameter setting. We set and for two supervised hashing methods. was set as and was set as . was set as 0.001. The maximum iteration number for two supervised hashing method was set as 400, while it was set as 15 for label generation method.
Iv Experimental Results
Iva Data sets and baselines
MIRFlickr [27] contains 25,000 entries each of which consists of 1 image, several textual tags and labels. Following literature [24], we only keep those textural tags appearing at least 20 times and remove entries which have no label. Hence, 20,015 entries are left. For each entry, the image is represented by a 512dimensional GIST [28] descriptors and the text is represented by a 500dimensional feature vector derived from PCA on index vectors of the textural tags. 5% entries are randomly selected for testing and the remaining entries are used as training set. In the training set, we use 10%, 50% and 90% labels to construct three partially labeled data sets. Groundtruth semantic neighbors for a test entry, i.e, a query, are defined as those sharing at least one label.
The proposed method was compared with five stateoftheart supervised multimodal hashing methods CMSSH [20], CVH [21], SCM [25], SePH [24] and MDBE [22] and one semisupervised method S3FH [1]. For supervised methods, 100% labels are used to train their models.
IvB Results and analysis
Mean average precision (MAP) which varies between 0 and 1 is a widelyused evaluation metric for retrieval performance. Table
I shows the MAP of compared methods. “Imagetext” means using images searching texts, while “textimage” means using texts searching images. In Table I, the performances are sorted in ascendant order according to the criteria that if method I performs better on at least three experiments than method II, then method I is supposed to be superior. It can be seen that with only 10% labels, SSMH approximates the worst fullsupervised methods. With 50% labels, SSMH gets a medium performance among compared methods. With 90% labels, SSMH surpasses all compared methods except for MDBE. However, the results of SSMH(90%) and MDBE does not differ significantly from each other.Task  Method  MIRFlickr  

16 bits  32 bits  64 bits  96 bits  128 bits  
ImageText  S3FH(10%)  0.5894  0.5902  0.5951  0.5947  0.5912 
CMSSH  0.5966  0.5674  0.5581  0.5692  0.5701  
SSMH(10%)  0.5802  0.5945  0.5953  0.5956  0.5916  
S3FH(50%)  0.5954  0.6028  0.6057  0.6044  0.6099  
CVH  0.6591  0.6145  0.6133  0.6091  0.6052  
S3FH(90%)  0.6116  0.6273  0.6145  0.6125  0.6191  
SSMH(50%)  0.6326  0.6333  0.6372  0.6381  0.6344  
SCM  0.6251  0.6361  0.6417  0.6446  0.6480  
SePH  0.6505  0.6447  0.6453  0.6497  0.6612  
SSCM(90%)  0.6612  0.6654  0.6818  0.6906  0.6950  
MDBE  0.6784  0.6950  0.6983  0.7048  0.7056  
TextImage  S3FH(10%)  0.5743  0.5835  0.5976  0.5997  0.5903 
SSMH(10%)  0.5891  0.5994  0.6008  0.5924  0.5982  
CVH  0.6495  0.6213  0.6179  0.6050  0.5948  
S3FH(50%)  0.5892  0.6069  0.6120  0.6126  0.6161  
S3FH(90%)  0.6281  0.6299  0.6315  0.6350  0.6329  
SCM  0.6194  0.6302  0.6377  0.6377  0.6417  
SSMH(50%)  0.6318  0.6335  0.6402  0.6411  0.6395  
CMSSH  0.6613  0.6510  0.6756  0.6643  0.6471  
SePH  0.6745  0.6824  0.6917  0.7059  0.7110  
SSCM(90%)  0.6672  0.7146  0.7254  0.7255  0.7332  
MDBE  0.6723  0.7237  0.7353  0.7355  0.7387 
V Conclusion
In this paper, we proposed a semisupervised multimodal hashing method (SSMH). SSMH first utilizes a supervised hashing method to generate a code matrix of the same dimension as label matrix for labeled data. Then, it transformed the unlabel data matrices to generate candidate label matrices. In each modality, a membership variable was introduced to represent the probability that the transformed label matrix for unlabeled data belongs to this modality. By iteratively calculating the membership variables and estimating label matrix, SSMH generated a label matrix for unlabeled data. Finally, the supervised hashing method in the first step was modified to generate a unified hashing code matrix. Experiments shew that the performance of SSMH approximately ranged within that of the worst compared supervised method and that of the best one, given that the percentage of available labels ranged within 10%90%.
References
 [1] J. Wang, G. Li, P. Pan, and X. Zhao, “Semisupervised semantic factorization hashing for fast crossmodal retrieval,” Multimedia Tools and Applications, vol. 76, no. 19, pp. 20 197–20 215, Oct 2017.
 [2] Y. Weiss, A. Torralba, and R. Fergus, “Spectral hashing,” in Advances in Neural Information Processing Systems, 2008, pp. 1753–1760.

[3]
W. Liu, J. Wang, and S.f. Chang, “Hashing with graphs,” in
International Conference on Machine Learning
, 2011.  [4] W. Liu, C. Mu, S. Kumar, and S.F. Chang, “Discrete graph hashing,” in Advances in Neural Information Processing Systems, 2014.
 [5] W. Kong and W.J. Li, “Isotropic hashing,” in Advances in Neural Information Processing Systems, 2012, pp. 1646–1654.

[6]
B. Xu, J. Bu, Y. Lin, C. Chen, X. He, and D. Cai, “Harmonious hashing,” in
International Joint Conference on Artificial Intelligence
, 2013, pp. 1820–1826. 
[7]
M. Norouzi and D. J. Fleet, “Cartesian kmeans,” in
IEEE Conference on Computer Vision and Pattern Recognition
, 2013, pp. 3017–3024.  [8] C. Strecha, A. M. Bronstein, M. M. Bronstein, and P. Fua, “LDAHash: Improved matching with smaller descriptors,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 1, pp. 66–78, May 2012.

[9]
J. Tang, Z. Li, M. Wang, and R. Zhao, “Neighborhood discriminant hashing for largescale image retrieval,”
IEEE Transactions on Image Processing, vol. 24, no. 9, pp. 2827–2840, Sept 2015.  [10] H. JaePil, L. Youngwoon, H. Junfeng, C. ShihFu, and Y. SungEui, “Spherical hashing,” in IEEE Conference on Computer Vision and Pattern Recognition, 2012, pp. 2957–2964.
 [11] D. Tian and D. Tao, “Global hashing system for fast image search,” IEEE Transactions on Image Processing, vol. 26, no. 1, pp. 79–89, Jan 2017.
 [12] D. Zhang, F. Wang, and L. Si, “Composite hashing with multiple information sources,” in In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, 2011, pp. 225–234.
 [13] J. Song, Y. Yang, Z. Huang, H. T. Shen, and J. Luo, “Effective multiple feature hashing for largescale nearduplicate video retrieval,” Trans. Multi., vol. 15, no. 8, pp. 1997–2008, Dec. 2013.
 [14] L. Liu, M. Yu, and L. Shao, “Multiview alignment hashing for efficient image search,” IEEE Transactions on Image Processing, vol. 24, no. 3, pp. 956–966, March 2015.
 [15] J. Cheng, C. Leng, P. Li, M. Wang, and H. Lu, “Semisupervised multigraph hashing for scalable similarity search,” Computer Vision and Image Understanding, vol. 124, no. Supplement C, pp. 12 – 21, 2014.
 [16] C. Zhang and W. S. Zheng, “Semisupervised multiview discrete hashing for fast image search,” IEEE Transactions on Image Processing, vol. 26, no. 6, pp. 2604–2617, June 2017.
 [17] J. Song, Y. Yang, Y. Yang, Z. Huang, and H. T. Shen, “Intermedia hashing for largescale retrieval from heterogeneous data sources,” in Proceedings of ACM SIGMOD International Conference on Management of Data, 2013, pp. 785–796.
 [18] X. Zhu, Z. Huang, H. T. Shen, and X. Zhao, “Linear crossmodal hashing for efficient multimedia search,” in Proceedings of ACM International Conference on Multimedia, 2013, pp. 143–152.
 [19] G. Ding, Y. Guo, and J. Zhou, “Collective matrix factorization hashing for multimodal data,” in IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 2083–2090.
 [20] M. M. Bronstein, A. M. Bronstein, F. Michel, and N. Paragios, “Data fusion through crossmodality metric learning using similaritysensitive hashing,” in IEEE Conference on Computer Vision and Pattern Recognition, 2010, pp. 3594–3601.
 [21] S. Kumar and R. Udupa, “Learning hash functions for crossview similarity search,” in Proceedings of International Joint Conference on Artificial Intelligence, July 2011.
 [22] D. Wang, X. Gao, X. Wang, L. He, and B. Yuan, “Multimodal discriminative binary embedding for largescale crossmodal retrieval,” IEEE Transactions on Image Processing, vol. 25, no. 10, pp. 4540–4554, Oct. 2016.
 [23] Y. Zhen and D.Y. Yeung, “A probabilistic model for multimodal hash function learning.” in KDD. ACM, 2012, pp. 940–948.
 [24] Z. Lin, G. Ding, J. Han, and J. Wang, “Crossview retrieval via probabilitybased semanticspreserving hashing,” IEEE Transactions on Cybernetics, vol. 47, no. 12, pp. 4342–4355, Dec 2017.
 [25] D. Zhang and W.J. Li, “Largescale supervised multimodal hashing with semantic correlation maximization,” in Proceedings of the TwentyEighth AAAI Conference on Artificial Intelligence, 2014, pp. 2177–2183.
 [26] D. Wang, P. Cui, M. Ou, and W. Zhu, “Deep multimodal hashing with orthogonal regularization,” in Proceedings of International Joint Conference on Artificial Intelligence, 2015, pp. 2291– 2297.
 [27] M. J. Huiskes and M. S. Lew, “The MIR flickr retrieval evaluation,” in Proceedings of the ACM International Conference on Multimedia Information Retrieval, 2008.
 [28] A. Oliva and A. Torralba, “Modeling the shape of the scene: A holistic representation of the spatial envelope,” International Journal of Computer Vision, vol. 42, no. 3, pp. 145–175, May 2001.
Comments
There are no comments yet.