1 Introduction
Similarity is a fundamental notion underlying a variety of computer vision, pattern recognition, and machine learning tasks ranging from retrieval, ranking, classification, and clustering to object detection, tracking, and registration. In all these problems, one has to quantify the degree of similarity between objects usually represented as feature vectors. While in some cases domainspecific knowledge dictates a natural similarity function, most generally a “natural” measure of similarity is rather elusive and cannot be constructed without side information provided e.g. through human annotation. An even more challenging setting frequently arises in tasks involving multiple media or data coming from different modalities. For example, a medical image of the same organ can be obtained using different physical processes such as CT and MRI; a multimedia search engine may perform queries in a corpus consisting of audio, video, and textual information. While domain knowledge can be used to construct reasonable similarity functions for each data modality, it is much more challenging to create a consistent and meaningful similarity measure
across them.Previous work
The idea of constructing similarity measures suitable to specific data has been thoroughly explored by the statistics and machine learning communities. One can roughly divide similarity learning methods into unsupervised and supervised. The former class uses only the data with no additional side information. Unsupervised methods include PCA and its kernelized version (Schoelkopf et al. (1997)) that approximate the data globally by their secondorder statistics either in the original Euclidean space or in a feature space represented by a kernel; and various local embedding methods such as the locally linear embedding (Roweis & Saul (2000)), Laplacian eigenmaps (Belkin & Niyogi (2003)), and diffusion maps (Coifman & Lafon (2006)), which are all based on the assumptions that the data residing in a highdimensional Euclidean space actually belong to a lowdimensional manifold, a parametrization of which is looked for. Unsupervised methods are inherently limited due to their inability to incorporate side information into the learning process.
Supervised methods can be further subdivided according to the type of side information they rely on. Class labels is the most straightforward way of specifying side information, and is used in methods dating back to LDA (Johnson & Wichern (2002)) and its kernelized version (Mika et al. (1999)) as well as more modern approaches of Xing et al. (2002); Weinberger & Saul (2009). Other methods accept side information in the form of knowingly similar and dissimilar pairs (Davis et al. (2007)) or triplets of the form “ is more similar to than ” (Shen et al. (2009); McFee & Lanckriet (2009)). A family of methods referred to as multidimensional scaling (MDS) rely on metric dissimilarity values supplied on a training set of pairs of data vectors, and seek for a Euclidean representation reproducing them as faithfully as possible (Borg & Groenen (2005)). Once the embedding into the representation space has been learned, similarity to new, unseen data is computed either directly (if the metric admits a parametric representation), or using an outofsample extension.
Similarity learning methods can also be classified by the type of the produced similarity functions. A significant class of practical methods learns a linear projection making the Euclidean metric optimal – this is essentially equivalent to learning a Mahalanobis distance (
Weinberger & Saul (2009); Shen et al. (2009)). Kernelized versions of these approaches are often available in the cases where the data have an intricate structure that cannot be captured by a linear transformation.
Hashing approaches represent the data as binary codes to which the Hamming metric is subsequently applied as the measure of similarity. These methods include the family of locality sensitive hashing (LSH) (Gionis et al. (1999)), and the recently introduced spectral hashing (Weiss et al. (2008)
). These approaches are mainly used to construct an efficient approximation of some trusted standard similarity such as the Jaccard index or the cosine distance, and are inapplicable if side information has to be relied upon.
Shakhnarovich et al. (2003) proposed to construct optimal LSHlike hashes (referred to as similaritysensitive hashingor SSH) using supervised learning. More efficient approaches have been subsequently proposed by
Torralba et al. (2008) and Strecha et al. (2012).The extension of similarity learning to multimodal data has been addressed in the literature only very recently. Bronstein et al. (2010) used a supervised learning algorithm based on boosting to construct hash functions of data belonging to different modalities in a way that makes them comparable using the Hamming metric. This method can be viewed as an extension of SSH to the multimodal setting, dubbed by the authors crossmodal SSH (CMSSH), and it enjoys the compactness of the representation and the low complexity involved in distance computation. McFee & Lanckriet (2011) proposed to learn multimodal similarity using ideas from multiple kernel learning (Bach et al. (2004); McFee & Lanckriet (2009)). Multimodal kernel learning approaches have been proposed by Lee et al. (2009) for medical image registration, and by Weston et al. (2010). The main disadvantage of the latter is the fact that it is limited to linear projections only. The framework proposed by McFee & Lanckriet (2011) can be kernelized, but it involves the computationally expensive semidefinite programming, which limits scalability. Also, both algorithms produce continuous Mahalanobis metrics, which is disadvantageous both in computational and storage complexity especially when dealing with largescale data. The appealing property of similaritypreserving hashing methods like the CMSSH Bronstein et al. (2010) is the compactness of the representation and the low complexity involved in distance computation.
Contributions
This paper is motivated by the work of Bronstein et al. (2010) on multimodal similaritypreserving hashing. We propose a novel multimodal similarity learning framework based on neural networks. Our approach has several advantages over the stateoftheart. First, we combine intra and intermodal similarity into a single framework. This allows exploiting richer information about the data and can tolerate missing modalities; the latter is especially important in sensor networks where one or more sensors may fail or in application like multimedia retrieval where it is hard to obtain reliable samples of crossmodal similarity. We show that previous works can be considered as particular cases of our model. Second, we solve the full optimization problem without resorting to relaxations as in SSHlike methods; it has been recently shown that such a relaxation degrades the hashing performance (see e.g., Strecha et al. (2012); Masci et al. (2011)). Third, we introduce a novel coupled siamese neural network architecture to solve the optimization problem underlying our multimodal hashing framework. Fourth, the use of neural networks can be very naturally generalized to more complex nonlinear projections using multilayered networks, thus allowing to produce embeddings of arbitrarily high complexity. We show experimental result on several standard multimodal datasets demonstrating that our approach compares favorably to stateoftheart algorithms.
2 Background
Let and be two spaces representing data belonging to different modalities (e.g., are images and are text descriptions). Note that even though we assume that the data can be represented in the Euclidean space, the similarity of the data is not necessarily Euclidean and in general can be described by some metrics and , to which we refer as intramodal dissimilarities. Furthermore, we assume that there exists some intermodal dissimilarity quantifying the “distance” between points in different modality. The ensemble of intra and intermodal structures is not necessarily a metric in the strict sense. In order to deal with these structures in a more convenient way, we try to represent them in a common metric space. In particular, the choice of the Hamming space offers significant advantages in the compact representation of the data as binary vectors and the efficient computation of their similarity.
Multimodal similaritypreserving hashing is the problem of represent the data from different modalities in a common space of dimensional binary vectors with the Hamming metric by means of two embeddings, and mapping similar points as close as possible to each other and dissimilar points as distant as possible from each other, such that , , and . In a sense, the embeddings act as a metric coupling, trying to construct a single metric that preserves the intra and intermodal similarities. A simplified setting of the multimodal hashing problem used in Bronstein et al. (2010) is crossmodality similaritypreserving hashing, in which only the intermodal dissimilarity is taken into consideration and are ignored. To the best of our knowledge, the full multimodal case has never been addressed before.
For simplicity, in the following discussion we assume the side information given as the intra and intermodal dissimilarities to be binary, , i.e., a pair of points can be either similar or dissimilar. This dissimilarity is usually unknown and hard to model, however, it should be possible to sample on some subset of the data . This sample can be represented as sets of similar pairs of points (positives) , , and , and likely defined sets , and of dissimilar pairs of points (negatives). In many practical applications such as image annotation or textbased image search, it might be hard to get the intermodal positive and negative pairs, but easy to get the intramodal ones.
The problem of multimodal similaritypreserving hashing boils down to find two embeddings and such that minimizing the aggregate of false positive and false negative rates,
(1)  
Crossmodality similarity sensitive hashing
Bronstein et al. (2010) studied the particular case of crossmodal similarity hashing (without incorporating intramodality similarity), with linear embeddings of the form and . Their CMSSH algorithm constructs the dimensions of and onebyone using boosting. At each iteration, onedimensional embeddings and are found using a twostage scheme: first, the embeddings are linearized as and and the resulting objective is minimized to find the projection
(2) 
(here and are unit vectors representing the th row of the matrices and , respectively, and the expectations are weighted by persample weights adjusted by the boosting). With such an approximation, the optimal projection directions and have a closedform expressions using SVD of the positive and negative covariance matrices. At the second stage, the thresholds and are found by onedimensional search.
This approach has several drawbacks. First, CMSSH solves a particular setting of problem (1) with only, thus ignoring the intramodality similarities. Second, the assumption of separability (treating each dimension separately) and the linearization of the objective replace the original problem with a relaxed version, whose optimization produces suboptimal solutions that tend to increase the hash sizes (or alternatively, for a fixed hash length , the method manifests inferior performance; see Masci et al. (2011)). Finally, this approximation is limited to a relatively narrow class of linear embeddings that often do not capture well the structure of the data.
3 Multimodal NN hashing
Our approach for multimodal hashing is related to supervised methods for dimensionality reduction and in particular extends the framework of (Schmidhuber & Prelinger (1993); Hadsell et al. (2006); Taylor et al. (2011)), also known as the siamese architecture. These methods learn a mapping onto a usually lowdimensional feature space such that similar observations are mapped to nearby points in the new manifold and dissimilar observations are pulled apart. In our simplest setting, the linear embedding is realized as a neural network with a single layer (where represent the linear weights and
is the bias) and a sign activation function (in practice, we use a smooth approximation
). The parameters of the embedding can be learned using the backpropagation algorithm (Werbos (1974)) minimizing the loss(3) 
w.r.t. the network parameters . Note that for binary vectors (when ), the squared Euclidean distance in (3) is equivalent up to constants to the Hamming distance. The second term in (3) is a hingeloss
providing robustness to outliers and produces a mapping for which negatives are pulled
apart. The system is fed with pairs of samples which share the same parametrization and for which a corresponding dissimilarity is known, for positives and for negatives (thus the name siamese network, e.g. two inputs and a common output vector).Coupled siamese architecture
In the multimodal setting, we have two embeddings and , each cast as a siamese network with parameters and
, respectively. Such an architecture allows to learn similaritysensitive hashing for each modality independently by minimizing the loss functions
. In order to incorporate intermodal similarity, we couple the two siamese networks by the crossmodal loss(4) 
thus jointly learning two sets of parameters for each modality. We refer to this model, which generalizes the siamese framework, as coupled siamese networks for which a schematic representation is shown Figure 1.
Our implementation differs from the original architecture of Hadsell et al. (2006) in the choice of the output activation function (we use activation that encourages binary representations rather than a linear output layer). This way the maximum distance is bounded by and by simply enlarging the margin between dissimilar pairs we enforce the learning of codes which differ by the sign of their components. Once the model is learned, hashes are produced by thresholding the output.
Training
The training of our coupled siamese network is performed by minimizing
(5) 
where are weights determining the relative importance of each modality. The loss (5) can be considered as a generalization of the loss in (1), which is obtained by setting , margins = 0, and . Furthermore, setting , we obtain the particular setting of crossmodal loss, whose relaxed version is minimized by the CMSSH algorithm of Bronstein et al. (2010). It is also worth repeating that in many practical cases, it is very hard to obtain reliable crossmodal training samples () but much easier to obtain intramodal samples (). In the full multimodal setting (), the terms can be considered as a regularization, preventing the algorithm from over fitting.
We apply the backpropagation algorithm (Werbos (1974); LeCun (1985); Rumelhart et al. (1986)) to get the gradient of our model w.r.t. the embedding parameters. The gradients of the intra and intermodal loss functions w.r.t. to the parameters of are given by
where the term is the usual backpropagation step of a neural network. Equivalent derivation is done for the parameters of
. The model can be easily learnt jointly using any gradientbased technique such as conjugate gradient or stochastic gradient descent.
Nonlinear embeddings
Our model straightforwardly generalizes to nonlinear embeddings using multilayered network architecture. The proposed framework is in fact general and any class of neural networks can be applied to arbitrarily increase the complexity of the embedding. Deep and hierarchical models are able to model highly nonlinear embeddings and scale well to largescale data by means of fully online learning, where the parameters are updated after every input tuple presentation. This allows to sample a possibly huge space with constant memory requirements.
4 Results
We tested our algorithm on crossmodal data retrieval tasks using standard datasets from the shape retrieval and multimedia retrieval communities. We compared three algorithms: our coupled siamese framework in the full multimodal setting (MMNN) and its reduced version (CMNN), as well as CMSSH. The singlelayer version (denoted L1) of CMNN and MMNN realizes a linear embedding function and compares directly with CMSSH. Twolayered version (L2) allows to obtain more complex nonlinear embeddings. For training the neural networks, we used conjugate gradients. The hash functions learned by each of the methods were applied to the data in the datasets, and the Hamming distance was used to rank the matches. Retrieval performance was evaluated using mean average precision , where is the relevance of a given rank (one if relevant and zero otherwise), is the number of retrieved results, and is precision at , defined as the percentage of relevant results in the first topranked retrieved matches.
ShapeGoogle
In the first experiment, we reproduced the multimodal shape retrieval experiment of Bronstein et al. (2010) using the ShapeGoogle dataset Bronstein et al. (2011), containing 583 geometric shapes of 12 different classes subjected to synthetic transformations as well as 456 unrelated shapes (“distractors”). The goal was to correctly match a transformed shape from the query set to the rest of the dataset. The shapes were represented using 32dimensional bag of geometric features (BoF) and 64dimensional spatiallysensitive bags of geometric features (SSBoF). To learn the hashing functions, we used positive and negative sets of size and , respectively. For CMSSH, we used the code with settings provided by Bronstein et al. (2010). For MMNN, we used singlelayer architecture with margins , and , which we empirically found to be the best combination (additional results with different parameters are shown in Table 3 and in supplementary materials). In addition, we also show a twolayer architecture with hidden nodes. For CMNN we used as for the single layer case. MMNN used , and , .
Figure 2 and Table 1 shows the performance of different methods as function of hash length . First, we can see that NNbased methods (CMNN and MMNN) dramatically outperform the boostingbased CMSSH for a fixed hash length. MMNN achieves almost perfect performance using only bits (for comparison, CMSSH requires almost bits to achieve similar performance). The reason is likely to be the fact that CMSSH resorts to relaxation of the problem thus producing a suboptimal solution, while NNhash solved the “true” optimization problem. Secondly, adding another layer to the neural network we obtain a nonlinear hashing function, which performs dramatically better than a singlelayered architecture, achieving nearperfect performance with bits. Thirdly, fully multimodal method (MMNN) consistently outperforms the crossmodal version (CMNN). We attribute this fact to the use of the intramodal losses, acting as regularization.
Figure 3 visually exemplifies a retrieval experiment for MMNNhash where the query shape on the leftmost side is compared with a similar shape, middle, and a dissimilar one, rightmost. The produced hash vectors on the bottom row are shown along with the BoF and the SS–BoF descriptors.
Importance of intramodal regularization is exemplified in Table 2. In this experiment, we performed training of a 1 layer net using a subset of the crossmodal data (), while keeping the intramodal data in the MMNN. The CMNN method manifested significant performance drop (attributed most likely to overfitting), while the performance of MMNN remains practically unchanged.
BoF–SSBoF  BoF  SSBoF  
MMNN 
99.53%  99.28%  99.67%  99.63%  99.53%  99.26% 
CMNN  98.44%  95.03%  98.96%  94.35%  98.51%  95.83% 

Choice of parameters
The theoretically smallest hash length must be . However, since we are using a simple embedding, in practice an about 5–10 times larger may be required to achieve satisfactory results. Table 3 and Figure 4 show the performance of the NNhash methods under different choices of the parameters. We can see that the addition of intramodal regularization makes the crossmodal performance less sensitive to the choice of the parameters, and that MMNN produces higher crossmodal performance than CMMM for similar margin settings.
BoF–SSBoF  BoF  SSBoF  
12 bit  16 bit  12 bit  16 bit  12 bit  16 bit  
0.0  0.5  1  1  2  90.67%  99.38%  91.26%  98.87%  90.67%  99.38% 
0.5  0.0  2  2  3  93.12%  99.39%  94.83%  99.35%  91.79%  99.25% 
0.1  0.1  1  1  3  95.62%  99.00%  96.05%  98.93%  95.62%  99.00% 
0.3  0.3  2  2  2  92.54%  99.62%  92.51%  99.45%  92.54%  99.62% 
0.3  0.3  1  1  3  98.88%  99.01%  98.97%  99.24%  98.88%  99.01% 
0.3  0.3  2  2  3  98.28%  98.55%  98.74%  98.51%  98.28%  99.48% 
0.5  0.5  3  3  3  95.59%  99.07%  95.46%  99.13%  95.59%  98.93% 
0  0  0  0  2  96.03%  96.96%  96.52%  96.97%  95.51%  96.72% 
3  87.97%  98.77%  88.60%  99.64%  87.97%  98.99%  
5  91.22%  95.60%  93.75%  97.06%  91.24%  95.60% 
Nus
In the second experiment, we used the NUS dataset of Chua et al. (2009), containing about 250K annotated images from Flickr. The images are manually categorized into 81 classes (one image can belong to more than a single class) and represented as 500dimensional bags of SIFT features (BoF, used as the first modality) and 1000dimensional bags of text tags (Tags, used as the second modality). The dataset was split into approximately equal parts for testing and training. We used positive and negative sets of size . Positive pairs were images belonging at least to one common class; negative pairs were images belonging to disjoint sets of classes. For MMNN, we used the margins and . CMNN used . Testing was performed using a query and database sets of size approximately and , respectively. First ten matches were found using approximate nearest neighbors Arya et al. (1998). Matches that had at least one class in common with the query were considered correct.
Table 4 compares the performance of different methods. MMNN outperforms other approaches in all quality criteria. Figure 5 shows examples of top matches using MMNN.
BoF–Tags  Tags–BoF  BoF  Tags  

Raw  –  –  57.0%  78.9% 
MMNN  64.3%  55.00%  79.39%  87.41% 
CMNN  61.1%  51.3%  75.62%  86.23% 
CMSSH 
53.7%  50.2%  54.0%  76.0% 
Figure 6 shows retrieval results using as queries artificially created Tag vectors containing specific words such as “cloud”. These tags are hashed using and matched to BoFs hashed using . The retrieved results are meaningful and most of them belong to the same class. It is especially interesting to note that MMNN, apart from the Labels in the groundtruth which are in general noisy, produced relevant results in both cases. Figure 7 shows image annotation results. We retrieve the top five Tags matches from a BoF query and assign the ten most frequent annotations to the image. We clearly see that MMNN produces better annotations than CMSSH also in this case.
Example of textbased image retrieval on NUS dataset using multimodal hashing. Shown are top five image matches produced by CMSSH (top) and MMNN (bottom) in response to two different queries:
cloud (left) and red+food (right). Relevant matches are shown in green.Wiki
In the third experiment, we reproduced the results of Rasiwasia et al. (2010) using the dataset of 2866 annotated images from Wikipedia. The images are categorized in 10 classes and represented as 128dimensional bags of SIFT features (Image modality) and 10dimensional LDA topic model (Text modality). The dataset was split into disjoint subsets of 2173 and 693 for training and testing, respectively. We used positive and negative sets of size , . Table 5 shows the mAP for the Image–Text and Text–Image crossmodal retrieval experiment. For reference, we also reproduce the results reported in Rasiwasia et al. (2010) using correlation matching (CM), semantic matching (SM), and semantic correlation matching (SCM). MMNN slightly outperforms SCM on average. We should stress however that these results are not directly comparable with ours: while Rasiwasia et al. (2010) find a Euclidean embedding, we use Hamming embedding (in general, a more difficult problem). While having similar performance to SCM, the significant advantage of our approach is that it produces much smaller compact binary codes (at least 10 smaller) that can be searched very efficiently.
ImageText  TextImage  Average  
MMNN  L1  27.8%  21.2%  24.5% 
L2  28.5%  22.0%  25.3%  
CMNN  L1  26.7%  20.9%  23.8% 
L2  27.1%  21.1%  24.1%  
CMSSH  22.2%  18.4%  20.3%  
CM  24.9%  19.6%  22.3%  
SM  22.5%  22.3%  22.4%  
SCM  27.7%  22.6%  25.2% 
5 Conclusions
We introduced a novel learning framework for multimodal similaritypreserving hashing based on the coupled siamese neural network architecture. Our approach is free from assuming linear projections unlike existing crossmodal similarity learning methods; in fact, by increasing the number of layers in the network, mappings of arbitrary complexity can be trained (our experiments showed that using multilayer architecture results in a significant improvement of performance). We also solve the exact optimization problem during training making no approximations like the boostingbased CMSSH. Our method does not involve semidefinite programming, and is scalable to a very large number of dimensions and training samples. Experimental results on standard multimedia retrieval datasets showed performance superior to stateoftheart hashing approaches.
References
 Arya et al. (1998) Arya, S., Mount, D. M., Netanyahu, N. S., Silverman, R., and Wu, A. Y. An optimal algorithm for approximate nearest neighbor searching fixed dimensions. J. ACM, 45(6):891–923, 1998.
 Bach et al. (2004) Bach, F. R., Lanckriet, G. R. G., and Jordan, M. I. Multiple kernel learning, conic duality, and the SMO algorithm. In Proc. ICML, 2004.
 Belkin & Niyogi (2003) Belkin, M. and Niyogi, P. Laplacian eigenmaps for dimensionality reduction and data representation. Neural computation, 15(6):1373–1396, 2003.
 Borg & Groenen (2005) Borg, I. and Groenen, P. J. F. Modern multidimensional scaling: Theory and applications. Springer, 2005.
 Bronstein et al. (2011) Bronstein, A. M., Bronstein, M. M., Guibas, L. J., and Ovsjanikov, M. Shape google: Geometric words and expressions for invariant shape retrieval. TOG, 30(1):1–20, 2011.
 Bronstein et al. (2010) Bronstein, M. M., Bronstein, A. M., Michel, F., and Paragios, N. Data fusion through crossmodality metric learning using similaritysensitive hashing. In Proc. CVPR, 2010.
 Chua et al. (2009) Chua, T.S., Tang, J., Hong, R., Li, H., Luo, Z., and Zheng, Y.T. Nuswide: A realworld web image database from national university of singapore. In Proc. CIVR, 2009.
 Coifman & Lafon (2006) Coifman, R. R. and Lafon, S. Diffusion maps. App. Comp. Harmonic Analysis, 21(1):5–30, 2006.
 Davis et al. (2007) Davis, J.V., Kulis, B., Jain, P., Sra, S., and Dhillon, I. S. Informationtheoretic metric learning. In Proc. ICML, 2007.
 Gionis et al. (1999) Gionis, A., Indyk, P., and Motwani, R. Similarity search in high dimensions via hashing. In Proc. Conf. Very Large Data Bases, 1999.
 Hadsell et al. (2006) Hadsell, R., Chopra, S., and LeCun, Y. Dimensionality reduction by learning an invariant mapping. In Proc. CVPR, 2006.
 Johnson & Wichern (2002) Johnson, R. A. and Wichern, D. W. Applied multivariate statistical analysis, volume 4. Prentice Hall, 2002.
 LeCun (1985) LeCun, Y. Une procédure d’apprentissage pour réseau à seuil asymétrique. Proceedings of Cognitiva 85, Paris, pp. 599–604, 1985.
 Lee et al. (2009) Lee, D., Hofmann, M., Steinke, F., Altun, Y., Cahill, N. D., and Scholkopf, B. Learning similarity measure for multimodal 3d image registration. In Proc. CVPR, 2009.
 Masci et al. (2011) Masci, J., Migliore, D., Bronstein, M. M., and Schmidhuber, J. Descriptor learning for omnidirectional image matching. Technical Report arXiv:1112.6291, 2011.
 McFee & Lanckriet (2009) McFee, B. and Lanckriet, G. R. G. Partial order embedding with multiple kernels. In Proc. ICML, 2009.
 McFee & Lanckriet (2011) McFee, B. and Lanckriet, G. R. G. Learning multimodal similarity. JMLR, 12:491–523, 2011.
 Mika et al. (1999) Mika, S., Ratsch, G., Weston, J., Schoelkopf, B., and Mueller, K. R. Fisher discriminant analysis with kernels. In Proc. Neural Networks for Signal Processing, 1999.
 Rasiwasia et al. (2010) Rasiwasia, N., Costa Pereira, J., Coviello, E., Doyle, G., Lanckriet, G.R.G., Levy, R., and Vasconcelos, N. A new approach to crossmodal multimedia retrieval. In Proc. ICM, pp. 251–260, 2010.
 Roweis & Saul (2000) Roweis, S. T. and Saul, L. K. Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500):2323, 2000.
 Rumelhart et al. (1986) Rumelhart, D. E., Hinton, G. E., and Williams, R. J. Learning representations by backpropagating errors. Nature, 323(6088):533–536, 1986.
 Schmidhuber & Prelinger (1993) Schmidhuber, J. and Prelinger, D. Discovering predictable classifications. Neural Computation, 5(4):625–635, 1993.

Schoelkopf et al. (1997)
Schoelkopf, B., Smola, A., and Mueller, K. R.
Kernel principal component analysis.
Artificial Neural Networks, pp. 583–588, 1997. 
Shakhnarovich et al. (2003)
Shakhnarovich, G., Viola, P., and Darrell, T.
Fast pose estimation with parametersensitive hashing.
In Proc. CVPR, 2003.  Shen et al. (2009) Shen, C., Kim, J., Wang, L., and Hengel, A. Positive semidefinite metric learning with boosting. In Proc. NIPS, 2009.
 Strecha et al. (2012) Strecha, C., Bronstein, A. M., Bronstein, M. M., and Fua, P. LDAHash: Improved matching with smaller descriptors. PAMI, 34(1):66–78, 2012.
 Taylor et al. (2011) Taylor, G. W., Spiro, I., Bregler, C., and Fergus, R. Learning invariance through imitation. In Proc. CVPR, 2011.
 Torralba et al. (2008) Torralba, A., Fergus, R., and Weiss, Y. Small codes and large image databases for recognition. In Proc. CVPR, 2008.
 Weinberger & Saul (2009) Weinberger, K. Q. and Saul, L. K. Distance metric learning for large margin nearest neighbor classification. JMLR, 10:207–244, 2009.
 Weiss et al. (2008) Weiss, Y., Torralba, A., and Fergus, R. Spectral hashing. In Proc. NIPS, 2008.
 Werbos (1974) Werbos, P. J. Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences. PhD thesis, Harvard University, 1974.
 Weston et al. (2010) Weston, J., Bengio, S., and Usunier, N. Large scale image annotation: learning to rank with joint wordimage embeddings. Machine learning, 81(1):21–35, 2010.
 Xing et al. (2002) Xing, E. P., Ng, A. Y., Jordan, M. I., and Russell, S. Distance metric learning with application to clustering with sideinformation. In Proc. NIPS, 2002.