1 Introduction
Consider a retrieval system that, given features of an object, searches a database for similar objects. Such a system requires a distance metric for assessing similarity. One way to produce a distance metric is to learn from similarity ratings that representative users have assigned to pairs of objects. Given data of this kind, ratings can be regressed onto differences between object features.
In this paper, we consider the use of class labels in addition to similarity ratings to learn a distance metric. Labels may be available, for example, if each object is assigned a class when entered into the database. The class label does not serve as an additional feature because when searching for objects similar to a new one, the class of the new object is usually unknown. In fact, the purpose of the retrieval system may be to supply similar objects and their class labels to assist the user in classifying the new object. However, class labels provide information useful to learning the distance metric because they may relate to similarity ratings in ways not captured by extracted features.
While distance metric learning has attracted much attention in recent years, approaches that have been proposed generally learn from either similarity/difference data or class labels but not both. We will refer to these two types of approaches as similaritybased and classbased methods, respectively. In the former category are multidimensional scaling methods (Cox and Cox, 2000)
, which embed vectors in a Euclidean space so that distances between pairs are close to available estimates, ordinal regression
(McCullagh and Nelder, 1989; Herbrich et al., 2000), which learns a function that maps feature differences to discrete levels of measured similarity, and convex optimization formulations (Xing et al., 2002; Schultz and Joachims, 2004; Frome et al., 2006), which learn metrics that tend to make data pairs classified as similar close and others distant. As for classbased methods, examples include relevant component analysis (BarHillel et al., 2003), which aims to learn a metric that makes data points that share a class close and others distant, neighbourhood component analysis (Goldberger et al., 2005), which learns a distance metric by optimizing the probability of correct classification based on a softmax model and nearest neighbors, and the algorithms of
Weinberger et al. (2006),Weinberger and Tesauro. (2007), and Weinberger and Saul (2009), which minimize the distances between objects in each neighborhood that share the same class while separating those from different classes.Our hybrid method of distance metric learning advances the aforementioned literature by providing an effective algorithm that makes use of both kinds of data simultaneously. It consists of two stages: a soft classifier is learned from the class label data and then used together with the similarity rating data by any similaritybased distance metric learning algorithm. Although this method can make use of any algorithm for learning a soft classifier and any similaritybased distance metric learning algorithm, to best illustrate our idea we will focus on the combination of a kernel density estimation algorithm similar to neighborhood component analysis and the aforementioned convex optimization approach to learning from similarity ratings. Results from experiments with synthetic data as well as a real medical image retrieval problem demonstrate that this hybrid method improves retrieval performance significantly.
2 Problem Formulation
2.1 Data
Suppose features of each object are encoded in a vector . We are given a data set consisting of similarity ratings for pairs of objects and class labels for individual objects. The ratings data is comprised of a set of quintuplets , each consisting of two object identifiers and , associated feature vectors and , and a similarity rating . We assume that each similarity rating takes one of three values, in particular, , , and , conveying dissimilarity, neutrality, and similarity, respectively. Denote the number of classes by and index each class by an integer from through . The class label data is a set of triplets , each consisting of an object identifier , a feature vector , and a class . The reason that object identifiers are included in the data is so that we know when a given class label is associated with the same object as a given similarity rating. In order to compress notation, when the object identifiers are not relevant to a discussion, we will refer to data samples in as triplets and data in as pairs .
2.2 Distance Metric
A distance metric is a mapping from to which assesses the distance of any given pair of objects. Given a a class of distance metrics , which is parameterized by a vector , we wish to compute so that the resulting distance metric accurately reflects perceived distances. Though the methods we present apply to a variety of distance metrics, much of our discussion will focus on the popular choice of a weighted Euclidean norm:
(1) 
3 Algorithms
Our goal is to learn a distance metric that help us retrieve similar objects in the database. We now discuss three existing algorithms for doing so and propose a new hybrid algorithm.
3.1 Ordinal Regression
Ordinal regression (McCullagh and Nelder, 1989) offers a simple approach to learning coefficients from the similarity rating data . Ordinal regression typically assumes that given a pair of objects , similarity ratings obeys the conditional distribution
where denotes the level of similarity, and are boundary parameters (we have implicitly ). These parameters, together with the coefficients , are computed by solving a maximum likelihood problem:
s.t  
Constraints are imposed on because, given the way our distance metric is defined in (1), coefficients of any suitable distance metric should be nonnegative. Note that this algorithm only makes use of the rating data .
3.2 Convex Optimization
Another approach, proposed in Xing et al. (2002), computes by solving a convex optimization problem:
s.t.  
This formulation results in a distance metric that aims to minimize the distances between similar objects while keeping dissimilar ones sufficiently far apart. Similarly with ordinal regression, this algorithm only makes use of the rating data .
3.3 Neighborhood Component Analysis
Neighborhood component analysis (NCA) learns a distance metric from class labels based on an assumption that similar objects are more likely to share the same class than dissimilar ones. NCA employs a model in which a feature vector is assigned class label with probability
(2) 
NCA computes coefficients that would lead to accurate classification of objects in the training set . We will define accuracy here in terms of log likelihood. In particular, we consider an implementation that aims to produce coefficients by maximizing the average leaveoneout loglikelihood. That is,
(3) 
This optimization problem is not convex, but in our experience a localoptimum can be found efficiently via projected gradient ascent. In many practical cases the number of training samples is not much larger than the number of parameters , and NCA consequently suffers from overfitting. Therefore, we consider regularization in our application of NCA. In particular, we subtract a penalty term from (3), where the parameter is selected by crossvalidation. Further details about our implementation can be found in the appendix.
3.4 A Hybrid Method
We now introduce a hybrid method that simultaneously makes use of similarity ratings and class labels. Our approach is motivated by an assumption that similarity ratings are driven by a weighted Euclidean norm distance metric, but that the observed feature vectors may not express all relevant information about objects being compared. In particular, there may be “missing features” that influence the underlying distance metric. Given objects and with observed feature vectors and missing feature vectors , we assume the underlying distance metric is given by
where and .
Another important assumption we will make concerning the missing feature vector is that it is conditionally independent from the observed feature vector when conditioned on the class label. In other words, given an object with observed and missing feature vectors and and a class label , we have . This assumption is justifiable since, if there exists any correlation between and , then we can subtract this dependence from
, resulting in another random variable
, and replace by without loss of generality.Now suppose we are given a learning algorithm that learns the conditional class probabilities from class data . In other words, is a function that maps into an estimate . Using these conditional class probabilities , we generate a soft class label for each unlabeled object represented in , our similarity ratings data set, that is not labeled in the class data set . In particular, for an unlabeled object with feature vector , we generate a vector , with each th component given by . For uniformity of notation, we also define for each object from , the set with class labels, a vector . In this case, if is the class label assigned to then and for .
We now discuss how the similarity ratings data is used together with these class probability vectors to produce a distance metric. The main idea is to generate an estimate of that is consistent with observed similarity ratings. The conditioning on and here indicates that these vectors are taken to be the class probabilities associated with the two objects.
Note that
and using the conditional independence assumption we have
where is defined as
We can view as a matrix that encodes distance information relating to missing features. This motivates the following parameterization of a distance metric, which is what we will use:
Note that in the event that class labels are not provided for and , the class probability vectors depend only on and . Therefore, with some abuse of notation, when there are no class labels, we can write the distance metric as
Our hybrid method estimates the vector and matrix so that they are consistent with similarity ratings. To do so, it makes use of a similaritybased learning algorithm that learns the coefficients of a distance metric from feature differences and similarity ratings, such as the ordinal regression or convex optimization methods we have described.
To provide a concrete version of our hybrid method, we consider the case where is a kernel density estimation procedure similar to NCA and is the algorithm based on convex optimization, discussed in Section 3.2. In this case, the method first generates a feature vector density for each class according to
where is a Gaussian kernel, defined by
To produce conditional class probabilities, we estimate the marginal distribution of classes according to
and applying Bayes’ rule to arrive at
The Gaussian kernel parameters can be estimated by a similar approach as described in (3). Then, to compute estimates and , we solve the following convex optimization problem:
s.t.  
This is the hybrid method we use in our experiments. Note that we only require to be elementwise nonnegative, but not positive semidefinite, and as such our method does not entail solution to an SDP.
4 Experiments
We evaluate the aforementioned four algorithms, namely ordinal regression (OR), convex optimization (CO), neighborhood component analysis (NCA), and the hybrid method (HYB), in two experiments. In the first experiment, we generate 100 synthetic data sets by a sampling process. For the second experiment, a real data set consisting of feature vectors derived from computed tomography (CT) scans of liver lesions, along with diagnoses and comparison ratings provided by radiologists, is considered. The data was collected as part of a project that seeks to develop a similaritybased image retrieval system for radiological decision support (Napel et al., 2010). We now describe the settings and empirical results of both experiments in detail.
It is worth mentioning that relative to other algorithms we consider, the hybrid method increases the number of free variables by , which is the number of numerical values used to represent the symmetric matrix . Since the number of classes is usually much smaller than the number of features
, we do not expect this increase in degrees of freedom to drive differences in empirical results. For instance, in the medical image dataset we study, we have
and , so our hybrid method only introduces new variables to the variables used by other methods.4.1 Synthetic Data
The following procedure explains how we generate and conduct experiments with synthetic data:

Sample a generative model and coefficient vectors and . Further details about this sampling process can be found in the appendix.

Generate data points from the resulting generative model; denote it by a set .

For each integer pair , let
where is sampled iid from to represent the random noise in rating. This results in distance values. Let be their first quintile and be their median. We set

Let be the training set and be the testing set. Take be the label data set.

Let and . will be used for testing, and for training we sample 5 subsets of , namely , such that the sizes of these sets equal to and of the size of , respectively. The reason for using as our training sets is that in many practical contexts it is not feasible to gather an exhaustive set of comparison data that rates all pairs of feature vectors as does .

For , run OR, CO, NCA, and HYB on the datasets , resulting in four distance measures. Then for every , apply each distance measure to retrieve the top 10 closest objects in , and evaluate the retrieved list by normalized discounted cumulative gain at position 10 ( NDCG), defined as
where is the th most similar object to based on the distance measure in test and is the th most similar object based on the ratings in . We use NDCG as our evaluation criterion since it is the most commonly used one when assessing relevance.
The above procedure was repeated for 100 times, resulting in 100 different generative models and data sets. Figure 1 plots the average NDCG delivered by OR, CO, NCA, and HYB. The advantage of HYB becomes singificant as the size of the rating data set grows.
4.2 Real Data
Our real data set consists of thirty medical images, each corresponding to a distinct CT scan. Features of each image included semantic annotations given by a radiologist (Rubin et al., 2008) using a controlled vocabulary and quantitative features such as lesion border sharpness, histogram statistics (Bilello et al., 2004; Rubin et al., 2008), Haar wavelets (Strela et al., 1999), and Gabor textures (Zhao et al., 2004). A total of 479 features were extracted from each image, many of which are linearly dependent. To simplify the computation, we removed those features whose correlations are above 0.95, and normalized the remaining ones. This resulted in 60 features which we used in our study.
For each pair among the thirty CT scans, we collected two ratings of image similarity from two different radiologists. Each image was classified with one of three dianoses: cyst, metastasis, or hemangioma. Figure 2 demonstrates some sample images in our data set.
To connect the aforementioned quantities to notation we have introduced, note that the number of features is , and the number of classes is . Denote the set of imagefeature pairs by , the class label data by , and the similarity rating data by . Tables 1 and 2 provide frequencies with which different ratings and classes appear in the data set.
Rating  Frequency 

1 (Dissimilar)  58.6% 
2 (Neutral)  16.2% 
3 (Similar)  25.2% 
Class  Frequency 

Cyst  44% 
Metastasis  33% 
Hemangioma  23% 
Since the data points are not very abundant in this case, we use leaveoneout crossvalidation to evaluate the performance. More specifically, for , we do the following:

Let .

Let .

Let

Apply the four methods OR, CO, NCA, and HYB on .

Use each of the resulting distance measures to retrieve the top 10 images from that are closest to .

Evaluate the NDCG of the retrieved lists.
Figure 3 plots the average NDCG delivered by OR, CO, NCA, and HYB. As we can see, HYB leads the other methods by a significant margin of more than 8 percent (0.75 vs. NCA’s 0.67).
5 Conclusion
We have presented a hybrid method that learns a distance measure by fusing similarity ratings and class labels. This approach consists of two elements, including an algorithm that learns the class probability conditioned on feature through label data, and another algorithm that fits model coefficients so that the resulting distance measure is consistent with similarity ratings. In our implementation, NCA and CO are chosen for these two elements, respectively. We tried the algorithm on synthetic data as well as a data set collected for the purpose of developing a medical image retrieval system, and demonstrated that it provides substantial gains over various methods that learn distance metrics exclusively from class or similarity data.
As a parting thought, it is worth mentioning that our hybrid method combines elements of generative and discriminative learning. There has been a growing literature that explores such combinations (Jaakkola and Haussler, 1998; Raina et al., 2004; Kao et al., 2009) and it would be interesting to explore the relationship of our hybrid method to other work on this broad topic.
Appendix: Implementation Details
regularized NCA
In our implementation, we randomly partition class label data set into a training set and a validation set , whose sizes are roughly and of , respectively. For each , we solve
by projected gradient ascent. We then compute the loglikelihood of the validation set, given by
and select the value of that results in the highest loglikelihood. The resulting value of is subsequently applied as the regularization parameter when we solve for with the complete training set . The range of is determined through trial and error and chosen so that in our experiments the optima rarely took on extreme values.
Sampling Generative Model
We take , , and for the synthetic data experiment. Algorithm 1 is the procedure we use to sample the generative models. Here we set and
as mixtures of Gaussian distributions. This procedure was repeated 100 times to produce 100 generative models.
References
 BarHillel et al. (2003) A. BarHillel, T. Hertz, N. Shental, and D. Weinshall. Learning distance functions using equivalence relations. In ICML, pages 11–18, 2003.
 Bilello et al. (2004) M. Bilello, S. B. Gokturk, T. Desser, S. Napel, R. B. Jeffrey Jr., and C. F. Beaulieu. Automatic detection and classification of hypodense hepatic lesions on contrastenhanced venousphase CT. Med Phys, 31:2584–2593, 2004.
 Cox and Cox (2000) T. Cox and M. A. A. Cox. Multidimensional Scaling. Chapman & Hall/CRC, 2000.
 Frome et al. (2006) A. Frome, Y. Singer, and J. Malik. Image retrieval and classification using local distance functions. In Advances in Neural Information Processing Systems 19, pages 417–424, 2006.
 Goldberger et al. (2005) J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov. Neighbourhood components analysis. In L. K. Saul, Y. Weiss, and L. Bottou, editors, Advances in Neural Information Processing Systems 17, pages 513–520. MIT Press, Cambridge, MA, 2005.
 Herbrich et al. (2000) R. Herbrich, T. Graepel, and K. Obermayer. Large margin rank boundaries for ordinal regression. In A.J. Smola, P.L. Bartlett, B. Schölkopf, and D. Schuurmans, editors, Advances in Large Margin Classifiers, pages 115–132, Cambridge, MA, 2000. MIT Press.
 Jaakkola and Haussler (1998) T. S. Jaakkola and D. Haussler. Exploiting generative models in discriminative classifiers. In Advances in Neural Information Processing Systems 11. MIT Press, Cambridge, MA, 1998.
 Kao et al. (2009) Y.H. Kao, B. Van Roy, and X. Yan. Directed regression. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems 22, pages 889–897. 2009.
 McCullagh and Nelder (1989) P. McCullagh and J. A. Nelder. Generalized linear models (Second edition). London: Chapman & Hall, 1989.
 Napel et al. (2010) S. Napel, C. F. Beaulieu, C. Rodriguez, J. Cui, J. Xu, A. Gupta, D. Korenblum, H. Greenspan, Y. Ma, and D. L. Rubin. Automated retrieval of CT images of liver lesions based on image similarity: Method and preliminary results. Radiology, 2010.
 Raina et al. (2004) R. Raina, Y. Shen, A. Y. Ng, and A. McCallum. Classification with hybrid generative/discriminative models. In S. Thrun, L. Saul, and B. Schölkopf, editors, Advances in Neural Information Processing Systems 16. MIT Press, Cambridge, MA, 2004.
 Rubin et al. (2008) D. L. Rubin, C. Rodriguez, P. Shah, and C. Beaulieu. iPad: Semantic annotation and markup of radiological images. In AMIA Annu Symp Proc, pages 626–630, 2008.
 Schultz and Joachims (2004) M. Schultz and T. Joachims. Learning a distance metric from relative comparisons. In S. Thrun, L. Saul, and B. Schölkopf, editors, Advances in Neural Information Processing Systems 16. MIT Press, Cambridge, MA, 2004.
 Strela et al. (1999) V. Strela, P. N. Heller, G. Strang, P. Topiwala, and C. Heil. The application of multiwavelet filterbanks to image processing. IEEE Trans Image Process, 8:548–563, 1999.

Weinberger and Saul (2009)
K. Q. Weinberger and L. K. Saul.
Distance metric learning for large margin nearest neighbor
classification.
Journal of Machine Learning Research
, pages 207–244, 2009. 
Weinberger and Tesauro. (2007)
K. Q. Weinberger and G. Tesauro.
Metric learning for kernel regression.
In
Eleventh International Conference on Artificial Intelligence and Statistics
, pages 608–615. 2007.  Weinberger et al. (2006) K. Q. Weinberger, J. Blitzer, and L. K. Saul. Distance metric learning for large margin nearest neighbor classification. In Advances in Neural Information Processing Systems 19. MIT Press, 2006.
 Xing et al. (2002) E. P. Xing, A. Y. Ng, M. I. Jordan, and S. Russell. Distance metric learning, with application to clustering with sideinformation. In Advances in Neural Information Processing Systems 15, pages 505–512. MIT Press, 2002.
 Zhao et al. (2004) C. G. Zhao, H. Y. Cheng, Y. L. Huo, and T. G. Zhuang. Liver CTimage retrieval based on gabor texture. In IEMBS: 26th Annual International Conference of the IEEE, pages 1491–1494, 2004.