1 Introduction
Given data pairs labeled as either “similar” or “dissimilar”, distance metric learning [58, 51, 11] learns a distance measure in such a way that similar examples are placed close to each other while dissimilar ones are separated apart. The learned distance metrics are important to many downstream tasks, such as retrieval [9], classification [51] and clustering [58]. One commonly used distance metric between two examples is: [51, 55, 9], which is parameterized by projection vectors (in .
Many works [50, 55, 48, 42, 9] have proposed orthogonalitypromoting DML to learn distance metrics that are (1) balanced: performing equally well on data instances belonging to frequent and infrequent classes; (2) compact: using a small number of projection vectors to achieve a “good” metric, (i.e., capturing well the relative distances of the data pairs); (3) generalizable
: reducing the overfitting to training data. Regarding balancedness, under many circumstances, the frequency of classes, defined as the number of examples belonging to each class, can be highly imbalanced. Classic DML methods are sensitive to the skewness of the frequency of the classes: they perform favorably on frequent classes whereas less well on infrequent classes — a phenomenon also confirmed in our experiments in Section
7. However, infrequent classes are of crucial importance in many applications, and should not be ignored. For example, in a clinical setting, many diseases occur infrequently, but are lifethreatening. Regarding compactness, the number of the projection vectors entails a tradeoff between performance and computational complexity [16, 55, 42]. On one hand, more projection vectors bring in more expressiveness in measuring distance. On the other hand, a larger incurs a higher computational overhead since the number of weight parameters in grows linearly with . It is therefore desirable to keep small without hurting much ML performance. Regarding generalization performance, in the case where the sample size is small but the size of is large, overfitting can easily happen.To address these three issues, many studies [58, 51, 11, 20, 61, 25, 63] propose to regularize the projection vectors to be close to being orthogonal. For balancedness, they argue that, without orthogonalitypromoting regularization (OPR), the majority of projection vectors learn latent features for frequent classes since these classes have dominant signals in the dataset; through OPR, the projection vectors uniformly “spread out”, giving both infrequent and frequent classes a fair treatment and thus leading to a more balanced distance metric (see [57] for details). For compactness, they claim that: “diversified” projection vectors bear less redundancy and are mutually complementary; as a result, a small number of such vectors are sufficient to achieve a “good” distance metric. For generalization performance, they posit that OPR imposes a structured constraint on the function class of DML, hence reduces model complexity.
While these orthogonalitypromoting DML methods have shown promising results, they have three problems. First, they involve solving nonconvex optimization problems where the global solution is extremely difficult, if not impossible, to obtain. Second, no formal analysis is conducted regarding why OPR can promote balancedness. Third, while the generalization error (GE) analysis of OPR has been studied in [57]
, it is incomplete. In this analysis, they first show that the upper bound of GE is a function of cosine similarity (CS), then show that CS and the regularizer are somewhat aligned in shape. They did not establish a direct relationship between the GE bound and the regularizer.
In this paper, we aim at addressing these problems by making the following contributions:

[leftmargin=*]

We relax the nonconvex, orthogonalitypromoting DML problems into convex problems and develop efficient proximal gradient descent algorithms. The algorithms only run once with a single initialization, and hence are much more efficient than existing nonconvex methods.

We perform theoretical analysis which formally reveals the relationship between OPR and balancedness: stronger OPR leads to more balancedness.

We perform generalization error (GE) analysis which shows that reducing the convex orthogonalitypromoting regularizers can reduce the upper bound of GE.

We apply the learned distance metrics for information retrieval to healthcare, texts, images, and sensory data. Compared with nonconvex baseline methods, our approaches achieve higher computational efficiency and are more capable of improving balancedness, compactness and generalizability.
2 Related Works
2.1 Distance Metric Learning
Many studies [58, 51, 11, 20, 61, 25, 63] have investigated DML. Please refer to [28, 49] for a detailed review. Xing et al. [58] learn a Mahalanobis distance by minimizing the sum of distances of all similar data pairs subject to the constraint that the sum of all dissimilar pairs is no less than 1. Weinberger et al. [51] propose large margin metric learning, which is applied for knearest neighbor classification. For each data example , they first obtain nearest neighbors based on Euclidean distance. Among the neighbors, some (denoted by ) have the same class label with and others (denoted by ) do not. Then a projection matrix is learned such that where and . Davis et al. [11] learn a Mahalanobis distance such that the distance between similar pairs is no more than a threshold and the distance between dissimilar pairs is no greater than a threshold . Guillaumin et al. [20]
define a probability of the similarity/dissimilarity label conditioned on the Mahalanobis distance:
, where the binary variable
equals to 1 if and have the same class label. is learned by maximizing the conditional likelihood of the training data. Kostinger et al. [25] learn a Mahalanobis distance metric from equivalence constraints based on likelihood ratio test. The Mahalanobis matrix is computed in one shot, without going through an iterative optimization procedure. Ying and Li [61]formulate DML as an eigenvalue optimization problem. Zadeh et al.
[63]propose a geometric mean metric learning approach, based on the Riemannian geometry of positive definite matrices. Similar to
[25], the Mahalanobis matrix has a closed form solution without iterative optimization.To avoid overfitting in DML, various regularization approaches have been explored. Davis et al. [11] regularize the Mahalanobis matrix to be close to another matrix that encodes prior information, where the closeness is measured using logdeterminant divergence. Qi et al. [40] use regularization to learn sparse distance metrics for highdimensional, smallsample problems. Ying et al. [60] use norm to simultaneously encourage lowrankness and sparsity. Trace norm is leveraged to encourage lowrankness in [37, 32]. Qian et al. [41] apply dropout to DML. Many works [50, 16, 55, 59, 42, 9] study diversitypromoting regularization in DML or hashing. They define regularizers based on squared Frobenius norm [50, 13, 16, 9] or angles [55, 59] to encourage the projection vectors to approach orthogonal. Several works [31, 52, 18, 22, 48] impose strict orthogonal constraint on the projection vectors. As observed in previous works [50, 13] and our experiments, strict orthogonality hurts performance. Isotropic hashing [24, 15]
encourages the variances of different projected dimensions to be equal to achieve balance. CarreiraPerpinán and Raziperchikolaei
[8]propose a diversity hashing method which first trains hash functions independently and then introduces diversity among them based on classifier ensembles.
2.2 OrthogonalityPromoting Regularization
Orthogonalitypromoting regularization has been studied in other problems as well, including ensemble learning, latent variable modeling, classification and multitask learning. In ensemble learning, many studies [30, 2, 39, 62] promote orthogonality among the coefficient vectors of base classifiers or regressors, with the aim to improve generalization performance and reduce computational complexity. Recently, several works [65, 3, 10, 56] study orthogonalitypromoting regularization of latent variable models (LVMs), which encourages the components in LVMs to be mutually orthogonal, for the sake of capturing infrequent patterns and reducing the number of components without sacrificing modeling power. In these works, various orthogonalitypromoting regularizers have been proposed, based on determinantal point process [27, 65] and cosine similarity [62, 3, 56]. In multiway classification, Malkin and Bilmes [33] propose to use the determinant of a covariance matrix to encourage orthogonality among classifiers. Jalali et al. [21] propose a class of variational Gram functions (VGFs) to promote pairwise orthogonality among vectors. While these VGFs are convex, they can only be applied to nonconvex DML formulations. As a result, the overall regularized DML is nonconvex and is not amenable for convex relaxation.
In the sequel, we review two families of orthogonalitypromoting regularizers.
Determinantal Point Process
[65] employed the determinantal point process (DPP) [27] as a prior to induce orthogonality in latent variable models. DPP is defined over vectors: , where is a kernel matrix with and as a kernel function. denotes the determinant of a matrix. A configuration of with larger probability is deemed to be more orthogonal. The underlying intuition is that: represents the volume of the parallelepiped formed by vectors in the kernelinduced feature space. If these vectors are closer to being orthogonal, the volume is larger, which results in a larger . The shortcoming of DPP is that it is sensitive to vector scaling. Enlarging the magnitudes of vectors results in larger volume, but does not essentially affects the orthogonality of vectors.
Pairwise Cosine Similarity
Several works define orthogonalitypromoting regularizers based on the pairwise cosine similarity among component vectors: if the cosine similarity scores are close to zero, then the components are closer to being orthogonal. Given component vectors, the cosine similarity between each pair of components and is computed: . Then these scores are aggregated as a single score. In [62], these scores are aggregated as . In [3], the aggregation is performed as where . In [56], the aggregated score is defined as mean of minus the variance of .
3 Preliminaries
Distance Metric Learning
Given data pairs labeled either as “similar” or “dissimilar” , DML [58, 51, 11] aims to learn a distance metric under which similar examples are close to each other and dissimilar ones are separated far apart. There are many ways to define a distance metric. Here, we present two popular choices. One is based on linear projection [51, 55, 9]. Given two examples , a linear projection matrix can be utilized to map them into a dimensional latent space. The distance metric is then defined as their squared Euclidean distance in the latent space: . can be learned by minimizing [58]: , which aims at making the distances between similar examples as small as possible while separating dissimilar examples with a margin using a hinge loss. We call this formulation as projection matrixbased DML (PDML). PDML is a nonconvex problem where the global optimal is difficult to achieve. Moreover, one needs to manually tune the number of projection vectors, typically via crossvalidation, which incurs substantial computational overhead.
The other popular choice of distance metric is , which is cast from by replacing with a positive semidefinite (PSD) matrix . This is known as the Mahalanobis distance [58]. Correspondingly, the PDML formulation can be transformed into a Mahalanobis distancebased DML (MDML) problem: , which is a convex problem where the global solution is guaranteed to be achievable. It also avoids tuning the number of projection vectors. However, the drawback of this approach is that, in order to satisfy the PSD constraint, one needs to perform eigendecomposition of in each iteration, which incurs complexity.
OrthogonalityPromoting Regularization
Among the various orthogonalitypromoting regularizers, we choose the BMD [29] regularizer [57] in this study since it is amenable for convex relaxation and facilitates theoretical analysis.
To encourage orthogonality between two vectors and , one can make their inner product close to zero and their norm , close to one. For a set of vectors , their nearorthogonality can be achieved by computing the Gram matrix where , then encouraging
to be close to an identity matrix. Off the diagonal of
and are and zero, respectively. On the diagonal of and are and one, respectively. Making close to effectively encourages to be close to zero and close to one, which therefore encourages and to be close to orthogonal.BMDs can be used to measure the “closeness” between two matrices. Let denote real symmetric matrices. Given a strictly convex, differentiable function , a BMD is defined as , where denotes the trace of matrix . Different choices of lead to different divergences. When , the BMD is specialized to the squared Frobenius norm (SFN) . If , where denotes the matrix logarithm of , the divergence becomes , which is referred to as von Neumann divergence (VND) [46]. If where denotes the determinant of , we get the logdeterminant divergence (LDD) [29]: .
In PDML, to encourage orthogonality among the projection vectors (row vectors in ), Xie et al. [57] define a family of regularizers which encourage the BMD between the Gram matrix and an identity matrix to be small. can be specialized to different instances, based on the choices of . Under SFN, becomes , which is used in [50, 13, 16, 9] to promote orthogonality. Under VND, becomes . Under LDD, becomes .
4 Convex Relaxation
The PDMLBMD problem is nonconvex, where the global optimal solution of is very difficult to achieve. We seek a convex relaxation and solve the relaxed problem instead. The basic idea is to transform PDML into MDML and approximate the BMD regularizers with convex functions.
4.1 Convex Approximations of the BMD Regularizers
The approximations are based on the properties of eigenvalues. Given a fullrank matrix (), we know that is a fullrank matrix with positive eigenvalues and is a rankdeficient matrix with zero eigenvalues and positive eigenvalues that equal to . For a general positive definite matrix whose eigenvalues are , we have , and . Next, we leverage these facts to seek convex relaxations of the BMD regularizers.
A convex SFN regularizer
The eigenvalues of are and those of are . Then . Therefore, the SFN regularizer equals to , where is a Mahalanobis matrix and . It is wellknown that the trace norm of a matrix is a convex envelope of its rank [44]. We use to approximate and get , where the right hand side is a convex function. Dropping the constant, we get the convex SFN (CSFN) regularizer defined over :
(1) 
A convex VND regularizer
Given the eigendecomposition where the eigenvalue equals to , based on the property of the matrix logarithm, we have where . Then , where the eigenvalues are . Then . Now we consider a matrix , where is a small scalar. Using similar calculation, we have . Performing certain algebra (see Appendix A), we get . Replacing with , approximating with and dropping constant , we get the convex VND (CVND) regularizer:
(2) 
whose convexity is shown in [36].
A convex LDD regularizer
We have and . Certain algebra shows that . After replacing with , approximating with and discarding constants, we obtain the convex LDD (CLDD) regularizer:
(3) 
where the convexity of is proved in [6]. Note that in [11, 40], an information theoretic regularizer based on logdeterminant divergence is applied to encourage the Mahalanobis matrix to be close to the identity matrix. This regularizer requires to be full rank; in contrast, by associating a large weight to the trace norm , our CLDD regularizer encourages to be lowrank. Since , reducing the rank of reduces the number of projection vectors in .
We discuss the errors in convex approximation, which are from two sources: one is the approximation of using where the error is controlled by and can be arbitrarily small (by setting to be very small); the other is the approximation of the matrix rank using the trace norm. Though the error of the second approximation can be large, it has been both empirically and theoretically [7] demonstrated that decreasing the trace norm can effectively reduce rank. We empirically verify that decreasing the convexified CSFN, CVND and CLDD regularizers can decrease the original nonconvex counterparts SFN, VND and LDD (see Appendix D.3). A rigorous analysis is left for future study.
4.2 DML with a Convex BMD Regularization
Given these convex BMD (CBMD) regularizers (denoted by ), we relax the nonconvex PDMLBMD problems into convex MDMLCBMD formulations by replacing with and replacing the nonconvex BMD regularizers with :
(4) 
5 Optimization
We use stochastic proximal subgradient descent algorithm [38] to solve the MDMLCBMD problems. The algorithm iteratively performs the following steps until convergence: (1) randomly sampling a minibatch of data pairs, computing the subgradient of the datadependent loss (the first and second term in the objective function) defined on the minibatch, then performing a subgradient descent update: , where is a small stepsize; and (2) applying proximal operators associated with the regularizers to . The gradient of the CVND regularizer is . To compute , we first perform an eigendecomposition: , then take the log of every eigenvalue in which gets us a new diagonal matrix , and finally compute as . In the CLDD regularizer, the gradient of is , which can also be computed by eigendecomposition. Next, we present the proximal operators.
5.1 Proximal Operators
Given the regularizer , the associated proximal operator is defined as: , subject to . Let be the eigenvalues of and be the eigenvalues of , then the above problem can be equivalently written as:
(5) 
where is a regularizerspecific scalar function. This problem can be decomposed into independent ones: (P) , subject to , for , which can be solved individually.
Sfn
For SFN where and , the problem (P) is simply a quadratic programming problem. The optimal solution is
Vnd
For VND where and , by taking the derivative of the objective function in problem (P) w.r.t and setting the derivative to zero, we get . The root of this equation is: , where is the Wright omega function [19]. If this root is negative, then the optimal is 0; if this root is positive, then the optimal could be either this root or 0. We pick the one that yields the lowest . Formally, , where .
Ldd
For LDD where and , by taking the derivative of w.r.t and setting the derivative to zero, we get a quadratic equation: , where and . The optimal solution is achieved either at the positive roots (if any) of this equation or 0. We pick the one that yields the lowest . Formally, , where .
Computational Complexity
In this algorithm, the major computation workload is eigendecomposion of by matrices, with a complexity of . In our experiments, since is no more than 1000, is not a big bottleneck. Besides, these matrices are symmetric, the structures of which can thus be leveraged to speed up eigendecomposition. In implementation, we use the MAGMA^{1}^{1}1http://icl.cs.utk.edu/magma/ library that supports the efficient eigendecomposition of symmetric matrices on GPU. Note that the unregularized MDML also requires the eigendecomposition (of ), hence adding these CBMD regularizes does not substantially increase additional computation cost.
6 Theoretical Analysis
In this section, we present theoretical analysis of balancedness and generalization error.
6.1 Analysis of Balancedness
In this section, we analyze how the nonconvex BMD regularizers that promote orthogonality affect the balancedness of the distance metrics learned by PDMLBMD^{2}^{2}2The analysis of convex BMD regularizers in MDMLCBMD will be left for future work.. Specifically, the analysis focuses on the following projection matrix: . We assume there are classes, where class has a distribution and the corresponding expectation is . Each data sample in and is drawn from the distribution of one specific class. We define and . Further, we assume has full rank (which is the number of the projection vectors), and let denote the eigendecomposition of , where with .
We define an imbalance factor (IF) to measure the (im)balancedness. For each class , we use the corresponding expectation to characterize this class.
We define the Mahalanobis distance between two classes and as: . We define the IF among all classes as:
(6) 
The motivation of such a definition is: for two frequent classes, since they have more training examples and hence contributing more in learning , DML intends to make their distance large; whereas for two infrequent classes, since they contribute less in learning (and DML is constrained by similar pairs which need to have small distances), their distance may end up being small. Consequently, if classes are imbalanced, some betweenclass distances can be large while others small, resulting in a large IF. The following theorem shows the upper bounds of IF.
Theorem 1.
Let denote the ratio between and and assume . Suppose the regularization parameter and distance margin are sufficiently large: and , where and depend on and . If and , then we have the following bounds for the IF^{3}^{3}3Please refer to Appendix B.1 for the definition of and the detailed proof..

[leftmargin=*]

For the VND regularizer , if , the following bound of the IF holds:
where is an increasing function defined in the following way. Let , which is strictly increasing on and strictly decreasing on and let be the inverse function of on , then for .

For the LDD regularizer , we have
As can be seen, the bounds are increasing functions of the BMD regularizers and . Decreasing these regularizers would reduce the upper bounds of the imbalance factor, hence leading to more balancedness. For SFN, such a bound cannot be derived.
6.2 Analysis of Generalization Error
In this section, we analyze how the convex BMD regularizers affect the generalization error in MDMLCBMD problems. Following [47], we use distancebased error to measure the quality of a Mahalanobis distance matrix . Given the sample and where the total number of data pairs is , the empirical error is defined as and the expected error is . Let be optimal matrix learned by minimizing the empirical error: . We are interested in how well performs on unseen data. The performance is measured using generalization error: . To incorporate the impact of the CBMD regularizers , we define the hypothesis class of to be . The upper bound controls the strength of regularization. A smaller entails stronger promotion of orthogonality. is controlled by the regularization parameter in Eq.(4). Increasing reduces . For different CBMD regularizers, we have the following generalization error bound.
Theorem 2.
Suppose , then with probability at least , we have:

[leftmargin=*]

For the CVND regularizer,

For the CLDD regularizer,

For the CSFN regularizer,
From these generalization error bounds (GEBs), we can see two major implications. First, CBMD regularizers can effectively control the GEBs. Increasing the strength of CBMD regularization (by enlarging ) reduces , which decreases the GEBs since they are all increasing functions of . Second, the GEBs converge with rate , where is the number of training data pairs. This rate matches with that in [5, 47].
#Train  #Test  Dim.  #Class  

MIMIC  40K  18K  1000  2833 
EICU  53K  39K  1000  2175 
Reuters  4152  1779  1000  49 
News  11307  7538  1000  20 
Cars  8144  8041  1000  196 
Birds  9000  2788  1000  200 
Act  7352  2947  561  6 
MIMIC  EICU  Reuters  News  Cars  Birds  Act  
AAll  AIF  BS  AAll  AIF  BS  AAll  AIF  BS  AAll  AAll  AAll  AAll  
PDML  0.634  0.608  0.070  0.671  0.637  0.077  0.949  0.916  0.049  0.757  0.714  0.851  0.949 
MDML  0.641  0.617  0.064  0.677  0.652  0.055  0.952  0.929  0.034  0.769  0.722  0.855  0.952 
LMNN  0.628  0.609  0.054  0.662  0.633  0.066  0.943  0.913  0.040  0.731  0.728  0.832  0.912 
LDML  0.619  0.594  0.068  0.667  0.647  0.046  0.934  0.906  0.042  0.748  0.706  0.847  0.937 
MLEC  0.621  0.605  0.045  0.679  0.656  0.053  0.927  0.916  0.021  0.761  0.725  0.814  0.917 
GMML  0.607  0.588  0.053  0.668  0.648  0.045  0.931  0.905  0.035  0.738  0.707  0.817  0.925 
ILHD  0.577  0.560  0.051  0.637  0.610  0.064  0.905  0.893  0.028  0.711  0.686  0.793  0.898 
MDML  0.648  0.627  0.055  0.695  0.676  0.042  0.955  0.930  0.037  0.774  0.728  0.872  0.958 
MDML  0.643  0.615  0.074  0.701  0.677  0.053  0.953  0.948  0.020  0.791  0.725  0.868  0.961 
MDML  0.646  0.630  0.043  0.703  0.661  0.091  0.963  0.936  0.035  0.783  0.728  0.861  0.964 
MDMLTr  0.659  0.642  0.044  0.696  0.673  0.051  0.961  0.934  0.036  0.785  0.731  0.875  0.955 
MDMLIT  0.653  0.626  0.070  0.692  0.668  0.053  0.954  0.920  0.046  0.771  0.724  0.858  0.967 
MDMLDrop  0.647  0.630  0.045  0.701  0.670  0.067  0.959  0.937  0.032  0.787  0.729  0.864  0.962 
MDMLOS  0.649  0.626  0.059  0.689  0.679  0.045  0.957  0.938  0.031  0.779  0.732  0.869  0.963 
PDMLDC  0.652  0.639  0.035  0.706  0.686  0.044  0.962  0.943  0.034  0.773  0.736  0.882  0.964 
PDMLCS  0.661  0.641  0.053  0.712  0.670  0.089  0.967  0.954  0.020  0.803  0.742  0.895  0.971 
PDMLDPP  0.659  0.632  0.069  0.714  0.695  0.041  0.958  0.937  0.036  0.797  0.751  0.891  0.969 
PDMLIC  0.660  0.642  0.047  0.711  0.685  0.057  0.972  0.954  0.030  0.801  0.740  0.887  0.967 
PDMLDeC  0.648  0.625  0.063  0.698  0.675  0.050  0.965  0.960  0.017  0.786  0.728  0.860  0.958 
PDMLVGF  0.657  0.634  0.059  0.718  0.697  0.045  0.974  0.952  0.036  0.806  0.747  0.894  0.974 
PDMLMA  0.659  0.644  0.040  0.721  0.703  0.038  0.975  0.959  0.024  0.815  0.743  0.898  0.968 
PDMLOC  0.651  0.636  0.041  0.705  0.685  0.043  0.955  0.931  0.036  0.779  0.727  0.875  0.956 
PDMLOS  0.639  0.614  0.067  0.675  0.641  0.072  0.951  0.928  0.038  0.764  0.716  0.855  0.950 
PDMLSFN  0.662  0.642  0.051  0.724  0.701  0.045  0.973  0.947  0.038  0.808  0.749  0.896  0.970 
PDMLVND  0.667  0.655  0.032  0.733  0.706  0.057  0.976  0.971  0.012  0.814  0.754  0.902  0.972 
PDMLLDD  0.664  0.651  0.035  0.731  0.711  0.043  0.973  0.964  0.017  0.816  0.751  0.904  0.971 
MDMLCSFN  0.668  0.653  0.039  0.728  0.705  0.049  0.978  0.968  0.023  0.813  0.753  0.905  0.972 
MDMLCVND  0.672  0.664  0.022  0.735  0.718  0.035  0.984  0.982  0.012  0.822  0.755  0.908  0.973 
MDMLCLDD  0.669  0.658  0.029  0.739  0.719  0.042  0.981  0.980  0.011  0.819  0.759  0.913  0.971 
On the three imbalanced datasets – MIMIC, EICU, Reuters, we show the mean AUC (averaged on 5 random train/test splits) on all classes (AAll) and infrequent classes (AIF) and the balance score. On the rest 4 balanced datasets, AAll is shown. The AUC on frequent classes and the standard errors are in Appendix
D.3.7 Experiments
Datasets
We used 7 datasets in the experiments: two electronic health record datasets MIMIC (version III) [23] and EICU (version 1.1) [17]; two text datasets Reuters^{4}^{4}4http://www.daviddlewis.com/resources/testcollections/reuters21578/ and 20Newsgroups (News)^{5}^{5}5http://qwone.com/~jason/20Newsgroups/; two image datasets StanfordCars (Cars) [26] and CaltechUCSDBirds (Birds) [53]; and one sensory dataset 6Activities (Act) [1]. The MIMICIII dataset contains 58K hospital admissions of 47K patients who stayed within the intensive care units (ICU). Each admission has a primary diagnosis (a disease), which acts as the class label of this admission. There are 2833 unique diseases. We extract 7207dimensional features from demographics, clinical notes, and lab tests. The EICU dataset contains 92K ICU admissions diagnosed with 2175 unique diseases. 3743dimensional features are extracted from demographics, lab tests, vital signs, and past medical history. For the Reuters datasets, after removing documents that have more than one labels and removing classes that have less than 3 documents, we are left with 5931 documents and 48 classes. Documents in Reuters and News are represented with tfidf vectors where the vocabulary size is 5000. For the two image datasets Birds and Cars, we use the VGG16 [43]convolutional neural network trained on the ImageNet [12] dataset to extract features, which are the 4096dimensional outputs of the second fullyconnected layer. The 6Activities dataset contains sensory recordings of 30 subjects performing 6 activities (which are the class labels). The features are 561dimensional sensory signals. For the first six datasets, the features are normalized using minmax normalization along each dimension and the feature dimension is reduced to 1000 using PCA. Since there is no standard split of the training/test set, we perform five random splits and average the results of the five runs. Dataset statistics are summarized in Table 1
. More details of the datasets and feature extraction are deferred to Appendix
D.1.Experimental Settings
Two examples are considered as similar if they belong to the same class and dissimilar if otherwise. The learned distance metrics are applied for retrieval (using each test example to query the rest of the test examples) whose performance is evaluated using the Area Under precisionrecall Curve (AUC) [34] which is the higher, the better. Note that the learned distance metrics can also be applied to other tasks such as clustering and classification. Due to the space limit, we focus on retrieval. We apply the proposed convex regularizers CSFN, CVND, CLDD to MDML. We compare them with two sets of baseline regularizers. The first set aims at promoting orthogonality, which are based on determinant of covariance (DC) [33], cosine similarity (CS) [62], determinantal point process (DPP) [27, 65], InCoherence (IC) [3], variational Gram function (VGF) [64, 21], decorrelation (DeC) [10], mutual angles (MA) [56], squared Frobenius norm (SFN) [50, 13, 16, 9], von Neumann divergence (VND) [57], logdeterminant divergence (LDD) [57], and orthogonal constraint (OC) [31, 48]. All these regularizers are applied to PDML. The other set of regularizers are not designed particularly for promoting orthogonality but are commonly used, including norm, norm [40], norm [60], trace norm (Tr) [32], information theoretic (IT) regularizer [11], and Dropout (Drop) [45]. All these regularizers are applied to MDML. One common way of dealing with classimbalance is oversampling (OS) [14], which repetitively draws samples from the empirical distributions of infrequent classes until all classes have the same number of samples. We apply this technique to PDML and MDML. In addition, we compare with vanilla Euclidean distance (EUC) and other distance learning methods including large margin nearest neighbor (LMNN) metric learning, information theoretic metric learning (ITML) [11], logistic discriminant metric learning (LDML) [20], metric learning from equivalence constraints (MLEC) [25], geometric mean metric learning (GMML) [63], and independent Laplacian hashing with diversity (ILHD) [8]. The PDMLbased methods except PDMLOC are solved with stochastic subgradient descent (SSD). PDMLOC is solved using the algorithm proposed in [54]. The MDMLbased methods are solved with proximal SSD. The learning rate is set to 0.001. The minibatch size is set to 100 (50 similar pairs and 50 dissimilar pairs). We use 5fold cross validation to tune the regularization parameter among and the number of projection vectors (of the PDML methods) among . In CVND and CLDD, is set to be . The margin is set to be 1. In the MDMLbased methods, after the Mahalanobis matrix (rank ) is learned, we factorize it into where (see Appendix D.2), then perform retrieval based on , which is more efficient than that based on . Each method is implemented on top of GPU using the MAGMA library. The experiments are conducted on a GPUcluster with 40 machines.
MIMIC  EICU  Reuters  News  Cars  Birds  Act  
PDML  62.1  66.6  5.2  11.0  8.4  10.1  3.4 
MDML  3.4  3.7  0.3  0.6  0.5  0.6  0.2 
PDMLDC  424.7  499.2  35.2  65.6  61.8  66.2  17.2 
PDMLCS  263.2  284.8  22.6  47.2  34.5  42.8  14.4 
PDMLDPP  411.8  479.1  36.9  61.9  64.2  70.5  16.5 
PDMLIC  265.9  281.2  23.4  46.1  37.5  45.2  15.3 
PDMLDeC  458.5  497.5  41.8  78.2  78.9  80.7  19.9 
PDMLVGF  267.3  284.1  22.3  48.9  35.8  38.7  15.4 
PDMLMA  271.4  282.9  23.6  50.2  30.9  39.6  17.5 
PDMLOC  104.9  118.2  9.6  14.3  14.8  17.0  3.9 
PDMLSFN  261.7  277.6  22.9  46.3  36.2  38.2  15.9 
PDMLVND  401.8  488.3  33.8  62.5  67.5  73.4  17.1 
PDMLLDD  407.5  483.5  34.3  60.1  61.8  72.6  17.9 
MDMLCSFN  41.1  43.9  3.3  7.3  6.5  6.9  1.8 
MDMLCVND  43.8  46.2  3.6  8.1  6.9  7.8  2.0 
MDMLCLDD  41.7  44.5  3.4  7.5  6.6  7.2  1.8 
Results
The training time taken by different methods to reach convergence is shown in Table 7. For the nonconvex, PDMLbased methods, we report the total time taken by the following computation: tuning the regularization parameter (4 choices) and the number of projection vectors (NPVs, 6 choices) on a twodimensional grid via 3fold cross validation ( experiments in total); for each of the 72 experiments, the algorithm restarts 5 times^{6}^{6}6Our experiments show that for nonconvex methods, multiple restarts are of great necessity to improve performance. For example, for PDMLVND on MIMIC with 100 projection vectors, the AUC is nondecreasing with the number of restarts: the AUC after restarts are 0.651, 0.651, 0.658, 0.667, 0.667., each with a different initialization, and picks the one yielding the lowest objective value. In total, the number of runs is . For the MDMLbased methods, there is no need to restart multiple times or tune the NPVs. The total number of runs is . As can be seen from the table, the proposed convex methods are much faster than the nonconvex ones, due to the greatly reduced number of experimental runs, although for each single run the convex methods are less efficient than the nonconvex methods due to the overhead of eigendecomposition. The unregularized MDML takes the least time to train since it has no parameters to tune and runs only once. On average, the time of each single run in MDML(CSFN,CVND,CLDD) is close to that in the unregularized MDML, since an eigendecomposition is required anyway regardless of the presence of the regularizers.
MIMIC  EICU  Reuters  News  Cars  Birds  Act  
NPV  CS  NPV  CS  NPV  CS  NPV  CS  NPV  CS  NPV  CS  NPV  CS  
PDML  300  2.1  400  1.7  300  3.2  300  2.5  300  2.4  500  1.7  200  4.7 
MDML  247  2.6  318  2.1  406  2.3  336  2.3  376  1.9  411  2.1  168  5.7 
LMNN  200  3.1  400  1.7  400  2.4  300  2.4  400  1.8  500  1.7  300  3.0 
LDML  300  2.1  400  1.7  400  2.3  200  3.7  300  2.4  400  2.1  300  3.1 
MLEC  487  1.3  493  1.4  276  3.4  549  1.4  624  1.2  438  1.9  327  2.8 
GMML  1000  0.6  1000  0.7  1000  0.9  1000  0.7  1000  0.7  1000  0.8  1000  0.9 
ILHD  100  5.8  100  6.4  50  18.1  100  7.1  100  6.9  100  7.9  50  18.0 
MDML  269  2.4  369  1.9  374  2.6  325  2.4  332  2.2  459  1.9  179  5.4 
MDML  341  1.9  353  2.0  417  2.3  317  2.5  278  2.6  535  1.6  161  6.0 
MDML  196  3.3  251  2.8  288  3.3  316  2.5  293  2.5  326  2.6  135  7.1 
MDMLTr  148  4.5  233  3.0  217  4.4  254  3.1  114  6.4  286  3.1  129  7.4 
MDMLIT  1000  0.7  1000  0.7  1000  1.0  1000  0.8  1000  0.7  1000  0.9  1000  1.0 
MDMLDrop  183  3.5  284  2.5  315  3.0  251  3.1  238  3.1  304  2.8  147  6.5 
PDMLDC  100  6.5  300  2.4  100  9.6  200  3.9  200  3.7  300  2.9  100  9.6 
PDMLCS  200  3.3  200  3.6  200  4.8  100  8.0  100  7.4  200  4.5  50  19.4 
PDMLDPP  100  6.6  200  3.6  100  9.6  100  8.0  200  3.8  200  4.5  100  9.7 
PDMLIC  200  3.3  200  3.6  200  4.9  100  8.0  200  3.7  100  8.9  100  9.7 
PDMLDeC  200  3.2  300  2.3  200  4.8  200  3.9  200  3.6  200  4.3  100  9.6 
PDMLVGF  200  3.3  200  3.6  200  4.9  100  8.1  200  3.7  200  4.5  100  9.7 
PDMLMA  200  3.3  200  3.6  100  9.8  100  8.2  100  7.4  200  4.5  50  19.4 
PDMLSFN  100  6.6  200  3.6  100  9.7  100  8.1  100  7.5  200  4.5  50  19.4 
PDMLOC  100  6.5  100  7.1  50  19.1  50  15.6  100  7.3  100  8.8  50  19.1 
PDMLVND  100  6.7  100  7.3  50  19.5  100  8.1  100  7.5  100  9.0  50  19.4 
PDMLLDD  100  6.6  200  3.7  100  9.7  100  8.2  100  7.5  100  9.0  50  19.4 
MDMLCSFN  143  4.7  209  3.5  174  5.6  87  9.3  62  12.1  139  6.5  64  15.2 
MDMLCVND  53  12.7  65  11.3  61  16.0  63  13.0  127  5.9  92  9.9  68  14.3 
MDMLCLDD  76  8.8  128  5.8  85  11.5  48  17.1  91  8.3  71  12.9  55  17.7 
Next, we verify whether CSFN, CVND and CLDD are able to learn more balanced distance metrics. On three datasets MIMIC, EICU and Reuters where the classes are imbalanced, we consider a class as “frequent” if it contains more than 1000 examples, and “infrequent” if otherwise. We measure AUCs on all classes (AAll), infrequent classes (AIF) and frequent classes (AF), then define a balance score (BS) as . A smaller BS indicates more balancedness. As shown in Table 2, MDML(CSFN,CVND,CLDD) achieve the highest AAll on 6 datasets and the highest AIF on all 3 imbalanced datasets. In terms of BS, our convex methods outperform all baseline DML methods. These results demonstrate our methods can learn more balanced metrics. By encouraging the projection vectors to be close to being orthogonal, our methods can reduce the redundancy among vectors. Mutually complementary vectors can achieve a broader coverage of latent features, including those associated with infrequent classes. As a result, these vectors improve the performance on infrequent classes and lead to better balancedness. Thanks to their convexity nature, our methods can achieve the global optimal solution and outperform the nonconvex ones that can only achieve a local optimal and hence a suboptimal solution. Comparing (PDML,MDML)OS with the unregularized PDLM/MDML, we can see that oversampling indeed improves balancedness. However, this improvement is less significant than that achieved by our methods. In general, the orthogonalitypromoting (OP) regularizers outperform the nonOP regularizers, suggesting the effectiveness of promoting orthogonality. The orthogonal constraint (OC) [31, 48] imposes strict orthogonality, which may be too restrictive that hurts performance. ILHD [8] learns binary hash codes, which makes retrieval more efficient, however, it achieves lower AUCs due to the quantization errors. MDML(CSFN,CVND,CLDD) outperform popular DML approaches including LMNN, LDML, MLEC and GMML, demonstrating their competitive standing in the DML literature.
Next we verify whether the learned distance metrics by MDML(CSFN,CVND,CLDD) are compact. Table 4 shows the numbers of the projection vectors (NPVs) that achieve the AUCs in Table 2. For MDMLbased methods, the NPV equals to the rank of the Mahalanobis matrix since . We define a compactness score (CS) which is the ratio between AAll (given in Table 2) and NPV. A higher CS indicates achieving higher AUC by using fewer projection vectors. From Table 4, we can see that on 5 datasets, MDML(CSFN,CVND,CLDD) achieve larger CSs than the baseline methods, demonstrating their better capability in learning compact distance metrics. Similar to the observations in Table 2, CSFN, CVND and CLDD perform better than nonconvex regularizers, and CVND, CLDD perform better than CSFN. The reduction of NPV improves the efficiency of retrieval since the computational complexity grows linearly with this number. Together, these results demonstrate that MDML(CSFN,CVND,CLDD) outperform other methods in terms of learning both compact and balanced distance metrics.
As can be seen from Table 2, our methods MDML(CVND,CLDD) achieve the best AUCAll. In Table 10 (Appendix D.3), it is shown that MDML(CVND,CLDD) have the smallest gap between training and testing AUC. This indicates that our methods are better capable of reducing overfitting and improving generalization performance.
8 Conclusions
In this paper, we have attempted to address three issues of existing orthogonalitypromoting DML methods, which include computational inefficiency and lacking theoretical analysis in balancedness and generalization. To address the computation issue, we perform a convex relaxation of these regularizers and develop a proximal gradient descent algorithm to solve the convex problems. To address the analysis issue, we define an imbalance factor (IF) to measure (im)balancedness and prove that decreasing the Bregman matrix divergence regularizers (which promote orthogonality) can reduce the upper bound of the IF, hence leading to more balancedness. We provide a generalization error (GE) analysis showing that decreasing the convex regularizers can reduce the GE upper bound. Experiments on datasets from different domains demonstrate that our methods are computationally more efficient and are more capable of learning balanced, compact and generalizable distance metrics than other approaches.
Appendix A Convex Approximations of BMD Regularizers
Approximation of VND regularizer
Given , according to the property of matrix logarithm, , where . Then , where the eigenvalues are . Since , we have . Now we consider a matrix , where is a small scalar. The eigenvalues of this matrix are
Comments
There are no comments yet.