Orthogonality-Promoting Distance Metric Learning: Convex Relaxation and Theoretical Analysis

02/16/2018 ∙ by Pengtao Xie, et al. ∙ 0

Distance metric learning (DML), which learns a distance metric from labeled "similar" and "dissimilar" data pairs, is widely utilized. Recently, several works investigate orthogonality-promoting regularization (OPR), which encourages the projection vectors in DML to be close to being orthogonal, to achieve three effects: (1) high balancedness -- achieving comparable performance on both frequent and infrequent classes; (2) high compactness -- using a small number of projection vectors to achieve a "good" metric; (3) good generalizability -- alleviating overfitting to training data. While showing promising results, these approaches suffer three problems. First, they involve solving non-convex optimization problems where achieving the global optimal is NP-hard. Second, it lacks a theoretical understanding why OPR can lead to balancedness. Third, the current generalization error analysis of OPR is not directly on the regularizer. In this paper, we address these three issues by (1) seeking convex relaxations of the original nonconvex problems so that the global optimal is guaranteed to be achievable; (2) providing a formal analysis on OPR's capability of promoting balancedness; (3) providing a theoretical analysis that directly reveals the relationship between OPR and generalization performance. Experiments on various datasets demonstrate that our convex methods are more effective in promoting balancedness, compactness, and generalization, and are computationally more efficient, compared with the nonconvex methods.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Given data pairs labeled as either “similar” or “dissimilar”, distance metric learning [58, 51, 11] learns a distance measure in such a way that similar examples are placed close to each other while dissimilar ones are separated apart. The learned distance metrics are important to many downstream tasks, such as retrieval [9], classification [51] and clustering [58]. One commonly used distance metric between two examples is:  [51, 55, 9], which is parameterized by projection vectors (in .

Many works [50, 55, 48, 42, 9] have proposed orthogonality-promoting DML to learn distance metrics that are (1) balanced: performing equally well on data instances belonging to frequent and infrequent classes; (2) compact: using a small number of projection vectors to achieve a “good” metric, (i.e., capturing well the relative distances of the data pairs); (3) generalizable

: reducing the overfitting to training data. Regarding balancedness, under many circumstances, the frequency of classes, defined as the number of examples belonging to each class, can be highly imbalanced. Classic DML methods are sensitive to the skewness of the frequency of the classes: they perform favorably on frequent classes whereas less well on infrequent classes — a phenomenon also confirmed in our experiments in Section

7. However, infrequent classes are of crucial importance in many applications, and should not be ignored. For example, in a clinical setting, many diseases occur infrequently, but are life-threatening. Regarding compactness, the number of the projection vectors entails a tradeoff between performance and computational complexity [16, 55, 42]. On one hand, more projection vectors bring in more expressiveness in measuring distance. On the other hand, a larger incurs a higher computational overhead since the number of weight parameters in grows linearly with . It is therefore desirable to keep small without hurting much ML performance. Regarding generalization performance, in the case where the sample size is small but the size of is large, overfitting can easily happen.

To address these three issues, many studies [58, 51, 11, 20, 61, 25, 63] propose to regularize the projection vectors to be close to being orthogonal. For balancedness, they argue that, without orthogonality-promoting regularization (OPR), the majority of projection vectors learn latent features for frequent classes since these classes have dominant signals in the dataset; through OPR, the projection vectors uniformly “spread out”, giving both infrequent and frequent classes a fair treatment and thus leading to a more balanced distance metric (see [57] for details). For compactness, they claim that: “diversified” projection vectors bear less redundancy and are mutually complementary; as a result, a small number of such vectors are sufficient to achieve a “good” distance metric. For generalization performance, they posit that OPR imposes a structured constraint on the function class of DML, hence reduces model complexity.

While these orthogonality-promoting DML methods have shown promising results, they have three problems. First, they involve solving non-convex optimization problems where the global solution is extremely difficult, if not impossible, to obtain. Second, no formal analysis is conducted regarding why OPR can promote balancedness. Third, while the generalization error (GE) analysis of OPR has been studied in [57]

, it is incomplete. In this analysis, they first show that the upper bound of GE is a function of cosine similarity (CS), then show that CS and the regularizer are somewhat aligned in shape. They did not establish a direct relationship between the GE bound and the regularizer.

In this paper, we aim at addressing these problems by making the following contributions:

  • [leftmargin=*]

  • We relax the nonconvex, orthogonality-promoting DML problems into convex problems and develop efficient proximal gradient descent algorithms. The algorithms only run once with a single initialization, and hence are much more efficient than existing non-convex methods.

  • We perform theoretical analysis which formally reveals the relationship between OPR and balancedness: stronger OPR leads to more balancedness.

  • We perform generalization error (GE) analysis which shows that reducing the convex orthogonality-promoting regularizers can reduce the upper bound of GE.

  • We apply the learned distance metrics for information retrieval to healthcare, texts, images, and sensory data. Compared with non-convex baseline methods, our approaches achieve higher computational efficiency and are more capable of improving balancedness, compactness and generalizability.

2 Related Works

2.1 Distance Metric Learning

Many studies [58, 51, 11, 20, 61, 25, 63] have investigated DML. Please refer to [28, 49] for a detailed review. Xing et al. [58] learn a Mahalanobis distance by minimizing the sum of distances of all similar data pairs subject to the constraint that the sum of all dissimilar pairs is no less than 1. Weinberger et al. [51] propose large margin metric learning, which is applied for k-nearest neighbor classification. For each data example , they first obtain nearest neighbors based on Euclidean distance. Among the neighbors, some (denoted by ) have the same class label with and others (denoted by ) do not. Then a projection matrix is learned such that where and . Davis et al. [11] learn a Mahalanobis distance such that the distance between similar pairs is no more than a threshold and the distance between dissimilar pairs is no greater than a threshold . Guillaumin et al. [20]

define a probability of the similarity/dissimilarity label conditioned on the Mahalanobis distance:

, where the binary variable

equals to 1 if and have the same class label. is learned by maximizing the conditional likelihood of the training data. Kostinger et al. [25] learn a Mahalanobis distance metric from equivalence constraints based on likelihood ratio test. The Mahalanobis matrix is computed in one shot, without going through an iterative optimization procedure. Ying and Li [61]

formulate DML as an eigenvalue optimization problem. Zadeh et al. 


propose a geometric mean metric learning approach, based on the Riemannian geometry of positive definite matrices. Similar to

[25], the Mahalanobis matrix has a closed form solution without iterative optimization.

To avoid overfitting in DML, various regularization approaches have been explored. Davis et al. [11] regularize the Mahalanobis matrix to be close to another matrix that encodes prior information, where the closeness is measured using log-determinant divergence. Qi et al. [40] use regularization to learn sparse distance metrics for high-dimensional, small-sample problems. Ying et al. [60] use norm to simultaneously encourage low-rankness and sparsity. Trace norm is leveraged to encourage low-rankness in [37, 32]. Qian et al. [41] apply dropout to DML. Many works [50, 16, 55, 59, 42, 9] study diversity-promoting regularization in DML or hashing. They define regularizers based on squared Frobenius norm [50, 13, 16, 9] or angles [55, 59] to encourage the projection vectors to approach orthogonal. Several works [31, 52, 18, 22, 48] impose strict orthogonal constraint on the projection vectors. As observed in previous works [50, 13] and our experiments, strict orthogonality hurts performance. Isotropic hashing [24, 15]

encourages the variances of different projected dimensions to be equal to achieve balance. Carreira-Perpinán and Raziperchikolaei 


propose a diversity hashing method which first trains hash functions independently and then introduces diversity among them based on classifier ensembles.

2.2 Orthogonality-Promoting Regularization

Orthogonality-promoting regularization has been studied in other problems as well, including ensemble learning, latent variable modeling, classification and multitask learning. In ensemble learning, many studies [30, 2, 39, 62] promote orthogonality among the coefficient vectors of base classifiers or regressors, with the aim to improve generalization performance and reduce computational complexity. Recently, several works [65, 3, 10, 56] study orthogonality-promoting regularization of latent variable models (LVMs), which encourages the components in LVMs to be mutually orthogonal, for the sake of capturing infrequent patterns and reducing the number of components without sacrificing modeling power. In these works, various orthogonality-promoting regularizers have been proposed, based on determinantal point process [27, 65] and cosine similarity [62, 3, 56]. In multi-way classification, Malkin and Bilmes [33] propose to use the determinant of a covariance matrix to encourage orthogonality among classifiers. Jalali et al. [21] propose a class of variational Gram functions (VGFs) to promote pairwise orthogonality among vectors. While these VGFs are convex, they can only be applied to non-convex DML formulations. As a result, the overall regularized DML is non-convex and is not amenable for convex relaxation.

In the sequel, we review two families of orthogonality-promoting regularizers.

Determinantal Point Process

[65] employed the determinantal point process (DPP) [27] as a prior to induce orthogonality in latent variable models. DPP is defined over vectors: , where is a kernel matrix with and as a kernel function. denotes the determinant of a matrix. A configuration of with larger probability is deemed to be more orthogonal. The underlying intuition is that: represents the volume of the parallelepiped formed by vectors in the kernel-induced feature space. If these vectors are closer to being orthogonal, the volume is larger, which results in a larger . The shortcoming of DPP is that it is sensitive to vector scaling. Enlarging the magnitudes of vectors results in larger volume, but does not essentially affects the orthogonality of vectors.

Pairwise Cosine Similarity

Several works define orthogonality-promoting regularizers based on the pairwise cosine similarity among component vectors: if the cosine similarity scores are close to zero, then the components are closer to being orthogonal. Given component vectors, the cosine similarity between each pair of components and is computed: . Then these scores are aggregated as a single score. In [62], these scores are aggregated as . In [3], the aggregation is performed as where . In [56], the aggregated score is defined as mean of minus the variance of .

3 Preliminaries

We review a DML method [57] that uses BMD [29] to promote orthogonality.

Distance Metric Learning

Given data pairs labeled either as “similar” or “dissimilar” , DML [58, 51, 11] aims to learn a distance metric under which similar examples are close to each other and dissimilar ones are separated far apart. There are many ways to define a distance metric. Here, we present two popular choices. One is based on linear projection [51, 55, 9]. Given two examples , a linear projection matrix can be utilized to map them into a -dimensional latent space. The distance metric is then defined as their squared Euclidean distance in the latent space: . can be learned by minimizing [58]: , which aims at making the distances between similar examples as small as possible while separating dissimilar examples with a margin using a hinge loss. We call this formulation as projection matrix-based DML (PDML). PDML is a non-convex problem where the global optimal is difficult to achieve. Moreover, one needs to manually tune the number of projection vectors, typically via cross-validation, which incurs substantial computational overhead.

The other popular choice of distance metric is , which is cast from by replacing with a positive semidefinite (PSD) matrix . This is known as the Mahalanobis distance [58]. Correspondingly, the PDML formulation can be transformed into a Mahalanobis distance-based DML (MDML) problem: , which is a convex problem where the global solution is guaranteed to be achievable. It also avoids tuning the number of projection vectors. However, the drawback of this approach is that, in order to satisfy the PSD constraint, one needs to perform eigen-decomposition of in each iteration, which incurs complexity.

Orthogonality-Promoting Regularization

Among the various orthogonality-promoting regularizers, we choose the BMD [29] regularizer [57] in this study since it is amenable for convex relaxation and facilitates theoretical analysis.

To encourage orthogonality between two vectors and , one can make their inner product close to zero and their norm , close to one. For a set of vectors , their near-orthogonality can be achieved by computing the Gram matrix where , then encouraging

to be close to an identity matrix. Off the diagonal of

and are and zero, respectively. On the diagonal of and are and one, respectively. Making close to effectively encourages to be close to zero and close to one, which therefore encourages and to be close to orthogonal.

BMDs can be used to measure the “closeness” between two matrices. Let denote real symmetric matrices. Given a strictly convex, differentiable function , a BMD is defined as , where denotes the trace of matrix . Different choices of lead to different divergences. When , the BMD is specialized to the squared Frobenius norm (SFN) . If , where denotes the matrix logarithm of , the divergence becomes , which is referred to as von Neumann divergence (VND) [46]. If where denotes the determinant of , we get the log-determinant divergence (LDD) [29]: .

In PDML, to encourage orthogonality among the projection vectors (row vectors in ), Xie et al. [57] define a family of regularizers which encourage the BMD between the Gram matrix and an identity matrix to be small. can be specialized to different instances, based on the choices of . Under SFN, becomes , which is used in [50, 13, 16, 9] to promote orthogonality. Under VND, becomes . Under LDD, becomes .

4 Convex Relaxation

The PDML-BMD problem is non-convex, where the global optimal solution of is very difficult to achieve. We seek a convex relaxation and solve the relaxed problem instead. The basic idea is to transform PDML into MDML and approximate the BMD regularizers with convex functions.

4.1 Convex Approximations of the BMD Regularizers

The approximations are based on the properties of eigenvalues. Given a full-rank matrix (), we know that is a full-rank matrix with positive eigenvalues and is a rank-deficient matrix with zero eigenvalues and positive eigenvalues that equal to . For a general positive definite matrix whose eigenvalues are , we have , and . Next, we leverage these facts to seek convex relaxations of the BMD regularizers.

A convex SFN regularizer

The eigenvalues of are and those of are . Then . Therefore, the SFN regularizer equals to , where is a Mahalanobis matrix and . It is well-known that the trace norm of a matrix is a convex envelope of its rank [44]. We use to approximate and get , where the right hand side is a convex function. Dropping the constant, we get the convex SFN (CSFN) regularizer defined over :


A convex VND regularizer

Given the eigen-decomposition where the eigenvalue equals to , based on the property of the matrix logarithm, we have where . Then , where the eigenvalues are . Then . Now we consider a matrix , where is a small scalar. Using similar calculation, we have . Performing certain algebra (see Appendix A), we get . Replacing with , approximating with and dropping constant , we get the convex VND (CVND) regularizer:


whose convexity is shown in [36].

A convex LDD regularizer

We have and . Certain algebra shows that . After replacing with , approximating with and discarding constants, we obtain the convex LDD (CLDD) regularizer:


where the convexity of is proved in [6]. Note that in [11, 40], an information theoretic regularizer based on log-determinant divergence is applied to encourage the Mahalanobis matrix to be close to the identity matrix. This regularizer requires to be full rank; in contrast, by associating a large weight to the trace norm , our CLDD regularizer encourages to be low-rank. Since , reducing the rank of reduces the number of projection vectors in .

We discuss the errors in convex approximation, which are from two sources: one is the approximation of using where the error is controlled by and can be arbitrarily small (by setting to be very small); the other is the approximation of the matrix rank using the trace norm. Though the error of the second approximation can be large, it has been both empirically and theoretically [7] demonstrated that decreasing the trace norm can effectively reduce rank. We empirically verify that decreasing the convexified CSFN, CVND and CLDD regularizers can decrease the original non-convex counterparts SFN, VND and LDD (see Appendix D.3). A rigorous analysis is left for future study.

4.2 DML with a Convex BMD Regularization

Given these convex BMD (CBMD) regularizers (denoted by ), we relax the non-convex PDML-BMD problems into convex MDML-CBMD formulations by replacing with and replacing the non-convex BMD regularizers with :


5 Optimization

We use stochastic proximal subgradient descent algorithm [38] to solve the MDML-CBMD problems. The algorithm iteratively performs the following steps until convergence: (1) randomly sampling a mini-batch of data pairs, computing the subgradient of the data-dependent loss (the first and second term in the objective function) defined on the mini-batch, then performing a subgradient descent update: , where is a small stepsize; and (2) applying proximal operators associated with the regularizers to . The gradient of the CVND regularizer is . To compute , we first perform an eigen-decomposition: , then take the log of every eigenvalue in which gets us a new diagonal matrix , and finally compute as . In the CLDD regularizer, the gradient of is , which can also be computed by eigen-decomposition. Next, we present the proximal operators.

5.1 Proximal Operators

Given the regularizer , the associated proximal operator is defined as: , subject to . Let be the eigenvalues of and be the eigenvalues of , then the above problem can be equivalently written as:


where is a regularizer-specific scalar function. This problem can be decomposed into independent ones: (P) , subject to , for , which can be solved individually.


For SFN where and , the problem (P) is simply a quadratic programming problem. The optimal solution is


For VND where and , by taking the derivative of the objective function in problem (P) w.r.t and setting the derivative to zero, we get . The root of this equation is: , where is the Wright omega function [19]. If this root is negative, then the optimal is 0; if this root is positive, then the optimal could be either this root or 0. We pick the one that yields the lowest . Formally, , where .


For LDD where and , by taking the derivative of w.r.t and setting the derivative to zero, we get a quadratic equation: , where and . The optimal solution is achieved either at the positive roots (if any) of this equation or 0. We pick the one that yields the lowest . Formally, , where .

Computational Complexity

In this algorithm, the major computation workload is eigen-decomposion of -by- matrices, with a complexity of . In our experiments, since is no more than 1000, is not a big bottleneck. Besides, these matrices are symmetric, the structures of which can thus be leveraged to speed up eigen-decomposition. In implementation, we use the MAGMA111http://icl.cs.utk.edu/magma/ library that supports the efficient eigen-decomposition of symmetric matrices on GPU. Note that the unregularized MDML also requires the eigen-decomposition (of ), hence adding these CBMD regularizes does not substantially increase additional computation cost.

6 Theoretical Analysis

In this section, we present theoretical analysis of balancedness and generalization error.

6.1 Analysis of Balancedness

In this section, we analyze how the nonconvex BMD regularizers that promote orthogonality affect the balancedness of the distance metrics learned by PDML-BMD222The analysis of convex BMD regularizers in MDML-CBMD will be left for future work.. Specifically, the analysis focuses on the following projection matrix: . We assume there are classes, where class has a distribution and the corresponding expectation is . Each data sample in and is drawn from the distribution of one specific class. We define and . Further, we assume has full rank (which is the number of the projection vectors), and let denote the eigen-decomposition of , where with .

We define an imbalance factor (IF) to measure the (im)balancedness. For each class , we use the corresponding expectation to characterize this class.

We define the Mahalanobis distance between two classes and as: . We define the IF among all classes as:


The motivation of such a definition is: for two frequent classes, since they have more training examples and hence contributing more in learning , DML intends to make their distance large; whereas for two infrequent classes, since they contribute less in learning (and DML is constrained by similar pairs which need to have small distances), their distance may end up being small. Consequently, if classes are imbalanced, some between-class distances can be large while others small, resulting in a large IF. The following theorem shows the upper bounds of IF.

Theorem 1.

Let denote the ratio between and and assume . Suppose the regularization parameter and distance margin are sufficiently large: and , where and depend on and . If and , then we have the following bounds for the IF333Please refer to Appendix B.1 for the definition of and the detailed proof..

  • [leftmargin=*]

  • For the VND regularizer , if , the following bound of the IF holds:

    where is an increasing function defined in the following way. Let , which is strictly increasing on and strictly decreasing on and let be the inverse function of on , then for .

  • For the LDD regularizer , we have

As can be seen, the bounds are increasing functions of the BMD regularizers and . Decreasing these regularizers would reduce the upper bounds of the imbalance factor, hence leading to more balancedness. For SFN, such a bound cannot be derived.

6.2 Analysis of Generalization Error

In this section, we analyze how the convex BMD regularizers affect the generalization error in MDML-CBMD problems. Following [47], we use distance-based error to measure the quality of a Mahalanobis distance matrix . Given the sample and where the total number of data pairs is , the empirical error is defined as and the expected error is . Let be optimal matrix learned by minimizing the empirical error: . We are interested in how well performs on unseen data. The performance is measured using generalization error: . To incorporate the impact of the CBMD regularizers , we define the hypothesis class of to be . The upper bound controls the strength of regularization. A smaller entails stronger promotion of orthogonality. is controlled by the regularization parameter in Eq.(4). Increasing reduces . For different CBMD regularizers, we have the following generalization error bound.

Theorem 2.

Suppose , then with probability at least , we have:

  • [leftmargin=*]

  • For the CVND regularizer,

  • For the CLDD regularizer,

  • For the CSFN regularizer,

From these generalization error bounds (GEBs), we can see two major implications. First, CBMD regularizers can effectively control the GEBs. Increasing the strength of CBMD regularization (by enlarging ) reduces , which decreases the GEBs since they are all increasing functions of . Second, the GEBs converge with rate , where is the number of training data pairs. This rate matches with that in [5, 47].

#Train #Test Dim. #Class
MIMIC 40K 18K 1000 2833
EICU 53K 39K 1000 2175
Reuters 4152 1779 1000 49
News 11307 7538 1000 20
Cars 8144 8041 1000 196
Birds 9000 2788 1000 200
Act 7352 2947 561 6
Table 1: Dataset Statistics
MIMIC EICU Reuters News Cars Birds Act
A-All A-IF BS A-All A-IF BS A-All A-IF BS A-All A-All A-All A-All
PDML 0.634 0.608 0.070 0.671 0.637 0.077 0.949 0.916 0.049 0.757 0.714 0.851 0.949
MDML 0.641 0.617 0.064 0.677 0.652 0.055 0.952 0.929 0.034 0.769 0.722 0.855 0.952
LMNN 0.628 0.609 0.054 0.662 0.633 0.066 0.943 0.913 0.040 0.731 0.728 0.832 0.912
LDML 0.619 0.594 0.068 0.667 0.647 0.046 0.934 0.906 0.042 0.748 0.706 0.847 0.937
MLEC 0.621 0.605 0.045 0.679 0.656 0.053 0.927 0.916 0.021 0.761 0.725 0.814 0.917
GMML 0.607 0.588 0.053 0.668 0.648 0.045 0.931 0.905 0.035 0.738 0.707 0.817 0.925
ILHD 0.577 0.560 0.051 0.637 0.610 0.064 0.905 0.893 0.028 0.711 0.686 0.793 0.898
MDML- 0.648 0.627 0.055 0.695 0.676 0.042 0.955 0.930 0.037 0.774 0.728 0.872 0.958
MDML- 0.643 0.615 0.074 0.701 0.677 0.053 0.953 0.948 0.020 0.791 0.725 0.868 0.961
MDML- 0.646 0.630 0.043 0.703 0.661 0.091 0.963 0.936 0.035 0.783 0.728 0.861 0.964
MDML-Tr 0.659 0.642 0.044 0.696 0.673 0.051 0.961 0.934 0.036 0.785 0.731 0.875 0.955
MDML-IT 0.653 0.626 0.070 0.692 0.668 0.053 0.954 0.920 0.046 0.771 0.724 0.858 0.967
MDML-Drop 0.647 0.630 0.045 0.701 0.670 0.067 0.959 0.937 0.032 0.787 0.729 0.864 0.962
MDML-OS 0.649 0.626 0.059 0.689 0.679 0.045 0.957 0.938 0.031 0.779 0.732 0.869 0.963
PDML-DC 0.652 0.639 0.035 0.706 0.686 0.044 0.962 0.943 0.034 0.773 0.736 0.882 0.964
PDML-CS 0.661 0.641 0.053 0.712 0.670 0.089 0.967 0.954 0.020 0.803 0.742 0.895 0.971
PDML-DPP 0.659 0.632 0.069 0.714 0.695 0.041 0.958 0.937 0.036 0.797 0.751 0.891 0.969
PDML-IC 0.660 0.642 0.047 0.711 0.685 0.057 0.972 0.954 0.030 0.801 0.740 0.887 0.967
PDML-DeC 0.648 0.625 0.063 0.698 0.675 0.050 0.965 0.960 0.017 0.786 0.728 0.860 0.958
PDML-VGF 0.657 0.634 0.059 0.718 0.697 0.045 0.974 0.952 0.036 0.806 0.747 0.894 0.974
PDML-MA 0.659 0.644 0.040 0.721 0.703 0.038 0.975 0.959 0.024 0.815 0.743 0.898 0.968
PDML-OC 0.651 0.636 0.041 0.705 0.685 0.043 0.955 0.931 0.036 0.779 0.727 0.875 0.956
PDML-OS 0.639 0.614 0.067 0.675 0.641 0.072 0.951 0.928 0.038 0.764 0.716 0.855 0.950
PDML-SFN 0.662 0.642 0.051 0.724 0.701 0.045 0.973 0.947 0.038 0.808 0.749 0.896 0.970
PDML-VND 0.667 0.655 0.032 0.733 0.706 0.057 0.976 0.971 0.012 0.814 0.754 0.902 0.972
PDML-LDD 0.664 0.651 0.035 0.731 0.711 0.043 0.973 0.964 0.017 0.816 0.751 0.904 0.971
MDML-CSFN 0.668 0.653 0.039 0.728 0.705 0.049 0.978 0.968 0.023 0.813 0.753 0.905 0.972
MDML-CVND 0.672 0.664 0.022 0.735 0.718 0.035 0.984 0.982 0.012 0.822 0.755 0.908 0.973
MDML-CLDD 0.669 0.658 0.029 0.739 0.719 0.042 0.981 0.980 0.011 0.819 0.759 0.913 0.971
Table 2:

On the three imbalanced datasets – MIMIC, EICU, Reuters, we show the mean AUC (averaged on 5 random train/test splits) on all classes (A-All) and infrequent classes (A-IF) and the balance score. On the rest 4 balanced datasets, A-All is shown. The AUC on frequent classes and the standard errors are in Appendix 


7 Experiments


We used 7 datasets in the experiments: two electronic health record datasets MIMIC (version III) [23] and EICU (version 1.1) [17]; two text datasets Reuters444http://www.daviddlewis.com/resources/testcollections/reuters21578/ and 20-Newsgroups (News)555http://qwone.com/~jason/20Newsgroups/; two image datasets Stanford-Cars (Cars) [26] and Caltech-UCSD-Birds (Birds) [53]; and one sensory dataset 6-Activities (Act) [1]. The MIMIC-III dataset contains 58K hospital admissions of 47K patients who stayed within the intensive care units (ICU). Each admission has a primary diagnosis (a disease), which acts as the class label of this admission. There are 2833 unique diseases. We extract 7207-dimensional features from demographics, clinical notes, and lab tests. The EICU dataset contains 92K ICU admissions diagnosed with 2175 unique diseases. 3743-dimensional features are extracted from demographics, lab tests, vital signs, and past medical history. For the Reuters datasets, after removing documents that have more than one labels and removing classes that have less than 3 documents, we are left with 5931 documents and 48 classes. Documents in Reuters and News are represented with tfidf vectors where the vocabulary size is 5000. For the two image datasets Birds and Cars, we use the VGG16 [43]convolutional neural network trained on the ImageNet [12] dataset to extract features, which are the 4096-dimensional outputs of the second fully-connected layer. The 6-Activities dataset contains sensory recordings of 30 subjects performing 6 activities (which are the class labels). The features are 561-dimensional sensory signals. For the first six datasets, the features are normalized using min-max normalization along each dimension and the feature dimension is reduced to 1000 using PCA. Since there is no standard split of the training/test set, we perform five random splits and average the results of the five runs. Dataset statistics are summarized in Table 1

. More details of the datasets and feature extraction are deferred to Appendix 


Experimental Settings

Two examples are considered as similar if they belong to the same class and dissimilar if otherwise. The learned distance metrics are applied for retrieval (using each test example to query the rest of the test examples) whose performance is evaluated using the Area Under precision-recall Curve (AUC) [34] which is the higher, the better. Note that the learned distance metrics can also be applied to other tasks such as clustering and classification. Due to the space limit, we focus on retrieval. We apply the proposed convex regularizers CSFN, CVND, CLDD to MDML. We compare them with two sets of baseline regularizers. The first set aims at promoting orthogonality, which are based on determinant of covariance (DC) [33], cosine similarity (CS) [62], determinantal point process (DPP) [27, 65], InCoherence (IC) [3], variational Gram function (VGF) [64, 21], decorrelation (DeC) [10], mutual angles (MA) [56], squared Frobenius norm (SFN) [50, 13, 16, 9], von Neumann divergence (VND) [57], log-determinant divergence (LDD) [57], and orthogonal constraint (OC)  [31, 48]. All these regularizers are applied to PDML. The other set of regularizers are not designed particularly for promoting orthogonality but are commonly used, including norm, norm [40], norm [60], trace norm (Tr) [32], information theoretic (IT) regularizer  [11], and Dropout (Drop) [45]. All these regularizers are applied to MDML. One common way of dealing with class-imbalance is over-sampling (OS) [14], which repetitively draws samples from the empirical distributions of infrequent classes until all classes have the same number of samples. We apply this technique to PDML and MDML. In addition, we compare with vanilla Euclidean distance (EUC) and other distance learning methods including large margin nearest neighbor (LMNN) metric learning, information theoretic metric learning (ITML) [11], logistic discriminant metric learning (LDML) [20], metric learning from equivalence constraints (MLEC) [25], geometric mean metric learning (GMML) [63], and independent Laplacian hashing with diversity (ILHD) [8]. The PDML-based methods except PDML-OC are solved with stochastic subgradient descent (SSD). PDML-OC is solved using the algorithm proposed in [54]. The MDML-based methods are solved with proximal SSD. The learning rate is set to 0.001. The mini-batch size is set to 100 (50 similar pairs and 50 dissimilar pairs). We use 5-fold cross validation to tune the regularization parameter among and the number of projection vectors (of the PDML methods) among . In CVND and CLDD, is set to be . The margin is set to be 1. In the MDML-based methods, after the Mahalanobis matrix (rank ) is learned, we factorize it into where (see Appendix D.2), then perform retrieval based on , which is more efficient than that based on . Each method is implemented on top of GPU using the MAGMA library. The experiments are conducted on a GPU-cluster with 40 machines.

MIMIC EICU Reuters News Cars Birds Act
PDML 62.1 66.6 5.2 11.0 8.4 10.1 3.4
MDML 3.4 3.7 0.3 0.6 0.5 0.6 0.2
PDML-DC 424.7 499.2 35.2 65.6 61.8 66.2 17.2
PDML-CS 263.2 284.8 22.6 47.2 34.5 42.8 14.4
PDML-DPP 411.8 479.1 36.9 61.9 64.2 70.5 16.5
PDML-IC 265.9 281.2 23.4 46.1 37.5 45.2 15.3
PDML-DeC 458.5 497.5 41.8 78.2 78.9 80.7 19.9
PDML-VGF 267.3 284.1 22.3 48.9 35.8 38.7 15.4
PDML-MA 271.4 282.9 23.6 50.2 30.9 39.6 17.5
PDML-OC 104.9 118.2 9.6 14.3 14.8 17.0 3.9
PDML-SFN 261.7 277.6 22.9 46.3 36.2 38.2 15.9
PDML-VND 401.8 488.3 33.8 62.5 67.5 73.4 17.1
PDML-LDD 407.5 483.5 34.3 60.1 61.8 72.6 17.9
MDML-CSFN 41.1 43.9 3.3 7.3 6.5 6.9 1.8
MDML-CVND 43.8 46.2 3.6 8.1 6.9 7.8 2.0
MDML-CLDD 41.7 44.5 3.4 7.5 6.6 7.2 1.8
Table 3: Training time (hours) on seven datasets. The training time of other baseline methods are deferred to Appendix D.3.


The training time taken by different methods to reach convergence is shown in Table 7. For the non-convex, PDML-based methods, we report the total time taken by the following computation: tuning the regularization parameter (4 choices) and the number of projection vectors (NPVs, 6 choices) on a two-dimensional grid via 3-fold cross validation ( experiments in total); for each of the 72 experiments, the algorithm restarts 5 times666Our experiments show that for non-convex methods, multiple re-starts are of great necessity to improve performance. For example, for PDML-VND on MIMIC with 100 projection vectors, the AUC is non-decreasing with the number of re-starts: the AUC after re-starts are 0.651, 0.651, 0.658, 0.667, 0.667., each with a different initialization, and picks the one yielding the lowest objective value. In total, the number of runs is . For the MDML-based methods, there is no need to restart multiple times or tune the NPVs. The total number of runs is . As can be seen from the table, the proposed convex methods are much faster than the non-convex ones, due to the greatly reduced number of experimental runs, although for each single run the convex methods are less efficient than the non-convex methods due to the overhead of eigen-decomposition. The unregularized MDML takes the least time to train since it has no parameters to tune and runs only once. On average, the time of each single run in MDML-(CSFN,CVND,CLDD) is close to that in the unregularized MDML, since an eigen-decomposition is required anyway regardless of the presence of the regularizers.

MIMIC EICU Reuters News Cars Birds Act
PDML 300 2.1 400 1.7 300 3.2 300 2.5 300 2.4 500 1.7 200 4.7
MDML 247 2.6 318 2.1 406 2.3 336 2.3 376 1.9 411 2.1 168 5.7
LMNN 200 3.1 400 1.7 400 2.4 300 2.4 400 1.8 500 1.7 300 3.0
LDML 300 2.1 400 1.7 400 2.3 200 3.7 300 2.4 400 2.1 300 3.1
MLEC 487 1.3 493 1.4 276 3.4 549 1.4 624 1.2 438 1.9 327 2.8
GMML 1000 0.6 1000 0.7 1000 0.9 1000 0.7 1000 0.7 1000 0.8 1000 0.9
ILHD 100 5.8 100 6.4 50 18.1 100 7.1 100 6.9 100 7.9 50 18.0
MDML- 269 2.4 369 1.9 374 2.6 325 2.4 332 2.2 459 1.9 179 5.4
MDML- 341 1.9 353 2.0 417 2.3 317 2.5 278 2.6 535 1.6 161 6.0
MDML- 196 3.3 251 2.8 288 3.3 316 2.5 293 2.5 326 2.6 135 7.1
MDML-Tr 148 4.5 233 3.0 217 4.4 254 3.1 114 6.4 286 3.1 129 7.4
MDML-IT 1000 0.7 1000 0.7 1000 1.0 1000 0.8 1000 0.7 1000 0.9 1000 1.0
MDML-Drop 183 3.5 284 2.5 315 3.0 251 3.1 238 3.1 304 2.8 147 6.5
PDML-DC 100 6.5 300 2.4 100 9.6 200 3.9 200 3.7 300 2.9 100 9.6
PDML-CS 200 3.3 200 3.6 200 4.8 100 8.0 100 7.4 200 4.5 50 19.4
PDML-DPP 100 6.6 200 3.6 100 9.6 100 8.0 200 3.8 200 4.5 100 9.7
PDML-IC 200 3.3 200 3.6 200 4.9 100 8.0 200 3.7 100 8.9 100 9.7
PDML-DeC 200 3.2 300 2.3 200 4.8 200 3.9 200 3.6 200 4.3 100 9.6
PDML-VGF 200 3.3 200 3.6 200 4.9 100 8.1 200 3.7 200 4.5 100 9.7
PDML-MA 200 3.3 200 3.6 100 9.8 100 8.2 100 7.4 200 4.5 50 19.4
PDML-SFN 100 6.6 200 3.6 100 9.7 100 8.1 100 7.5 200 4.5 50 19.4
PDML-OC 100 6.5 100 7.1 50 19.1 50 15.6 100 7.3 100 8.8 50 19.1
PDML-VND 100 6.7 100 7.3 50 19.5 100 8.1 100 7.5 100 9.0 50 19.4
PDML-LDD 100 6.6 200 3.7 100 9.7 100 8.2 100 7.5 100 9.0 50 19.4
MDML-CSFN 143 4.7 209 3.5 174 5.6 87 9.3 62 12.1 139 6.5 64 15.2
MDML-CVND 53 12.7 65 11.3 61 16.0 63 13.0 127 5.9 92 9.9 68 14.3
MDML-CLDD 76 8.8 128 5.8 85 11.5 48 17.1 91 8.3 71 12.9 55 17.7
Table 4: Number of projection vectors (NPV) and compactness score (CS,).

Next, we verify whether CSFN, CVND and CLDD are able to learn more balanced distance metrics. On three datasets MIMIC, EICU and Reuters where the classes are imbalanced, we consider a class as “frequent” if it contains more than 1000 examples, and “infrequent” if otherwise. We measure AUCs on all classes (A-All), infrequent classes (A-IF) and frequent classes (A-F), then define a balance score (BS) as . A smaller BS indicates more balancedness. As shown in Table 2, MDML-(CSFN,CVND,CLDD) achieve the highest A-All on 6 datasets and the highest A-IF on all 3 imbalanced datasets. In terms of BS, our convex methods outperform all baseline DML methods. These results demonstrate our methods can learn more balanced metrics. By encouraging the projection vectors to be close to being orthogonal, our methods can reduce the redundancy among vectors. Mutually complementary vectors can achieve a broader coverage of latent features, including those associated with infrequent classes. As a result, these vectors improve the performance on infrequent classes and lead to better balancedness. Thanks to their convexity nature, our methods can achieve the global optimal solution and outperform the non-convex ones that can only achieve a local optimal and hence a sub-optimal solution. Comparing (PDML,MDML)-OS with the unregularized PDLM/MDML, we can see that over-sampling indeed improves balancedness. However, this improvement is less significant than that achieved by our methods. In general, the orthogonality-promoting (OP) regularizers outperform the non-OP regularizers, suggesting the effectiveness of promoting orthogonality. The orthogonal constraint (OC) [31, 48] imposes strict orthogonality, which may be too restrictive that hurts performance. ILHD [8] learns binary hash codes, which makes retrieval more efficient, however, it achieves lower AUCs due to the quantization errors. MDML-(CSFN,CVND,CLDD) outperform popular DML approaches including LMNN, LDML, MLEC and GMML, demonstrating their competitive standing in the DML literature.

Next we verify whether the learned distance metrics by MDML-(CSFN,CVND,CLDD) are compact. Table 4 shows the numbers of the projection vectors (NPVs) that achieve the AUCs in Table 2. For MDML-based methods, the NPV equals to the rank of the Mahalanobis matrix since . We define a compactness score (CS) which is the ratio between A-All (given in Table 2) and NPV. A higher CS indicates achieving higher AUC by using fewer projection vectors. From Table 4, we can see that on 5 datasets, MDML-(CSFN,CVND,CLDD) achieve larger CSs than the baseline methods, demonstrating their better capability in learning compact distance metrics. Similar to the observations in Table 2, CSFN, CVND and CLDD perform better than non-convex regularizers, and CVND, CLDD perform better than CSFN. The reduction of NPV improves the efficiency of retrieval since the computational complexity grows linearly with this number. Together, these results demonstrate that MDML-(CSFN,CVND,CLDD) outperform other methods in terms of learning both compact and balanced distance metrics.

As can be seen from Table 2, our methods MDML-(CVND,CLDD) achieve the best AUC-All. In Table 10 (Appendix D.3), it is shown that MDML-(CVND,CLDD) have the smallest gap between training and testing AUC. This indicates that our methods are better capable of reducing overfitting and improving generalization performance.

8 Conclusions

In this paper, we have attempted to address three issues of existing orthogonality-promoting DML methods, which include computational inefficiency and lacking theoretical analysis in balancedness and generalization. To address the computation issue, we perform a convex relaxation of these regularizers and develop a proximal gradient descent algorithm to solve the convex problems. To address the analysis issue, we define an imbalance factor (IF) to measure (im)balancedness and prove that decreasing the Bregman matrix divergence regularizers (which promote orthogonality) can reduce the upper bound of the IF, hence leading to more balancedness. We provide a generalization error (GE) analysis showing that decreasing the convex regularizers can reduce the GE upper bound. Experiments on datasets from different domains demonstrate that our methods are computationally more efficient and are more capable of learning balanced, compact and generalizable distance metrics than other approaches.

Appendix A Convex Approximations of BMD Regularizers

Approximation of VND regularizer

Given , according to the property of matrix logarithm, , where . Then , where the eigenvalues are . Since , we have . Now we consider a matrix , where is a small scalar. The eigenvalues of this matrix are