Metric (or distance metric) learning represents an essential task for many machine learning problems. Relying on appropriate distance metrics can boost the performance of many learning algorithms, such as-Nearest Neighbor, of which its success is largely depended on the distance metric of the points closest to a given point. Similarly, in -means clustering, the shortest distance between a data point and all cluster centers also determines its cluster assignment. Several important studies have been conducted in this area, including the information theoretic metric learning (ITML) approach , the large margin nearest neighbour (LMNN) approach, or the pseudo-metric online learning algorithm (POLA) .
However, in many cases, numerical features usually come along with categorical ones that also contain discriminative information. For instance, the categorical features of educational level and marriage status represent valuable information in the credit card fraud detection problem. The standard method to deal with these categorical features is to treat them as numerical ones by transforming them into binary vectors. However, the feature number is increased in a polynomial rate. Unsupervised learning methods for categorical distances, as
, usually rely on the simple overlapping similarity, that varies from the simple counting, through co-occurrence frequency to entropy. When label information is available, the supervised-learning of categorical measures[6, 10] is further developed. However, these methods either ignore the correlation between data samples, or come at a heavy computational cost. In addition, none of these studies provide theoretical guarantees on the generalization bound of the learned metric.
To address the above problems, in our work, we put forward a new method, namely categorical projected metric learning (CPML), to efficiently learn metrics on categorical features and utilize them in real classification tasks. First, we employ the standard value distance metric (VDM)  to project each feature value into a class-based vector. Then these vectors are re-arranged to define new distances relying on the correlation between features. These new defined distances are further utilized in -Nearest Neighbor classification tasks. Comparing to previous methods, our approach is superior in terms of computational cost, without loss of classification accuracy. It also comes with theoretical guarantees that ensure its reliability.
To achieve this, we apply the Schatten -norm (
) to regularize the eigenvalues of the metric and promote low rank solutions. Several popular regularizers are special cases of this Schatten-norm; refers to the trace norm, corresponds to the Frobenius norm and represents exactly the maximum eigenvalue norm . Correspondingly, we provide the generalization bound for this Schatten -norm (), as a supplement for the standard generalization bound in metric learning literature .
On the experimental part, we test the performance of our model in different scenarios. By adding different number of noisy features, our model is shown to be able to correctly identify the noisy features and ”denoise” them. By testing the running time in different data sizes and class numbers, we show that the class number hardly influences our model’s running time. Lastly, detailed results obtained on synthetic and real world data sets confirm our models’ competitive results against other benchmark models.
The remainder of this paper is organized as follows. In Section II, we introduce the value distance metric (VDM) method and the general framework of metric learning. We then we propose the categorical projected metric learning (CPML) framework aiming at efficiently learning metrics on categorical features for classification tasks. A generalization bound for the general Schatten -norm () is provided in Section IV. After a literature review in Section V, we provide and discuss experimental results in Section VI. The conclusion and future work can be found in Section VII.
Ii Preliminary Knowledge
Throughout this study, we will use the notations provided in Table I.
|Number of data points|
|Number of features|
|Number of classes|
|(metric to be learned)|
|Number of possible values for feature|
|Maximum number of possible values for all the features|
|VDM-based projection of example|
|VDM-based projection of the dataset|
|Number of times value is observed|
|for feature in class|
|Set of semi-definite positive matrices|
As is common setting, our training data includes the observations and the corresponding labels . Furthermore, , contains categorical features, i.e. , each taking value in , where denotes the number of possible values for feature .
We now introduce the representation we use for categorical features and provide the general framework for metric learning.
Ii-a Value Distance Metric
The value distance metric (VDM) is a method for representing categorical features into a -dimensional normalized vector ( corresponds to the number of classes, see Table I). , VDM partitions the whole dataset into subgroups, where the data points in the same subgroup have the same value on feature . Then, VDM would histogram the data points according to their corresponding class labels and the histogram is normalized to represent the feature’s class distribution.
More specifically, the feature is first transformed into the -length vector as follows:
wherewhen the feature of point has value . It is defined as:
Here denotes the number of times the feature value occurs in class for the feature, and refers to the total appearances of feature value for the feature. VDM is thus a class-based projection, inspired from the original value distance metric [20, 6]. Recent work in  has also employed this projection.
We take the credit risk data as an example. In Table II, we have a set of persons. The occupations, educations and marital status of these persons are taken as features, and the credit risk level is taken as labels. For person ’s occupation feature, which is Accountant, we can first estimate the probability of different credit risk for person being the accountant as follows:
Correspondingly, we can represent Eq. (1) as:
Ii-B Metric Learning
Metric learning naturally arises in the question of how to assess the similarity of different objects. Its corresponding distance function is usually set as the Mahalanobis distance, with the inverse covariance matrix as the unknown variables. With the prior knowledge of class labels or side information, we are trying to find an optimal metric that aims at minimizing the number of errors made.
is the loss function,is the regularization function and is the tuning parameter balancing the loss incurred and the model complexity.
Our approach, as well as most metric learning approaches, fits within this general setting.
Iii Categorical Metric Learning
VDM transforms each data point into a matrix, denoted as . Each row in is an estimation of the class distribution, so that . The column, , in represents the popularity of class in different features, which could be denoted as . The whole projected dataset is .
Based on this representation, we define two distances that take into account the correlations between features inside each class. They differ in the way of treating the metric learned for different classes (different metrics for different classes are learned in one case, whereas a single metric for all classes is learned in the other case).
CPm: The categorical projected multi (CPm) distance considers the features’ individual metrics among different classes and is defined by:
CPs: The categorical projected single (CPs) distance considers the features’ correlation by assuming that all the classes share the same metric, and is defined by:
We assume the positive semi-definite property of and to ensure the the non-negativeness of . The following property (see the proof in the appendix) furthermore shows that the above definitions correspond to valid distances.
Assume , then , and, , we have .
Iii-a Objective Function & Optimization
From the class information in the data set, one can derive a constraint set based on triplets of points, , that indicates that any point should be closer to points of the same class than to points of other classes. From this constraint set, our task is to find an optimal such that the empirical loss minimized. The empirical loss is here defined as:
is the hinge loss function, and is the margin parameter, often set to .
The above setting can be used for constraint sets based on pairs of points, , in which case the empirical loss takes the form:
It can also be used for quadratic constraint sets, based on 4-uples of data points, , in which case the empirical loss takes the form:
In real world applications, no single constraint would dominate the other two in different scenarios. When two points from different classes can indeed be very close or far away to each other, the triplet constraint set and the pair constraint set do not make strong assumptions in this case, whereas the quadratic constraint might set these assumptions. We believe that both and define valuable information on which to learn a new metric. We present here the solution of the optimization problem based on , but the same development can be used for .
By choosing a suitable metric regularizer , our problem amounts to minimize the objective function :
where is the regularization parameter.
The choice of the metric regularizer affects the structure of the solution learned. For instance, the -norm promotes sparse metric, while the trace norm encourages metrics with low rank; the Frobenius norm (), on the other hand, tends to yield robust solutions.
If is a convex function, the objective function is also convex with respect to . As the empirical hinge loss is non-differentiable at , we apply the Projected Subgradient Descent method to seek the optimal value of . The subgradient of is composed of two terms: one in the regularization, i.e. the matrix , and the other in the empirical loss function. The subgradient of the empirical loss is the sum of the sub gradient of the hinge loss in each constraint in . For each , the subgradient direction is if the loss is , i.e. . For , we get:
Thus, the subgradient of is:
where denotes the set of triplets for which the constraint is violated.
After each gradient step , we need to project back to the positive semi-definite cone. This is conducted by setting the negative eigenvalues in to be .
The complete process is described in Algorithm 1. It is important to note that the values can be computed before the iteration, which reduces the computational cost.
Iv Rademacher Complexity and Schatten -Norms
In choosing the regularizer , we here rely on a Schatten -norm of the learned metric , that generalizes several well-known metric-regularizers. The study in  gives a generalization bound on the pair-comparison empirical loss, defined by:
Here if , otherwise ; is the margin parameter. One expects that if , and otherwise.
Thus, the objective function to be minimized is:
where is the metric regularizer, and the regularization parameter.
From this, the study in  derives the following generalization bound, that holds with probability at least :
where denotes the empirical Rademacher complexity, defined as:
and ( measures the diameter of the domain of ). denotes the dual norm for a given norm .
However, the study in  does not consider the Schatten -norm as a metric regularizer, and we provide here a bound on the Rademacher complexity of the Schatten -norm regularizer for the case . This bound complements the study in  for the pair-comparison case. To do this, we first define the expectation of the empirical Rademacher complexity:
Then, we have the following theorem.
The Rademacher Complexity of our distances in the Schatten -norm () in the pair-comparison empirical loss case is bounded by:
The dual norm of the Schatten -norm is the Schatten -norm, where satisfies . Let us first assume that ; then . Assume are the eigenvalues of the matrix , we have
Let us now assume that ; then . Let .
It has to be noted here that, as we are using the union bound in this case, the result in the case may be loose.
Armed with this result, we can now state a generalization bound for with Schatten -norm in the pair-comparison case.
, with probability , we have that
V Related Work
V-a Metric Learning
From the seminal work of 
, the majority of studies in metric learning focuses on numerical data. An optimal metric is usually learned, from some labeled information, in the family of Mahalanobis distances. Several metric learning methods have lead to classifiers significantly better than the one based on standard metrics as the Euclidean distance (i.e. without learning). Among such methods we can cite thelarge margin nearest neighbour (LMNN) approach that first determines the nearest neighbors of each point in the Euclidean space, then tries to move the point closer to its neighbors of the same classe while pushing it away from neighbors of other classes. The information theoretical metric learning (ITML) 
tries to minimize the relative entropy between two multivariate Gaussian distributions under distance constraints. Themaximum-eigenvalue metric learning [25, 26] uses the popular Frank-Wolfe algorithm to formulate the problem as a constrained maximum-eigenvalue problem, which avoids the computation of the full eigen-decomposition in each iteration.
Metric learning on numerical data usually involves a linear transformation of the original Euclidean space. Some studies, however, rely on non-linear transformations, as the-LMNN and GB-LMNN approaches  that respectively make use of the distance and regression trees. Other non-linear transformations involve the use of kernels [8, 21]. Hamming distance metric learning  has recently been proposed to learn a mapping from real-valued inputs into binary ones, with which the hash function can fully utilized to enable large scalability.
On learning the distance between categorical data,  have conducted an extensive comparison between various unsupervised measures. The closest work to ours is , which is to measure correlation structure among each feature, and , which considers the full class-features’ correlation structure. However, the first neglects the potential correlations between the features and both of their scalability are questionable, as they individually optimize and matrixes.
We notice there are two algorithms proposed recently to learn metric on categorical data. The heterogeneous metric learning with hierarchical couplings (HELIC)  mainly focuses on the attribute-to-value and attribute coupling framework. Their insufficient utilization on the labels leads to degraded performance. Also, the DM3 method  considers only the frequency information within the attributes. The lack of class label incorporation makes the performance unappealing again. We show this in the experimental part.
On the theoretical generalization guarantees,  uses the uniform stability concept to firstly bound the deviation from true risk to empirical risk, under the Frobenious norm case.  gives the generalization bound without regularizers, with strong assumptions on the points’ distribution.  has shown that robustness  is necessary and sufficient to be generalized well.  uses the notion of Rademacher complexity to derive the generalization bound for several regularizers. Our generalization bound on the Schatten -norm () is a supplement to this work.
V-B Distance between categorical data
In addition to the VDM method we have used, there are other existing methods to quantify the distance between categorical data in the literature. Among all those methods, Hamming distance  is widely known for its intuitive understanding and simplicity to implementation. The idea is to treat the same value in the categorical data as , and otherwise. However, the Hamming distance lacks the ability to model dependence within the features and also the potential connection to the class information.
The association-based metric proposed by 
uses an indirect probabilistic method. Particularly, the metric is estimated by the sum of distances between conditional probability density function (cpdf) of other attributes given these two values, i.e.,, where
is the distance function between two probability density functions and can be used as the Kullback-Leibler divergence. However, the distance isif all the attributes are independent of each other.
The context-based metric [11, 12] is determined by a measure of symmetrical uncertainty (SU) between pairwise attributes. Particularly, SU is calculated as , where is the entropy of attribute and is the information gain. The metric between two attribute value is determined by further usage of SU. Similar as the association-based metric, the context-based metric can not work well when these attributes are independent.
Fig. 1 displays four different feature correlation structures. As we can see, the CP-m and CP-s distances are in between the full correlation considered in KDML and the simple one used in the Euclidean distance.
V-C Computational complexity
The compuational complexity of our captegorical project method is determined by two parts: calculating the VDM projection and the metric learning part. In calculating the VDM projection, we calculate the corresponding projection of all the categorical features of nodes. For each feature value, we use the corresponding class information. Thus, the total computational cost is , where is the maximum number of values in one feature.
In calculating the computational complexity of metric learning, we taken the matrix of to be fixed as it is calculated in advance. Thus, for a given iteration number and length of contraint set , the computational time scales to for CPML-m (i.e. metric learning with categorical projected multi-distance) method (and for CPML-s (i.e. metric learning with categorical projected single-distance) method). Here the term of refers to the spectral decomposition of such that we can manually make it positive semi-definite. In most of the other approaches (e.g. KDML), the computation complexity would scales to , which is almost infeasible when the number of class is large.
As a result, the computational complexity for the CPML-m and CPML-s method can be summarized as and .
The performance of our CPML framework is validated by experiments on synthetic dataset as well as real-world datasets. On the synthetic dataset, we mainly test the properties of CPML, e.g., the influence of the number of features, the presence of noisy features, and the running time. The real-world datasets are mainly used for evaluating the performance of different approaches. Particularly, we individually implement three baseline methods, i.e. LMNN, KDML and DM3, to the best of our understanding. For HELIC, we use the authors’ kindly provided implementation.
Further, we use triplet comparison accuracy and classification accuracy to assess the performance of the methods. The triplet constraints are built by considering that any pair (,) from the same class should have a lower distance than any pair (,) from different classes. For the triplet comparison accuracy, we also construct these triplets on the test data and evaluate the proportion correctly predicted. For the classification accuracy, we use Nearest Neighbor classification as the default classifier. Further, we randomly divide the data into parts and set the ratio of training:validation:testing as . The trade-off parameter is set as ranging from to
. For each scenario, the experiments are run 50 times and averaged; the summary statistics (mean, standard deviation) are reported.
Vi-a Synthetic data
The synthetic data is generated as follows. For each categorical feature, their values are generated by a multinomial distribution, which is parameterized by uniform random variables. To distinguish different classe, we manually add a weight to one of the component for each class. After the parameter normalization, the generated feature values can ensure favored values in each feature for different classes. Thus, a total ofmultinomial distributions are demanded for this generation.
More specifically, we set the size of the dataset as . The number of features varies in , and the weights are chosen sequentially in .
Vi-A1 Impact of the Number of Features and of the Weights
Fig. 2 and Fig. 3 shows the triplet comparison accuracy and pair-comparison classification accuracy for our methods of CPML-s and CPML-m. In these comparisons, we test the cases with classes and classes and use the trace-norm as the regularizer in this case, which is our Schatten -norm.
From these figures, we can easily see that, in general, the performance (i.e. triplet comparison accuracy and pair-comparison classification accuracy) improves when the number of features increases. This phenomenon coincides with our common knowledge that the larger the number of informative features we have, the better performance will be. On the weight comparison, it is also clear that the performance will be better with larger weight values.
These four figures also show that CPML-s and CPML-m have similar performance (we did not find any significant differences between the two models). In summary, the performance tends to be stable when the number of features lies between and . In that case, one obtains satisfying results even on the least informative features (i.e. weight = 0.1). For the cases with classes and classes, we can see they have similar performance trends. The score of the latter looks to be a bit degraded, this might due to the larger number of classes.
Vi-A2 Impact of Noisy Features
We assess here whether the presence of noisy features impact the valid calculation of distance. We are using the CPML-s as the exploratory method, and the norms include the trace norm , the Frobenius Norm and the Schatten -norm . informative features, with favored weight , are used here with a set of noisy features (# N.F. denotes the number of noisy features). We then compute the ratio between the norm of the metric on noisy feature and the norm of the metric on the whole feature space, i.e. . The results are shown in Table III.
From Table III, we can see that, even if the number of noisy features is much larger than the number of meaningful features ( noisy features and meaningful features), our CPML-s method with various Schatten -norms can successfully control the noisy features as their influence on the metric does not exceed . This noisy feature resistance property may be explained as the result of our simple distance definitions.
Vi-A3 Running Time Comparison
Fig. 4 displays comparisons on the logarithm of the running time over different methods. The left part shows the performance of CPML-s, CPML-m and KDML in terms of different classes. As we can see, our CPML-s model is the fastest one when compared to CPML-m and KDML. Also, the running time of CPML-s does not show a significant difference on the different choices for the number of classes. In contrast, both of CPML-m and KDML require heavier computation, and their running time depends on the number of classes. An interesting observation is that KDML is faster than CPML-m when the number of classes is small. The right part shows the performance of CPML-s, CPML-m, KDML, LMNN, HELIC and POLA when the number of classes are set to . Due to the online learning nature, POLA is the fastest to obtain the result as it only needs to scan the whole dataset once. Among all other comparison methods, CPML-s require the smallest running time. When HELIC requires smaller running time than the CPML-m algorithm, KDML and LMNN usually require more running time than the CPML-m algorithm (especially the size of dataset is large).
Vi-B Real world datasets 
We select real world datasets to test the performance of the CPM framework: Car, Balance, Mushroom, Voting, Nursery, Monks1, Monks2, Tic-tac-toe, Krkopt, Adult, Connect4, Census, Zoo, DNAPromoter, Lymphography, Audiology, Housevotes, Spect, Soybeanlarge, DNANomial, Splice, Krvskp and Led24. The detail information of these datasets, including number of instances (# I.), number of categorical features (# C.F.) and number of classes (# C.), is shown in Table IV.
|Dataset||# I.||# C.F.||# C.|
As shown by the results in Table V and Table VI, our methods achieve very competitive performances against other baseline methods. In all the datasets, the CPML methods obtain the best performance in most cases. Even if in some datasets like Voting they may not be the best, their performance is quite close to it. What is more, their performance is consistent among different datasets. For other comparison methods, the HELIC and DM3 usually perform better than the KDML and LMNN. This might due to that CPML, HELIC and DM3 are specially designed for categorical data.