Many machine learning and data mining tasks rely on a good similarity or distance metric which captures the appropriate relationships between items. The most notable examples are k-nearest neighbors (k-NN) and k-means. Having an appropriate distance is crucial in many applications such as classification, regression, clustering, information retrieval, and recommender systems. It also plays an important role in psychological science and cognitive neuroscience. For instance, one common psychology experiment is to have subjects rate similarities/distances between many pairs of items. The researcher then attempts to understand how features of the items drive the distance judgements[tversky1977, oswal2016].
Motivated by the above, many researchers have demonstrated how learning an appropriate distance can improve performance on a variety of tasks. [xing2003, goldberger2004, weinberger2009, bar2005, davis2007] propose various ways of learning a Mahalanobis distance metric to improve clustering and k-NN prediction. [oswal2016] learns a bilinear similarity for inference tasks in psychology and cognitive neuroscience. All of these methods have nice properties such as global optimality guarantees and interpretability, but they are limited in expressive capacity. Thus, methods have been proposed to address this. The random forest distance (RFD) [xiong2012] views distance metric learning as a pairwise classification problem – items in a pair are similar or they are not. This method is really a data transformation method, employing standard classification forests on the transformed data. This data transformation is costly, as it doubles the number of features and squares the number of data points that need to be partitioned, potentially leading to very deep trees for large training sample sizes. [liu2019] proposes a neural similarity learning method for improving visual recognition with CNNs. Various similarities based on siamese networks have been proposed for tasks such as signature verifcation [bromley1994] and one-shot learning [koch2015].
While the more expressive methods may demonstrate high accuracy on complex tasks, they lose computational efficiency and/or interpretability. Furthermore, many are tailored for specific structured problems, which limits flexibility. To bridge the gap, we propose a robust, interpretable, and computationally efficient Similarity and Metric Random Forests (SMERF) for learning distances. Given an observed set of pairwise distances, SMERF trees attempt to partition the points into disjoint regions such that the average pairwise distances between points within each region is minimal. Our method directly generalizes standard classification and regression trees. Like classification and regression forests, SMERF is embarrassingly parallelizable, and unlike RFD, SMERF partitions points in dimensions rather than points in dimensions. We show that SMERF can approximate a range of different notions of distances. Last, we demonstrate its flexibility and real-world utility by using it to predict links in networks.
2 Similarity and Metric Random Forests
Suppose we observe a set of points , along with a symmetric matrix whose element represents some notion of dissimilarity or distance between and . We wish to learn a function which predicts the distance for a new pair of observations and .111Typically a distance or dissimilarity is nonnegative, but SMERF can operate on negative-valued distances (for example, a distance which is defined by taking the negative of a nonnegative similarity measure.
To this end, we introduce an ensemble decision tree-based method called Similarity and Metric Random Forests (SMERF) (technically, it learns a semi-pseudometric). Starting at the root node of a tree, which is the entire input space , the training observations are partitioned into disjoint regions of the input space via a series of recursive binary splits. The orientation and location of each split is found by maximizing the reduction in average pairwise distance of points in the resulting child nodes, relative to the parent node. For convenience, we assume splits are made orthogonal to the axes of the input space, although in practice we allow arbitrarily oriented splits. Let be a set of points at a particular split node of a tree and . The average pairwise distance is:
Let denote the tuple of split parameters at a split node, where indexes a dimension to split and specifies where to split along the dimension. Furthermore, let and be the subsets of to the left and right of the splitting threshold, respectively. denotes the dimension of . Denote by and the number of observations in the left and right child nodes, respectively. Then a split is made via:
Eq. (1) finds the split that maximally reduces the average pairwise distance of points in the child nodes, relative to that of the parent node. This optimization is performed exhaustively. Nodes are recursively split until a stopping criterion has been reached, either a maximum depth or a minimum number of points in a node. The end result is a set of leaf nodes, which are disjoint regions of the feature space each containing one or more training points. An ensemble of randomized trees are constructed, where randomization occurs via the following two procedures: 1) using a random subsample or bootstrap of the training points for each tree and 2) restricting the search in Eq. (1) over a random subsample of the input feature dimensions.
In order to predict the distance for a new pair of points, a particular notion of distances between all pairs of leaf nodes is computed, which is defined in the following.222The pairwise leaf node distance we adopt is one of many possible sensible distances. Let be the leaf node of the tree, and let be the subset of the training data contained in . The distance between leaves and is
and are passed down the tree until they fall into a leaf node. Their predicted distance is simply the distances of the leaves that and fall into. Letting and be the leaf nodes where and fall into for the tree, the tree prediction is . The prediction made by the ensemble of trees is the average of the individual tree predictions:
The utility of SMERF is diverse. For instance, can represent distances between items for information retrieval, or it can represent links between nodes in a network. For knowledge discovery, one can use standard computationally efficient tree-based variable importance methods, enabling identification of features driving the observed distances/dissimilarities between points. In classification, regression, and clustering, one can learn distances to improve k-NN or k-means.
3 Smerf Generalizes Classification and Regression Trees
We show that the classification and regression tree procedures of Breiman et al [breiman1984] are special cases of SMERF trees, instantiated by specifying particular notions of pairwise distances. Both classification and regression trees recursively split the training points by optimizing a split objective function. The classification tree procedure constructs a tree from points and associated class labels . The Gini impurity for a tree node sample is defined as
where is the fraction of points in whose class label is . A classification tree finds the optimal orientation and location to split the training points using the following optimization:
This equation is a special case of Eq. (1), when the pairwise distance is defined as the indicator of points and belonging to different classes.
Similarly, we claim a regression tree is a special case of a SMERF tree. A regression tree is constructed from and associated continuous responses . Denote by
the (biased) sample variance of the responses points. Specifically,
A regression tree finds the best split parameters which maximally reduce the sample variance in the child nodes, relative to the parent node. Specifically, the optimization is
This optimization is a special case of Eq. (1) when the distance of a pair of points is defined as one-half the squared difference in the responses of the points.
Proofs for these propositions are in the appendix.
One implication of Propositions 1 and 2 is that different flavors of classification and regression trees can be constructed, simply by changing the notion of distance. Doing so is equivalent to changing the split objective function. For example, one can construct a more robust regression tree by defining the pairwise distance .
4 Examining Smerf Under a Statistical Learning Framework
In this section, a statistical learning framework is developed, which will ultimately shed light on the mathematical driving forces of SMERF
. Suppose a pair of i.i.d. random vectors
is observed, and the goal is to predict a random variablewhich represents some notion of distance between and . Formally, we wish to find a function that minimizes . To this end, we assume a training sample distributed as the prototype is observed, where each is i.i.d.333 and may be dependent, since they represent distances for two pairs which share the same sample Furthermore, assume both the distance and are symmetric, so that and . The objective is to use
to construct an estimateof the Bayes optimal distance function , which is the true but generally unknown minimizer of . For convenience in notation, we will omit from when appropriate. An estimate is said to be consistent if .
We analyze consistency of procedures under a simple class of distributions over
. Suppose there exists an additional response variableassociated with each , where may be observed or latent within the training sample. This induces a distribution over the joint set of random variables
. Assume the joint distribution ofis described by
where is the Bayes optimal regression function for predicting under squared error loss and is the Bayes optimal distance predictor if in (6) is a constant (see Appendix B.3 for a detailed derivation). If is observable in the training sample and (6, 7) are assumed, then one obvious approach for constructing an estimate of would be to use to construct estimates of and , and plug them into (8). Denoting by and the estimates of and , respectively, consider the estimate
We can show that a consistent estimate of exists, using random forests. Scornet et al. [scornet2015] proved consistency of regression random forests in the context of the additive regression model described by Eqn. (6). This result, which we refer to as Scornet Theorem 2 (ST2), is reviewed in Appendix B.2. We will build off of ST2 to analyze asymptotic performance of distance estimates constructed using random forest procedures. First, we define more notation.
A regression forest estimate is an ensemble of randomized regression trees constructed on training sample , where the randomization procedure is the same as that defined for SMERF in Section 2. The goal is to minimize . Each tree is constructed from a subsample of the original training points. Denote by the specified size of this subsample, and denote by the specified number of leaves in each tree. A tree is fully grown if , meaning each leaf contains exactly one of the points. The prediction of the response at query point for the tree is denoted by , where are i.i.d random variables which are used to randomize each decision tree. The forest estimate is the average of the tree estimates
To make the analysis more tractable, we take the limit as , obtaining the infinite random forest estimate
Here, expectation is taken with respect to conditioned on and .
Suppose the conditions in ST2 are satisfied. Then the estimate
So far, we have considered estimates constructed from regression estimates . Such procedures require that is accessible in the training sample, which is not always the case. Now we assume is latent, and thus can only be constructed by observing each in the training sample. In this case, we use SMERF to construct . Under assumption (7), Proposition 2 states that a SMERF tree constructed from is identical to a regression tree constructed from . Thus, we may examine the SMERF estimate in terms of the regression tree estimates and , for which ST2 provides consistency.
SMERF builds randomized trees constructed using Eq. (1). The prediction of the distance at query pair made by the tree is denoted by , where as before is the randomization parameter for each tree. Now, assume trees are fully grown (meaning one point per leaf). By Proposition 2, the SMERF tree is equivalent to the fully grown regression tree constructed with the same randomization parameter. Having equivalence of leaves between the two trees, denote by the leaf of the tree containing the single training point . Similarly denote by and the leaves that and fall into at prediction time. The fully grown regression tree makes the following prediction at query point :
That is, the tree makes prediction when falls into the same leaf containing the single training point . The fully grown SMERF tree makes the distance prediction at query pair :
That is, the tree makes prediction when falls into the leaf containing and falls into the leaf containing . Thus, the following relationship between , , and holds:
The SMERF estimate for trees is
As was done for regression random forests, we analyze the infinite SMERF estimate
Comparing the last line of (15) to (10), we see that is implicitly an estimate of the form . Thus, SMERF estimates two contributions to : The term estimates one-half the squared deviation of and , while the term estimates . By Theorem 1, we know that . Thus, Lemma 1 (Appendix B.1) tells us half of the work is done in establishing consistency of ; the second half is establishing that . Unfortunately, this term is difficult to analyze due to the fact that and are not independent with respect to the distribution of . However, our own numerical experiments for various additive regression settings suggest that gets arbitrarily close to with large (Appendix B.4). Based on these empirical findings, we conjecture that SMERF is consistent under our framework. Confirmation of this is left for future work.
5.1 Simulations for Distance Learning
We evaluate the ability of SMERF to learn distances in three very different simulated settings. SMERF’s performance is compared to two other methods which are designed to learn quite different notions of distances. The first method we compare to learns a symmetric squared Mahalanobis distance by solving the following optimization problem:
where and are the same as in Section 2. This can be seen as a regression form of [xing2003]. We refer to this method as Mahalanobis. The second method we compare to learns a bilinear similarity [chechik2010, oswal2016, kulis2012] via the optimization:
where is the matrix form of and is a symmetric matrix whose element is the similarity between and . The learned matrix can be viewed as a linear mapping to a new inner product space, such that the dot product between two points and when mapped to the new space is (hopefully) close to . We refer to this method as Bilinear.
We implement Mahalanobis and Bilinear in Matlab using the CVX package with the Mosek solver, while SMERF was implemented in native R. In all three experiments, the number of training examples ranged from 20 to 320 and the number of test examples was 200. The dimensionality of the input space, , for all three experiments is 20, but only the first two dimensions contribute to the distance and the other dimensions are irrelevant. Each experiment was repeated ten times. The experiments are described below, with additional details in Appendix D.2.
Regression Distance models the pairwise distances as the squared deviation between additive regression responses. Specifically, each dimension of each is sampled i.i.d. from . Then regression responses are computed according to
where is the dimension of and . The pairwise distance is defined as , which is bounded in . This boundedness was specified purposefully since Bilinear requires a similarity matrix for training and testing. Thus, we have a natural conversion from distance to similarity using the formula .
Bilinear Distance models the similarities between two points as the product of their regression responses. Specifically, each dimension of each is sampled i.i.d. from . Regression responses are derived from (16), but with removed. Then similarity , which is bounded in . We obtain as input to SMERF and Mahalanobis.
Radial Distance models the distance between two points contained to the unit ball by the squared deviation of the vector norms of their first two dimensions. Specifically each
is uniformly distributed within the 20-dimensional unit ball. Lettingdenote the first two dimensions of , the distance is , which is bounded in . Again, we obtain as .
Figure 1 shows three performance measures: 1) mAP-10 (top row) is the mean average precision, using the ten ground-truth closest points to each test point as the relevant items; 2) Spearman (middle row) is the average Spearman correlation between the predicted and ground-truth distances between each test point and every other point; 3) RMSE (bottom row) is the root-mean-squared-error between the predicted and ground-truth distances.
The Regression Distance experiment was designed specifically for Mahalanobis to perform well. The left column of Figure 1 shows that Bilinear performs poorly in all three measures, while SMERF performs comparably to Mahalanobis, perhaps even better for small sample sizes. The Bilinear Distance experiment was designed specifically for Bilinear to perform well. Similarly, we see in the middle column that Mahalanobis completely fails in all three measures while SMERF eventually performs comparably to Bilinear. The right column shows that SMERF is the only method that can learn the Radial Distance. Furthermore, SMERF is able to correctly identify the first two dimensions as the important dimensions for the Radial Distance (Appendix D.4 and Figure 5). Overall, these results highlight the robustness and interpretability of SMERF.
5.2 Network Link Prediction
Related to the notion of distances/similarities between items is the notion of interactions between items. Here, we demonstrate the flexibility and real-world utility of SMERF by using it to predict links in a network when node attribute information is available. We compare our method to two state-of-the-art methods for predicting links in a network. One is the Edge Partition Model (EPM) [zhou2015], which is a Bayesian latent factor relational model. It is purely relational, meaning it does not account for node attributes. The second is the Node Attribute Relational Model (NARM) [zhao2017], which builds off of EPM by incorporating node attribute information into the model priors. Thus, the methods span a gradient from only using network structural information (EPM) to only using node attribute information (SMERF). We compare the three methods on three real-world network data sets used in [zhao2017]. Lazega-cowork is a network of cowork relationships among 71 attorneys, containing 378 links. Each attorney is associated with eight ordinal and binary attributes. Facebook-ego is a network of 347 Facebook users with 2419 links. Each user is associated with 227 binary attributes. NIPS234 is a network of 234 NIPS authors with 598 links indicating co-authorship. Each author is associated with 100 binary attributes. Additional details regarding the network data sets and experimental setup can be found in Appendix E.
For each data set, the proportion of nodes used for training (TP) was varied from 0.1 to 0.9 (the rest used for testing). We note that this is different from [zhao2017], in which the data was split by node-pairs, rather than by nodes. For EPM, however, we split the data by node-pairs. This is because, by not leveraging node attributes, it is hopeless in being able to predict links for newly observed nodes any better than chance. Experiments for each TP were repeated five times. For SMERF, was computed as , where is the adjacency matrix. Thus, SMERF predictions represent scores between zero and one reflecting the belief that a link exists. EPM and NARM explicitly model such beliefs. Thus, we use the area under the ROC (AUC-ROC) and Precision-Recall (AUC-PR) curves as measures of performance.
Figure 2 shows AUC-ROC (top) and AUC-PR (bottom) for the three networks. The left column indicates shows that for small TP on Lazega-cowork, SMERF outperforms NARM for AUC-ROC, while NARM eventually catches up. They perform comparably in terms of AUC-PR. Both methods outperform EPM until the TP equals 70%. On Facebook-ego, EPM outperforms the other methods. SMERF performs slight better than NARM for small TP and comparably otherwise. On NIPS234, SMERF outperforms NARM in both measures for nearly all values of TP. SMERF substantially outperforms EPM for small TP, but EPM quickly begins to win once TP equals 50%. Overall, the results indicate that the general purpose SMERF, which disregards network structural information, is highly competitive with dedicated relational models for the purpose of link prediction.
We have presented a novel tree ensemble-based method called SMERF for learning distances, which generalizes classification and regression trees. Via its connection to regression forests, analysis of SMERF under a statistical learning framework is performed through the lens of regression forest estimates. We show that SMERF can robustly learn many notions of distance, and also demonstrate how it can be used for network link prediction. Future work will build off of theoretical work here for establishing consistency of SMERF. Second, we will explore how different notions of distance constructed from labeled data can impact k-NN accuracy using the learned SMERF distance function. Third, SMERF will be extended to handle missing values in the distance matrix during training. Last, we will investigate ways to improve performance of SMERF for link prediction by leveraging the addition of network structural information.
Appendix A Proofs of Propositions 1 and 2
a.1 Proof of Proposition 1
Consider two trees partitioning a set of points , constructed with the same stopping criteria. Without loss of generality, assume the trees are deterministic. Then the trees will be identical if the split optimizations are identical. Thus, it suffices to show that if , then in (1) is identical to in (4).
First, compute for a sample of points . The sum of all pairwise distances is just the total number of pairs of points in that don’t share the same class label. Denote by the number of points in whose class label is . For any pair of distinct class labels , there are pairs of points whose members have labels that are either or , but who do not share the same label. Thus, the total number of pairs of points in not sharing the same label is . Since the distance matrix for the set double counts each pair of distinct points in (because it is symmetric), we multiply this sum by . Noting that there are total pairs of points, the average pairwise distance is
We show that this expression is equivalent to :
where in the third to last equality we used the identity .
a.2 Proof of Proposition 2
Appendix B Supplementary Results for Section 4
b.1 Lemma 1 and Corollary 1
Recall the Bayes distance predictor
Further recall the estimate defined by (9):
We have the following lemma.
Suppose and . Then .
The function is continuous in and . Therefore, it follows from the continuous mapping theorem (CMT) that . Similarly, the function is continuous in and . Using the result of the first application of CMT, another application of CMT yields . This completes the proof.
The proof of Lemma 1 leads to the following corollary.
Let . Then .
b.2 Review of Scornet et al. [scornet2015] Theorem 2 (ST2)
Scornet et al. Theorem 2
Assume Equation (6) holds. Let . Then, provided , , and