Supervised Visualization for Data Exploration

06/15/2020 ∙ by Jake S. Rhodes, et al. ∙ Université de Montréal Utah State University 13

Dimensionality reduction is often used as an initial step in data exploration, either as preprocessing for classification or regression or for visualization. Most dimensionality reduction techniques to date are unsupervised; they do not take class labels into account (e.g., PCA, MDS, t-SNE, Isomap). Such methods require large amounts of data and are often sensitive to noise that may obfuscate important patterns in the data. Various attempts at supervised dimensionality reduction methods that take into account auxiliary annotations (e.g., class labels) have been successfully implemented with goals of increased classification accuracy or improved data visualization. Many of these supervised techniques incorporate labels in the loss function in the form of similarity or dissimilarity matrices, thereby creating over-emphasized separation between class clusters, which does not realistically represent the local and global relationships in the data. In addition, these approaches are often sensitive to parameter tuning, which may be difficult to configure without an explicit quantitative notion of visual superiority. In this paper, we describe a novel supervised visualization technique based on random forest proximities and diffusion-based dimensionality reduction. We show, both qualitatively and quantitatively, the advantages of our approach in retaining local and global structures in data, while emphasizing important variables in the low-dimensional embedding. Importantly, our approach is robust to noise and parameter tuning, thus making it simple to use while producing reliable visualizations for data exploration.



There are no comments yet.


page 19

page 20

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Large, high-dimensional data sets are now regularly obtained in nearly every field of research and business. Despite the high dimensions of the data, it is typically assumed that the data can be accurately described using a much smaller set of latent variables. Dimensionality reduction methods aim to find these latent variables and are commonly a key step in the data analysis pipeline. For example, dimensionality reduction is often performed as a preprocessing step as many downstream methods of analysis suffer in performance and accuracy from the ‘curse of dimensionality’. Principal components analysis (PCA) 

Pearson (1901); Hotelling (1933) is a popular choice for preprocessing because of its simplicity and computational speed. Nonnegative matrix factorization (NMF) Lee and Seung (1999) and classical multi-dimensional scaling (MDS) Kruskal and Wish (1978) are also commonly used. However, these three methods assume a linear model on the data and thus are inefficient at representing nonlinear relationships in the latent space. Therefore, much work has focused on nonlinear dimensionality reduction methods including local linear embedding (LLE) Roweis and Saul (2000), Isomap Tenenbaum et al. (2000), diffusion maps Coifman and Lafon (2006); Nadler et al. (2006)

, and autoencoders 

Hinton and Zemel (1994).

Recent dimensionality reduction methods have focused on reducing dimensions for data visualization. Data visualization is an important aspect of exploratory data analysis for aiding humans in developing an understanding of the underlying structure within the data, which can enable hypothesis generation and data interpretation. Recent, successful visualization methods include t-SNE (t-Distributed Stochastic Neighbor Embedding) Maaten and Hinton (2008), UMAP (Uniform Manifold Approximation and Projection) McInnes et al. (2018), and PHATE (Potential of Heat-diffusion for Affinity-based Transition Embedding) Moon et al. (2019). These methods have found applications in computational biology Amir et al. (2013); Macosko et al. (2015); Gopalakrishnan et al. (2018); Becht et al. (2019); Zhao et al. (2020); Karthaus et al. (2020); Baccin et al. (2019), visualizing text Chen and Salman (2011), time series Duque et al. (2019)

, and the internal nodes in neural networks 

Horoi et al. (2020); Gigante et al. (2019).

However, all of these dimensionality reduction and visualization methods are unsupervised, where class labels are unavailable or ignored. Supervised adaptations have been created for some of these methods. The majority of these Bair et al. (2006); Yu et al. (2006); qing Zhang (2009); Li et al. (2018); Zhang et al. (2016); Jia et al. (2019); Vlachos et al. (2002); Ribeiro et al. (2008) are used for classification or regression preprocessing, and a few are used for visualization Vlachos et al. (2002); Tapson and Greene (2005); Babaee et al. (2016); Barshan et al. (2011). Many of these approaches incorporate class labels directly in a loss function, which can create an exaggerated separation between classes and introduce other distortions in the visualization. Additionally, some of these adaptations are limited to classification and not easily adapted to regression problems where the labels are not categorical. Finally, these approaches are often sensitive to parameter tuning, which may be difficult to perform without a specific measure of visualization quality.

In this paper we introduce a new approach to supervised visualization based on random forests Breiman (2001) and diffusion-based dimensionality reduction Coifman and Lafon (2006); Nadler et al. (2006). In particular, we leverage advances made by PHATE Moon et al. (2019) in extracting the information learned through a diffusion process for visualization. Hence we name our visualization technique RF-PHATE. The approach incorporates variable importance as measured from the trained random forests and gives a noise-resilient visual of the predictor space which may be used in the context of data exploration with known labels. We show empirically that RF-PHATE retains both the local and global structure of the data while emphasizing the important variables in the visualization. We also show that RF-PHATE is robust to noise and parameter tuning.

2 Related work on supervised dimensionality reduction

Supervised dimensionality reduction methods can generally be categorized into three major groups: PCA, NMF, and manifold learning Chao et al. (2019). The original implementation of supervised PCA (SPCA) selects a subset of features that are most highly correlated with the labels. PCA is applied to these features to generate a new feature space to be used in the regression problem Bair et al. (2006). The primary motivation for this approach was to handle problems where the number of features exceeds the number of observations () Bair et al. (2006). Another variation of SPCA was introduced in Yu et al. (2006), which may be kernelized and is capable of handling missing values. SPCA was applied for data visualization in Barshan et al. (2011)

. However, SPCA is linear and does not therefore accurately capture non-linear relationships in the data structure in low dimensions. In addition, the number of components selected is usually determined based on global variance explained, which does not adequately capture local relationships 

Yu et al. (2006).

NMF Lee and Seung (1999) seeks to decompose a data matrix into two, non-negative, matrices and , by minimizing the Frobenius norm of the difference between and while restricting the entries of and to be nonnegative. The rows of

can be regarded as basis vectors while the columns of

form the axes of the lower-dimensional space Tsuge et al. (2001). A number of supervised and semi-supervised modifications to NMF (SNMF or SSNMF) have been proposed, such as constrained NMF Haifeng Liu et al. (2012), structured NMF Li et al. (2018), and NMF for constrained clustering Zhang et al. (2016). Most of these approaches use the labels in a regularization term in the optimization problem. Jia et al. (2019) proposed a semi-supervised NMF with both similarity and dissimilarity regularization terms. In Liu et al. (2008); Tapson and Greene (2005); Babaee et al. (2016) NMF has also been used for low-dimensional visualization. However, it is still a linear method and therefore does not capture the intrinsic geometric structure of nonlinear data Cai et al. (2008). In addition, supervised versions of NMF tend to accentuate class differences in clusters, providing inflated separation between groups, and tight clustering within groups.

Manifold-based dimensionality reduction methods assume that the high-dimensional data lie on a low-dimensional manifold. Examples of manifold-based algorithms include Isomap Tenenbaum et al. (2000), UMAP McInnes et al. (2018), Diffusion Maps Coifman and Lafon (2006), t-SNE van der Maaten and Hinton (2008), LLE Roweis and Saul (2000), Laplacian Eigenmaps Belkin and Niyogi (2003), and PHATE Moon et al. (2019). Supervised versions of some of these methods have been proposed. WeightedIso, Iso+Ada Vlachos et al. (2002), and Enhanced-supervised Isomap (ES-Isomap) Ribeiro et al. (2008)

are supervised versions of Isomap, which estimates the geodesic distances between points. These supervised variations use the class labels to modify the distance metric to accentuate closeness in like-class samples while emphasizing different classes. Supervised versions of LLE (SLLE) have also been proposed that modify the dissimilarity metric using the class information 

de Ridder et al. (2003); qing Zhang (2009). However, like LLE, SLLE is sensitive to parameters. It was also shown in Vlachos et al. (2002); Ribeiro et al. (2008); Polito and Perona (2002) that LLE and Isomap are subject to “overclustering” unless the data are comprised of a single, well-clustered sample.

3 Supervised neighborhoods via random forest proximities

Manifold learning techniques are typically based on the construction of a kernel that computes some measure of pairwise local similarity between data points. For example, classical diffusion maps applies a Gaussian kernel with a fixed bandwidth to pairwise Euclidean distances Coifman and Lafon (2006) while PHATE applies the -decay kernel with an adaptive -nearest neighbor (NN) bandwidth Moon et al. (2019). Isomap similarly builds a -NN graph to measure similarity between points Tenenbaum et al. (2000). Trees are also commonly used to construct these kernel measures such as the use of -d trees for computing -NN graphs Friedman et al. (1977). Long range relationships between points are then learned by chaining together the local similarities, for example by diffusion Coifman and Lafon (2006); Moon et al. (2019) or shortest path algorithms Tenenbaum et al. (2000).

To create supervised versions of manifold learning techniques, the label information can be incorporated in the kernel construction. We propose to do this using random forests. Random forests Breiman (2001)

are a tree-based ensemble method commonly used for classification and regression. They are widely considered one of the best out-of-the-box supervised learning techniques as they often achieve very good results with relatively little tuning 

Cutler et al. (2012)

. In addition to classification and regression, random forests have a number of additional uses, such as variable selection, imputation, outlier detection, and unsupervised learning

Cutler et al. (2012).

Random forests were created as an extension to Leo Breiman’s bagging idea Breiman (1996)

, which uses bootstrap samples (sampling with replacement) to train base learners (decision trees in this case). The out-of-sample data is said to be “out of bag”

Breiman (2001), which is often used as a test set to estimate the generalization error of the model as well as to assess variable importance Cutler et al. (2012). The aggregate results of the base learners are combined to make a final prediction Breiman (1996)

. While individual decision trees tend to be sensitive to the data, random forests are very stable due to the two-part randomness of the algorithm (boostrapping and random variable selection), which ensures low correlation among the base learners 

Segal and Xiao (2011). Random forests are also robust to noise as splits are not likely to be determined by noise variables when meaningful variables are considered at a given node. Overall, as the number of trees increases, meaningful variables are more likely to be shown as important while noise variables will show less importance, averaging over all trees.

Definition 3.1.

Consider two out-of-bag observations from the training data and . The random forest proximity between these points is defined as the proportion of trees in the trained random forest for which the observations and share the same terminal (leaf) node Breiman and Cutler ((Accessed on 04/15/2020).

From Definition 3.1, it is clear that two observations that always end in the same terminal node will have a proximity of 1, while those that never fall into the same terminal node will have a proximity of 0. The proximities applied to all pairs of training points form a symmetric, positive definite kernel matrix with ones along the diagonal. The random forest proximities provide a measure of how similar observations are in the predictor space and naturally take into account the weight of variable importance relative to the training task. That is to say, two observations may be near one another in the predictor space due to having similar values in relatively few, important variables, but may otherwise be far apart in a Euclidean space Cutler et al. (2012). Thus using the proximities as inputs to visualization methods will produce visualizations that emphasize the variables important to the supervised task while largely ignoring the irrelevant variables Lin and Jeon (2006).

We propose to use the random forest proximities as the kernel for RF-PHATE because of their desirable properties described above. Random forest proximities have been used previously for visualization by applying MDS Cutler et al. (2012) or graph-based methods Golino and Gomes (2014). However, applying MDS directly to the proximities loses much of the geometric structure of the data (see figure 9 in appendix F), while force-directed layout methods (such as the Fruchterman-Reingold algorithm Fruchterman and Reingold (1991) used in Golino and Gomes (2014)) do not scale well with large data sets Gajer et al. (2004). In contrast, our approach preserves the geometric structure (both locally and globally) due to our use of diffusion and information distances and scales well to both small and large data sets.

4 Rf-Phate

Kernels typically contain mostly local information about the data points used for training rather than global. This includes the random forest proximities. Consider a classification task with two points and that have the same class label. Suppose and are similar to each other in the variables that are important for classification. Then and will have a high proximity as measured by a trained random forest as they are likely to end up in the same terminal node of each tree. This is true even if the points are highly dissimilar in variables that are not important for the classification. Now, suppose and have the same class label but are dissimilar in one or more variables that are relevant for classification. Then and will have a lower proximity between them compared to the first scenario as it is less likely they will end up in the same terminal nodes. However, the proximities between sufficiently dissimilar points will all be zero, thus obscuring some of the long-range relationships in the data. Therefore, directly embedding the proximities leads to global distortion.

A robust way of learning global relations from local proximities is via diffusion (see Coifman and Lafon (2006); Nadler et al. (2006); Moon et al. (2019)). Because of recent successes in adapted, diffusion-based visualizations (see Moon et al. (2019); Duque et al. (2019); Gigante et al. (2019); Horoi et al. (2020)), we create a new, diffusion-based, supervised visualization tool, called RF-PHATE, which combines the locally-adaptive random forest kernel with recent advances in manifold learning; all of the steps are provided in Algorithm 1. Random forests can be viewed as a

-nearest neighbor (NN) classifier with an adaptive metric 

Lin and Jeon (2006). The proximities produced by the random forest can therefore be viewed as locally-adapted affinities in the predictor space, where local variable importance is assessed using a measure of class purity at each node, making the affinities insensitive to data density. This further motivates the use of random forest proximities to encode local structure in data that is dependent on the label space.

Input: Data matrix , class labels , output dimension
Output: , an -dimensional () embedding
  1. Compute the random forest proximities,

  2. row-normalize to form the diffusion operator

  3. compute time scale via Von Neumann Entropy

  4. diffuse for time steps

  5. Compute potential representations

  6. Compute potential distances from

  7. apply classical MDS to

  8. apply metric MDS to with initialization

Algorithm 1 The RF-PHATE algorithm

Diffusion maps constructs a graph from local similarities to learn the global data geometry. Typically, the graph is constructed by applying a kernel function (e.g. Gaussian) to pairwise Euclidean distances using a fixed kernel bandwidth resulting in an kernel matrix , where

is the number of points we wish to embed, giving a notion of similarity between data points. Instead of applying a kernel function to Euclidean distances, our method uses the random forest proximity (local-affinity) matrix. The local affinity matrix is then row-sum normalized to create a Markov transition matrix or diffusion operator,

, which is used to capture global relationships. An entry, , of row in

represents the probability of transitioning or “walking” from point

to point in a single step of a random walk on a graph constructed from the affinity matrix . Thus the probability of transitioning from point to point on this graph in a single time step is high if the distance between them is small. This prevents an exiting of the intrinsic manifold of the data. is raised to the power of to simulate diffusion steps. This has the effect of denoising as transitions with low probabilities are filtered out, while transitions with high probability retain their importance. In diffusion maps, the local and global information learned in is often extracted into lower dimensions using eigendecomposition.

Diffusion maps has several weaknesses that prevent it from being effective at visualization in most settings. First, in many applications a fixed bandwidth for all points is not appropriate as the data may not be sampled uniformly. Second, choosing a good time scale for the diffusion process is difficult and largely overlooked in classical diffusion maps. A small value of can lead to insufficient denoising and an overemphasis on local structure. In contrast, a large value of can lead to oversmoothing and an overemphasis on global structure. Third, the eigendecomposition in diffusion maps tends to place the information learned in into different dimensions Moon et al. (2019); Haghverdi et al. (2016), which is not amenable for visualization.

Since the random forest proximity matrix, , is comprised of locally-adaptive affinities, it does not rely on uniformly-sampled data, overcoming this weakness in diffusion maps. The von Neumann Entropy (VNE) of the diffused operator provides a good choice of for visualization. The VNE of the diffused operator

is the Shannon entropy of the normalized eigenvalues of

. Since the entropy of a discrete random variable is maximized with a uniform distribution, the VNE is a soft proxy for the number of significant eigenvalues of

. The more significant eigenvalues there are, the closer the normalized eigenvalues are to a uniform distribution, and the higher the VNE. Since is a probability transition matrix with a stationary distribution, the VNE converges to zero as . The rate of decay of the VNE as increases is used to select . Typically, is chosen to be around the transition from rapid to slow decay in the VNE as this is considered to be point in the diffusion process where noise has been eliminated and oversmoothing begins Moon et al. (2019). We show empirically (see appendix E) that diffusion over proximities is not sensitive to the choice of around the VNE-selected value for local variable preservation.

To counter the third weakness of diffusion maps, we apply an information distance to the powered diffusion operator to create the potential distance Moon et al. (2019). The potential distance is calculated by log- or square-root-transforming the powered diffusion operator and then calculating the Euclidean distance between rows, although other information distances can also be used Duque et al. (2019). The potential distance is sensitive to differences in both the tails and the more dense regions of the diffused probabilities, resulting in a distance which preserves both local and global relationships. These distances are then embedded into low-dimensions using metric MDS with classical MDS as an initialization. This extracts the information in low dimensions for better visualization.

5 Experimental Results

To evaluate our supervised visualization method, we apply RF-PHATE to multiple data sets and compare it to other methods described in Sections 1 and 2. Full details on the used data sets, computational environment, method implementations, and parameter tuning can be found in the appendices. We first demonstrate qualitatively how RF-PHATE can be used in exploratory data analysis to visualize variable importance with respect to the supervised task on both low- and high-dimensional data sets. We then quantitatively establish this capability compared to other supervised dimensionality reduction methods. Finally, we show that RF-PHATE is able to accurately visualize data with noisy dimensions and outperforms both supervised and unsupervised dimensionality reduction methods.

5.1 Visualizing Variable Importance

Figure 1: RF-PHATE on the Titanic data set colored by sex, class, and age. Passenger deaths are denoted by light-colored diamonds while survivors are marked by dark-colored dots in (a) and (b). Sex and class were the top two important variables. The visualization shows clear groups and trends within the data.

We demonstrate the ability of RF-PHATE to capture local variable importance in a low-dimensional representation of the popular Titanic data set Dua and Graff (2017), which has 12 variables and 891 observations. The variables Name and Ticket were removed as they are nearly unique to each observation. In addition, observations with missing values were removed. The proximities were created using the randomForest Liaw and Wiener (2002) package in R R Core Team (2019), with the variable Survived as the response. The forests’ classification accuracy was 81.46% using 5000 trees and otherwise default parameters. Variable importance was computed using the forest’s built-in importance measure (mean decrease in accuracy). To compute the importance for the variable using a random forest, the error rate (for classification) or MSE (for regression) is computed using the out-of-bag (oob) observations. The values of for the oob data are randomly permuted and the error rate (or MSE) is computed using the permuted data. The importance is the difference between the oob error rate and the permuted oob error rate Cutler et al. (2012). For the Titanic data set, the top two important variables were Sex and Class. Figure 1 shows the results of applying RF-PHATE to this data. Instead of creating distinct clusters based solely on the response, RF-PHATE arranges points and clusters by variable importance. The right-most cluster includes mostly survivors and contains female first- and second-class passengers. Most passengers in the left-most cluster did not survive. This cluster contains male, second- and third-class passengers. The small cluster in the center is entirely composed of second-class, male children who all survived. At least some of the variation within clusters appears to be driven by age. This demonstrates how RF-PHATE can be used in data exploration by visually identifying groupings and trends within the variables important for the supervised task.

5.2 Visualization of Higher-Dimensional Data

Figure 2: (a) RF-PHATE on the Optical Digits data set Xu et al. (1992); Dua and Graff (2017). Some numbers (0, 6) are easily separated, while others are more difficult to classify, such as 1 and 9. The color-bars in (a)ii and (a)iii indicate the count of pixels made up by non-overlapping blocks. Numbers 0, 3, and 6 have higher counts in the pixels which make up the block (the top variable) as can be seen in (a)ii. (b) RF-PHATE on the Ames Housing data set Cock (2011). By comparing (b)i, (b)ii, and (b)iii it is easy to see the relationship between higher sale prices with overall house quality and ground-floor living area.

To demonstrate the application of RF-PHATE to high dimensional data we apply it to the Optical Recognition of Handwritten Digits data from Xu et al. (1992); Dua and Graff (2017) and Ames, Iowa, house pricing data from Cock (2011), with classification and regression supervision, respectively. The Optical Digits data contains 5620 observations of handwritten numerical digits recorded in bit images. The images are divided into blocks resulting in 64 dimensional arrays Xu et al. (1992); Dua and Graff (2017). The random forest accuracy for the data set was 82.7%. The variables in this data set are not as straightforward to analyze as they consist of individual pixels in the images, but we can see in Figure 2(a) that high values in the top variable (block 43) are important in classifying 0’s and 6’s, while high values in the second variable provide further separation between 4’s, 7’s, and 9’s from the 3’s, 5’s, and 6’s. On the Ames, Iowa house pricing data set Cock (2011), we adapted the random forest in RF-PHATE to perform regression on house prices, resulting in overall quality and ground living area (square footage) being designated as the most important variables, as emphasized by the RF-PHATE visualization in Figure 2(b).

5.3 Variable Importance Preservation

We now numerically assess the quality of the embeddings provided by RF-PHATE. To be useful in supervised exploratory data analysis, the embedding should preserve local and global structure in the most important variables for the supervised task. To numerically assess this structure preservation, we applied a -nearest neighbor classifier or regressor with the embedding dimensions as inputs (i.e. features) and the important variables as the response output. The key idea is that if the embedding preserves the local and global structure in the important variables, then the embedding dimensions should be sufficient to predict the important variables.

For classification, the mean prediction error, and for regression, the root mean-squared error was used as the criterion, applying 10-fold cross-validation. We compared RF-PHATE with six supervised methods: Enhanced Supervised Isomap (ESIso) Ribeiro et al. (2008), Kernalized Supervised PCA (KSPCA) Yu et al. (2006), SPCA Yu et al. (2006), Enhanced Supervised LLE (ESLLE) qing Zhang (2009), SNMF Jia et al. (2019), and Partial Least Squares Discriminant Analysis (PLSDA) Wold et al. (1984).

Titanic Sex (2) 0.0000 0.0028 0.0084 0.0056 0.0070 0.1419 0.0028
Titanic Sex (3) 0.0000 0.0028 0.0084 0.0028 0.0042 0.1489 0.0042
Titanic Class (2) 0.0154 0.1657 0.1025 0.0688 0.0702 0.3174 0.0197
Titanic Class (3) 0.0112 0.0492 0.1011 0.0281 0.0969 0.3048 0.0239
Optical Top Var (2) 3.612 3.554 5.478 5.237 5.910 6.659 5.842
Optical Top Var (3) 3.473 3.484 4.013 4.922 5.887 6.659 5.407
Optical Second Var (2) 3.425 3.869 4.101 4.862 5.441 6.540 5.730
Optical Second Var (3) 3.361 3.851 3.902 4.007 5.194 6.540 5.060
Optical Third Var (2) 4.147 3.997 5.107 4.416 5.839 7.147 6.217
Optical Third Var (3) 3.766 3.810 4.608 4.039 5.810 7.147 5.968
Optical Fourth Var (2) 4.108 4.118 5.373 5.687 5.965 6.161 5.985
Optical Fourth Var (3) 4.001 4.037 4.035 4.949 5.867 6.161 5.320
Ames Quality (2) 0.644 N/A 0.815 0.764 N/A N/A 0.767
Ames Quality (3) 0.631 N/A 0.788 0.759 N/A N/A 0.736
Ames GrLvArea (2) 263.75 N/A 402.08 413.36 N/A N/A 415.90
Ames GrLvArea (3) 224.57 N/A 393.64 411.53 N/A N/A 293.18
Table 1: Numerical comparison of RF-PHATE with other supervised methods. The top two to four variables for classification (Variable Titanic and Optical Digits) or regression (Ames Housing

) were used as the response, using the low-dimensional embeddings (2 or 3) from 6 supervised dimensionality reduction algorithms. For categorical variables (Sex, Class) the classification error was used; for continuous variables (all of the remaining), the root mean-squared error (RMSE) was used. Therefore, lower is better. ESIso, ESLLE, and SNMF have not been adapted for regression, hence the missing values in the Ames dataset (N/A). RF-PHATE is either first or second at each task.

Table 1 shows the results for the Titanic, Optical Digits, and Ames Housing data sets, respectively. For the Titanic data set, RF-PHATE performs the best out of all methods when predicting either sex or class (the two most important variables) using either a 2 or 3 dimensional embedding. For the Optical Digits, RF-PHATE is either the best or second-best method at preserving the relevant structure when using 2 or 3 dimensions. ESIso is generally in second and is sometimes significantly below RF-PHATE in terms of quality. We performed similar comparisons on other data sets. We also note that RF-PHATE continues to do well when embedding with higher dimensions. See appendices A and B.

5.4 Supervised Visualization in the Presence of Noisy Variables

We now consider the problem where only a few variables are relevant for the supervised task and all other variables contain noise. This is a case where unsupervised visualization methods are very likely to fail, demonstrating the utility of supervised visualization methods. To simulate this, we took the Iris data set Anderson (1936); Dua and Graff (2017) and added 1000 dimensions generated from a Gaussian random variable with means uniformly generated between -1 and 1 and with a variance of 1. The simulation was repeated 10 times. This also creates a case with few data points as the Iris data set has only 150 samples. The visualizations are shown in Figure 3. We note that the unsupervised methods (PCA, Isomap, and t-SNE) perform poorly in this setting as their visualizations consist of a single cloud of points without any distinguishable features.

In contrast, the supervised visualization methods are able to reveal some structure by using the class labels. RF-PHATE (Figure 3(b)) separates the setosa class from the other two classes while the versicolor and virginica classes are shown with some overlap. Furthermore, the setosa class is shown to be approximately the same distance from each of the other two classes. This overall structure is consistent with the properties of the sepal measurements in the Iris data set as seen in figure 3(a) in appendix A.

Figure 3: Iris data set Anderson (1936); Dua and Graff (2017) with additional 1000 Gaussian noise variables using various dimensionality reduction techniques. The ground truth (without additional noise) is shown in (a). RF-PHATE (b) effectively denoises the data and produces a comparable visualization. It is seen that unsupervised methods (c)[(i), (ii), and (iii)] do not perform in this scenario. Other supervised methods (c)[(iv), (v), and (vi)] separate the data into classes but lose some of the general data structure.

PLSDA and ES-Isomap (Figures 3(c)iv and vi, respectively) separate the classes well. However, these methods do not show the overlap between the versicolor and virginica species. Also, the ES-Isomap visualization shows the data points for each class collapsed onto a single point, completely eliminating any structure that may be present within the classes. This suggests that these methods do not preserve global relationships between classes well.

Supervised PCA (Figure 3(c)v) separates the setosa class from the other two classes while the versicolor and virginica species share some overlap. The setosa species is shown in the visualization to be much closer to the versicolor class than the virginica species. While this is not consistent with the global structure of the sepal measurements (Figure 3(a)), it is consistent with the structure of the petal measurements (see figure 4). However, SPCA does not show a large separation between the setosa and versicolor classes, suggesting that SPCA may not always accurately reflect the magnitude of class separations.

These results are corroborated numerically in Table 2. Here we performed a similar experiment as in Section 5.3 where we performed regression on the embeddings to predict each of the 4 variables in the data set. Again, RF-PHATE outperforms the other methods in all settings. While this is a simple data set, its simplicity allows us to evaluate qualitatively and quantitatively the performance of the various methods, and thus reflects a generalizable trend that will be studied further in future work. See the appendix B for experiments with higher dimensions.

petal len. (2) 0.330 0.01 0.348 0.01 0.340 0.01 0.381 0.01 0.349 0.01 1.475 0.03 0.387 0.01
petal len. (3) 0.334 0.01 0.353 0.01 0.360 0.01 0.400 0.01 0.353 0.01 2.233 0.04 0.363 0.01
petal wid. (2) 0.291 0.01 0.298 0.01 0.297 0.01 0.330 0.01 0.304 0.01 1.297 0.02 0.330 0.01
petal wid. (3) 0.292 0.01 0.305 0.01 0.312 0.01 0.345 0.01 0.308 0.01 1.905 0.03 0.308 0.01
sepal len. (2) 0.459 0.01 0.522 0.02 0.514 0.02 0.514 0.02 0.528 0.02 0.885 0.02 0.514 0.02
sepal len. (3) 0.455 0.01 0.536 0.02 0.540 0.02 0.520 0.02 0.534 0.02 1.272 0.02 0.516 0.02
sepal wid. (2) 0.320 0.01 0.348 0.01 0.351 0.01 0.357 0.01 0.349 0.01 0.462 0.01 0.357 0.01
sepal wid. (3) 0.318 0.01 0.354 0.01 0.358 0.01 0.356 0.01 0.354 0.01 0.501 0.02 0.357 0.01
Table 2: Variable regression on Iris

with 1000 additional random noise variables. The low-dimensional embeddings of seven supervised dimensionality reduction methods were used as features to regress on the original iris data set variables. The noise variables were simulated ten times and the results were averaged. The standard deviation is also reported. RF-PHATE outperforms all other methods.

6 Conclusion

Large, high-dimensional data sets are commonplace and can be difficult to analyze and interpret. Supervised dimensionality reduction algorithms are used for preprocessing and visualizing high-dimensional data sets when labels are present. Current methods for supervised visualization either over-emphasize class separation or ignore the data’s intrinsic geometric structure altogether. In contrast, RF-PHATE incorporates labels directly in its kernel matrix using random forests’ out-of-bag samples, thereby robustly learning local relationships without overstressing class separation. The global structure is discovered by means of diffusion and then visualized by embedding an information distance. RF-PHATE outperforms other supervised dimensionality reduction algorithms, is noise-resilient, robust to parameter tuning, and handles both continuous and discrete labels.

Broader Impact

Data exploration plays a role in virtually all fields of science and technology nowadays. In this era of big data, there is an ever-increasing need for means to make sense of the information overload we face daily. In instances where labeled data is present, RF-PHATE has the potential to visually unlock patterns and signals in excessively noisy data. This holds the potential to lead to new discoveries in social, medical, natural, and business settings. RF-PHATE’s implementation allows enables identification of trends or clusters based on important variables. These clusters are inherently meaningful and not subject to arbitrary subjective interpretation that is often encountered in an unsupervised setting. This, in turn, may lead to the discovery of meaningful relationships not only between the variables and classes, but also between the variables themselves. We do note that like many other data processing tools, data visualization is not immune to misuse, and may be used as a means of deceit when not presented properly. However, this work is computational in nature and addresses fundamental development in data science, agnostic to specific application. As such, by itself, it is not expected to raise ethical concerns nor to have adverse effects on society.


  • Pearson [1901] Karl Pearson. LIII. On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 2(11):559–572, 1901.
  • Hotelling [1933] Harold Hotelling. Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology, 1933.
  • Lee and Seung [1999] Daniel D. Lee and H. Sebastian Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755):788–791, Oct 1999. ISSN 1476-4687.
  • Kruskal and Wish [1978] J.B. Kruskal and M. Wish. Multidimensional Scaling. Sage Publications, 1978.
  • Roweis and Saul [2000] Sam T. Roweis and Lawrence K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500):2323–2326, 2000. ISSN 0036-8075.
  • Tenenbaum et al. [2000] Joshua B. Tenenbaum, Vin de Silva, and John C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319–2323, 2000. ISSN 0036-8075.
  • Coifman and Lafon [2006] Ronald R. Coifman and Stéphane Lafon. Diffusion maps. Applied and Computational Harmonic Analysis, 21(1):5 – 30, 2006. ISSN 1063-5203. Special Issue: Diffusion Maps and Wavelets.
  • Nadler et al. [2006] Boaz Nadler, Stephane Lafon, Ioannis Kevrekidis, and Ronald R Coifman.

    Diffusion maps, spectral clustering and eigenfunctions of fokker-planck operators.

    In Advances in neural information processing systems, pages 955–962, 2006.
  • Hinton and Zemel [1994] Geoffrey E Hinton and Richard S Zemel. Autoencoders, minimum description length and helmholtz free energy. In Advances in neural information processing systems, pages 3–10, 1994.
  • Maaten and Hinton [2008] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-SNE.

    Journal of machine learning research

    , 9(Nov):2579–2605, 2008.
  • McInnes et al. [2018] Leland McInnes, John Healy, and James Melville. UMAP: Uniform manifold approximation and projection for dimension reduction, 2018.
  • Moon et al. [2019] Kevin R. Moon, David van Dijk, Zheng Wang, Scott Gigante, Daniel B. Burkhardt, William S. Chen, Kristina Yim, Antonia van den Elzen, Matthew J. Hirn, Ronald R. Coifman, Natalia B. Ivanova, Guy Wolf, and Smita Krishnaswamy. Visualizing structure and transitions in high-dimensional biological data. Nature Biotechnology, 37(12):1482–1492, Dec 2019. ISSN 1546-1696.
  • Amir et al. [2013] El-ad David Amir, Kara L Davis, Michelle D Tadmor, Erin F Simonds, Jacob H Levine, Sean C Bendall, Daniel K Shenfeld, Smita Krishnaswamy, Garry P Nolan, and Dana Pe’er. visne enables visualization of high dimensional single-cell data and reveals phenotypic heterogeneity of leukemia. Nature Biotechnology, 31(6):545, 2013.
  • Macosko et al. [2015] Evan Z Macosko, Anindita Basu, Rahul Satija, James Nemesh, Karthik Shekhar, Melissa Goldman, Itay Tirosh, Allison R Bialas, Nolan Kamitaki, Emily M Martersteck, et al. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell, 161(5):1202–1214, 2015.
  • Gopalakrishnan et al. [2018] Vancheswaran Gopalakrishnan, Christine N Spencer, Luigi Nezi, Alexandre Reuben, MC Andrews, TV Karpinets, PA Prieto, D Vicente, Karen Hoffman, SC Wei, et al. Gut microbiome modulates response to anti–pd-1 immunotherapy in melanoma patients. Science, 359(6371):97–103, 2018.
  • Becht et al. [2019] Etienne Becht, Leland McInnes, John Healy, Charles-Antoine Dutertre, Immanuel WH Kwok, Lai Guan Ng, Florent Ginhoux, and Evan W Newell. Dimensionality reduction for visualizing single-cell data using umap. Nature Biotechnology, 37(1):38, 2019.
  • Zhao et al. [2020] Yujiao Zhao, Matthew Amodio, Brent Vander Wyk, Bram Gerritsen, Mahesh M Kumar, David van Dijk, Kevin Moon, Xiaomei Wang, Anna Malawista, Monique M Richards, Megan Cahill, Anita Desai, Jayasree Sivadasan, Manjunatha Venkataswamy, Vasanthapuram Ravi, Erol Fikrig, Priti Kumar, Steven Kleinstein, Smita Krishnaswamy, and Ruth Montgomery. Single cell immune profiling of dengue virus patients reveals intact immune responses to Zika virus with enrichment of innate immune signatures. PLoS Neglected Tropical Diseases, 14(3):e0008112, 2020.
  • Karthaus et al. [2020] Wouter R Karthaus, Matan Hofree, Danielle Choi, Eliot L Linton, Mesruh Turkekul, Alborz Bejnood, Brett Carver, Anuradha Gopalan, Wassim Abida, Vincent Laudone, Moshe Biton, Ojasvi Chaudhary, Tianhao Xu, Ignas Masilionis, Katia Manova, Linas Mazutis, Dana Pe’er, Aviv Regev, and Charles Sawyers. Regenerative potential of prostate luminal cells revealed by single-cell analysis. Science, 368(6490):497–505, 2020.
  • Baccin et al. [2019] Chiara Baccin, Jude Al-Sabah, Lars Velten, Patrick M Helbling, Florian Grünschläger, Pablo Hernández-Malmierca, César Nombela-Arrieta, Lars M Steinmetz, Andreas Trumpp, and Simon Haas. Combined single-cell and spatial transcriptomics reveal the molecular, cellular and spatial bone marrow niche organization. Nature Cell Biology, pages 1–11, 2019.
  • Chen and Salman [2011] Ke Chen and Ahmad Salman. Extracting speaker-specific information with a regularized siamese deep network. In Advances in Neural Information Processing Systems, pages 298–306, 2011.
  • Duque et al. [2019] Andrés F. Duque, Guy Wolf, and Kevin R. Moon. Visualizing high dimensional dynamical processes. In IEEE International Workshop on Machine Learning for Signal Processing, 2019.
  • Horoi et al. [2020] Stefan Horoi, Victor Geadah, Guy Wolf, and Guillaume Lajoie.

    Low-dimensional dynamics of encoding and learning in recurrent neural networks.


    Canadian Conference on Artificial Intelligence

    , pages 276–282. Springer, 2020.
  • Gigante et al. [2019] Scott Gigante, Adam S Charles, Smita Krishnaswamy, and Gal Mishne. Visualizing the phate of neural networks. In Advances in Neural Information Processing Systems, pages 1840–1851, 2019.
  • Bair et al. [2006] Eric Bair, Trevor Hastie, Debashis Paul, and Robert Tibshirani. Prediction by supervised principal components. Journal of the American Statistical Association, 101(473):119–137, 2006.
  • Yu et al. [2006] Shipeng Yu, Kai Yu, Volker Tresp, Hans-Peter Kriegel, and Mingrui Wu. Supervised probabilistic principal component analysis. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’06, page 464–473, New York, NY, USA, 2006. Association for Computing Machinery. ISBN 1595933395.
  • qing Zhang [2009] Shi qing Zhang. Enhanced supervised locally linear embedding. Pattern Recognition Letters, 30(13):1208 – 1218, 2009. ISSN 0167-8655.
  • Li et al. [2018] Z. Li, J. Tang, and X. He. Robust structured nonnegative matrix factorization for image representation. IEEE Transactions on Neural Networks and Learning Systems, 29(5):1947–1960, 2018.
  • Zhang et al. [2016] X. Zhang, L. Zong, X. Liu, and J. Luo. Constrained clustering with nonnegative matrix factorization. IEEE Transactions on Neural Networks and Learning Systems, 27(7):1514–1526, 2016.
  • Jia et al. [2019] Y. Jia, S. Kwong, J. Hou, and W. Wu. Semi-supervised non-negative matrix factorization with dissimilarity and similarity regularization. IEEE Transactions on Neural Networks and Learning Systems, pages 1–12, 2019.
  • Vlachos et al. [2002] Michail Vlachos, Carlotta Domeniconi, Dimitrios Gunopulos, George Kollios, and Nick Koudas. Non-linear dimensionality reduction techniques for classification and visualization. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’02, page 645–651, New York, NY, USA, 2002. Association for Computing Machinery. ISBN 158113567X.
  • Ribeiro et al. [2008] Bernardete Ribeiro, Armando Vieira, and João Carvalho das Neves. Supervised isomap with dissimilarity measures in embedding learning. In José Ruiz-Shulcloper and Walter G. Kropatsch, editors, Progress in Pattern Recognition, Image Analysis and Applications, pages 389–396, Berlin, Heidelberg, 2008. Springer Berlin Heidelberg. ISBN 978-3-540-85920-8.
  • Tapson and Greene [2005] J. Tapson and J.R. Greene. Plant data visualization using non-negative matrix factorization. IFAC Proceedings Volumes, 38(1):73 – 78, 2005. ISSN 1474-6670. 16th IFAC World Congress.
  • Babaee et al. [2016] Mohammadreza Babaee, Stefanos Tsoukalas, Gerhard Rigoll, and Mihai Datcu. Immersive visualization of visual data using nonnegative matrix factorization. Neurocomputing, 173:245 – 255, 2016. ISSN 0925-2312.
  • Barshan et al. [2011] Elnaz Barshan, Ali Ghodsi, Zohreh Azimifar, and Mansoor Zolghadri Jahromi. Supervised principal component analysis: Visualization, classification and regression on subspaces and submanifolds. Pattern Recogn., 44(7):1357–1371, July 2011. ISSN 0031-3203.
  • Breiman [2001] Leo Breiman. Random forests. Mach. Learn., 45(1):5–32, October 2001. ISSN 0885-6125.
  • Chao et al. [2019] Guoqing Chao, Yuan Luo, and Weiping Ding. Recent advances in supervised dimension reduction: A survey. Machine Learning and Knowledge Extraction, 1(1):341–358, Jan 2019. ISSN 2504-4990.
  • Tsuge et al. [2001] S. Tsuge, M. Shishibori, S. Kuroiwa, and K. Kita. Dimensionality reduction using non-negative matrix factorization for information retrieval. In 2001 IEEE International Conference on Systems, Man and Cybernetics. e-Systems and e-Man for Cybernetics in Cyberspace (Cat.No.01CH37236), volume 2, pages 960–965 vol.2, 2001.
  • Haifeng Liu et al. [2012] Haifeng Liu, Zhaohui Wu, Xuelong Li, Deng Cai, and T. S. Huang. Constrained nonnegative matrix factorization for image representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(7):1299–1311, 2012.
  • Liu et al. [2008] Weixiang Liu, Kehong Yuan, and Datian Ye.

    Reducing microarray data via nonnegative matrix factorization for visualization and clustering analysis.

    Journal of Biomedical Informatics, 41(4):602 – 606, 2008. ISSN 1532-0464.
  • Cai et al. [2008] D. Cai, X. He, X. Wu, and J. Han. Non-negative matrix factorization on manifold. In 2008 Eighth IEEE International Conference on Data Mining, pages 63–72, 2008.
  • van der Maaten and Hinton [2008] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-SNE. Journal of Machine Learning Research, 9:2579–2605, 2008.
  • Belkin and Niyogi [2003] M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15(6):1373–1396, 2003.
  • de Ridder et al. [2003] Dick de Ridder, Olga Kouropteva, Oleg Okun, Matti Pietikäinen, and Robert P. W. Duin. Supervised locally linear embedding. In Okyay Kaynak, Ethem Alpaydin, Erkki Oja, and Lei Xu, editors, Artificial Neural Networks and Neural Information Processing — ICANN/ICONIP 2003, pages 333–341, Berlin, Heidelberg, 2003. Springer Berlin Heidelberg. ISBN 978-3-540-44989-8.
  • Polito and Perona [2002] Marzia Polito and Pietro Perona. Grouping and dimensionality reduction by locally linear embedding. In T. G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems 14, pages 1255–1262. MIT Press, 2002.
  • Friedman et al. [1977] Jerome H. Friedman, Jon Louis Bentley, and Raphael Ari Finkel. An algorithm for finding best matches in logarithmic expected time. ACM Trans. Math. Softw., 3(3):209–226, September 1977. ISSN 0098-3500.
  • Cutler et al. [2012] Adele Cutler, D. Richard Cutler, and John R. Stevens. Random Forests, pages 157–175. Springer US, Boston, MA, 2012. ISBN 978-1-4419-9326-7.
  • Breiman [1996] Leo Breiman. Bagging predictors. Machine Learning, 24(2):123–140, Aug 1996. ISSN 1573-0565.
  • Segal and Xiao [2011] Mark Segal and Yuanyuan Xiao. Multivariate random forests. WIREs Data Mining and Knowledge Discovery, 1(1):80–87, 2011.
  • Breiman and Cutler [(Accessed on 04/15/2020] Leo Breiman and Adele Cutler. Random forests for scientific discovery., (Accessed on 04/15/2020).
  • Lin and Jeon [2006] Yi Lin and Yongho Jeon. Random forests and adaptive nearest neighbors. Journal of the American Statistical Association, 101(474):578–590, 2006.
  • Golino and Gomes [2014] Hudson F. Golino and Cristiano Mauro Assis Gomes. Visualizing Random Forest’s Prediction Results. Psychology, 5:2084–2098, 2014.
  • Fruchterman and Reingold [1991] Thomas M. J. Fruchterman and Edward M. Reingold. Graph drawing by force-directed placement. Software: Practice and Experience, 21(11):1129–1164, 1991.
  • Gajer et al. [2004] Pawel Gajer, Michael T. Goodrich, and Stephen G. Kobourov. A multi-dimensional approach to force-directed layouts of large graphs. Computational Geometry, 29(1):3 – 18, 2004. ISSN 0925-7721. Special Issue on the 10th Fall Workshop on Computational Geometry, SUNY at Stony Brook.
  • Duque et al. [2019] Andrés F Duque, Guy Wolf, and Kevin R Moon. Visualizing high dimensional dynamical processes. In 2019 IEEE 29th International Workshop on Machine Learning for Signal Processing (MLSP), pages 1–6. IEEE, 2019.
  • Haghverdi et al. [2016] Laleh Haghverdi, Maren Buettner, F Alexander Wolf, Florian Buettner, and Fabian J Theis. Diffusion pseudotime robustly reconstructs lineage branching. Nature Methods, 13(10):845, 2016.
  • Dua and Graff [2017] Dheeru Dua and Casey Graff. UCI machine learning repository, 2017.
  • Liaw and Wiener [2002] Andy Liaw and Matthew Wiener. Classification and regression by randomforest. R News, 2(3):18–22, 2002.
  • R Core Team [2019] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2019.
  • Xu et al. [1992] Lei Xu, A. Krzyzak, and Ching Suen. Methods of combining multiple classifiers and their applications to handwriting recognition. IEEE Transactions on Systems Man and Cybernetics Part B (Cybernetics), 22:418 – 435, 06 1992.
  • Cock [2011] Dean De Cock. Ames, iowa: Alternative to the boston housing data as an end of semester regression project. Journal of Statistics Education, 19(3):null, 2011.
  • Wold et al. [1984] S. Wold, A. Ruhe, H. Wold, and W. J. Dunn, III.

    The collinearity problem in linear regression. the partial least squares (pls) approach to generalized inverses.

    SIAM Journal on Scientific and Statistical Computing, 5(3):735–743, 1984.
  • Anderson [1936] Edgar Anderson. The species problem in iris. Annals of the Missouri Botanical Garden, 23(3):457–509, 1936. ISSN 00266493.
  • Gorman and Sejnowski [1988] R.Paul Gorman and Terrence J. Sejnowski. Analysis of hidden units in a layered network trained to classify sonar targets. Neural Networks, 1(1):75 – 89, 1988. ISSN 0893-6080.
  • Kuhn [2020] Max Kuhn. caret: Classification and Regression Training, 2020. R package version 6.0-86.
  • Moon et al. [2017] K. R. Moon, K. Sricharan, and A. O. Hero. Ensemble estimation of mutual information. In 2017 IEEE International Symposium on Information Theory (ISIT), pages 3030–3034, 2017.
  • Tibshirani [1994] Robert Tibshirani. Regression shrinkage and selection via the lasso. JOURNAL OF THE ROYAL STATISTICAL SOCIETY, SERIES B, 58:267–288, 1994.
  • Díaz-Uriarte and Alvarez de Andrés [2006] Ramón Díaz-Uriarte and Sara Alvarez de Andrés. Gene selection and classification of microarray data using random forest. BMC Bioinformatics, 7(1):3, Jan 2006. ISSN 1471-2105.


In these appendices, we provide further details about RF-PHATE. In Section A, we provide details about each of the data sets used in the experiments. We then provide details on the metric we use to quantify our results and show extended results in Section B. In Section C, we present alternate measures of variable importance which are consistent with the random forest measure of variable importance. We compare RF-PHATE to unsupervised visualization methods in Section D. We then perform an ablation study in Section E, demonstrating that RF-PHATE is robust to the choice of parameters. Finally, we demonstrate that other approaches to embedding the random forest proximities result in inferior visualizations in Section F.

Appendix A Description of Data Sets

Descriptions of each the data sets used in the experiments and their preprocessing steps are given here. In addition to steps listed particular to each data set, all continuous variables in each data set were standardized to have a mean of zero and a variance of one.

The Optical Digits Xu et al. [1992], Dua and Graff [2017] data contains 5620 observations of handwritten numerical digits (0 - 9) recorded in bit images. The images were collect from 43 individuals and the pixels are summed into blocks resulting in 64 dimensional arrays. Each variable takes on integer values from 0 to 16 which denote the number of original pixels with writing included in that particular block. There are no missing values.

The Ames Housing Cock [2011]

data set consists of 1460 house listings with 80 variable with a mixture of nominal, ordinal, discrete, and continuous values. The response variable is the sales price. Missing values were manually imputed (N/A values were replaced with “none” for missing categorical variables, or 0 for continuous variables, where applicable). One-hot encoding was applied to categorical variables for compatibility with the compared dimensionality reduction techniques.

The Titanic data set has 12 variables and 891 observations. The variables Ticket and Name are unique to each observations and were thus removed. In addition, observations with missing values were also removed, resulting in a total of 712 observations used in the comparisons. One-hot encoding was applied to the categorical variables.

Iris Anderson [1936], Dua and Graff [2017] contains 150 observations of three iris species: virginica, versicolor, and setosa. There are 50 observations from each species. Each observation has four measurements: sepal length, sepal width, petal length, and petal width. The data set has no missing values. See a pair-wise variable plot in figure 4.

Figure 4: Pairwise variable plot of Iris

In addition to the previous data sets used in the main paper, we include results on other data sets. This includes the Sonar Gorman and Sejnowski [1988] data set, which consists of 208 instances of sonar signals reflected off of rocks or metal cylinders positioned at various angles. This results in 60 variables (60 sonar bands). Classification is done on the material type (rock or metal). The data set contains no missing values. (See figure 5).

Figure 5: RF-PHATE embeddings of the Sonar data set showing (a) class labels (M for metal, R for rock), (b) the top important variable (Band 11), and (c) the second important variable (Band 12). More densely clustered metal points on the left (blue) and rock points on the right (red) give an indication of random forest confidence in the predictions. Sparsely scattered points in the center are more difficult to classify.

Additionally, we show results for the German Credit data set Dua and Graff [2017], consisting of 1000 instances of 20 categorical and numeric data sets. An alternative version (that we use) contains 24 numeric variables representative of the original data. The variables (attributes) are proprietary and therefore variable names are not included. The data set has two classes, bad credit (0) or good credit (1). RF-PHATE demonstrates variables which may be used to clearly distinguish customers with good credit from those with bad. See figure 6.

Figure 6: RF-PHATE embeddings of the German data set. In (a) it is clear that a result of 1 in attribute 23 guarantees bad credit. In (b) group 1 of attribute 6 further separates the majority of the remaining bad credit customers from those with good credit. Attribute values are denoted by different colors.

Appendix B Low-Dimensional Variable Regression

To quantify the degree in which the low-dimensional embeddings captured important variables, we used the embeddings as feature matrices for a -NN classifier/regressor on the original variables (see Tables 1 and 2). The value of used in each case was the square-root of the number of observations in the data set. For discrete variables, the average error rate was computed using 10-fold cross validation. For continuous variables, the average root mean squared error (RMSE) was the criterion, also using 10-fold cross validation. We compared results using 7 supervised dimensionality reduction techniques, RF-PHATE, Enhanced Supervised Isomap Ribeiro et al. [2008], Kernalized Supervised PCA Yu et al. [2006], Supervised PCA Yu et al. [2006], Enhanced Supervised LLE qing Zhang [2009], Supervised NMF Jia et al. [2019], and Partial Least Squares Discriminant Analysis (PLSDA) Wold et al. [1984].

Here we show extended results from those in the paper by including other datasets and by assessing the embeddings with dimensions of 4 and 5. The results are contained in Tables 3 to 7. Here we see that RF-PHATE again outperforms the other methods, as it is mostly first and either second or third for all comparisons. No other method is as consistent in its performance across all datasets.

Quality (2) 0.645 N/A 0.815 0.765 N/A N/A 0.767
Quality (3) 0.631 N/A 0.789 0.760 N/A N/A 0.736
Quality (4) 0.619 N/A 0.785 0.747 N/A N/A 0.723
Quality (5) 0.619 N/A 0.783 0.743 N/A N/A 0.728
GrLivArea (2) 263.751 N/A 402.084 413.366 N/A N/A 415.903
GrLivArea (3) 224.579 N/A 393.641 411.534 N/A N/A 293.184
GrLivArea (4) 226.168 N/A 390.608 385.940 N/A N/A 293.578
GrLivArea (5) 226.585 N/A 392.209 385.254 N/A N/A 297.176
Ext. Quality (2) 0.114 N/A 0.132 0.134 N/A N/A 0.125
Ext. Quality (3) 0.114 N/A 0.136 0.127 N/A N/A 0.125
Ext. Quality (4) 0.115 N/A 0.124 0.128 N/A N/A 0.112
Ext. Quality (5) 0.112 N/A 0.123 0.129 N/A N/A 0.112
Table 3: Variable regression on Ames Housing data set. ESIso, ESLLE, and SNMF have not been adapted for continuous labels, hence the missing values (N/A).
Sex (2) 0.0000 0.0028 0.0084 0.0056 0.0070 0.1419 0.0028
Sex (3) 0.0000 0.0028 0.0084 0.0028 0.0042 0.1489 0.0042
Sex (4) 0.0000 0.0070 0.0056 0.0028 0.0028 0.1320 0.0042
Sex (5) 0.0000 0.0056 0.0056 0.0028 0.0028 0.1629 0.0028
Class (2) 0.0154 0.1657 0.1025 0.0688 0.0702 0.3174 0.0197
Class (3) 0.0112 0.0492 0.1011 0.0281 0.0969 0.3048 0.0239
Class (4) 0.0169 0.0421 0.0421 0.0253 0.0702 0.3666 0.0211
Class (5) 0.0126 0.0463 0.0197 0.0253 0.0337 0.2893 0.0197
Table 4: Variable regression on Titanic. Sex and Class are both categorical; the average 10-fold cross-validation error is reported.
V43 (2) 3.612 3.554 5.478 5.237 5.910 6.659 5.842
V43 (3) 3.473 3.484 4.013 4.922 5.887 6.659 5.407
V43 (4) 3.221 3.365 3.266 4.153 5.729 6.659 4.704
V43 (5) 3.313 3.386 2.895 3.970 5.750 6.659 4.120
V22 (2) 3.425 3.869 4.101 4.862 5.441 6.540 5.730
V22 (3) 3.361 3.851 3.902 4.007 5.194 6.540 5.060
V22 (4) 3.381 3.765 3.258 3.750 5.049 6.540 4.832
V22 (5) 3.321 3.712 3.156 3.754 5.077 6.540 4.478
V44 (2) 4.147 3.997 5.107 4.416 5.839 7.147 6.217
V44 (3) 3.766 3.810 4.608 4.039 5.810 7.147 5.968
V44 (4) 3.549 3.735 3.082 3.681 5.501 7.147 5.183
V44 (5) 3.614 3.724 3.015 3.459 5.579 7.147 4.756
Table 5: Variable regression on Optical Digits. The variables correspond to block locations in the condensed ( images.
Band 11 (2) 0.0889 0.1224 0.0918 0.0978 0.1173 0.1158 0.0958
Band 11 (3) 0.0874 0.1217 0.0907 0.0985 0.1143 0.1209 0.0924
Band 11 (4) 0.0873 0.1179 0.0898 0.0973 0.1157 0.1242 0.0905
Band 11 (5) 0.0874 0.1185 0.0877 0.0972 0.1080 0.1232 0.0880
Band 12 (2) 0.0957 0.1327 0.1119 0.1104 0.1278 0.1195 0.1158
Band 12 (3) 0.0949 0.1309 0.1112 0.1097 0.1229 0.1248 0.1139
Band 12 (4) 0.0946 0.1304 0.1110 0.1089 0.1214 0.1308 0.1053
Band 12 (5) 0.0951 0.1316 0.1103 0.1088 0.1102 0.1356 0.0984
Table 6: Variable regression on Sonar using the top two important variables (see section C).
V23 (2) 0.000 0.061 0.027 0.025 0.198 0.170 0.025
V23 (3) 0.000 0.059 0.027 0.024 0.176 0.170 0.028
V23 (4) 0.000 0.042 0.027 0.024 0.063 0.170 0.026
V23 (5) 0.000 0.036 0.027 0.025 0.063 0.170 0.027
V6 (2) 0.551 0.621 0.596 0.666 0.684 0.657 0.680
V6 (3) 0.541 0.590 0.577 0.668 0.594 0.560 0.674
V6 (4) 0.510 0.594 0.510 0.646 0.568 0.611 0.637
V6 (5) 0.510 0.567 0.515 0.643 0.579 0.518 0.606
Table 7: Variable regression on the German data set.

Appendix C Alternative Assessments of Variable Importance

To show that our measure of variable importance using random forests is robust, we computed variable importance using several other methods: ROC curve variable importance (for discrete labels) or importance (for continuous labels) using the caret Kuhn [2020] package in R R Core Team [2019] as well as using a mutual information (MI) estimator based on Moon et al. [2017]. The results are given in table 8. The varImp function from the caret packageKuhn [2020] used repeated 10-fold cross validation with 3 repeats. The method used was LVQ (learning vector quantization) for discrete responses, and LASSO Tibshirani [1994] ( importance) for continuous responses.

We note that the results are generally consistent with each other, especially when comparing the random forest results with the ROC or measure. While the MI gives some slight differences in some of the datasets, they are still generally the same. For example, MI determines the top three variables for the Titanic dataset to be Age, Fare, and Sex, respectively. The other two methods include Sex and Fare in their top three as well, albeit in slightly different orders. Overall, the results indicate that we are selecting important variables for the experiments.

Data Set Rank RF ROC / MI
Iris 1 Petal Length Petal Length Petal Length
Iris 2 Petal Width Petal Width Petal Width
Iris 3 Sepal Length Sepal Width Sepal Length
Optical Digits 1 V43 V22 V43
Optical Digits 2 V22 V43 V22
Optical Digits 3 V44 V31 V31
Titanic 1 Sex Sex Age
Titanic 2 Class Fare Fare
Titanic 3 Fare Class Sex
Ames Housing 1 Quality Quality N/A
Ames Housing 2 GrLivArea GrLivArea N/A
Ames Housing 3 Ext. Quality TotalBsmtSF N/A
Sonar 1 Band 11 Band 11 Band 9
Sonar 2 Band 12 Band 12 Band 11
Sonar 3 Band 9 Band 10 Band 12
German 1 V23 V23 V4
German 2 V6 V10 V10
German 3 V22 V6 V23
Table 8: Variable importance ranking as assessed by random forests, ROC curve importance (or importance for continuous responses), and mutual information estimation. This mutual information estimation implementation is currently only written for discrete response variables.

Appendix D Unsupervised Results on Noisy Data

Data exploration with excessive noise variables provides an example of a use-case scenario for supervised dimensionality reduction. In this section, we include regression results for the Iris

dataset with 1000 additional noise variables. The noise variables were randomly generated from Gaussian distributions with means uniformly sampled from values between -1 and 1. Each had a variance of 1. All variables (noise and otherwise) were scaled and centered prior to performing dimensionality reduction. The low-dimensional embeddings from 7 supervised and 7 unsupervised dimensionality techniques were used as predictors while the original data variables were used as the response for

-NN regression (unweighted). We used the square root of the number of observations as the parameter. The average RMSE using 10-fold cross validation was recorded for each regression problem with the experiment repeated 10 times. The same ten data sets were used for both the supervised and unsupervised cases.

Tables 9 and 10 show the supervised and unsupervised results, respectively. First, we see that RF-PHATE universally outperforms all of the supervised and unsupervised methods, indicating that RF-PHATE is very useful when there are many noise variables. We also observe that all of the unsupervised methods perform worse than all of the supervised methods except for SNMF. Thus supervised methods for visualization are preferred when the data contain noise variables.

petal len. (2) 0.330 0.01 0.348 0.01 0.340 0.01 0.381 0.01 0.349 0.01 1.475 0.03 0.387 0.01
petal len. (3) 0.334 0.01 0.353 0.01 0.360 0.01 0.400 0.01 0.353 0.01 2.233 0.04 0.363 0.01
petal len. (4) 0.337 0.01 0.3550.01 0.3610.01 0.4120.01 0.352 0.01 2.2330.04 0.3690.01
petal wid. (2) 0.291 0.01 0.298 0.01 0.297 0.01 0.330 0.01 0.304 0.01 1.297 0.02 0.330 0.01
petal wid. (3) 0.292 0.01 0.305 0.01 0.312 0.01 0.345 0.01 0.308 0.01 1.905 0.03 0.308 0.01
petal wid. (4) 0.290 0.01 0.304 0.01 0.3140.01 0.3580.01 0.3060.01 1.9050.03 0.3210.01
sepal len. (2) 0.459 0.01 0.522 0.02 0.514 0.02 0.514 0.02 0.528 0.02 0.885 0.02 0.514 0.02
sepal len. (3) 0.455 0.01 0.536 0.02 0.540 0.02 0.520 0.02 0.534 0.02 1.272 0.02 0.516 0.02
sepal len. (4) 0.456 0.01 0.5370.02 0.5410.02 0.5220.02 0.5330.02 1.2720.02 0.517 0.02
sepal wid. (2) 0.320 0.01 0.348 0.01 0.351 0.01 0.357 0.01 0.349 0.01 0.462 0.01 0.357 0.01
sepal wid. (3) 0.318 0.01 0.354 0.01 0.358 0.01 0.356 0.01 0.354 0.01 0.501 0.02 0.357 0.01
sepal wid. (4) 0.321 0.01 0.3550.01 0.3610.01 0.3570.01 0.354 0.01 0.5010.02 0.3580.01
Table 9: Variable regression on Iris with 1000 additional random noise variables. The low-dimensional embeddings of seven supervised dimensionality reduction methods were used as features to regress on the original iris data set variables. The noise variables were simulated ten times and were averaged. The standard deviation is also reported. RF-PHATE outperforms all other methods.
petal len. (2) 1.393 0.03 1.398 0.03 1.366 0.03 1.365 0.03 1.768 0.03 1.322 0.03 1.355 0.03
petal len. (3) 1.372 0.03 1.393 0.03 1.366 0.03 1.339 0.03 1.539 0.03 1.342 0.03 1.355 0.03
petal len. (4) 1.355 0.03 1.365 0.03 1.366 0.03 1.34 0.03 1.388 0.03 1.307 0.03 1.328 0.03
petal wid. (2) 1.196 0.02 1.181 0.02 1.120 0.02 1.159 0.02 1.502 0.03 1.119 0.02 1.164 0.02
petal wid. (3) 1.170 0.02 1.183 0.02 1.120 0.02 1.157 0.02 1.300 0.02 1.113 0.02 1.156 0.02
petal wid. (4) 1.163 0.02 1.172 0.02 1.120 0.02 1.159 0.02 1.177 0.02 1.103 0.02 1.141 0.02
sepal len. (2) 0.847 0.02 0.848 0.02 0.816 0.02 0.832 0.02 0.979 0.02 0.809 0.02 0.833 0.02
sepal len. (3) 0.830 0.02 0.847 0.02 0.816 0.02 0.826 0.02 0.904 0.02 0.816 0.02 0.821 0.02
sepal len. (4) 0.822 0.02 0.840 0.02 0.816 0.02 0.828 0.02 0.844 0.02 0.801 0.02 0.813 0.02
sepal wid. (2) 0.450 0.01 0.445 0.01 0.439 0.01 0.452 0.01 0.523 0.02 0.444 0.01 0.441 0.01
sepal wid. (3) 0.446 0.01 0.451 0.01 0.439 0.01 0.440 0.01 0.463 0.01 0.440 0.01 0.449 0.01
sepal wid. (4) 0.445 0.01 0.443 0.01 0.439 0.01 0.444 0.01 0.442 0.01 0.437 0.01 0.445 0.01
Table 10: Variable regression on Iris with 1000 additional random Gaussian noise variables. The low-dimensional embeddings of seven unsupervised dimensionality reduction methods were used as features to regress on the original iris data set variables as a comparison to the supervised methods in table 9. The experiments were repeated 10 times. Compared to the supervised results, regression using the unsupervised embeddings performed much worse than all supervised cases with the exception of SNMF.

Appendix E Parameter Selection Robustness

In this section we show that RF-PHATE is robust to parameter tuning. Random forests are robust to overfitting the data; increasing the number of trees can only improve random forest performance Cutler et al. [2012] and the random forest generalization error converges almost surely to a limit Breiman [2001]. Random forests use, by default, fully-grown trees. In the context of proximity generation, decreasing the number of terminal nodes (i.e. pruning the tree) or setting a minimum node size, inflates the proximity values since the probability of two observations falling into the same terminal node increases as the number of terminal nodes decreases. This is demonstrated in Figure 7.

Figure 7: Scaled images of the proximity matrices of the Iris Anderson [1936], Dua and Graff [2017] data set setting the minimum node size (parameter nodesize) at different levels. Increasing the minimum node size inflates the proximity values, as can be seen by viewing the increased proximities in (c) as compared to those in (a).

It has been empirically shown Cutler et al. [2012], Díaz-Uriarte and Alvarez de Andrés [2006] that the only parameter to which random forests are sensitive is the number of randomly selected variables to be considered at a given terminal node (the parameter mtry). The randomForest Liaw and Wiener [2002] default for classification is , where is the number of observations. For regression, the default is . We selected values of mtry centered at the default to test RF-PHATE’s robustness to this parameter.

Diffusion-based dimensionality reduction is sensitive to the number of time steps, , considered when conducting a “random walk” across all possible transitions Moon et al. [2019], Duque et al. [2019], Gigante et al. [2019], Horoi et al. [2020]. We therefore test this parameter as well. The optimal , as selected using VNE, serves as a center point to the values of considered for parameter tuning. The results are shown in Figure 8. For each of the datasets in this figure, we ran RF-PHATE using ranges of mtry and centered about the default values. We recorded the error rate (for categorical variables) or RMSE (for continuous variables) regressing on the top variable using the 2-dimensional RF-PHATE embeddings as features. This is the same metric used in Tables 1 and 2.

Figure 8: Heatmaps of the results of regression over the top important variable using the 2-dimensional RF-PHATE embeddings (the same metrics used in Tables 1 and 2) over a range of and mtry values centered at their defaults on the (a) Sonar, (b) Titanic, (c) Optical Digits, and (d) Ames Housing datasets. The results are consistent across all values of of the parameters, demonstrating that RF-PHATE is robust to the choice of and mtry.

The results show that RF-PHATE is not sensitive to mtry or in regards to capturing important variables in a low-dimensional space with a categorical response. However, we have observed that high values of (much greater than the optimally selected ) tend to visually collapse clusters. We therefore recommend that the default choice of , or a value close to it, should be used.

Appendix F Embedding random forest proximities with other methods

Because of its abilities to accurately preserve local and global structure, we chose to adapt the PHATE algorithm to embed the random forest proximities. Here we show the results when embedding the proximities using other dimensionality reduction methods with varying success. We compared results embedding the proximities using Isomap, LLE, and MDS in addition to our proposed method. These are each denoted as RF-“method”.

RF-Isomap gives meaningful results in small, low-dimensional data sets even in the presence of noise variables, but less desirable results for higher-dimensional data sets (see Figure 9), while RF-LLE does not perform well without significant parameter tuning. For higher-dimensional data sets, including when noise variables are present, RF-MDS does not capture the geometric (global) structure of the data. RF-Isomap captures more of the global structure, but does not sufficiently denoise the data. In contrast, RF-PHATE shows the global and local structure of the data, even when many of the variables are irrelevant for the supervised task.

Figure 9: Visualizations using random forest proximities as a kernel for Isomap, MDS, and LLE on the (a) Iris with noise, (b) Optical Digits, and (c) Ames Housing data sets. For low-dimensional data sets, RF-Isomap (a)ii gives meaningful results; however, it can be seen in (a)iii (b)iii and (c)iii that MDS applied to the proximities does not accurately capture the global structure of the data on high dimensional data sets. Isomap with proximities (a)ii and (b)ii captures some of the global structure but does not reduce noise. LLE with proximities does not produce meaningful results on (a)iv or (b)iv.

Appendix G Computational Environment

All computations were done using an Intel Xeon CPU E5-1650 v2 @ 3.50 GHz with 6 cores. The random forest proximities were computing using R version 3.6.2. All other computations were done in Matlab 2019b.