1 Introduction
Large, highdimensional data sets are now regularly obtained in nearly every field of research and business. Despite the high dimensions of the data, it is typically assumed that the data can be accurately described using a much smaller set of latent variables. Dimensionality reduction methods aim to find these latent variables and are commonly a key step in the data analysis pipeline. For example, dimensionality reduction is often performed as a preprocessing step as many downstream methods of analysis suffer in performance and accuracy from the ‘curse of dimensionality’. Principal components analysis (PCA)
Pearson (1901); Hotelling (1933) is a popular choice for preprocessing because of its simplicity and computational speed. Nonnegative matrix factorization (NMF) Lee and Seung (1999) and classical multidimensional scaling (MDS) Kruskal and Wish (1978) are also commonly used. However, these three methods assume a linear model on the data and thus are inefficient at representing nonlinear relationships in the latent space. Therefore, much work has focused on nonlinear dimensionality reduction methods including local linear embedding (LLE) Roweis and Saul (2000), Isomap Tenenbaum et al. (2000), diffusion maps Coifman and Lafon (2006); Nadler et al. (2006), and autoencoders
Hinton and Zemel (1994).Recent dimensionality reduction methods have focused on reducing dimensions for data visualization. Data visualization is an important aspect of exploratory data analysis for aiding humans in developing an understanding of the underlying structure within the data, which can enable hypothesis generation and data interpretation. Recent, successful visualization methods include tSNE (tDistributed Stochastic Neighbor Embedding) Maaten and Hinton (2008), UMAP (Uniform Manifold Approximation and Projection) McInnes et al. (2018), and PHATE (Potential of Heatdiffusion for Affinitybased Transition Embedding) Moon et al. (2019). These methods have found applications in computational biology Amir et al. (2013); Macosko et al. (2015); Gopalakrishnan et al. (2018); Becht et al. (2019); Zhao et al. (2020); Karthaus et al. (2020); Baccin et al. (2019), visualizing text Chen and Salman (2011), time series Duque et al. (2019)
, and the internal nodes in neural networks
Horoi et al. (2020); Gigante et al. (2019).However, all of these dimensionality reduction and visualization methods are unsupervised, where class labels are unavailable or ignored. Supervised adaptations have been created for some of these methods. The majority of these Bair et al. (2006); Yu et al. (2006); qing Zhang (2009); Li et al. (2018); Zhang et al. (2016); Jia et al. (2019); Vlachos et al. (2002); Ribeiro et al. (2008) are used for classification or regression preprocessing, and a few are used for visualization Vlachos et al. (2002); Tapson and Greene (2005); Babaee et al. (2016); Barshan et al. (2011). Many of these approaches incorporate class labels directly in a loss function, which can create an exaggerated separation between classes and introduce other distortions in the visualization. Additionally, some of these adaptations are limited to classification and not easily adapted to regression problems where the labels are not categorical. Finally, these approaches are often sensitive to parameter tuning, which may be difficult to perform without a specific measure of visualization quality.
In this paper we introduce a new approach to supervised visualization based on random forests Breiman (2001) and diffusionbased dimensionality reduction Coifman and Lafon (2006); Nadler et al. (2006). In particular, we leverage advances made by PHATE Moon et al. (2019) in extracting the information learned through a diffusion process for visualization. Hence we name our visualization technique RFPHATE. The approach incorporates variable importance as measured from the trained random forests and gives a noiseresilient visual of the predictor space which may be used in the context of data exploration with known labels. We show empirically that RFPHATE retains both the local and global structure of the data while emphasizing the important variables in the visualization. We also show that RFPHATE is robust to noise and parameter tuning.
2 Related work on supervised dimensionality reduction
Supervised dimensionality reduction methods can generally be categorized into three major groups: PCA, NMF, and manifold learning Chao et al. (2019). The original implementation of supervised PCA (SPCA) selects a subset of features that are most highly correlated with the labels. PCA is applied to these features to generate a new feature space to be used in the regression problem Bair et al. (2006). The primary motivation for this approach was to handle problems where the number of features exceeds the number of observations () Bair et al. (2006). Another variation of SPCA was introduced in Yu et al. (2006), which may be kernelized and is capable of handling missing values. SPCA was applied for data visualization in Barshan et al. (2011)
. However, SPCA is linear and does not therefore accurately capture nonlinear relationships in the data structure in low dimensions. In addition, the number of components selected is usually determined based on global variance explained, which does not adequately capture local relationships
Yu et al. (2006).NMF Lee and Seung (1999) seeks to decompose a data matrix into two, nonnegative, matrices and , by minimizing the Frobenius norm of the difference between and while restricting the entries of and to be nonnegative. The rows of
can be regarded as basis vectors while the columns of
form the axes of the lowerdimensional space Tsuge et al. (2001). A number of supervised and semisupervised modifications to NMF (SNMF or SSNMF) have been proposed, such as constrained NMF Haifeng Liu et al. (2012), structured NMF Li et al. (2018), and NMF for constrained clustering Zhang et al. (2016). Most of these approaches use the labels in a regularization term in the optimization problem. Jia et al. (2019) proposed a semisupervised NMF with both similarity and dissimilarity regularization terms. In Liu et al. (2008); Tapson and Greene (2005); Babaee et al. (2016) NMF has also been used for lowdimensional visualization. However, it is still a linear method and therefore does not capture the intrinsic geometric structure of nonlinear data Cai et al. (2008). In addition, supervised versions of NMF tend to accentuate class differences in clusters, providing inflated separation between groups, and tight clustering within groups.Manifoldbased dimensionality reduction methods assume that the highdimensional data lie on a lowdimensional manifold. Examples of manifoldbased algorithms include Isomap Tenenbaum et al. (2000), UMAP McInnes et al. (2018), Diffusion Maps Coifman and Lafon (2006), tSNE van der Maaten and Hinton (2008), LLE Roweis and Saul (2000), Laplacian Eigenmaps Belkin and Niyogi (2003), and PHATE Moon et al. (2019). Supervised versions of some of these methods have been proposed. WeightedIso, Iso+Ada Vlachos et al. (2002), and Enhancedsupervised Isomap (ESIsomap) Ribeiro et al. (2008)
are supervised versions of Isomap, which estimates the geodesic distances between points. These supervised variations use the class labels to modify the distance metric to accentuate closeness in likeclass samples while emphasizing different classes. Supervised versions of LLE (SLLE) have also been proposed that modify the dissimilarity metric using the class information
de Ridder et al. (2003); qing Zhang (2009). However, like LLE, SLLE is sensitive to parameters. It was also shown in Vlachos et al. (2002); Ribeiro et al. (2008); Polito and Perona (2002) that LLE and Isomap are subject to “overclustering” unless the data are comprised of a single, wellclustered sample.3 Supervised neighborhoods via random forest proximities
Manifold learning techniques are typically based on the construction of a kernel that computes some measure of pairwise local similarity between data points. For example, classical diffusion maps applies a Gaussian kernel with a fixed bandwidth to pairwise Euclidean distances Coifman and Lafon (2006) while PHATE applies the decay kernel with an adaptive nearest neighbor (NN) bandwidth Moon et al. (2019). Isomap similarly builds a NN graph to measure similarity between points Tenenbaum et al. (2000). Trees are also commonly used to construct these kernel measures such as the use of d trees for computing NN graphs Friedman et al. (1977). Long range relationships between points are then learned by chaining together the local similarities, for example by diffusion Coifman and Lafon (2006); Moon et al. (2019) or shortest path algorithms Tenenbaum et al. (2000).
To create supervised versions of manifold learning techniques, the label information can be incorporated in the kernel construction. We propose to do this using random forests. Random forests Breiman (2001)
are a treebased ensemble method commonly used for classification and regression. They are widely considered one of the best outofthebox supervised learning techniques as they often achieve very good results with relatively little tuning
Cutler et al. (2012). In addition to classification and regression, random forests have a number of additional uses, such as variable selection, imputation, outlier detection, and unsupervised learning
Cutler et al. (2012).Random forests were created as an extension to Leo Breiman’s bagging idea Breiman (1996)
, which uses bootstrap samples (sampling with replacement) to train base learners (decision trees in this case). The outofsample data is said to be “out of bag”
Breiman (2001), which is often used as a test set to estimate the generalization error of the model as well as to assess variable importance Cutler et al. (2012). The aggregate results of the base learners are combined to make a final prediction Breiman (1996). While individual decision trees tend to be sensitive to the data, random forests are very stable due to the twopart randomness of the algorithm (boostrapping and random variable selection), which ensures low correlation among the base learners
Segal and Xiao (2011). Random forests are also robust to noise as splits are not likely to be determined by noise variables when meaningful variables are considered at a given node. Overall, as the number of trees increases, meaningful variables are more likely to be shown as important while noise variables will show less importance, averaging over all trees.Definition 3.1.
Consider two outofbag observations from the training data and . The random forest proximity between these points is defined as the proportion of trees in the trained random forest for which the observations and share the same terminal (leaf) node Breiman and Cutler ((Accessed on 04/15/2020).
From Definition 3.1, it is clear that two observations that always end in the same terminal node will have a proximity of 1, while those that never fall into the same terminal node will have a proximity of 0. The proximities applied to all pairs of training points form a symmetric, positive definite kernel matrix with ones along the diagonal. The random forest proximities provide a measure of how similar observations are in the predictor space and naturally take into account the weight of variable importance relative to the training task. That is to say, two observations may be near one another in the predictor space due to having similar values in relatively few, important variables, but may otherwise be far apart in a Euclidean space Cutler et al. (2012). Thus using the proximities as inputs to visualization methods will produce visualizations that emphasize the variables important to the supervised task while largely ignoring the irrelevant variables Lin and Jeon (2006).
We propose to use the random forest proximities as the kernel for RFPHATE because of their desirable properties described above. Random forest proximities have been used previously for visualization by applying MDS Cutler et al. (2012) or graphbased methods Golino and Gomes (2014). However, applying MDS directly to the proximities loses much of the geometric structure of the data (see figure 9 in appendix F), while forcedirected layout methods (such as the FruchtermanReingold algorithm Fruchterman and Reingold (1991) used in Golino and Gomes (2014)) do not scale well with large data sets Gajer et al. (2004). In contrast, our approach preserves the geometric structure (both locally and globally) due to our use of diffusion and information distances and scales well to both small and large data sets.
4 RfPhate
Kernels typically contain mostly local information about the data points used for training rather than global. This includes the random forest proximities. Consider a classification task with two points and that have the same class label. Suppose and are similar to each other in the variables that are important for classification. Then and will have a high proximity as measured by a trained random forest as they are likely to end up in the same terminal node of each tree. This is true even if the points are highly dissimilar in variables that are not important for the classification. Now, suppose and have the same class label but are dissimilar in one or more variables that are relevant for classification. Then and will have a lower proximity between them compared to the first scenario as it is less likely they will end up in the same terminal nodes. However, the proximities between sufficiently dissimilar points will all be zero, thus obscuring some of the longrange relationships in the data. Therefore, directly embedding the proximities leads to global distortion.
A robust way of learning global relations from local proximities is via diffusion (see Coifman and Lafon (2006); Nadler et al. (2006); Moon et al. (2019)). Because of recent successes in adapted, diffusionbased visualizations (see Moon et al. (2019); Duque et al. (2019); Gigante et al. (2019); Horoi et al. (2020)), we create a new, diffusionbased, supervised visualization tool, called RFPHATE, which combines the locallyadaptive random forest kernel with recent advances in manifold learning; all of the steps are provided in Algorithm 1. Random forests can be viewed as a
nearest neighbor (NN) classifier with an adaptive metric
Lin and Jeon (2006). The proximities produced by the random forest can therefore be viewed as locallyadapted affinities in the predictor space, where local variable importance is assessed using a measure of class purity at each node, making the affinities insensitive to data density. This further motivates the use of random forest proximities to encode local structure in data that is dependent on the label space.Diffusion maps constructs a graph from local similarities to learn the global data geometry. Typically, the graph is constructed by applying a kernel function (e.g. Gaussian) to pairwise Euclidean distances using a fixed kernel bandwidth resulting in an kernel matrix , where
is the number of points we wish to embed, giving a notion of similarity between data points. Instead of applying a kernel function to Euclidean distances, our method uses the random forest proximity (localaffinity) matrix. The local affinity matrix is then rowsum normalized to create a Markov transition matrix or diffusion operator,
, which is used to capture global relationships. An entry, , of row inrepresents the probability of transitioning or “walking” from point
to point in a single step of a random walk on a graph constructed from the affinity matrix . Thus the probability of transitioning from point to point on this graph in a single time step is high if the distance between them is small. This prevents an exiting of the intrinsic manifold of the data. is raised to the power of to simulate diffusion steps. This has the effect of denoising as transitions with low probabilities are filtered out, while transitions with high probability retain their importance. In diffusion maps, the local and global information learned in is often extracted into lower dimensions using eigendecomposition.Diffusion maps has several weaknesses that prevent it from being effective at visualization in most settings. First, in many applications a fixed bandwidth for all points is not appropriate as the data may not be sampled uniformly. Second, choosing a good time scale for the diffusion process is difficult and largely overlooked in classical diffusion maps. A small value of can lead to insufficient denoising and an overemphasis on local structure. In contrast, a large value of can lead to oversmoothing and an overemphasis on global structure. Third, the eigendecomposition in diffusion maps tends to place the information learned in into different dimensions Moon et al. (2019); Haghverdi et al. (2016), which is not amenable for visualization.
Since the random forest proximity matrix, , is comprised of locallyadaptive affinities, it does not rely on uniformlysampled data, overcoming this weakness in diffusion maps. The von Neumann Entropy (VNE) of the diffused operator provides a good choice of for visualization. The VNE of the diffused operator
is the Shannon entropy of the normalized eigenvalues of
. Since the entropy of a discrete random variable is maximized with a uniform distribution, the VNE is a soft proxy for the number of significant eigenvalues of
. The more significant eigenvalues there are, the closer the normalized eigenvalues are to a uniform distribution, and the higher the VNE. Since is a probability transition matrix with a stationary distribution, the VNE converges to zero as . The rate of decay of the VNE as increases is used to select . Typically, is chosen to be around the transition from rapid to slow decay in the VNE as this is considered to be point in the diffusion process where noise has been eliminated and oversmoothing begins Moon et al. (2019). We show empirically (see appendix E) that diffusion over proximities is not sensitive to the choice of around the VNEselected value for local variable preservation.To counter the third weakness of diffusion maps, we apply an information distance to the powered diffusion operator to create the potential distance Moon et al. (2019). The potential distance is calculated by log or squareroottransforming the powered diffusion operator and then calculating the Euclidean distance between rows, although other information distances can also be used Duque et al. (2019). The potential distance is sensitive to differences in both the tails and the more dense regions of the diffused probabilities, resulting in a distance which preserves both local and global relationships. These distances are then embedded into lowdimensions using metric MDS with classical MDS as an initialization. This extracts the information in low dimensions for better visualization.
5 Experimental Results
To evaluate our supervised visualization method, we apply RFPHATE to multiple data sets and compare it to other methods described in Sections 1 and 2. Full details on the used data sets, computational environment, method implementations, and parameter tuning can be found in the appendices. We first demonstrate qualitatively how RFPHATE can be used in exploratory data analysis to visualize variable importance with respect to the supervised task on both low and highdimensional data sets. We then quantitatively establish this capability compared to other supervised dimensionality reduction methods. Finally, we show that RFPHATE is able to accurately visualize data with noisy dimensions and outperforms both supervised and unsupervised dimensionality reduction methods.
5.1 Visualizing Variable Importance
We demonstrate the ability of RFPHATE to capture local variable importance in a lowdimensional representation of the popular Titanic data set Dua and Graff (2017), which has 12 variables and 891 observations. The variables Name and Ticket were removed as they are nearly unique to each observation. In addition, observations with missing values were removed. The proximities were created using the randomForest Liaw and Wiener (2002) package in R R Core Team (2019), with the variable Survived as the response. The forests’ classification accuracy was 81.46% using 5000 trees and otherwise default parameters. Variable importance was computed using the forest’s builtin importance measure (mean decrease in accuracy). To compute the importance for the variable using a random forest, the error rate (for classification) or MSE (for regression) is computed using the outofbag (oob) observations. The values of for the oob data are randomly permuted and the error rate (or MSE) is computed using the permuted data. The importance is the difference between the oob error rate and the permuted oob error rate Cutler et al. (2012). For the Titanic data set, the top two important variables were Sex and Class. Figure 1 shows the results of applying RFPHATE to this data. Instead of creating distinct clusters based solely on the response, RFPHATE arranges points and clusters by variable importance. The rightmost cluster includes mostly survivors and contains female first and secondclass passengers. Most passengers in the leftmost cluster did not survive. This cluster contains male, second and thirdclass passengers. The small cluster in the center is entirely composed of secondclass, male children who all survived. At least some of the variation within clusters appears to be driven by age. This demonstrates how RFPHATE can be used in data exploration by visually identifying groupings and trends within the variables important for the supervised task.
5.2 Visualization of HigherDimensional Data
To demonstrate the application of RFPHATE to high dimensional data we apply it to the Optical Recognition of Handwritten Digits data from Xu et al. (1992); Dua and Graff (2017) and Ames, Iowa, house pricing data from Cock (2011), with classification and regression supervision, respectively. The Optical Digits data contains 5620 observations of handwritten numerical digits recorded in bit images. The images are divided into blocks resulting in 64 dimensional arrays Xu et al. (1992); Dua and Graff (2017). The random forest accuracy for the data set was 82.7%. The variables in this data set are not as straightforward to analyze as they consist of individual pixels in the images, but we can see in Figure 2(a) that high values in the top variable (block 43) are important in classifying 0’s and 6’s, while high values in the second variable provide further separation between 4’s, 7’s, and 9’s from the 3’s, 5’s, and 6’s. On the Ames, Iowa house pricing data set Cock (2011), we adapted the random forest in RFPHATE to perform regression on house prices, resulting in overall quality and ground living area (square footage) being designated as the most important variables, as emphasized by the RFPHATE visualization in Figure 2(b).
5.3 Variable Importance Preservation
We now numerically assess the quality of the embeddings provided by RFPHATE. To be useful in supervised exploratory data analysis, the embedding should preserve local and global structure in the most important variables for the supervised task. To numerically assess this structure preservation, we applied a nearest neighbor classifier or regressor with the embedding dimensions as inputs (i.e. features) and the important variables as the response output. The key idea is that if the embedding preserves the local and global structure in the important variables, then the embedding dimensions should be sufficient to predict the important variables.
For classification, the mean prediction error, and for regression, the root meansquared error was used as the criterion, applying 10fold crossvalidation. We compared RFPHATE with six supervised methods: Enhanced Supervised Isomap (ESIso) Ribeiro et al. (2008), Kernalized Supervised PCA (KSPCA) Yu et al. (2006), SPCA Yu et al. (2006), Enhanced Supervised LLE (ESLLE) qing Zhang (2009), SNMF Jia et al. (2019), and Partial Least Squares Discriminant Analysis (PLSDA) Wold et al. (1984).
Data Set  Variable (dims)  RFPHATE  ESIso  KSPCA  SPCA  ESLLE  SNMF  PLSDA 

Titanic  Sex (2)  0.0000  0.0028  0.0084  0.0056  0.0070  0.1419  0.0028 
Titanic  Sex (3)  0.0000  0.0028  0.0084  0.0028  0.0042  0.1489  0.0042 
Titanic  Class (2)  0.0154  0.1657  0.1025  0.0688  0.0702  0.3174  0.0197 
Titanic  Class (3)  0.0112  0.0492  0.1011  0.0281  0.0969  0.3048  0.0239 
Optical  Top Var (2)  3.612  3.554  5.478  5.237  5.910  6.659  5.842 
Optical  Top Var (3)  3.473  3.484  4.013  4.922  5.887  6.659  5.407 
Optical  Second Var (2)  3.425  3.869  4.101  4.862  5.441  6.540  5.730 
Optical  Second Var (3)  3.361  3.851  3.902  4.007  5.194  6.540  5.060 
Optical  Third Var (2)  4.147  3.997  5.107  4.416  5.839  7.147  6.217 
Optical  Third Var (3)  3.766  3.810  4.608  4.039  5.810  7.147  5.968 
Optical  Fourth Var (2)  4.108  4.118  5.373  5.687  5.965  6.161  5.985 
Optical  Fourth Var (3)  4.001  4.037  4.035  4.949  5.867  6.161  5.320 
Ames  Quality (2)  0.644  N/A  0.815  0.764  N/A  N/A  0.767 
Ames  Quality (3)  0.631  N/A  0.788  0.759  N/A  N/A  0.736 
Ames  GrLvArea (2)  263.75  N/A  402.08  413.36  N/A  N/A  415.90 
Ames  GrLvArea (3)  224.57  N/A  393.64  411.53  N/A  N/A  293.18 
) were used as the response, using the lowdimensional embeddings (2 or 3) from 6 supervised dimensionality reduction algorithms. For categorical variables (Sex, Class) the classification error was used; for continuous variables (all of the remaining), the root meansquared error (RMSE) was used. Therefore, lower is better. ESIso, ESLLE, and SNMF have not been adapted for regression, hence the missing values in the Ames dataset (N/A). RFPHATE is either first or second at each task.
Table 1 shows the results for the Titanic, Optical Digits, and Ames Housing data sets, respectively. For the Titanic data set, RFPHATE performs the best out of all methods when predicting either sex or class (the two most important variables) using either a 2 or 3 dimensional embedding. For the Optical Digits, RFPHATE is either the best or secondbest method at preserving the relevant structure when using 2 or 3 dimensions. ESIso is generally in second and is sometimes significantly below RFPHATE in terms of quality. We performed similar comparisons on other data sets. We also note that RFPHATE continues to do well when embedding with higher dimensions. See appendices A and B.
5.4 Supervised Visualization in the Presence of Noisy Variables
We now consider the problem where only a few variables are relevant for the supervised task and all other variables contain noise. This is a case where unsupervised visualization methods are very likely to fail, demonstrating the utility of supervised visualization methods. To simulate this, we took the Iris data set Anderson (1936); Dua and Graff (2017) and added 1000 dimensions generated from a Gaussian random variable with means uniformly generated between 1 and 1 and with a variance of 1. The simulation was repeated 10 times. This also creates a case with few data points as the Iris data set has only 150 samples. The visualizations are shown in Figure 3. We note that the unsupervised methods (PCA, Isomap, and tSNE) perform poorly in this setting as their visualizations consist of a single cloud of points without any distinguishable features.
In contrast, the supervised visualization methods are able to reveal some structure by using the class labels. RFPHATE (Figure 3(b)) separates the setosa class from the other two classes while the versicolor and virginica classes are shown with some overlap. Furthermore, the setosa class is shown to be approximately the same distance from each of the other two classes. This overall structure is consistent with the properties of the sepal measurements in the Iris data set as seen in figure 3(a) in appendix A.
PLSDA and ESIsomap (Figures 3(c)iv and vi, respectively) separate the classes well. However, these methods do not show the overlap between the versicolor and virginica species. Also, the ESIsomap visualization shows the data points for each class collapsed onto a single point, completely eliminating any structure that may be present within the classes. This suggests that these methods do not preserve global relationships between classes well.
Supervised PCA (Figure 3(c)v) separates the setosa class from the other two classes while the versicolor and virginica species share some overlap. The setosa species is shown in the visualization to be much closer to the versicolor class than the virginica species. While this is not consistent with the global structure of the sepal measurements (Figure 3(a)), it is consistent with the structure of the petal measurements (see figure 4). However, SPCA does not show a large separation between the setosa and versicolor classes, suggesting that SPCA may not always accurately reflect the magnitude of class separations.
These results are corroborated numerically in Table 2. Here we performed a similar experiment as in Section 5.3 where we performed regression on the embeddings to predict each of the 4 variables in the data set. Again, RFPHATE outperforms the other methods in all settings. While this is a simple data set, its simplicity allows us to evaluate qualitatively and quantitatively the performance of the various methods, and thus reflects a generalizable trend that will be studied further in future work. See the appendix B for experiments with higher dimensions.
Var (dims)  RFPHATE  ESIso  KSPCA  SPCA  ESLLE  SNMF  PLSDA 

petal len. (2)  0.330 0.01  0.348 0.01  0.340 0.01  0.381 0.01  0.349 0.01  1.475 0.03  0.387 0.01 
petal len. (3)  0.334 0.01  0.353 0.01  0.360 0.01  0.400 0.01  0.353 0.01  2.233 0.04  0.363 0.01 
petal wid. (2)  0.291 0.01  0.298 0.01  0.297 0.01  0.330 0.01  0.304 0.01  1.297 0.02  0.330 0.01 
petal wid. (3)  0.292 0.01  0.305 0.01  0.312 0.01  0.345 0.01  0.308 0.01  1.905 0.03  0.308 0.01 
sepal len. (2)  0.459 0.01  0.522 0.02  0.514 0.02  0.514 0.02  0.528 0.02  0.885 0.02  0.514 0.02 
sepal len. (3)  0.455 0.01  0.536 0.02  0.540 0.02  0.520 0.02  0.534 0.02  1.272 0.02  0.516 0.02 
sepal wid. (2)  0.320 0.01  0.348 0.01  0.351 0.01  0.357 0.01  0.349 0.01  0.462 0.01  0.357 0.01 
sepal wid. (3)  0.318 0.01  0.354 0.01  0.358 0.01  0.356 0.01  0.354 0.01  0.501 0.02  0.357 0.01 
with 1000 additional random noise variables. The lowdimensional embeddings of seven supervised dimensionality reduction methods were used as features to regress on the original iris data set variables. The noise variables were simulated ten times and the results were averaged. The standard deviation is also reported. RFPHATE outperforms all other methods.
6 Conclusion
Large, highdimensional data sets are commonplace and can be difficult to analyze and interpret. Supervised dimensionality reduction algorithms are used for preprocessing and visualizing highdimensional data sets when labels are present. Current methods for supervised visualization either overemphasize class separation or ignore the data’s intrinsic geometric structure altogether. In contrast, RFPHATE incorporates labels directly in its kernel matrix using random forests’ outofbag samples, thereby robustly learning local relationships without overstressing class separation. The global structure is discovered by means of diffusion and then visualized by embedding an information distance. RFPHATE outperforms other supervised dimensionality reduction algorithms, is noiseresilient, robust to parameter tuning, and handles both continuous and discrete labels.
Broader Impact
Data exploration plays a role in virtually all fields of science and technology nowadays. In this era of big data, there is an everincreasing need for means to make sense of the information overload we face daily. In instances where labeled data is present, RFPHATE has the potential to visually unlock patterns and signals in excessively noisy data. This holds the potential to lead to new discoveries in social, medical, natural, and business settings. RFPHATE’s implementation allows enables identification of trends or clusters based on important variables. These clusters are inherently meaningful and not subject to arbitrary subjective interpretation that is often encountered in an unsupervised setting. This, in turn, may lead to the discovery of meaningful relationships not only between the variables and classes, but also between the variables themselves. We do note that like many other data processing tools, data visualization is not immune to misuse, and may be used as a means of deceit when not presented properly. However, this work is computational in nature and addresses fundamental development in data science, agnostic to specific application. As such, by itself, it is not expected to raise ethical concerns nor to have adverse effects on society.
References
 Pearson [1901] Karl Pearson. LIII. On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 2(11):559–572, 1901.
 Hotelling [1933] Harold Hotelling. Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology, 1933.
 Lee and Seung [1999] Daniel D. Lee and H. Sebastian Seung. Learning the parts of objects by nonnegative matrix factorization. Nature, 401(6755):788–791, Oct 1999. ISSN 14764687.
 Kruskal and Wish [1978] J.B. Kruskal and M. Wish. Multidimensional Scaling. Sage Publications, 1978.
 Roweis and Saul [2000] Sam T. Roweis and Lawrence K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500):2323–2326, 2000. ISSN 00368075.
 Tenenbaum et al. [2000] Joshua B. Tenenbaum, Vin de Silva, and John C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319–2323, 2000. ISSN 00368075.
 Coifman and Lafon [2006] Ronald R. Coifman and Stéphane Lafon. Diffusion maps. Applied and Computational Harmonic Analysis, 21(1):5 – 30, 2006. ISSN 10635203. Special Issue: Diffusion Maps and Wavelets.

Nadler et al. [2006]
Boaz Nadler, Stephane Lafon, Ioannis Kevrekidis, and Ronald R Coifman.
Diffusion maps, spectral clustering and eigenfunctions of fokkerplanck operators.
In Advances in neural information processing systems, pages 955–962, 2006.  Hinton and Zemel [1994] Geoffrey E Hinton and Richard S Zemel. Autoencoders, minimum description length and helmholtz free energy. In Advances in neural information processing systems, pages 3–10, 1994.

Maaten and Hinton [2008]
Laurens van der Maaten and Geoffrey Hinton.
Visualizing data using tSNE.
Journal of machine learning research
, 9(Nov):2579–2605, 2008.  McInnes et al. [2018] Leland McInnes, John Healy, and James Melville. UMAP: Uniform manifold approximation and projection for dimension reduction, 2018.
 Moon et al. [2019] Kevin R. Moon, David van Dijk, Zheng Wang, Scott Gigante, Daniel B. Burkhardt, William S. Chen, Kristina Yim, Antonia van den Elzen, Matthew J. Hirn, Ronald R. Coifman, Natalia B. Ivanova, Guy Wolf, and Smita Krishnaswamy. Visualizing structure and transitions in highdimensional biological data. Nature Biotechnology, 37(12):1482–1492, Dec 2019. ISSN 15461696.
 Amir et al. [2013] Elad David Amir, Kara L Davis, Michelle D Tadmor, Erin F Simonds, Jacob H Levine, Sean C Bendall, Daniel K Shenfeld, Smita Krishnaswamy, Garry P Nolan, and Dana Pe’er. visne enables visualization of high dimensional singlecell data and reveals phenotypic heterogeneity of leukemia. Nature Biotechnology, 31(6):545, 2013.
 Macosko et al. [2015] Evan Z Macosko, Anindita Basu, Rahul Satija, James Nemesh, Karthik Shekhar, Melissa Goldman, Itay Tirosh, Allison R Bialas, Nolan Kamitaki, Emily M Martersteck, et al. Highly parallel genomewide expression profiling of individual cells using nanoliter droplets. Cell, 161(5):1202–1214, 2015.
 Gopalakrishnan et al. [2018] Vancheswaran Gopalakrishnan, Christine N Spencer, Luigi Nezi, Alexandre Reuben, MC Andrews, TV Karpinets, PA Prieto, D Vicente, Karen Hoffman, SC Wei, et al. Gut microbiome modulates response to anti–pd1 immunotherapy in melanoma patients. Science, 359(6371):97–103, 2018.
 Becht et al. [2019] Etienne Becht, Leland McInnes, John Healy, CharlesAntoine Dutertre, Immanuel WH Kwok, Lai Guan Ng, Florent Ginhoux, and Evan W Newell. Dimensionality reduction for visualizing singlecell data using umap. Nature Biotechnology, 37(1):38, 2019.
 Zhao et al. [2020] Yujiao Zhao, Matthew Amodio, Brent Vander Wyk, Bram Gerritsen, Mahesh M Kumar, David van Dijk, Kevin Moon, Xiaomei Wang, Anna Malawista, Monique M Richards, Megan Cahill, Anita Desai, Jayasree Sivadasan, Manjunatha Venkataswamy, Vasanthapuram Ravi, Erol Fikrig, Priti Kumar, Steven Kleinstein, Smita Krishnaswamy, and Ruth Montgomery. Single cell immune profiling of dengue virus patients reveals intact immune responses to Zika virus with enrichment of innate immune signatures. PLoS Neglected Tropical Diseases, 14(3):e0008112, 2020.
 Karthaus et al. [2020] Wouter R Karthaus, Matan Hofree, Danielle Choi, Eliot L Linton, Mesruh Turkekul, Alborz Bejnood, Brett Carver, Anuradha Gopalan, Wassim Abida, Vincent Laudone, Moshe Biton, Ojasvi Chaudhary, Tianhao Xu, Ignas Masilionis, Katia Manova, Linas Mazutis, Dana Pe’er, Aviv Regev, and Charles Sawyers. Regenerative potential of prostate luminal cells revealed by singlecell analysis. Science, 368(6490):497–505, 2020.
 Baccin et al. [2019] Chiara Baccin, Jude AlSabah, Lars Velten, Patrick M Helbling, Florian Grünschläger, Pablo HernándezMalmierca, César NombelaArrieta, Lars M Steinmetz, Andreas Trumpp, and Simon Haas. Combined singlecell and spatial transcriptomics reveal the molecular, cellular and spatial bone marrow niche organization. Nature Cell Biology, pages 1–11, 2019.
 Chen and Salman [2011] Ke Chen and Ahmad Salman. Extracting speakerspecific information with a regularized siamese deep network. In Advances in Neural Information Processing Systems, pages 298–306, 2011.
 Duque et al. [2019] Andrés F. Duque, Guy Wolf, and Kevin R. Moon. Visualizing high dimensional dynamical processes. In IEEE International Workshop on Machine Learning for Signal Processing, 2019.

Horoi et al. [2020]
Stefan Horoi, Victor Geadah, Guy Wolf, and Guillaume Lajoie.
Lowdimensional dynamics of encoding and learning in recurrent neural networks.
InCanadian Conference on Artificial Intelligence
, pages 276–282. Springer, 2020.  Gigante et al. [2019] Scott Gigante, Adam S Charles, Smita Krishnaswamy, and Gal Mishne. Visualizing the phate of neural networks. In Advances in Neural Information Processing Systems, pages 1840–1851, 2019.
 Bair et al. [2006] Eric Bair, Trevor Hastie, Debashis Paul, and Robert Tibshirani. Prediction by supervised principal components. Journal of the American Statistical Association, 101(473):119–137, 2006.
 Yu et al. [2006] Shipeng Yu, Kai Yu, Volker Tresp, HansPeter Kriegel, and Mingrui Wu. Supervised probabilistic principal component analysis. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’06, page 464–473, New York, NY, USA, 2006. Association for Computing Machinery. ISBN 1595933395.
 qing Zhang [2009] Shi qing Zhang. Enhanced supervised locally linear embedding. Pattern Recognition Letters, 30(13):1208 – 1218, 2009. ISSN 01678655.
 Li et al. [2018] Z. Li, J. Tang, and X. He. Robust structured nonnegative matrix factorization for image representation. IEEE Transactions on Neural Networks and Learning Systems, 29(5):1947–1960, 2018.
 Zhang et al. [2016] X. Zhang, L. Zong, X. Liu, and J. Luo. Constrained clustering with nonnegative matrix factorization. IEEE Transactions on Neural Networks and Learning Systems, 27(7):1514–1526, 2016.
 Jia et al. [2019] Y. Jia, S. Kwong, J. Hou, and W. Wu. Semisupervised nonnegative matrix factorization with dissimilarity and similarity regularization. IEEE Transactions on Neural Networks and Learning Systems, pages 1–12, 2019.
 Vlachos et al. [2002] Michail Vlachos, Carlotta Domeniconi, Dimitrios Gunopulos, George Kollios, and Nick Koudas. Nonlinear dimensionality reduction techniques for classification and visualization. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’02, page 645–651, New York, NY, USA, 2002. Association for Computing Machinery. ISBN 158113567X.
 Ribeiro et al. [2008] Bernardete Ribeiro, Armando Vieira, and João Carvalho das Neves. Supervised isomap with dissimilarity measures in embedding learning. In José RuizShulcloper and Walter G. Kropatsch, editors, Progress in Pattern Recognition, Image Analysis and Applications, pages 389–396, Berlin, Heidelberg, 2008. Springer Berlin Heidelberg. ISBN 9783540859208.
 Tapson and Greene [2005] J. Tapson and J.R. Greene. Plant data visualization using nonnegative matrix factorization. IFAC Proceedings Volumes, 38(1):73 – 78, 2005. ISSN 14746670. 16th IFAC World Congress.
 Babaee et al. [2016] Mohammadreza Babaee, Stefanos Tsoukalas, Gerhard Rigoll, and Mihai Datcu. Immersive visualization of visual data using nonnegative matrix factorization. Neurocomputing, 173:245 – 255, 2016. ISSN 09252312.
 Barshan et al. [2011] Elnaz Barshan, Ali Ghodsi, Zohreh Azimifar, and Mansoor Zolghadri Jahromi. Supervised principal component analysis: Visualization, classification and regression on subspaces and submanifolds. Pattern Recogn., 44(7):1357–1371, July 2011. ISSN 00313203.
 Breiman [2001] Leo Breiman. Random forests. Mach. Learn., 45(1):5–32, October 2001. ISSN 08856125.
 Chao et al. [2019] Guoqing Chao, Yuan Luo, and Weiping Ding. Recent advances in supervised dimension reduction: A survey. Machine Learning and Knowledge Extraction, 1(1):341–358, Jan 2019. ISSN 25044990.
 Tsuge et al. [2001] S. Tsuge, M. Shishibori, S. Kuroiwa, and K. Kita. Dimensionality reduction using nonnegative matrix factorization for information retrieval. In 2001 IEEE International Conference on Systems, Man and Cybernetics. eSystems and eMan for Cybernetics in Cyberspace (Cat.No.01CH37236), volume 2, pages 960–965 vol.2, 2001.
 Haifeng Liu et al. [2012] Haifeng Liu, Zhaohui Wu, Xuelong Li, Deng Cai, and T. S. Huang. Constrained nonnegative matrix factorization for image representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(7):1299–1311, 2012.

Liu et al. [2008]
Weixiang Liu, Kehong Yuan, and Datian Ye.
Reducing microarray data via nonnegative matrix factorization for visualization and clustering analysis.
Journal of Biomedical Informatics, 41(4):602 – 606, 2008. ISSN 15320464.  Cai et al. [2008] D. Cai, X. He, X. Wu, and J. Han. Nonnegative matrix factorization on manifold. In 2008 Eighth IEEE International Conference on Data Mining, pages 63–72, 2008.
 van der Maaten and Hinton [2008] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using tSNE. Journal of Machine Learning Research, 9:2579–2605, 2008.
 Belkin and Niyogi [2003] M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15(6):1373–1396, 2003.
 de Ridder et al. [2003] Dick de Ridder, Olga Kouropteva, Oleg Okun, Matti Pietikäinen, and Robert P. W. Duin. Supervised locally linear embedding. In Okyay Kaynak, Ethem Alpaydin, Erkki Oja, and Lei Xu, editors, Artificial Neural Networks and Neural Information Processing — ICANN/ICONIP 2003, pages 333–341, Berlin, Heidelberg, 2003. Springer Berlin Heidelberg. ISBN 9783540449898.
 Polito and Perona [2002] Marzia Polito and Pietro Perona. Grouping and dimensionality reduction by locally linear embedding. In T. G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems 14, pages 1255–1262. MIT Press, 2002.
 Friedman et al. [1977] Jerome H. Friedman, Jon Louis Bentley, and Raphael Ari Finkel. An algorithm for finding best matches in logarithmic expected time. ACM Trans. Math. Softw., 3(3):209–226, September 1977. ISSN 00983500.
 Cutler et al. [2012] Adele Cutler, D. Richard Cutler, and John R. Stevens. Random Forests, pages 157–175. Springer US, Boston, MA, 2012. ISBN 9781441993267.
 Breiman [1996] Leo Breiman. Bagging predictors. Machine Learning, 24(2):123–140, Aug 1996. ISSN 15730565.
 Segal and Xiao [2011] Mark Segal and Yuanyuan Xiao. Multivariate random forests. WIREs Data Mining and Knowledge Discovery, 1(1):80–87, 2011.
 Breiman and Cutler [(Accessed on 04/15/2020] Leo Breiman and Adele Cutler. Random forests for scientific discovery. http://www.math.usu.edu/adele/RandomForests/ENAR.pdf, (Accessed on 04/15/2020).
 Lin and Jeon [2006] Yi Lin and Yongho Jeon. Random forests and adaptive nearest neighbors. Journal of the American Statistical Association, 101(474):578–590, 2006.
 Golino and Gomes [2014] Hudson F. Golino and Cristiano Mauro Assis Gomes. Visualizing Random Forest’s Prediction Results. Psychology, 5:2084–2098, 2014.
 Fruchterman and Reingold [1991] Thomas M. J. Fruchterman and Edward M. Reingold. Graph drawing by forcedirected placement. Software: Practice and Experience, 21(11):1129–1164, 1991.
 Gajer et al. [2004] Pawel Gajer, Michael T. Goodrich, and Stephen G. Kobourov. A multidimensional approach to forcedirected layouts of large graphs. Computational Geometry, 29(1):3 – 18, 2004. ISSN 09257721. Special Issue on the 10th Fall Workshop on Computational Geometry, SUNY at Stony Brook.
 Duque et al. [2019] Andrés F Duque, Guy Wolf, and Kevin R Moon. Visualizing high dimensional dynamical processes. In 2019 IEEE 29th International Workshop on Machine Learning for Signal Processing (MLSP), pages 1–6. IEEE, 2019.
 Haghverdi et al. [2016] Laleh Haghverdi, Maren Buettner, F Alexander Wolf, Florian Buettner, and Fabian J Theis. Diffusion pseudotime robustly reconstructs lineage branching. Nature Methods, 13(10):845, 2016.
 Dua and Graff [2017] Dheeru Dua and Casey Graff. UCI machine learning repository, 2017.
 Liaw and Wiener [2002] Andy Liaw and Matthew Wiener. Classification and regression by randomforest. R News, 2(3):18–22, 2002.
 R Core Team [2019] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2019.
 Xu et al. [1992] Lei Xu, A. Krzyzak, and Ching Suen. Methods of combining multiple classifiers and their applications to handwriting recognition. IEEE Transactions on Systems Man and Cybernetics Part B (Cybernetics), 22:418 – 435, 06 1992.
 Cock [2011] Dean De Cock. Ames, iowa: Alternative to the boston housing data as an end of semester regression project. Journal of Statistics Education, 19(3):null, 2011.

Wold et al. [1984]
S. Wold, A. Ruhe, H. Wold, and W. J. Dunn, III.
The collinearity problem in linear regression. the partial least squares (pls) approach to generalized inverses.
SIAM Journal on Scientific and Statistical Computing, 5(3):735–743, 1984.  Anderson [1936] Edgar Anderson. The species problem in iris. Annals of the Missouri Botanical Garden, 23(3):457–509, 1936. ISSN 00266493.
 Gorman and Sejnowski [1988] R.Paul Gorman and Terrence J. Sejnowski. Analysis of hidden units in a layered network trained to classify sonar targets. Neural Networks, 1(1):75 – 89, 1988. ISSN 08936080.
 Kuhn [2020] Max Kuhn. caret: Classification and Regression Training, 2020. R package version 6.086.
 Moon et al. [2017] K. R. Moon, K. Sricharan, and A. O. Hero. Ensemble estimation of mutual information. In 2017 IEEE International Symposium on Information Theory (ISIT), pages 3030–3034, 2017.
 Tibshirani [1994] Robert Tibshirani. Regression shrinkage and selection via the lasso. JOURNAL OF THE ROYAL STATISTICAL SOCIETY, SERIES B, 58:267–288, 1994.
 DíazUriarte and Alvarez de Andrés [2006] Ramón DíazUriarte and Sara Alvarez de Andrés. Gene selection and classification of microarray data using random forest. BMC Bioinformatics, 7(1):3, Jan 2006. ISSN 14712105.
Appendices
In these appendices, we provide further details about RFPHATE. In Section A, we provide details about each of the data sets used in the experiments. We then provide details on the metric we use to quantify our results and show extended results in Section B. In Section C, we present alternate measures of variable importance which are consistent with the random forest measure of variable importance. We compare RFPHATE to unsupervised visualization methods in Section D. We then perform an ablation study in Section E, demonstrating that RFPHATE is robust to the choice of parameters. Finally, we demonstrate that other approaches to embedding the random forest proximities result in inferior visualizations in Section F.
Appendix A Description of Data Sets
Descriptions of each the data sets used in the experiments and their preprocessing steps are given here. In addition to steps listed particular to each data set, all continuous variables in each data set were standardized to have a mean of zero and a variance of one.
The Optical Digits Xu et al. [1992], Dua and Graff [2017] data contains 5620 observations of handwritten numerical digits (0  9) recorded in bit images. The images were collect from 43 individuals and the pixels are summed into blocks resulting in 64 dimensional arrays. Each variable takes on integer values from 0 to 16 which denote the number of original pixels with writing included in that particular block. There are no missing values.
The Ames Housing Cock [2011]
data set consists of 1460 house listings with 80 variable with a mixture of nominal, ordinal, discrete, and continuous values. The response variable is the sales price. Missing values were manually imputed (N/A values were replaced with “none” for missing categorical variables, or 0 for continuous variables, where applicable). Onehot encoding was applied to categorical variables for compatibility with the compared dimensionality reduction techniques.
The Titanic data set has 12 variables and 891 observations. The variables Ticket and Name are unique to each observations and were thus removed. In addition, observations with missing values were also removed, resulting in a total of 712 observations used in the comparisons. Onehot encoding was applied to the categorical variables.
Iris Anderson [1936], Dua and Graff [2017] contains 150 observations of three iris species: virginica, versicolor, and setosa. There are 50 observations from each species. Each observation has four measurements: sepal length, sepal width, petal length, and petal width. The data set has no missing values. See a pairwise variable plot in figure 4.
In addition to the previous data sets used in the main paper, we include results on other data sets. This includes the Sonar Gorman and Sejnowski [1988] data set, which consists of 208 instances of sonar signals reflected off of rocks or metal cylinders positioned at various angles. This results in 60 variables (60 sonar bands). Classification is done on the material type (rock or metal). The data set contains no missing values. (See figure 5).
Additionally, we show results for the German Credit data set Dua and Graff [2017], consisting of 1000 instances of 20 categorical and numeric data sets. An alternative version (that we use) contains 24 numeric variables representative of the original data. The variables (attributes) are proprietary and therefore variable names are not included. The data set has two classes, bad credit (0) or good credit (1). RFPHATE demonstrates variables which may be used to clearly distinguish customers with good credit from those with bad. See figure 6.
Appendix B LowDimensional Variable Regression
To quantify the degree in which the lowdimensional embeddings captured important variables, we used the embeddings as feature matrices for a NN classifier/regressor on the original variables (see Tables 1 and 2). The value of used in each case was the squareroot of the number of observations in the data set. For discrete variables, the average error rate was computed using 10fold cross validation. For continuous variables, the average root mean squared error (RMSE) was the criterion, also using 10fold cross validation. We compared results using 7 supervised dimensionality reduction techniques, RFPHATE, Enhanced Supervised Isomap Ribeiro et al. [2008], Kernalized Supervised PCA Yu et al. [2006], Supervised PCA Yu et al. [2006], Enhanced Supervised LLE qing Zhang [2009], Supervised NMF Jia et al. [2019], and Partial Least Squares Discriminant Analysis (PLSDA) Wold et al. [1984].
Here we show extended results from those in the paper by including other datasets and by assessing the embeddings with dimensions of 4 and 5. The results are contained in Tables 3 to 7. Here we see that RFPHATE again outperforms the other methods, as it is mostly first and either second or third for all comparisons. No other method is as consistent in its performance across all datasets.
Var (dims)  RFPHATE  ESIso  KSPCA  SPCA  ESLLE  SNMF  PLSDA 
Quality (2)  0.645  N/A  0.815  0.765  N/A  N/A  0.767 
Quality (3)  0.631  N/A  0.789  0.760  N/A  N/A  0.736 
Quality (4)  0.619  N/A  0.785  0.747  N/A  N/A  0.723 
Quality (5)  0.619  N/A  0.783  0.743  N/A  N/A  0.728 
GrLivArea (2)  263.751  N/A  402.084  413.366  N/A  N/A  415.903 
GrLivArea (3)  224.579  N/A  393.641  411.534  N/A  N/A  293.184 
GrLivArea (4)  226.168  N/A  390.608  385.940  N/A  N/A  293.578 
GrLivArea (5)  226.585  N/A  392.209  385.254  N/A  N/A  297.176 
Ext. Quality (2)  0.114  N/A  0.132  0.134  N/A  N/A  0.125 
Ext. Quality (3)  0.114  N/A  0.136  0.127  N/A  N/A  0.125 
Ext. Quality (4)  0.115  N/A  0.124  0.128  N/A  N/A  0.112 
Ext. Quality (5)  0.112  N/A  0.123  0.129  N/A  N/A  0.112 
Var (dims)  RFPHATE  ESIso  KSPCA  SPCA  ESLLE  SNMF  PLSDA 
Sex (2)  0.0000  0.0028  0.0084  0.0056  0.0070  0.1419  0.0028 
Sex (3)  0.0000  0.0028  0.0084  0.0028  0.0042  0.1489  0.0042 
Sex (4)  0.0000  0.0070  0.0056  0.0028  0.0028  0.1320  0.0042 
Sex (5)  0.0000  0.0056  0.0056  0.0028  0.0028  0.1629  0.0028 
Class (2)  0.0154  0.1657  0.1025  0.0688  0.0702  0.3174  0.0197 
Class (3)  0.0112  0.0492  0.1011  0.0281  0.0969  0.3048  0.0239 
Class (4)  0.0169  0.0421  0.0421  0.0253  0.0702  0.3666  0.0211 
Class (5)  0.0126  0.0463  0.0197  0.0253  0.0337  0.2893  0.0197 
Var (dims)  RFPHATE  ESIso  KSPCA  SPCA  ESLLE  SNMF  PLSDA 
V43 (2)  3.612  3.554  5.478  5.237  5.910  6.659  5.842 
V43 (3)  3.473  3.484  4.013  4.922  5.887  6.659  5.407 
V43 (4)  3.221  3.365  3.266  4.153  5.729  6.659  4.704 
V43 (5)  3.313  3.386  2.895  3.970  5.750  6.659  4.120 
V22 (2)  3.425  3.869  4.101  4.862  5.441  6.540  5.730 
V22 (3)  3.361  3.851  3.902  4.007  5.194  6.540  5.060 
V22 (4)  3.381  3.765  3.258  3.750  5.049  6.540  4.832 
V22 (5)  3.321  3.712  3.156  3.754  5.077  6.540  4.478 
V44 (2)  4.147  3.997  5.107  4.416  5.839  7.147  6.217 
V44 (3)  3.766  3.810  4.608  4.039  5.810  7.147  5.968 
V44 (4)  3.549  3.735  3.082  3.681  5.501  7.147  5.183 
V44 (5)  3.614  3.724  3.015  3.459  5.579  7.147  4.756 
Var (dims)  RFPHATE  ESIso  KSPCA  SPCA  ESLLE  SNMF  PLSDA 

Band 11 (2)  0.0889  0.1224  0.0918  0.0978  0.1173  0.1158  0.0958 
Band 11 (3)  0.0874  0.1217  0.0907  0.0985  0.1143  0.1209  0.0924 
Band 11 (4)  0.0873  0.1179  0.0898  0.0973  0.1157  0.1242  0.0905 
Band 11 (5)  0.0874  0.1185  0.0877  0.0972  0.1080  0.1232  0.0880 
Band 12 (2)  0.0957  0.1327  0.1119  0.1104  0.1278  0.1195  0.1158 
Band 12 (3)  0.0949  0.1309  0.1112  0.1097  0.1229  0.1248  0.1139 
Band 12 (4)  0.0946  0.1304  0.1110  0.1089  0.1214  0.1308  0.1053 
Band 12 (5)  0.0951  0.1316  0.1103  0.1088  0.1102  0.1356  0.0984 
Var (dims)  RFPHATE  ESIso  KSPCA  SPCA  ESLLE  SNMF  PLSDA 

V23 (2)  0.000  0.061  0.027  0.025  0.198  0.170  0.025 
V23 (3)  0.000  0.059  0.027  0.024  0.176  0.170  0.028 
V23 (4)  0.000  0.042  0.027  0.024  0.063  0.170  0.026 
V23 (5)  0.000  0.036  0.027  0.025  0.063  0.170  0.027 
V6 (2)  0.551  0.621  0.596  0.666  0.684  0.657  0.680 
V6 (3)  0.541  0.590  0.577  0.668  0.594  0.560  0.674 
V6 (4)  0.510  0.594  0.510  0.646  0.568  0.611  0.637 
V6 (5)  0.510  0.567  0.515  0.643  0.579  0.518  0.606 
Appendix C Alternative Assessments of Variable Importance
To show that our measure of variable importance using random forests is robust, we computed variable importance using several other methods: ROC curve variable importance (for discrete labels) or importance (for continuous labels) using the caret Kuhn [2020] package in R R Core Team [2019] as well as using a mutual information (MI) estimator based on Moon et al. [2017]. The results are given in table 8. The varImp function from the caret packageKuhn [2020] used repeated 10fold cross validation with 3 repeats. The method used was LVQ (learning vector quantization) for discrete responses, and LASSO Tibshirani [1994] ( importance) for continuous responses.
We note that the results are generally consistent with each other, especially when comparing the random forest results with the ROC or measure. While the MI gives some slight differences in some of the datasets, they are still generally the same. For example, MI determines the top three variables for the Titanic dataset to be Age, Fare, and Sex, respectively. The other two methods include Sex and Fare in their top three as well, albeit in slightly different orders. Overall, the results indicate that we are selecting important variables for the experiments.
Data Set  Rank  RF  ROC /  MI 

Iris  1  Petal Length  Petal Length  Petal Length 
Iris  2  Petal Width  Petal Width  Petal Width 
Iris  3  Sepal Length  Sepal Width  Sepal Length 
Optical Digits  1  V43  V22  V43 
Optical Digits  2  V22  V43  V22 
Optical Digits  3  V44  V31  V31 
Titanic  1  Sex  Sex  Age 
Titanic  2  Class  Fare  Fare 
Titanic  3  Fare  Class  Sex 
Ames Housing  1  Quality  Quality  N/A 
Ames Housing  2  GrLivArea  GrLivArea  N/A 
Ames Housing  3  Ext. Quality  TotalBsmtSF  N/A 
Sonar  1  Band 11  Band 11  Band 9 
Sonar  2  Band 12  Band 12  Band 11 
Sonar  3  Band 9  Band 10  Band 12 
German  1  V23  V23  V4 
German  2  V6  V10  V10 
German  3  V22  V6  V23 
Appendix D Unsupervised Results on Noisy Data
Data exploration with excessive noise variables provides an example of a usecase scenario for supervised dimensionality reduction. In this section, we include regression results for the Iris
dataset with 1000 additional noise variables. The noise variables were randomly generated from Gaussian distributions with means uniformly sampled from values between 1 and 1. Each had a variance of 1. All variables (noise and otherwise) were scaled and centered prior to performing dimensionality reduction. The lowdimensional embeddings from 7 supervised and 7 unsupervised dimensionality techniques were used as predictors while the original data variables were used as the response for
NN regression (unweighted). We used the square root of the number of observations as the parameter. The average RMSE using 10fold cross validation was recorded for each regression problem with the experiment repeated 10 times. The same ten data sets were used for both the supervised and unsupervised cases.Tables 9 and 10 show the supervised and unsupervised results, respectively. First, we see that RFPHATE universally outperforms all of the supervised and unsupervised methods, indicating that RFPHATE is very useful when there are many noise variables. We also observe that all of the unsupervised methods perform worse than all of the supervised methods except for SNMF. Thus supervised methods for visualization are preferred when the data contain noise variables.
Var (dims)  RFPHATE  ESIso  KSPCA  SPCA  ESLLE  SNMF  PLSDA 

petal len. (2)  0.330 0.01  0.348 0.01  0.340 0.01  0.381 0.01  0.349 0.01  1.475 0.03  0.387 0.01 
petal len. (3)  0.334 0.01  0.353 0.01  0.360 0.01  0.400 0.01  0.353 0.01  2.233 0.04  0.363 0.01 
petal len. (4)  0.337 0.01  0.3550.01  0.3610.01  0.4120.01  0.352 0.01  2.2330.04  0.3690.01 
petal wid. (2)  0.291 0.01  0.298 0.01  0.297 0.01  0.330 0.01  0.304 0.01  1.297 0.02  0.330 0.01 
petal wid. (3)  0.292 0.01  0.305 0.01  0.312 0.01  0.345 0.01  0.308 0.01  1.905 0.03  0.308 0.01 
petal wid. (4)  0.290 0.01  0.304 0.01  0.3140.01  0.3580.01  0.3060.01  1.9050.03  0.3210.01 
sepal len. (2)  0.459 0.01  0.522 0.02  0.514 0.02  0.514 0.02  0.528 0.02  0.885 0.02  0.514 0.02 
sepal len. (3)  0.455 0.01  0.536 0.02  0.540 0.02  0.520 0.02  0.534 0.02  1.272 0.02  0.516 0.02 
sepal len. (4)  0.456 0.01  0.5370.02  0.5410.02  0.5220.02  0.5330.02  1.2720.02  0.517 0.02 
sepal wid. (2)  0.320 0.01  0.348 0.01  0.351 0.01  0.357 0.01  0.349 0.01  0.462 0.01  0.357 0.01 
sepal wid. (3)  0.318 0.01  0.354 0.01  0.358 0.01  0.356 0.01  0.354 0.01  0.501 0.02  0.357 0.01 
sepal wid. (4)  0.321 0.01  0.3550.01  0.3610.01  0.3570.01  0.354 0.01  0.5010.02  0.3580.01 
Dimensions  PHATE  ISO  PCA  LLE  NMF  MDS  TSNE 

petal len. (2)  1.393 0.03  1.398 0.03  1.366 0.03  1.365 0.03  1.768 0.03  1.322 0.03  1.355 0.03 
petal len. (3)  1.372 0.03  1.393 0.03  1.366 0.03  1.339 0.03  1.539 0.03  1.342 0.03  1.355 0.03 
petal len. (4)  1.355 0.03  1.365 0.03  1.366 0.03  1.34 0.03  1.388 0.03  1.307 0.03  1.328 0.03 
petal wid. (2)  1.196 0.02  1.181 0.02  1.120 0.02  1.159 0.02  1.502 0.03  1.119 0.02  1.164 0.02 
petal wid. (3)  1.170 0.02  1.183 0.02  1.120 0.02  1.157 0.02  1.300 0.02  1.113 0.02  1.156 0.02 
petal wid. (4)  1.163 0.02  1.172 0.02  1.120 0.02  1.159 0.02  1.177 0.02  1.103 0.02  1.141 0.02 
sepal len. (2)  0.847 0.02  0.848 0.02  0.816 0.02  0.832 0.02  0.979 0.02  0.809 0.02  0.833 0.02 
sepal len. (3)  0.830 0.02  0.847 0.02  0.816 0.02  0.826 0.02  0.904 0.02  0.816 0.02  0.821 0.02 
sepal len. (4)  0.822 0.02  0.840 0.02  0.816 0.02  0.828 0.02  0.844 0.02  0.801 0.02  0.813 0.02 
sepal wid. (2)  0.450 0.01  0.445 0.01  0.439 0.01  0.452 0.01  0.523 0.02  0.444 0.01  0.441 0.01 
sepal wid. (3)  0.446 0.01  0.451 0.01  0.439 0.01  0.440 0.01  0.463 0.01  0.440 0.01  0.449 0.01 
sepal wid. (4)  0.445 0.01  0.443 0.01  0.439 0.01  0.444 0.01  0.442 0.01  0.437 0.01  0.445 0.01 
Appendix E Parameter Selection Robustness
In this section we show that RFPHATE is robust to parameter tuning. Random forests are robust to overfitting the data; increasing the number of trees can only improve random forest performance Cutler et al. [2012] and the random forest generalization error converges almost surely to a limit Breiman [2001]. Random forests use, by default, fullygrown trees. In the context of proximity generation, decreasing the number of terminal nodes (i.e. pruning the tree) or setting a minimum node size, inflates the proximity values since the probability of two observations falling into the same terminal node increases as the number of terminal nodes decreases. This is demonstrated in Figure 7.
It has been empirically shown Cutler et al. [2012], DíazUriarte and Alvarez de Andrés [2006] that the only parameter to which random forests are sensitive is the number of randomly selected variables to be considered at a given terminal node (the parameter mtry). The randomForest Liaw and Wiener [2002] default for classification is , where is the number of observations. For regression, the default is . We selected values of mtry centered at the default to test RFPHATE’s robustness to this parameter.
Diffusionbased dimensionality reduction is sensitive to the number of time steps, , considered when conducting a “random walk” across all possible transitions Moon et al. [2019], Duque et al. [2019], Gigante et al. [2019], Horoi et al. [2020]. We therefore test this parameter as well. The optimal , as selected using VNE, serves as a center point to the values of considered for parameter tuning. The results are shown in Figure 8. For each of the datasets in this figure, we ran RFPHATE using ranges of mtry and centered about the default values. We recorded the error rate (for categorical variables) or RMSE (for continuous variables) regressing on the top variable using the 2dimensional RFPHATE embeddings as features. This is the same metric used in Tables 1 and 2.
The results show that RFPHATE is not sensitive to mtry or in regards to capturing important variables in a lowdimensional space with a categorical response. However, we have observed that high values of (much greater than the optimally selected ) tend to visually collapse clusters. We therefore recommend that the default choice of , or a value close to it, should be used.
Appendix F Embedding random forest proximities with other methods
Because of its abilities to accurately preserve local and global structure, we chose to adapt the PHATE algorithm to embed the random forest proximities. Here we show the results when embedding the proximities using other dimensionality reduction methods with varying success. We compared results embedding the proximities using Isomap, LLE, and MDS in addition to our proposed method. These are each denoted as RF“method”.
RFIsomap gives meaningful results in small, lowdimensional data sets even in the presence of noise variables, but less desirable results for higherdimensional data sets (see Figure 9), while RFLLE does not perform well without significant parameter tuning. For higherdimensional data sets, including when noise variables are present, RFMDS does not capture the geometric (global) structure of the data. RFIsomap captures more of the global structure, but does not sufficiently denoise the data. In contrast, RFPHATE shows the global and local structure of the data, even when many of the variables are irrelevant for the supervised task.
Appendix G Computational Environment
All computations were done using an Intel Xeon CPU E51650 v2 @ 3.50 GHz with 6 cores. The random forest proximities were computing using R version 3.6.2. All other computations were done in Matlab 2019b.
Comments
There are no comments yet.