The random forest (RF) algorithm (Breiman 1996) was one of the first ensemble classifiers developed. It combines the predictions from individual classification and regression trees (CART) (Breiman et al. 1984), built by bagging observations (Breiman 1996). It also samples variables at each tree node. These produce diagnostics in the form of uncertainty in predictions for each observation, importance of variables for the prediction, predictive error for future samples based on out-of-bag (OOB) case predictions, and similarity of observations based on how often they group together in the trees.
, and the basic ideas behind the random forest can be applied to virtually any type of model. The benefits for classification are reduced variability in predictive error, and the suite of diagnostics provides the potential for better understanding the class structure in the high-dimensional data space. The use of visualization on these diagnostics, in association with multivariate data plots, completes the process to support a better understanding of the underlying problem.
A conceptual framework for model visualization can be summarized in three strategies: (1) visualize the model in the data space, (2) look all members of a collection of a model and (3) explore the complete process of model fitting (Wickham et al. 2015). The first strategy is to explore how well the model captures the data characteristics (model in the data space), which contrasts determining if the model assumptions hold (data in the model space). The second strategy is to look at a group of models instead of only the best. This strategy can offer a broad understanding of the problem by comparing and contrasting possible models. The last strategy focuses on the exploration of the process of the model fit in addition to the end result.
There has been some, but not a lot of, research on visualizing classification models. Urbanek (2008) presents interactive tree visualization implemented in the java software called KLIMT that include zooming, selection, multiple views, interactive pruning and tree construction as well as the interactive analysis of forests of trees using treemaps. Cutler and Breiman (2011) developed a java package called RAFT to visualize a forest classifier, that included variable selection, parallel coordinate plots, heat maps and scatter plots of some diagnostics. Linking between plots is limited. Quach (2012) presents interactive forest visualization using the R package iPlots eXtreme (Urbanek 2011), where several displays are shown in the one window with some linking between them available. Silva and Ribeiro (2016) describes visualizing components of an ensemble classifier.
This paper describes structuring interactive graphics to facilitate visual exploration of ensemble classifiers, using RFs and projection pursuit forests (PPF) (da Silva et al. 2017) as examples. The PPF algorithm builds on the projection pursuit tree (PPtree) (Lee et al. 2013) algorithm, which uses projection pursuit at each tree node to find the best linear combination of variables to separate the classes. The visualization approach is consistent with the framework in Wickham et al. (2015), and the implementation is built on the newest interactive graphics available in R. The purpose is to provide readily available tools for users to explore and improve ensemble fits, and obtain an intuition for the underlying class structure in data. Interactive plots are a key component for model visualization that help the user see multivariate relationships and be more efficient in the model diagnosis. Multiple levels of data are constructed for exploration: observation, model and ensemble summaries.
2 Diagnostics in forest classifiers
The diagnostics typically available are:
For each model, in the ensemble, some cases of the original data are not used. Predicting the response for these cases gives a better estimate for the error of the model with future data. The OOB error rate is a measure for each model that is combined in the ensemble, and is used to provide the overall error of the ensemble.
Uncertainty measure for each observation: Across individual (classification) models we can compute the proportion of times that a case is predicted to be each class. If a case is always predicted to be the true class, there is no uncertainty about an observation. Cases that are proportionately predicted to be multiple classes indicate difficult to classify observations. They may be important by indicating neighborhoods of the data space that would benefit from a more complex model, or more simply, they may be errors in measurements in the data.
Variable importance: Each model uses samples of variables. With this, the accuracy of the models can be compared when the variable is included or omitted. There are several versions of statistics that use this to provide a measure of the variable importance for prediction.
Similarity measure for pairs of observations:
In each model, each pair of observations will be either in the same terminal node or not. This is used to compute a proximity matrix. Cluster analysis on this matrix can be used to follow up the classification to assess the original labeling. It may suggest improvements or mistakes in original labels.
In addition to these overall ensemble statistics, each component model has its own diagnostics, measuring error, variables utilized, and class predictions. Visualization will enable the individual models to be examined, relate these to the data and to their contribution to the ensemble.
3 Mapping ensemble diagnostics to visual components
This section describes the mapping of diagnostics to visualizations. These are illustrated using the Australian crabs data (Campbell and Mahon 1974). The data has 200 cases, 5 predictors and 4 classes (combinations of species and sex, blue male, blue female, orange male and orange female). The predictors are: FL (the size of the frontal lobe length, in mm), RW (rear width, in mm), CL (length of mid-line of the carapace, in mm), CW (maximum width of carapace, in mm), BD (depth of the body; for females, measured after displacement of the abdomen, in mm). This is old data but it provides a good illustration of the visual methods.
3.1 Individual models: PPtree
The PPF is composed of individual projection pursuit trees. Figure 1
shows a visual ensemble of plots of a tree model on the crab data. There are three nodes for the four class problem. The nodes of this tree are based on projections of the data, the coefficients of which form the building block to calculate the variable importance. The density plot displays the data projection at each node, and the mosaic plot shows the confusion matrix for the nodes. The packagePPtreeViz provides visual tools to diagnose a PPtree model. The PPF builds on these, and modified a little. The PPtree model is simpler than a regular classification tree, because the classes are mostly separated by combinations of variables – just three projections are needed to see the differences between the four classes.
3.2 Variable importance
The PPF algorithm calculates variable importance in two ways: (1) permuted importance using accuracy, and (2) importance based on projection coefficients of standardized variables.
The permuted variable importance is comparable with the measure defined in the classical random forest algorithm. It is computed using the OOB sample for the tree for each predictor variable. Then the permuted importance of the variable in the tree can be defined as:
where is the predicted class for observation in tree and is the predicted class for observation in tree after permuting the values for variable . The global permuted importance measure is the average importance over all the trees in the forest. This measure is based on comparing the accuracy of classifying OOB observations, using the true class with permuted (nonsense) class.
For the second importance measure, the coefficients of each projection are examined. The magnitude of these values indicates importance, if the variables have been standardized. The variable importance for a single tree is computed by a weighted sum of the absolute values of the coefficients across nodes. The weights takes the number of classes in each node into account (Lee et al. 2013). Then the importance of the variable in the PPtree can be defined as:
Where is the projected coefficient for node , variable , and the total number of node partitions in the tree .
The global variable importance in a PPforest then can be defined in different ways. The most intuitive is the average variable importance from each PPtree across all the trees in the forest.
Alternatively we have defined a global importance measure for the forest as a weighted mean of the absolute value of the projection coefficients across all nodes in every tree. The weights are based on the projection pursuit indexes in each node (), and 1-(OOB-error of each tree)().
Figure 2 shows the absolute projection coefficient of the top three nodes for all the trees in a forest model. This information is displayed by a side-by-side jittered dot plot. The red dots correspond to the absolute coefficient values for the tree model of Figure 1. The forest was built using random samples of two variables for each node, hence there are two coefficients for each node. At node 1, BD has a high value and CW contributes much less. The scatterplot at right shows these two variables and the resulting boundary between groups that this would produce. Node 2 uses CL and RW, and RW contributes the most to the separation. The plot at right shows the boundary that is induced. Node 3 uses FL and RW, and this is a much more even contribution by the two variables. For each tree in the forest different decision rules are defined, the resulting boundaries on the previous plots are based on Rule 1 , where and are the mean of the left and right groups at each node.
3.3 Similarity of cases
For each tree, every pair of observations can be in the same terminal node or not. Tallying this up across all trees in a forest gives the proximity matrix, an matrix of the proportion of trees that the pair share a terminal node. It can be considered to be a similarity matrix.
Multidimensional scaling (MDS) is used to reduce the dimension of this matrix, to view the similarity between observations. MDS transforms the data set into a low-dimensional space where the distances are approximately the same as in the full dimensions. With groups, the low-dimensional space should be no more than dimensions. Figure 3 shows the MDS plots for the 3D space induced by the four groups of the crab data. Color indicates the true species and sex. For this data two dimensions are enough to see the four groups separated quite well. Some crabs are clearly more similar to a different group, though, especially in examining the sex differences.
3.4 Uncertainty of cases
The vote matrix () contains the proportion of times each observation was classified to each class, while oob. Two approaches to visualize the vote matrix information are used.
A ternary plot is a triangular diagram used to display compositional data with three components. More generally, compositional data can have any number of components, say , and hence is constrained to a -D simplex in -space. The vote matrix is an example of compositional data, with components.
Figure 4 shows the tetrahedron structure for the crab vote matrix shown in three pairwise views. With well-separated classes, the colored points will each concentrate into one of the vertices. This is close but not perfect, indicating some crabs are commonly incorrectly predicted.
Because visualizing the vote matrix with a -D tetrahedron requires dynamic graphics, a low-dimensional option is also provided. For each class, each case has a value between 0-1. A side-by-side jittered dotplot is used for the display, where class is displayed on one axis and proportion is displayed on the other. For each dotplot, the ideal arrangement is that points are concentrated at 0 or 1, and only at 1 for their true class. This data is close to the ideal but not perfect, e.g. there are a few blue male crabs (orange) that are frequently predicted to be blue females (green), and a few blue female crabs predicted to be another class.
4 Interactive web app
Interaction is added to the plots described in Section 3 and other plots, and they are organized into an interactive web app using shiny (Chang et al. 2015) for exploring the ensemble model. The app is organized into three tabs, individual cases, models, and performance comparison, to provide a model diagnostic tool. Interaction is provided as mouse-over labeling, mouse-click selection, and brushing, with results linked across multiple plots. The app takes advantage of new tools provided in the plotly (Sievert et al. 2017) package, developed as a part of Sievert’s PhD thesis research (Sievert 2017).
As Sievert (2017) describes one of the biggest difficulties for the app in order manage linking between plots is the data structure management for each widget. Each widget has it own data structure and interaction. Putting them into the structure of a shiny app facilitates access to the widget data, and coordinates selections across multiple plots.
The fishcatch data (Puranen 2017) is used to illustrate the shiny app characteristics. It contains 159 observations, with 6 physical measurement variables, and 7 types of fish, all caught from the same lake (Laengelmavesi) near Tampere in Finland. There are 35 bream, 11 parkki, 56 perch, 17 pike, 20 roach, 14 smelt and 6 whitewish. The shiny app showing fishcatch data can be accessed at https://natydasilva.shinyapps.io/shinyppforest.
4.1 Individual cases
This tab is designed to examine uncertainty in the classification of observations, and also to explore the similarity between pairs of observations. The data feeding the display is an data frame, containing the original data, and the model statistics generated from the full vote matrix, along with its generalized ternary coordinates, and the first two MDS projections of the proximity matrix. Figure 6 shows the arrangement of plots. The plots in the tab are (1) a parallel coordinate plot (PCP) of the data, (2) the MDS display of the proximity matrix, (3) side-by-side jittered dotplot and (4) generalized ternary plot of the vote matrix. Each of these plots are interactive in the sense that each one presents individual interactions (mouse-over) and they are linked so that selections in one display are propagated to other plots (clicking and selecting).
This selection of plots enables aspects of the model, relating to performance for individual cases, to be examined in the data space. The data plot is an essential elements following the model-in-the-data-space philosophy of Wickham et al. (2015). The choice was made to use a parallel coordinate plot because it provides a space-efficient display of the data. Alternatives include the tour, a dynamic plot, or a scatterplot matrix. Theoretically, either of these could be substituted or added.
The diagram in Figure 7 illustrates the data pipeline (Buja et al. 1988; Wickham et al. 2009) for the interactive graphics in the case level tab. Solid lines indicate notifications from the source data to the plots, and dashed lines indicate notification of user action on the plot, that notifies the data source of actions to take. The data table is a reactive object, that has a listener associated with it. Each of the plots is reactive, and has numerous listeners. When users make selections on a plot, either by clicking or group selection, a change to the data is made in terms of an update on the selected cases. This invokes a note to other plots to re-draw themselves. The linking between plots is effectively one-to-one, based on the row id of the data. The side-by-side jittered dotplot has points, but selection can only be done within a dotplot. Selecting in one of the dotplots notifies the data table of the selection which triggers a re-draw of the other dotplots. Mouseovers on the plot pull additional information about the point or line under the cursor but doesn’t link between plots.
Two alternatives can be selected in shiny to draw the parallel coordinate plot: parallel or enhanced. Parallel draws the classic PCP and enhanced draws a modified version where variables are repeated (Hurley and Oldford 2011). Because reading a PCP is really only possible for neighboring variables, the variables are repeated so that all variables are neighboring.
This second tab in the app focuses on teasing apart the forest to examine the qualities of each tree. For each tree, information on the variable importance, the projections used at each node, and the OOB error is available. The data feeding into this tab is a list of models, along with the original data frame. The tree id is displayed when we mouse over the jittered side-by-side plot. This information is useful because, based on the accuracy some trees could be pruned from the forest outside of the app.
Figure 8 is a screenshot of the models tab. There are five plots, with varying levels of interaction: (1) a jittered side-by-side dotplot showing variable importance for the top three nodes of all trees in the forest, (2) a static display of one tree, (3) a boxplot of OOB error for all trees, (4) a faceted density plot of projected data at each node of the tree, with split point indicated by a vertical line, and (5) a mosaic plot showing the confusion matrix for each node of the tree. The interaction is driven from the variable importance plot – when the user selects a point in that display, the corresponding tree, density displays and mosaic plots are drawn. The tree plot from the PPtreeViz is used to visualize the selected tree structure. Also highlighted are the variable importance values for each variable for each of the top three nodes, and the OOB error value for the tree on the boxplot.
The diagram in Figure 9 illustrates the data pipeline for interactive graphics. The data source is a PPforest object. Interaction is driven by the variable importance plot. Selecting a point triggers a change in the data, which cascades to re-draws of the other displays. Each plot has some information available on mouse over.
4.3 Performance comparison
The third tab (Figure 10
) examines the PPF fit, and compares the result with a RF fit. There are four displays for each type of model: (1) Variable importance for all trees in the forest (same as in the models tab), (2) an receiver operating characteristic curve (ROC) curve comparing sensitivity and specificity for each class, (3) OOB error by number of trees, to assess complexity, (4) overall variable importance. There is very little interaction on this tab. Users can select to focus on a subset of classes, or choose the importance measure to show. Being able to focus on class can help to better understand how well the model performs across classes, and can be especially useful for unbalanced data. Examining the OOB error by trees enables an assessment of how few trees might be used to provide an equally accurate prediction of future data.
The ROC is used to summarize the trade-off between sensitivity and specificity. The plot shows the sensitivity and specificity when a parameter of classifier is varied (Hastie et al. 2011). The specificity and sensitivity was computed with the pROC package. If more than two classes are available a multi-class ROC analysis is needed. Several solutions have been proposed for multi-class ROC. Some of the proposed reduced the multi-class problem to a set of binary problems. The approach used for a multi-class ROC analysis in this paper is called one-against-all (Allwein et al. 2000).
Having better tools to open up black box models will provide for better understanding the data, the model strengths and weaknesses, and how the model performs for future data. This visualisation app provides a selection of interactive plots to diagnose PPF models. This shell could be used to make an app for other ensemble classifiers. The philosophy underlying the collection of displays is “show the model in the data space” explained in Wickham et al. (2015). It is not easy to do this, and to completely take this on would require plotting the model in the -dimensional data space. In the simplest approach, as taken here, it means to link the model diagnostics to displays of the data. Then it is possible to probe and query, to obtain a better understanding, such as finding regions in the data that prove difficult to fit, and detract from the predictive accuracy, or that don’t adhere to model assumptions.
The app is implemented with new technology for interactive graphics provided by the plotly package. It is one of the first uses of these new tools.
One challenge to use plotly is that when layers with different data are created in a ggplot2, it is difficult to specify the unique keys required for linking with another plot.
There are many possible extensions to the app, that could help it to be a tool for model refinement: (1) Using the diagnostics to weed out under-performing models in the ensemble; (2) Identifying and boosting models that perform well, particularly if they do well for problematic subsets of the data; (3) Problematic cases could be removed, and ensembles re-fit; (4) Classes as a whole could be aggregated or re-organised as suggested by the model diagnostics, to produce a more effective hierarchical approach to the multiple class problem. Working within the R environment makes all of these desires available using command line outside the app, given the unique ids of models and cases can be exported from the app.
The app has helped to identify ways to improve the PPtree algorithm, and consequently the PPF model. These especially apply to multiclass problems. Multiple splits for the same class would enable nonlinear classifications. Split criteria tend to place boundaries too close to some groups, due to heteroskedasticity being induced by aggregating classes. Forests are not always better than their constituent trees, and if the trees can be built better, the forest will provide stronger predictions.
Allwein et al. (2000)
Allwein, E. L., Schapire, R. E., and Singer, Y. (2000), “Reducing
multiclass to binary: A unifying approach for margin classifiers,”
Journal of machine learning research, 1, 113–141.
- Breiman (1996) Breiman, L. (1996), “Bagging predictors,” Machine learning, 24, 123–140.
- Breiman et al. (1984) Breiman, L., Friedman, J., Stone, C. J., and Olshen, R. A. (1984), Classification and regression trees, CRC press.
- Buja et al. (1988) Buja, A., Asimov, D., Hurley, C., and McDonald, J. A. (1988), “Elements of a viewing iipeline for data analysis,” in Dynamic graphics for statistics, eds. Cleveland, W. S. and McGill, M. E., Monterey, CA: Wadsworth, pp. 277–308.
- Campbell and Mahon (1974) Campbell, N. A. and Mahon, R. J. (1974), “A multivariate study of variation in two species of rock crab of genus Leptograpsus,” Australian Journal of Zoology, 22, 417–425.
- Chang et al. (2015) Chang, W., Cheng, J., Allaire, J., Xie, Y., and McPherson, J. (2015), “shiny: Web application framework for R, R package version 0.11,” .
- Cutler and Breiman (2011) Cutler, A. and Breiman, L. (2011), “RAFT: Random forest tool,” .
- da Silva et al. (2017) da Silva, N., Cook, D., and Lee, E.-K. (2017), “Projection pursuit classification random forest,” https://github.com/natydasilva/PPforestpaper.
- Dietterich (2000) Dietterich, T. G. (2000), Ensemble methods in machine learning, New York: Springer Verlag, pp. 1–15.
- Hastie et al. (2011) Hastie, T. J., Tibshirani, R. J., and Friedman, J. H. (2011), The elements of statistical learning: data mining, inference, and prediction, Springer.
Hurley and Oldford (2011)
Hurley, C. B. and Oldford, R. (2011), “Eulerian tour algorithms for data visualization and the PairViz package,”Computational Statistics, 26, 613–633.
- Lee et al. (2013) Lee, Y. D., Cook, D., Park, J.-w., Lee, E.-K., et al. (2013), “PPtree: Projection pursuit classification tree,” Electronic Journal of Statistics, 7, 1369–1386.
- Puranen (2017) Puranen, J. (2017), “Finland fish catch,” https://ww2.amstat.org/publications/jse/jse_data_archive.htm.
- Quach (2012) Quach, A. T. (2012), “Interactive random forests plots,” .
- Schloerke et al. (2017) Schloerke, B., Wickham, H., Cook, D., and Hofmann, H. (2017), “Escape from Boxland: Generating a library of high-dimensional geometric shapes,” The R Journal, https://journal.r-project.org/archive/accepted.
Sievert, C. (2017), “Interfacing R with the web for accessible, portable, and contents interactive data science,” .
- Sievert et al. (2017) Sievert, C., Parmer, C., Hocking, T., Chamberlain, S., Ram, K., Corvellec, M., and Despouy, P. (2017), plotly: Create interactive web-based graphs via plotly’s API, r package version 1.1.0.
- Silva and Ribeiro (2016) Silva, C. and Ribeiro, B. (2016), Visualization of individual ensemble classifier contributions, Cham: Springer International Publishing, pp. 633–642.
- Sutherland et al. (2000) Sutherland, P., Rossini, A., Lumley, T., Lewin-Koh, N., Dickerson, J., Cox, Z., and Cook, D. (2000), “Orca: A visualization toolkit for high-dimensional data,” Journal of Computational and Graphical Statistics, 9, 509–529.
- Talbot et al. (2009) Talbot, J., Lee, B., Kapoor, A., and Tan, D. S. (2009), “EnsembleMatrix: Interactive visualization to support machine learning with multiple classifiers,” in Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI ’09), New York, NY, USA: Association for Computing Machinery, pp. 1283–1292.
- Urbanek (2008) Urbanek, S. (2008), “Visualizing trees and forests,” in Handbook of Data Visualization, eds. Chen, C., Härdle, W., and Unwin, A., Springer, Springer Handbooks of Computational Statistics, chap. III.2, pp. 243–264.
- Urbanek (2011) — (2011), “iPlots eXtreme: next-generation interactive graphics design and implementation of modern interactive graphics,” Computational Statistics, 26, 381–393.
- Wickham et al. (2015) Wickham, H., Cook, D., and Hofmann, H. (2015), “Visualizing statistical models: Removing the blindfold,” Statistical Analysis and Data Mining: The ASA Data Science Journal, 8, 203–225.
- Wickham et al. (2011) Wickham, H., Cook, D., Hofmann, H., Buja, A., et al. (2011), “tourr: An R package for exploring multivariate data with projections,” Journal of Statistical Software, 40, 1–18.
- Wickham et al. (2009) Wickham, H., Lawrence, M., Cook, D., Buja, A., Hofmann, H., and Swayne, D. F. (2009), “The plumbing of interactive graphics,” Computational Statistics, 24, 207–215.
- Xie et al. (2014) Xie, Y., Hofmann, H., Cheng, X., et al. (2014), “Reactive programming for interactive graphics,” Statistical Science, 29, 201–213.