A demonstration of Manifold Oblique Random Forests paper simulation and example datasets.
Decision forests (DF), in particular random forests and gradient boosting trees, have demonstrated state-of-the-art accuracy compared to other methods in many supervised learning scenarios. In particular, DFs dominate other methods in tabular data, that is, when the feature space is unstructured, so that the signal is invariant to permuting feature indices. However, in structured data lying on a manifold---such as images, text, and speech---neural nets (NN) tend to outperform DFs. We conjecture that at least part of the reason for this is that the input to NN is not simply the feature magnitudes, but also their indices (for example, the convolution operation uses "feature locality"). In contrast, naïve DF implementations fail to explicitly consider feature indices. A recently proposed DF approach demonstrates that DFs, for each node, implicitly sample a random matrix from some specific distribution. Here, we build on that to show that one can choose distributions in a manifold aware fashion. For example, for image classification, rather than randomly selecting pixels, one can randomly select contiguous patches. We demonstrate the empirical performance of data living on three different manifolds: images, time-series, and a torus. In all three cases, our Manifold Forest () algorithm empirically dominates other state-of-the-art approaches that ignore feature space structure, achieving a lower classification error on all sample sizes. This dominance extends to the MNIST data set as well. Moreover, both training and test time is significantly faster for manifold forests as compared to deep nets. This approach, therefore, has promise to enable DFs and other machine learning methods to close the gap with deep nets on manifold-valued data.READ FULL TEXT VIEW PDF
A demonstration of Manifold Oblique Random Forests paper simulation and example datasets.
Decision forests, including random forests and gradient boosting trees, have solidified themselves in the past couple decades as a powerful ensemble learning method in supervised settings [JMLR:v15:delgado14a, Caruana:2006:ECS:1143844.1143865], including both classification and regression [hastie01statisticallearning]
. In classification, each forest is a collection of decision trees whose individual classifications of a data point are aggregated together using majority vote. One of the strengths of this approach is that each decision tree need only perform better than chance for the forest to be a strong learner, given a few assumptions[Schapire:1990:SWL:83637.83645, Biau:2008:CRF:1390681.1442799]. Additionally, decision trees are relatively interpretable because they can provide an understanding of which features are most important for correct classification [Breiman2001]
. Breiman originally proposed decision trees that partition the data set using hyperplanes aligned to feature axes[Breiman2001]
. Yet, this limits the flexibility of the forest and requires deep trees to classify some data sets, leading to overfitting. He also suggested that algorithms which partition based on sparse linear combinations of the coordinate axes can improve performance[Breiman2001]. More recently, Sparse Projection Oblique Randomer Forest (), partitions a random projection of the data and has shown impressive improvement over other methods [SPORF].
Yet random forests and other machine learning algorithms frequently operate in a tabular setting, viewing an observation
as an unstructured feature vector. In doing so, they neglect the indices in settings where the indices encode additional information. For structured data, e.g. images or time series, traditional decision forests are not able to incorporate known continuity between features to learn new features. For decision forests to utilize known local structure in data, new features encoding this information must be manually constructed. Prior research has extended random forests to a variety of computer vision tasks[rf_keypoint_recog, rf_hough_detection, rf_image_classification, kinect_rf] and augmented random forests with structured pixel label information [rf_structured]. Yet these methods either generate features a priori from individual pixels, and thus do not take advantage of the local topology, or lack the flexibility to learn relevant patches. Decision forests have been used to learn distance metrics on unknown manifolds [Criminisi:2012:DFU:2185837.2185838], but such manifold forest algorithms are unsupervised and aim to learn a low dimensional representation of the data.
Inspired by , we propose a projection distribution that takes into account continuity between neighboring features while incorporating enough randomness to learn relevant projections. At each node in the decision tree, sets of random spatially contiguous features are randomly selected using knowledge of the underlying manifold. Summing the intensities of the sampled features yields a set of projections which can then be evaluated to partition the observations. We describe this proposed classification algorithm, Manifold Forests () in detail and show its effectiveness in three simulation settings as compared to common classification algorithms. Furthermore, the optimized and parallelizable open source implementation of in R and Python is available 111https://neurodata.io/sporf/. This addition makes for an effective and flexible learner across a wide range of manifold structures.
In the two-class classification setting, there is a data set of n pairs drawn from an unknown distribution where and . Our goal is to train a classifier based on our observations that generalizes to correctly predict the class of an observed
. The performance of this classifier is evaluated via the 0-1 Loss functionto find the optimal classifier
, which minimizes the probability of an incorrect classification.
Originally popularized by Breiman, the random forest (RF) classifier is empirically very effective [JMLR:v15:delgado14a] while maintaining strong theoretical guarantees [Breiman2001]. A random forest is an ensemble of decision trees whose individual classifications of a data point are aggregated together using majority vote. Each decision tree consists of split nodes and leaf nodes. A split node is associated with a subset of the data and splits into two child nodes, each associated with a binary partition of . Let denote a unit vector in the standard basis (that is, a vector with a single one and the rest of the entries are zero) and a threshold value. Then is partitioned into two subsets given the pair .
To choose the partition, the optimal pair is selected via a greedy search from among a set of randomly selected standard basis vectors . The selected partition is that which maximizes some measure of information gain. A typical measure is a decrease in impurity, calculated by the Gini impurity score , of the resulting partitions [hastie01statisticallearning]. Let be the fraction of elements of class in partition , then the optimal split is found as
A leaf node is created once the partition reaches a stopping criterion, typically either falling below an impurity score threshold or a minimum number of observations [hastie01statisticallearning]. The leaf nodes of the tree form a disjoint partition of the feature space in which each partition of observations is assigned a class label corresponding to the class majority.
A decision tree classifies a new observation by assigning it the class of the partition into which the observation falls. The forest averages the classifications over all decision trees to make the final classification [hastie01statisticallearning]. For good performance of the ensemble and strong theoretical guarantees, the individual decision trees must be relatively uncorrelated from one another. Breiman’s random forest algorithm does this in two ways:
At every node in the decision tree, the optimal split is determined over a random subset of the total collection of features .
Each tree is trained on a randomly bootstrapped sample of data points from the full training data set.
Applying these techniques means that random forests do not overfit and lowers the upper bound of the generalization error [Breiman2001].
is a recent modification to random forest that has shown improvement over other versions [SPORF, tomita2]. Recall that RF split nodes partition data along the coordinate axes by comparing the projection of observation on standard basis to a threshold value . generalizes the set of possible projections, allowing for the data to be partitioned along axes specified by any sparse vector .
Rather than partitioning the data solely along the coordinate axes (i.e. the standard basis), creates partitions along axes specified by sparse vectors. In other words, let the dictionary be the set of atoms , each atom a -dimensional vector defining a possible projection . In axis-aligned forests, is the set of standard basis vectors . In , the dictionary can be much larger, because it includes, for example, all 2-sparse vectors. At each split node, samples atoms from according to a specified distribution. By default, each of the
atoms are randomly generated with a sparsity level drawn from a Poisson distribution with a specified rate. Then, each of the non-zero elements are uniformly randomly assigned either or . Note that the size of the dictionary for is (because each of the elements could be , , or
), although the atoms are sampled from a distribution heavily skewed towards sparsity.
In the structured setting, the dictionary of projection vectors is modified to take advantage of the underlying manifold on which the data lies. We term this method the Manifold Forest ().
Each atom projects an observation to a real number and is designed with respect to prior knowledge of the data manifold. Nonzero elements of effectively select and weight features. Since the feature space is structured, each element of maps to a location on the underlying manifold. Thus, patterns of contiguous points on the manifold define the atoms of ; the distribution of those patterns yields a distribution over the atoms. At each node in the decision tree, samples atoms, yielding new features per observation. proceeds just like by optimizing the best split according to the Gini index. Algorithm pseudocode, essentially equivalent to that of , can be found in the Appendix.
In the case of two-dimensional arrays, such as images, an observation is a vectorized representation of a data-matrix . To capture the relevance of neighboring pixels, creates projections by summing the intensities of pixels in rectangular patches. Thus the atoms of are the vectorized representations of these rectangular patches.
A rectangular patch is fully parameterized by the location of its upper-left corner , its height , and width
. To generate a patch, first the index of the upper left corner is uniformly sampled. Then its height and width are independently sampled from separate uniform distributions. hyperparameters determine the minimum and maximum heights heights, and widths , respectively, to sample from. Let denote the discrete uniform distribution. An atom is sampled as follows. Note that the patch cannot exceed the data-matrix boundaries.
The vectorized atom yields a projection of the data , effectively selecting and summing pixel intensities in the sampled rectangular patch.
By constructing features in this way, learns low-level features in the structured data, such as edges or corners in images. The forest can therefore learn the features that best distinguish a class. The structure of these atoms is flexible and task dependent. In the case of data lying on a cyclic manifold, the atoms can wrap-around borders to capture the added continuity. Atoms can also be used in one-dimensional arrays, such as univariate time-series data, in which case
One of the benefits to decision trees is that their results are fairly interpretable in that they allow for estimation of the relative importance of each feature. Many approaches have been suggested[Breiman2001, Lundberg2017AUA] and here a projection forest specific metric is used in which the number of times a given feature was used in projections across the ensemble of decision trees is counted. A decision tree is composed of many nodes , each one associated with an atom and threshold that partition the feature space according to the projection . Thus, the indices corresponding to nonzero elements of indicate important features used in the projection. For each feature , the number of times it is used in a projection, across all split nodes and decision trees, is counted.
These normalized counts represent the relative importance of each feature in making a correct classification. Such a method applies to both and , although different results between them would be expected due to different projection distributions yielding different hyperplanes.
To test , we evaluate its performance in three simulation settings as compared to logistic regression (Log. Reg), linear support vector machine (Lin. SVM), support vector machine with a radial basis function kernel (SVM), k-nearest neighbors (kNN), random forest (RF), Multi-layer Perceptron (MLP), and (SPORF). For each experiment, we used our open source implementation of and that of . All decision forest algorithms used 100 decision trees on the simulations. Each of the other classifiers were run from the Scikit-learn Python package[scikit-learn]paszke2017automatic]
with two convolution layers, ReLU activations, and maxpooling, followed by dropout and two densely connected layers. The CNN results were averaged over 5 runs for the simulations and training was stopped early if the loss plateaued.
Experiment (A) is a non-Euclidean example inspired by Younes2018DiffeomorphicL. Each observation is a discretization of a circle into 100 features with two non-adjacent segments of 1’s in two differing patterns: class 1 features two segments of length five while class 2 features one segment of length four and one of length six. chose one-dimensional rectangles in this setting as the observations were one-dimensional in nature. These projection patches had a width between one and fifteen pixels and each split node of and considered 40 random projections. Figure 1(A) shows examples from the two classes and classification results across various sample sizes.
In experiment (B) consists of a simple binary image classification problem. Images in class 0 contain randomly sized and spaced horizontal bars while those in class 1 contain randomly sized and spaced vertical bars. For each sampled image, bars were distributed among the rows or columns, depending on the class. The distributions of the two classes are identical if a 90 degree rotation is applied to one of the classes. Projection patches were between one and four pixels in both width and height and each split node of and considered 28 random projections. Figure 1(B) shows examples from the two classes and classification results across various sample sizes.
Experiment (C) is a signal classification problem. One class consists of 100 values of Gaussian noise while the second class has an added exponentially decaying unit step beginning at time 20.
Projection patches were 1D with a width between one and five timesteps. Each split node of and considered the default number of random projections, the square root of the number of features. Figure 1(C) shows examples from the two classes and classification results across various sample sizes.
In all three simulation settings, outperforms all other classifiers, doing especially better at low sample sizes, except the CNN for which there is no clear winner. The performance of is particularly good in the discretized circle simulation for which most other classifiers perform at chance levels. also performs well in the signal classification problem although all the classifiers are close in performance. This may be because the exponential signal is prevalent throughout most of the time-steps and so perfect continuity is less relevant.
All experiments were run on a single core CPU. has train and test times on par with those of and and so is not particularly more computationally intensive to run. The CNN, however, took noticeably longer to run—especially in terms of training, but also in testing—in two of the three simulations. Thus its strong performance in those settings comes at an added computational cost, a typical issue for deep learning methods[dl_cost].
’s performance was evaluated on the MNIST dataset, a collection of handwritten digits stored in 28 by 28 square images [mnist], and compared to the algorithms used in the simulations. 10,000 images were held out for testing and a subset of the remaining images were used for training. The results are displayed in Figure 3. All three forest algorithms were composed of 500 decision trees and was restricted to use patches up to only 3 pixels in height and width. showed an improvement over the other algorithms, especially for smaller sample sizes. Thus, even this trivial modification can improve performance by several percentage points. Specifically, achieved a better (lower) classification error than all other algorithms besides CNNs for all sample sizes on this real data problem.
To evaluate the capability of to identify importance features in manifold-valued data as compared to and . All methods were run on a subset of the MNIST dataset: we only used threes and fives, 100 images from each class.
The feature importance of each pixel is shown in Figure 4. visibly results in a smoother pixel importance, a result most likely from the continuity of neighboring pixels in selected projections. Although SPORF demonstrated empirical improvement of over on the MNIST data, its projection distribution yields scattered importance of unimportant background pixels as compared to . Since projections in have no continuity constraint, those that select high importance pixels will also select pixels of low importance by chance. This may be a nonissue asymptotically, but is a relevant problem in low sample size settings. , however, shows little or no importance of these background pixels by virtue of the modified projection distribution.
The success of sparse oblique projections in decision forests has opened up many possible ways to improve axis-aligned decision forests (including random forests and gradient boosting trees) by way of specialized projection distributions. Traditional decision forests have already been applied to structured data, using predefined features to classify images or pixels. Decision forest algorithms like that implemented in the Microsoft Kinect showed great success but ignore pixel continuity and specialize for a specific data modality, namely images [kinect_rf].
We expanded upon sparse oblique projections and introduced a structural projection distribution that uses prior knowledge of the topology of a feature space. The open source implementation of 222https://neurodata.io/sporf/ has allowed for a relatively easy implementation of , creating a flexible classification method for a variety of data modalities. We showed in various simulated settings that appropriate domain knowledge can improve the projection distribution to yield impressive results that challenge the strength of deep learning techniques on manifold-valued data. On the MNIST data set showed modest improvements over the other algorithms besides CNNs and smoother importance plots than the other decision forest algorithms. This is in spite of the the data set’s low resolution images which are harder for the modified projection distributions to take advantage of.
Research into other, task-specific convolution kernels may lead to improved results in real-world computer vision tasks. Such structured projection distributions, while incorporated into here, may also be incorporated into other state of the art algorithms such as XgBoost[XGboost].