Manifold Forests: Closing the Gap on Neural Networks

09/25/2019 ∙ by Ronan Perry, et al. ∙ 54

Decision forests (DF), in particular random forests and gradient boosting trees, have demonstrated state-of-the-art accuracy compared to other methods in many supervised learning scenarios. In particular, DFs dominate other methods in tabular data, that is, when the feature space is unstructured, so that the signal is invariant to permuting feature indices. However, in structured data lying on a manifold---such as images, text, and speech---neural nets (NN) tend to outperform DFs. We conjecture that at least part of the reason for this is that the input to NN is not simply the feature magnitudes, but also their indices (for example, the convolution operation uses "feature locality"). In contrast, naïve DF implementations fail to explicitly consider feature indices. A recently proposed DF approach demonstrates that DFs, for each node, implicitly sample a random matrix from some specific distribution. Here, we build on that to show that one can choose distributions in a manifold aware fashion. For example, for image classification, rather than randomly selecting pixels, one can randomly select contiguous patches. We demonstrate the empirical performance of data living on three different manifolds: images, time-series, and a torus. In all three cases, our Manifold Forest () algorithm empirically dominates other state-of-the-art approaches that ignore feature space structure, achieving a lower classification error on all sample sizes. This dominance extends to the MNIST data set as well. Moreover, both training and test time is significantly faster for manifold forests as compared to deep nets. This approach, therefore, has promise to enable DFs and other machine learning methods to close the gap with deep nets on manifold-valued data.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Decision forests, including random forests and gradient boosting trees, have solidified themselves in the past couple decades as a powerful ensemble learning method in supervised settings [JMLR:v15:delgado14a, Caruana:2006:ECS:1143844.1143865], including both classification and regression [hastie01statisticallearning]

. In classification, each forest is a collection of decision trees whose individual classifications of a data point are aggregated together using majority vote. One of the strengths of this approach is that each decision tree need only perform better than chance for the forest to be a strong learner, given a few assumptions

[Schapire:1990:SWL:83637.83645, Biau:2008:CRF:1390681.1442799]. Additionally, decision trees are relatively interpretable because they can provide an understanding of which features are most important for correct classification [Breiman2001]

. Breiman originally proposed decision trees that partition the data set using hyperplanes aligned to feature axes

[Breiman2001]

. Yet, this limits the flexibility of the forest and requires deep trees to classify some data sets, leading to overfitting. He also suggested that algorithms which partition based on sparse linear combinations of the coordinate axes can improve performance

[Breiman2001]. More recently, Sparse Projection Oblique Randomer Forest (), partitions a random projection of the data and has shown impressive improvement over other methods [SPORF].

Yet random forests and other machine learning algorithms frequently operate in a tabular setting, viewing an observation

as an unstructured feature vector. In doing so, they neglect the indices in settings where the indices encode additional information. For structured data, e.g. images or time series, traditional decision forests are not able to incorporate known continuity between features to learn new features. For decision forests to utilize known local structure in data, new features encoding this information must be manually constructed. Prior research has extended random forests to a variety of computer vision tasks

[rf_keypoint_recog, rf_hough_detection, rf_image_classification, kinect_rf] and augmented random forests with structured pixel label information [rf_structured]. Yet these methods either generate features a priori from individual pixels, and thus do not take advantage of the local topology, or lack the flexibility to learn relevant patches. Decision forests have been used to learn distance metrics on unknown manifolds [Criminisi:2012:DFU:2185837.2185838], but such manifold forest algorithms are unsupervised and aim to learn a low dimensional representation of the data.

Inspired by , we propose a projection distribution that takes into account continuity between neighboring features while incorporating enough randomness to learn relevant projections. At each node in the decision tree, sets of random spatially contiguous features are randomly selected using knowledge of the underlying manifold. Summing the intensities of the sampled features yields a set of projections which can then be evaluated to partition the observations. We describe this proposed classification algorithm, Manifold Forests () in detail and show its effectiveness in three simulation settings as compared to common classification algorithms. Furthermore, the optimized and parallelizable open source implementation of   in R and Python is available 111https://neurodata.io/sporf/. This addition makes for an effective and flexible learner across a wide range of manifold structures.

2 Background and Related Work

2.1 Classification

In the two-class classification setting, there is a data set of n pairs drawn from an unknown distribution where and . Our goal is to train a classifier based on our observations that generalizes to correctly predict the class of an observed

. The performance of this classifier is evaluated via the 0-1 Loss function

to find the optimal classifier

, which minimizes the probability of an incorrect classification.

2.2 Random Forests

Originally popularized by Breiman, the random forest (RF) classifier is empirically very effective [JMLR:v15:delgado14a] while maintaining strong theoretical guarantees [Breiman2001]. A random forest is an ensemble of decision trees whose individual classifications of a data point are aggregated together using majority vote. Each decision tree consists of split nodes and leaf nodes. A split node is associated with a subset of the data and splits into two child nodes, each associated with a binary partition of . Let denote a unit vector in the standard basis (that is, a vector with a single one and the rest of the entries are zero) and a threshold value. Then is partitioned into two subsets given the pair .

To choose the partition, the optimal pair is selected via a greedy search from among a set of randomly selected standard basis vectors . The selected partition is that which maximizes some measure of information gain. A typical measure is a decrease in impurity, calculated by the Gini impurity score , of the resulting partitions [hastie01statisticallearning]. Let be the fraction of elements of class in partition , then the optimal split is found as

A leaf node is created once the partition reaches a stopping criterion, typically either falling below an impurity score threshold or a minimum number of observations [hastie01statisticallearning]. The leaf nodes of the tree form a disjoint partition of the feature space in which each partition of observations is assigned a class label corresponding to the class majority.

A decision tree classifies a new observation by assigning it the class of the partition into which the observation falls. The forest averages the classifications over all decision trees to make the final classification [hastie01statisticallearning]. For good performance of the ensemble and strong theoretical guarantees, the individual decision trees must be relatively uncorrelated from one another. Breiman’s random forest algorithm does this in two ways:

  1. At every node in the decision tree, the optimal split is determined over a random subset of the total collection of features .

  2. Each tree is trained on a randomly bootstrapped sample of data points from the full training data set.

Applying these techniques means that random forests do not overfit and lowers the upper bound of the generalization error [Breiman2001].

2.3 Sparse Projection Oblique Randomer Forests

is a recent modification to random forest that has shown improvement over other versions [SPORF, tomita2]. Recall that RF split nodes partition data along the coordinate axes by comparing the projection of observation on standard basis to a threshold value .   generalizes the set of possible projections, allowing for the data to be partitioned along axes specified by any sparse vector .

Rather than partitioning the data solely along the coordinate axes (i.e. the standard basis),   creates partitions along axes specified by sparse vectors. In other words, let the dictionary be the set of atoms , each atom a -dimensional vector defining a possible projection . In axis-aligned forests, is the set of standard basis vectors . In , the dictionary can be much larger, because it includes, for example, all 2-sparse vectors. At each split node,   samples atoms from according to a specified distribution. By default, each of the

atoms are randomly generated with a sparsity level drawn from a Poisson distribution with a specified rate

. Then, each of the non-zero elements are uniformly randomly assigned either or . Note that the size of the dictionary for   is (because each of the elements could be , , or

), although the atoms are sampled from a distribution heavily skewed towards sparsity.

3 Methods

3.1 Random Projection Forests on Manifolds

In the structured setting, the dictionary of projection vectors is modified to take advantage of the underlying manifold on which the data lies. We term this method the Manifold Forest ().

Each atom projects an observation to a real number and is designed with respect to prior knowledge of the data manifold. Nonzero elements of effectively select and weight features. Since the feature space is structured, each element of maps to a location on the underlying manifold. Thus, patterns of contiguous points on the manifold define the atoms of ; the distribution of those patterns yields a distribution over the atoms. At each node in the decision tree,   samples atoms, yielding new features per observation.   proceeds just like   by optimizing the best split according to the Gini index. Algorithm pseudocode, essentially equivalent to that of  , can be found in the Appendix.

In the case of two-dimensional arrays, such as images, an observation is a vectorized representation of a data-matrix . To capture the relevance of neighboring pixels,   creates projections by summing the intensities of pixels in rectangular patches. Thus the atoms of are the vectorized representations of these rectangular patches.

A rectangular patch is fully parameterized by the location of its upper-left corner , its height , and width

. To generate a patch, first the index of the upper left corner is uniformly sampled. Then its height and width are independently sampled from separate uniform distributions.   hyperparameters determine the minimum and maximum heights heights

, and widths , respectively, to sample from. Let denote the discrete uniform distribution. An atom is sampled as follows. Note that the patch cannot exceed the data-matrix boundaries.

The vectorized atom yields a projection of the data , effectively selecting and summing pixel intensities in the sampled rectangular patch.

By constructing features in this way,   learns low-level features in the structured data, such as edges or corners in images. The forest can therefore learn the features that best distinguish a class. The structure of these atoms is flexible and task dependent. In the case of data lying on a cyclic manifold, the atoms can wrap-around borders to capture the added continuity. Atoms can also be used in one-dimensional arrays, such as univariate time-series data, in which case

3.2 Feature Importance

One of the benefits to decision trees is that their results are fairly interpretable in that they allow for estimation of the relative importance of each feature. Many approaches have been suggested

[Breiman2001, Lundberg2017AUA] and here a projection forest specific metric is used in which the number of times a given feature was used in projections across the ensemble of decision trees is counted. A decision tree is composed of many nodes , each one associated with an atom and threshold that partition the feature space according to the projection . Thus, the indices corresponding to nonzero elements of indicate important features used in the projection. For each feature , the number of times it is used in a projection, across all split nodes and decision trees, is counted.

These normalized counts represent the relative importance of each feature in making a correct classification. Such a method applies to both   and  , although different results between them would be expected due to different projection distributions yielding different hyperplanes.

4 Simulation Results

To test  , we evaluate its performance in three simulation settings as compared to logistic regression (Log. Reg), linear support vector machine (Lin. SVM), support vector machine with a radial basis function kernel (SVM), k-nearest neighbors (kNN), random forest (RF), Multi-layer Perceptron (MLP), and   (SPORF). For each experiment, we used our open source implementation of   and that of  . All decision forest algorithms used 100 decision trees on the simulations. Each of the other classifiers were run from the Scikit-learn Python package

[scikit-learn]

with default parameters. Additionally, we tested against a Convolutional Neural Network (CNN) built using PyTorch

[paszke2017automatic]

with two convolution layers, ReLU activations, and maxpooling, followed by dropout and two densely connected layers. The CNN results were averaged over 5 runs for the simulations and training was stopped early if the loss plateaued.

4.1 Simulation Settings

Experiment (A) is a non-Euclidean example inspired by Younes2018DiffeomorphicL. Each observation is a discretization of a circle into 100 features with two non-adjacent segments of 1’s in two differing patterns: class 1 features two segments of length five while class 2 features one segment of length four and one of length six.   chose one-dimensional rectangles in this setting as the observations were one-dimensional in nature. These projection patches had a width between one and fifteen pixels and each split node of   and   considered 40 random projections. Figure 1(A) shows examples from the two classes and classification results across various sample sizes.

In experiment (B) consists of a simple binary image classification problem. Images in class 0 contain randomly sized and spaced horizontal bars while those in class 1 contain randomly sized and spaced vertical bars. For each sampled image, bars were distributed among the rows or columns, depending on the class. The distributions of the two classes are identical if a 90 degree rotation is applied to one of the classes. Projection patches were between one and four pixels in both width and height and each split node of   and   considered 28 random projections. Figure 1(B) shows examples from the two classes and classification results across various sample sizes.

Experiment (C) is a signal classification problem. One class consists of 100 values of Gaussian noise while the second class has an added exponentially decaying unit step beginning at time 20.

Projection patches were 1D with a width between one and five timesteps. Each split node of   and   considered the default number of random projections, the square root of the number of features. Figure 1(C) shows examples from the two classes and classification results across various sample sizes.

4.2 Classification Accuracy

Figure 1:   outperforms other algorithms in three two-class classification settings. Upper row shows examples of simulated data from each setting and class. Lower row shows misclassification rate in each setting, tested on 10,000 test samples. (A) Two segments in a discretized circle. Segment lengths vary by class. (B) Image setting with uniformly distributed horizontal or vertical bars. (C)White noise (class 0) vs. exponentially decaying unit impulse plus white noise (class 1).

In all three simulation settings,   outperforms all other classifiers, doing especially better at low sample sizes, except the CNN for which there is no clear winner. The performance of   is particularly good in the discretized circle simulation for which most other classifiers perform at chance levels.   also performs well in the signal classification problem although all the classifiers are close in performance. This may be because the exponential signal is prevalent throughout most of the time-steps and so perfect continuity is less relevant.

Figure 2: Algorithm train times (above) and test times (below).   runtime is not particularly costly and well below CNN runtime in most examples.

4.3 Run Time

All experiments were run on a single core CPU.   has train and test times on par with those of   and   and so is not particularly more computationally intensive to run. The CNN, however, took noticeably longer to run—especially in terms of training, but also in testing—in two of the three simulations. Thus its strong performance in those settings comes at an added computational cost, a typical issue for deep learning methods

[dl_cost].

5 Real Data Results

5.1 Classification Accuracy

Figure 3:   improves classification accuracy over all other non-CNN algorithms for all sample sizes, especially in small sample sizes.

’s performance was evaluated on the MNIST dataset, a collection of handwritten digits stored in 28 by 28 square images [mnist], and compared to the algorithms used in the simulations. 10,000 images were held out for testing and a subset of the remaining images were used for training. The results are displayed in Figure 3. All three forest algorithms were composed of 500 decision trees and   was restricted to use patches up to only 3 pixels in height and width.   showed an improvement over the other algorithms, especially for smaller sample sizes. Thus, even this trivial modification can improve performance by several percentage points. Specifically,  achieved a better (lower) classification error than all other algorithms besides CNNs for all sample sizes on this real data problem.

5.2 Feature Importance

To evaluate the capability of   to identify importance features in manifold-valued data as compared to   and . All methods were run on a subset of the MNIST dataset: we only used threes and fives, 100 images from each class.

Figure 4: Averages of images in the two classes and their difference (above). Feature importance from   (bottom right) shows less noise than   (bottom middle) and is smoother than RF (bottom left).

The feature importance of each pixel is shown in Figure 4.   visibly results in a smoother pixel importance, a result most likely from the continuity of neighboring pixels in selected projections. Although SPORF demonstrated empirical improvement of   over   on the MNIST data, its projection distribution yields scattered importance of unimportant background pixels as compared to . Since projections in   have no continuity constraint, those that select high importance pixels will also select pixels of low importance by chance. This may be a nonissue asymptotically, but is a relevant problem in low sample size settings.  , however, shows little or no importance of these background pixels by virtue of the modified projection distribution.

6 Discussion

The success of sparse oblique projections in decision forests has opened up many possible ways to improve axis-aligned decision forests (including random forests and gradient boosting trees) by way of specialized projection distributions. Traditional decision forests have already been applied to structured data, using predefined features to classify images or pixels. Decision forest algorithms like that implemented in the Microsoft Kinect showed great success but ignore pixel continuity and specialize for a specific data modality, namely images [kinect_rf].

We expanded upon sparse oblique projections and introduced a structural projection distribution that uses prior knowledge of the topology of a feature space. The open source implementation of  222https://neurodata.io/sporf/ has allowed for a relatively easy implementation of , creating a flexible classification method for a variety of data modalities. We showed in various simulated settings that appropriate domain knowledge can improve the projection distribution to yield impressive results that challenge the strength of deep learning techniques on manifold-valued data. On the MNIST data set showed modest improvements over the other algorithms besides CNNs and smoother importance plots than the other decision forest algorithms. This is in spite of the the data set’s low resolution images which are harder for the modified projection distributions to take advantage of.

Research into other, task-specific convolution kernels may lead to improved results in real-world computer vision tasks. Such structured projection distributions, while incorporated into   here, may also be incorporated into other state of the art algorithms such as XgBoost 

[XGboost].

References

7 Appendix

1:(1) : training data (2) : dimensionality of the projected space, (3) : distribution of the atoms, (4) : set of split eligibility criteria
2:A   decision tree
3:function  growtree()
4:      is the current node index
5:      is the number of nodes currently existing
6:      bootstrap() is the indices of the observations at node
7:     while  do visit each of the existing nodes
8:          data at the current node class counts (for classification)
9:         if  satisfied then do we split this node?
10:               sample random matrix of atoms
11:               random projection into new feature space
12:               findbestsplit() Algorithm 2
13:               assign to left child node
14:               assign to right child node
15:               store best projection for current node
16:               store best split threshold for current node
17:               node indices of children of current node
18:               update the number of nodes that exist
19:         else
20:               NULL
21:         end if
22:          move to next node
23:     end while
24:     return
25:end function
Algorithm 1 Learning a Manifold Forest decision tree.
1:(1) , where
2:(1) dimension , (2) split value
3:function  findbestsplit()
4:     for  do
5:         Let be the row of .
6:          sort() is the index of the smallest value in
7:          initialize split to the left of all observations
8:          number of observations left of the current split
9:          number of observations right of the current split
10:         if (task is classification) then
11:              for  do
12:                   total number of observations in class
13:                   number of observations in class left of the current split
14:                   number of observations in class right of the current split
15:              end for
16:         end if
17:         for  do assess split location, moving right one at a time
18:               increment()
19:               score() measure of split quality
20:         end for
21:     end for
22:     
23:      compute the actual split location from the index
24:     return
25:end function
Algorithm 2 Finding the best node split. This function is called by growtree (Alg 1) at every split node. For each of the dimensions in , a binary split is assessed at each location between adjacent observations. The dimension and split value in that best split the data are selected. The notion of “best” means maximizing some choice in scoring function. In classification, the scoring function is typically the reduction in Gini impurity or entropy. The increment function called within this function updates the counts in the left and right partitions as the split is incrementally moved to the right.
Notes
  • 1 Department of Biomedical Engineering, Johns Hopkins University
  • 1 Department of Biomedical Engineering, Johns Hopkins University
  • 1 Department of Biomedical Engineering, Johns Hopkins University
  • 1 Department of Biomedical Engineering, Johns Hopkins University
  • 1 Department of Biomedical Engineering, Johns Hopkins University
  • 2 Center for Imaging Science, Johns Hopkins University
  • 3 Kavli Neuroscience Discovery Institute, Johns Hopkins University