The performance of classifiers can be significantly improved by aggregating the decisions of several classifiers instead of using only a single classifier. This is generally known as ensemble of classifiers, or multiple classifier systems. The ensemble is obtained by perturbing and combining several individual classifiers . Specifically, it is obtained by perturbing the training set or injecting some randomness in each classifier and aggregating the outputs of the these classifiers in a suitable way.
Decision trees (DT) and Neural networks (NN) are generally used for ensemble generation. Both decision trees and randomized neural networks are unstable classifiers whose performance greatly vary even when there is a small perturbation in training set or some classifier parameters. Thus, they are ideal candidates for a base classifier of an ensemble framework. Random Forest (RaF) , an ensemble of decision trees is an exemplar of such ensembles. It is the top ranked classifier based on the comparisons amongst 179 classifiers on 121 datasets 
. The standard random forest is however, superseded by oblique random forest, an ensemble of decision trees employing linear hyperplanes at each node to split the data instead of a single feature, in a recent exhaustive comparison among 183 classifiers
. An ensemble of random vector functional link (RVFL) networks, a popular single layer feed forward neural network, also ranks amongst the top-20.
Decision trees in random forest employ recursive partitioning of the training data into smaller subsets that further aid in classification by optimizing some impurity criteria such as information gain or gini index . Classical RaFs achieve this by using a single feature at each node to partition the training set into two partitions that generates an axis-parallel or orthogonal hyperplane at each node. Such hyperplanes may not always approximate complex decision boundaries [6, 7, 8]. In a variant of random forest, known as oblique random forest (obRaF) , an oblique hyperplane (or linear decision boundary) is used at each node. Such decision boundary uses a linear combination of features to split the training data. The decision trees in random forest exhaustively search for a single feature among a random subset of features at each node. However, such exhaustive search for the best oblique hyperplane is computationally expensive 
. Thus, the search for the oblique splits are generally based on heuristic approaches and are non-optimal. To circumvent such an issue, we present an oblique random forest that searches for an optimal linear hyperplane at each node from a finite search space while optimizing the Gini-impurity criteria similar to RaF.
Random Vector Functional Link (RVFL) network on the other hand, is a randomized variant of Functional Link Neural Network (FLNN) 
. The weight and bias vectors of the hidden layer in RVFL are randomly generated thus, making the learning algorithm less complicated and faster to train than conventional back-prop based SLFN[11, 12].
It is generally cumbersome to learn from large datasets. Most of the classifiers such as random forest, support vector machine (SVM) do not scale well with datasets with large sample size, feature dimension or number of classes. Divide and conquer strategies, dimension reduction techniques are some of the common techniques employed in such cases. However, the number of classes still pose a constraint to the application of decision trees and SVM based classifiers. The approaches used to handle such scenarios trade performance with computational complexity. RVFL, on the other hand, can be effectively utilized for such large datasets. In this paper, we extend the idea proposed in by fusing RVFL with our proposed oblique random forest and show that such ensembles can improve the accuracy while incurring less computational cost than random forest based ensembles. The RVFL partitions the training dataset into several subsets where confusing or difficult samples are grouped in the same subset. Such technique allows us to employ finer classification rules while focusing on confusing or difficult-to-classify samples. Through experiments on several datasets with varying sample size and number of classes, we demonstrate that our proposed oblique random forest ensemble is superior to standard random forest and its oblique variant in terms of both performance and computational requirements. We then create a hybrid ensemble of RVFL and oblique random forest to further boost the performance of the oblique random forest classifier.
The rest of the paper is organized as follows: we present a brief review of the related works in the following section. In Section III, we elucidate our approach for the hybrid ensemble. In Section IV, we present experimental results and comparison of our proposed hybrid ensemble with different classifiers. Finally, we present our conclusions in Section V.
Ii Related Works
Before proposing our hybrid ensemble classifier based on decision trees and neural networks, in this section we briefly review decision trees, random forest, random vector functional link networks and some hybridization based classification techniques.
Ii-a Decision Trees.
A decision tree consists of nodes and edges. The nodes are either internal (split) or leaf (terminal). The internal nodes are split into two child nodes until a stopping criterion is met, after which they become a leaf. Each internal node is associated with a test function defined as:
where, and is a threshold. The outcome determines the child node to which x is routed. For instance, 0 represents left child node while 1 represents right child node. Each node chooses the best test function from a pool of potential test functions by optimizing a metric known as Gini-impurity. The objective is to make the resulting child nodes as pure as possible, i.e containing training samples of a single class only.
Based on the nature of test (split) functions, decision trees are categorized into two types: univariate (axis-parallel or orthogonal) and multivariate (oblique) . In a univariate decision tree, the parameter of the test function is based on a single feature i.e. the node selects a single feature from a random subset of features that best minimizes Gini-impurity. The final decision boundary produced by a tree is of staircase type as shown in Fig. 1. In multivariate or oblique decision trees, depends on a linear combination of the features. Since the decision boundary can orient in any direction to the axes, the trees with such hyperplanes is also known as oblique trees . Thus, (1) can be reformulated as:
where is the weight coefficient for each feature in .
A random forest is an ensemble of decision trees that are trained independently on several instances of the training data obtained using bagging . In RaF, the optimal split at each node is chosen from the possible splits, where is the unique feature values in and is typically set to the square root of feature dimension . That means, in the worst possible scenario the exhaustive search for the ‘optimal’ split grows linearly in the number of training samples at each node and . However, there are at most distinct oblique splits at each node and the exhaustive search of the ‘optimal’ oblique split is NP-hard .
have been proposed in the literature. Generally linear classifiers such as SVM, MPSVM, LDA, Logistic regression are used to generate oblique hyperplanes[7, 15, 16, 17, 18]. However, most of these approaches are used without optimizing impurity criteria or any search for the optimal oblique hyperplanes. A multi-class classification problem is ususally reformulated as a binary problem by defining two or more hyperclasses [16, 18] either via clustering, some distance metric or possible combinations and a linear decision boundary is learned. These approaches either do not optimize impurity criteria or are computationally expensive. However, we can still integrate impurity optimization techniques for optimal oblique hyperplane search in ObRaF as in RaF without incuring great computational costs by implementing a simple and effective approach.
Ii-B Random Vector Functional Link Networks.
A RVFL is a single layer feed-forward neural network which is mainly characterized by the absence of backpropagation (BP) and the presence of direct links between the input and output nodes (see Fig.2) [10, 11]
. The weights between the input and hidden neurons in RVFL are randomly generated from a suitable range. The direct links in RVFL regularize the network from the effects of randomization leading to a simpler model with a small number of hidden neurons while improving the generalization performance of the neural network[12, 19]. The output layer of RVFL consists of nodes corresponding to the number of classes, with each node assigning a score for each class. The predicted class for a sample x is the class represented by a node with , where is the score given by each output node . Since the hidden layer parameters are randomly generated and kept fixed, only the output weights need to be computed. The learning objective of RVFL is:
where is the stacked feature matrix obtained from direct links () and the hidden layer (), is the vector of class labels, and is the regularization parameter.
A closed form solution of (3) can be obtained by using either least squares or Moore-Penrose pseudoinverse. Using Moore-Penrose pseudoinverse, the solution is given by:
while using least squares (ridge regression), the closed form solution is given by:
is the identity matrix.
Ii-C Hybridization based classification techniques
Many classifiers (heterogeneous or homogeneous) can be used as base classifiers in an ensemble framework. Here, we refer to some classification techniques employing both neural network and decision trees. In , the authors use decision trees to empirically determine the number of neurons needed in three layers of neural network. Richmond et al. in 
extends this idea by mapping stacked RF to Convolutional Neural Networks (CNN) for semantic segmentation. Similarly, in, Jerez et al. use decision trees to identify most important variables from breast cancer data set and use those variables as input for their neural network architecture. The work in 
integrates NN and DT by replacing the final Softmax layer of CNN by DT. In
, the authors use multi-layer perceptrons as split functions in each node of the trees. Different from these works, we present a hybrid or heterogeneous ensemble of classifiers where we exploit the probability like outputs of neural network to quickly partition the training data for efficient multi-class classification by decision trees (or random forest).
Apart from the hybridization technique discussed above, we also review some ideas relevant to our proposed method. Generally, binary splits are popular with decision trees with very few researches on multi-way splits. Multi-way (Multi-branch) splits in decision trees have previously been studied in [25, 26, 27]. In , correlation is used to do find the best single feature and thresholds to split the training data into multiple branches are computed by SVM. However, such multi-way splits are cumbersome to determine and do not improve the performance of decision trees . 
is another closely related work to ours. It uses a deep neural network to perform a hierarchical partition of the data as in decision trees while creating the clusters of confusing classes. The classes are clustered by employing spectral co-clustering algorithm over confusion matrix computed over the validation dataset. Thus, it is computationally expensive and requires large dataset. However, we employ a simple, fast neural network to partition the data without incurring large computational complexities. We extend the idea of by proposing an improved oblique random forest. Our method is based on an ensemble framework that boosts the multi-class classification handling capacity of random forest.
Iii RVFL and oblique random forest for many class problems
A RVFL followed by oblique decision trees is a base classifier in our ensemble framework. Each bag of the original training data obtained using bagging is carefully partitioned by RVFL into several subsets such that the decision trees employed afterwards can improve the classification performance by learning to separate the confusing training samples. The decision trees or more specifically ensembles of decision trees is one of the best classification algorithm in terms of generalization ability and robustness. The partitions obtained using RVFL enables to employ a more fine-grained classification rule via decision trees as the classification algorithm focuses on difficult to classify samples. In this section, we first describe the data partitioning step by RVFL and then present our oblique random forest.
Iii-a Data Partition by RVFL
We employ RVFL at the top node to divide the data into partitions as in  where is the number of classes in a dataset. Our proposed oblique decision tree is trained thereafter on each partition separately to improve the accuracy. In each partition, the class distribution of samples is unique i.e. majority of the samples are from one class and the rest from other classes. The samples from the other classes are those that are “hard” to classify by RVFL. Such partitioning is possible by utilizing the output scores given by RVFL.
In the training phase, each training sample is passed to RVFL. The output of RVFL is a probability like score for each class in that particular data set. Generally, the class with the highest score is the predicted class by RVFL. However, in our case two classes with the highest and second-highest scores are selected as the potential classes which indicate the most confusing classes for that particular training sample. Each partition by RVFL corresponds to a class. Thus, is used as a training data for the oblique decision trees associated with those two classes/subsets. This procedure is repeated for all training samples, creating a training set for each decision tree. The final model is an ensemble of such base classifiers. In cases, where the true class of is neither the highest nor the second-highest class, the training sample is still placed in its true class. For further details, readers can refer to .
Iii-B Improved Oblique Random Forest
Decision trees employ recursive partitioning so that the child nodes are purer than the parent node. The objective is to separate the training samples into different partitions such that these partitions contain samples of one class only. In RaF, such partitions are obtained by an exhaustive search for the best orthogonal hyperplane. This problem, however, can be reformulated using the information of the class labels of the training samples.
Many popular binary classifiers such as Support Vector Machine (SVM) use “one-vs-all” approach to breakdown a multi-class classification problem into several binary classification problems. Specifically, for each class, a single SVM is trained with the samples of that class as positive samples and all other as negative samples. A caveat is that it may not always be the best method to deal with multi-class problems . Such methods can however, be integrated intactly at the internal nodes of the decision trees. As stated earlier, a linear classifier at each node need not always be a perfect classifier but simply aid in further classification. Thus, the above objective of random forest can be restated to separating one class from all other classes at each node. That means, instead of performing exhaustive search, one can search for the ‘’ hyperplanes by transforming a multi-class classification problems into ‘’ binary classification problems where ‘’ is the number of classes at each node 111The total number of classes in a dataset is denoted by and the number of classes at each node by . where at the top root node.. This restricts the hyperplane search space in ways at each node and a linear classifier can be selected that best optimizes the impurity criteria. In the best scenario, a linear classifier or decision boundary may result in all samples of a class in one child node while the rest of the training samples on another child node which is exactly the objective of decision trees i.e. to make child nodes purer than the parent nodes. Thus, employing linear hyperplanes with impurity optimization technique may help to better capture the geometric structure of the data than axis-parallel hyperplanes.
For our oblique decision trees, we employ an MPSVM based linear classifier. MPSVM generates two non-parallel planes based on the proximity to each class and the final decision boundary employed at each node is based on the angle bisector of these two planes 
where , , and , are the matrices of class 1 and 2 respectively and is the vector of ones. The linear hyperplane at each node is the one that passes in between them.
At each node of the tree, oblique hyperplanes based on MPSVM are obtained and the one that best maximizes (7) is selected as the node splitting hyperplane.
where and are the values of Gini impurity at the parent node and child nodes respectively , is the number of data samples in the parent node, is the number of data samples that reach the left and right child nodes of the current parent node, and are the number of samples of class in the left and right child nodes respectively.
The construction algorithm of the proposed oblique random forest is presented in 1.
One of the issues with “one-vs-all” technique is that it is computationally voracious when the number of classes is very large. Thus, generating hyperplanes at each node can be computationally expensive in such cases. However, it also offers an advantage. Since the linear hyperplane at each node is selected from a pool of hyperplanes based on the impurity criteria (more specifically (7)), it results in pure child nodes faster than previous exhaustive approaches (in case of RaF) and non-impurity optimization approaches (in case of obRaF) 222When we say obRaF, we are referring to the older variants of oblique random forest that do not employ impurity optimization techniques.. Thus, the trees in our proposed oblique random forest variant are generally shallow compared to the standard trees. This may negate the complexity associated with generating many hyperplanes at each node. In Section IV, we validate this through experiments on several datasets.
When employing RVFL at the top (root) node, each subset or partition contains majority of the samples from few classes and rest from the others. Because the linear classifiers trained with greater number of training samples are most likely to better optimize the impurity (Gini) criteria compared to the classifiers trained with very few samples, we can avoid such classes and instead choose the best hyperplane from a pool of hyperplanes obtained using classes with larger training data only. This is intuitive since the hyperplane generated using “one-vs-all” method attempts to separate one class from the rest of the other classes and thus, it favours classes with larger training data as it better optimizes (7). This observation is also based on our experiments where the best hyperplane is usually the one trained with larger number of classes. Thus, when the number of classes is very large, we employ the “one-vs-all” approach to only the top classes where is a hyperparamater. Based on our experiments, we set . This further decreases the computational complexity of the model without incuring any significant loss in the performance.
Our method is particularly suitable for multi-core or distributed environment where after the partitioning by RVFL, each partition can be run in parallel or distributed across different cores. Similarly, the hyperplane generation operation using “one-vs-all” can also be distributed. This method is also suitable for large datasets. However, a caveat associated with our ensemble classifier is that it can only be employed for many class classification problems. For binary classes, the partitions provided by RVFL are just duplicated versions.
In this section, we compare the performance of random forests variants and our hybrid ensemble. We compare four classifiers in 10 UCI multi-class datasets. The number of classes in these datasets vary from 7 to 100. The datasets are selected based on their size and number of classes and the performance of oblique random forest and the hybrid ensemble in . The properties of these datasets are summarized in Table IV.
Dataset #Patterns #Features #Classes Chess-krvk 28056 6 18 Letter 20000 16 26 Optical 3823 62 10 Pendigits 7494 16 10 Plant margin 1600 64 100 Plant shape 1600 64 100 Statlog-image 2310 18 7 USPS 9298 252 10 W-qua-white 4898 11 7 Yeast 1484 8 10
Iv-a Experimental setup
We follow the experimental setup of [3, 13]. For a fair comparison between all the ensemble methods, we use the same values for the common parameters. Thus, for each classifier, we set the ensemble size or the number of trees to 500, number of random features at each node () to , where
is the feature dimension. If the feature vector has not been normalized, each feature is normalized by removing the mean and dividing by its variance. In all the ensembles, the trees are fully grown until the terminating criteria is met (no longer possible to optimize7). RVFL Configuration. The objective of our proposed method is to obtain diverse RVFL models in each base classifier which in turn results in diverse data partitions hence, diverse decision trees. We use the same parameter setting as in 
where each RVFL randomly picks the activation function and network parameters from the parameter settings listed below:
Number of hidden neurons, = 3:203
(=) in ridge regression, = -5:14
Activation Functions: radbas, sine and tribas
Range of the randomization for weights [-,+] and bias [0,], where with t = -1.5:0.5:1.5
The RVFL has direct links from input layer to output layer with bias term in the output neuron.
Iv-B Comparison between random forest variants
In Table III, we present the classification accuracies of each classifier in each dataset. First we compare the performance of random forest variants: RaF, obRaF and obRaF(M). In almost all the datasets except Plant margin, our proposed oblique random forest (obRaF(M)) outperforms both standard random forest and oblique random forest. Although both obRaF and obRaF(M) employ linear decision boundary at each node using MPSVM, only obRaF(M) performs a search for the optimal linear boundary. This suggests that oblique random forests that employ linear decision boundaries with impurity optimization generalize better than other random forest variants.
In Tables I and II, we show the training time and the average number of nodes comparison of random forest classifiers. For the comparison, we select two datasets: Plant shape with 100 classes and Pendigits, a medium size dataset. Even though our proposed oblique random forest employs “one-vs-all” approach, it still offers computational advantages over RaF and obRaF which is evident by shorter training time and less number of nodes.
Iv-C Performance comparison between all classifiers
The average classification accuracies (in %) of RaF, obRaF, obRaF(M) and obRaFL are 83.07, 83.68, 84.36 and 84.6 respectively. The hybrid ensemble (obRaFL) has the highest accuracy followed by our proposed oblique random forest, obRaF(M). However, comparing the classifiers using average accuracy is susceptible to outliers and may atone for a classifier’s poor performance in one dataset with an excellent performance on the other. Thus, we follow the procedure of, and use the rank of each classifier to assess its performance. In this approach, each classifier is ranked based on its performance, that means, the highest performing classifier is ranked 1, the second highest rank 2, and so on. The mean ranks of each classifier over all the datasets is presented in Table II. Our hybrid ensemble, obRaFL, is the top ranked classifier followed by our proposed oblique random forest, obRaF(M).
Thus, from the experimental results, we can infer that employing obRaF(M) on the partitions provided by RVFL improves the performance. The RVFL provides partitions with confusing samples so classification rules that focus on such difficult-to-classify samples can be employed. Such rules can be easily implemented by decision trees or random forest owing to their superior and robust performance. Furthermore, our proposed oblique random forest improves the random forest and by employing RVFL at the top node, we can obtain more robust and superior performance.
obRaF is the oblique random forest of . obRaF(M) is our proposed oblique random forest while obRaFL is the hybrid ensemble of RVFL and obRaF(M). Bold values indicate the best performance.
Lower rank reflects better performance.
Iv-D Analysis of Common Parameters
Tree depth, the number of features randomly selected at each node and the number of trees are the common parameters of random forest based methods. We evaluate the influence of each parameter in our proposed oblique random forest and the hybrid classifier in the Pendigits dataset. Similar conclusions pertain to other datasets as well. For the analysis, the maximum depth, the number of trees and the number of random features are varied in the ranges [1,15], [1,500] and [1,16] respectively. Tree depth. Generally, the trees in random forest are fully grown. From Fig. 3, we can see that there is a sharp increase in the accuracy when the tree depth increases from 1 to 6. This implies that to obtain a good performance, the trees should be deeper. As the trees in the hybrid classifier are grown with reduced data set, the trees are usually shallow. However, it gives good performance than oblique random forest even when is set to a small value because of the nature of the data partitions (small size and fine-grained classification rules).
Number of features. The number of features randomly selected at each node, , controls the diversity of the trees in forest. A small value of results in uncorrelated trees whereas large values of may result in correlated trees. Generally, gives good performance. Number of trees. As the number of trees (base classifiers) increases, the generalization ability of random forest based methods also increases. However, large number of trees or ensemble size also increases the computational cost. There is only a slight improvement on the performance beyond the ensemble size of 300. In Fig. 3, we can observe that even for small tree depth and ensemble size, our hybrid ensemble classifier provides the near maximal performance. For large datasets, random forest require a large number of very deep trees to provide an acceptable performance. This may be computationally intractable. However, by employing our hybrid ensemble with shallow trees and small ensemble size, we can eschew expensive computational requirements and still obtain good performance.
In this paper, we first proposed an oblique decision tree that uses impurity optimization techniques similar to the trees in random forests. We employed the oblique decision trees with a fast RVFL network to create a hybrid ensemble. In each base classifier, the decision trees were trained on the samples partitioned by RVFL. Such a marriage of decision trees and fast neural network further enhances the capability of decision trees to handle multi-class classification problem which is evident by the performance of the hybrid ensemble in several machine learning datasets. Even with small tree depth and ensemble size, our hybrid ensemble can achieve superior performance compared to standard random forest classifiers. This can significantly preserve the computational resources and time when dealing with large datasets. One of the interesting traits of our hybrid ensemble is parallelization or distributed computation as several jobs can be effectively distributed over several cores or machines. However, the gain can only be seized if we reduce the communication overhead. Even though the proposed oblique random forest has faster training time, the partitioning and the application of RVFL adds complexity to the ensemble. Thus, our future work is to improve the training time of our hybrid ensemble with efficient use of distributed computing.
-  L. Breiman, “Bias, variance, and arcing classifiers,” Tech. Rep. 460, Statistics Department, University of California, Berkeley, CA, USA, 1996.
-  ——, “Random forests,” Machine learning, vol. 45, no. 1, pp. 5–32, 2001.
-  M. Fernández-Delgado, E. Cernadas, S. Barro, and D. Amorim, “Do we need hundreds of classifiers to solve real world classification problems?” Journal of Machine Learning Research, vol. 15, pp. 3133–3181, 2014. [Online]. Available: http://jmlr.org/papers/v15/delgado14a.html
-  L. Zhang and P. N. Suganthan, “Benchmarking ensemble classifiers with novel co-trained kernal ridge regression and random vector functional link ensembles [research frontier],” IEEE Computational Intelligence Magazine, vol. 12, no. 4, pp. 61–72, 2017.
-  A. Criminisi, J. Shotton, E. Konukoglu et al.
S. K. Murthy, S. Kasif, and S. Salzberg, “A system for induction of oblique
Journal of artificial intelligence research, vol. 2, pp. 1–32, 1994.
-  B. H. Menze, B. M. Kelm, D. N. Splitthoff, U. Koethe, and F. A. Hamprecht, “On oblique random forests,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 2011, pp. 453–469.
-  L. Zhang, J. Varadarajan, P. N. Suganthan, N. Ahuja, and P. Moulin, “Robust visual tracking using oblique random forests,” in
-  L. Breiman, J. Friedman, C. J. Stone, and R. A. Olshen, Classification and regression trees. CRC press, 1984.
-  S. Dehuri and S.-B. Cho, “A comprehensive survey on functional link neural networks and an adaptive pso–bp learning for cflnn,” Neural Computing and Applications, vol. 19, no. 2, pp. 187–205, 2010.
-  Y.-H. Pao, S. M. Phillips, and D. J. Sobajic, “Neural-net computing and the intelligent control of systems,” International Journal of Control, vol. 56, no. 2, pp. 263–289, 1992.
-  L. Zhang and P. N. Suganthan, “A comprehensive evaluation of random vector functional link networks,” Information sciences, vol. 367, pp. 1094–1105, 2016.
-  R. Katuwal, P. Suganthan, and L. Zhang, “An ensemble of decision trees with random vector functional link networks for multi-class classification,” Applied Soft Computing, 2017.
-  L. Breiman, “Bagging predictors,” Machine learning, vol. 24, no. 2, pp. 123–140, 1996.
-  L. Zhang and P. N. Suganthan, “Oblique decision tree ensemble via multisurface proximal support vector machine,” IEEE Transactions on Cybernetics, vol. 45, no. 10, pp. 2165–2176, Oct 2015.
-  L. Zhang, W.-D. Zhou, T.-T. Su, and L.-C. Jiao, “Decision tree support vector machine,” International Journal on Artificial Intelligence Tools, vol. 16, no. 01, pp. 1–15, 2007.
-  T. D. Lemmond, B. Y. Chen, A. O. Hatch, and W. G. Hanley, “An extended study of the discriminant random forest,” in Data Mining. Springer, 2010, pp. 123–146.
-  A. K. Y. Truong, “Fast growing and interpretable oblique trees via logistic regression models,” Ph.D. dissertation, University of Oxford, 2009.
-  Y. Ren, P. N. Suganthan, N. Srikanth, and G. Amaratunga, “Random vector functional link network for short-term electricity load demand forecasting,” Information Sciences, vol. 367, pp. 1078–1093, 2016.
-  I. K. Sethi, “Entropy nets: from decision trees to neural networks,” Proceedings of the IEEE, vol. 78, no. 10, pp. 1605–1613, 1990.
-  D. L. Richmond, D. Kainmueller, M. Y. Yang, E. W. Myers, and C. Rother, “Relating cascaded random forests to deep convolutional neural networks for semantic segmentation,” arXiv preprint arXiv:1507.07583, 2015.
-  J. M. Jerez-Aragonés, J. A. Gómez-Ruiz, G. Ramos-Jiménez, J. Muñoz-Pérez, and E. Alba-Conejo, “A combined neural network and decision trees model for prognosis of breast cancer relapse,” Artificial intelligence in medicine, vol. 27, no. 1, pp. 45–63, 2003.
-  P. Kontschieder, M. Fiterau, A. Criminisi, and S. Rota Bulo, “Deep neural decision forests,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1467–1475.
-  S. Rota Bulo and P. Kontschieder, “Neural decision forests for semantic image labelling,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 81–88.
-  Y. Mansour and D. A. McAllester, “Boosting with multi-way branching in decision trees,” in Advances in Neural Information Processing Systems, 2000, pp. 300–306.
-  H. Sadoghi Yazdi, N. Salehi Moghaddami, and H. Poostchi Mohammadabadi, “Correlation based splitting criterionin multi branch decision tree,” Central European Journal of Computer Science, vol. 1, 2011.
-  E. Frank and I. H. Witten, “Selecting multiway splits in decision trees,” 1996.
-  V. N. Murthy, V. Singh, T. Chen, R. Manmatha, and D. Comaniciu, “Deep decision network for multi-class image classification,” in Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on. IEEE, 2016, pp. 2240–2248.
-  C.-W. Hsu and C.-J. Lin, “A comparison of methods for multiclass support vector machines,” IEEE Transactions on Neural Networks, vol. 13, no. 2, pp. 415–425, Mar 2002.