Random Forests

What is a Random Forest?

Random Forest is a robust machine learning algorithm that can be used for a variety of tasks including regression and classification. It is an ensemble method, meaning that a random forest model is made up of a large number of small decision trees, called estimators, which each produce their own predictions. The random forest model combines the predictions of the estimators to produce a more accurate prediction.

Standard decision tree classifiers have the disadvantage that they are prone to overfitting to the training set. The random forest's ensemble design allows the random forest to compensate for this and generalize well to unseen data, including data with missing values. Random forests are also good at handling large datasets with high dimensionality and heterogeneous feature types (for example, if one column is categorical and another is numerical).

Random forests are very good for classification problems but are slightly less good at regression problems. In contrast to linear regression, a random forest regressor is unable to make predictions outside the range of its training data.

Random forests are also black boxes: in contrast to some more traditional machine learning algorithms, it is difficult to look inside a random forest classifier and understand the reasoning behind its decisions. In addition, they can be slow to train and run, and produce large file sizes.

Because they are extremely robust, easy to get started with, good at heterogeneous data types, and have very few hyperparameters, random forests are often a data scientist's first port of call when developing a new machine learning system, as they allow data scientists to get a quick overview of what kind of accuracy can reasonably be achieved on a problem, even if the final solution may not involve a random forest.

Decision Tree Learning

In order to understand Random Forests, it is necessary to first understand how decision trees are built.

A decision tree is a simple way of classifying examples. For example, a common dataset used for testing machine learning algorithms is the Iris Dataset, which is a set of measurements of 150 flowers belonging to three species. A decision tree can be built to identify which is the most likely species that a specimen belongs to, from its petal length, petal width, and sepal length.

There are a number of widely used decision tree algorithms. The best known are the C4.5 algorithm, the ID3 algorithm, and the CART algorithm. For all three, the basic process to build a decision tree model for a problem such as classifying iris flowers is as follows:

1. Choose the attribute which can most effectively split the dataset into different classes. For example, splitting at petal length = 2.45 could be a good choice if most training examples above 2.45 belong to one species and most species below 2.45 belong to another.

2. Split the dataset on this attribute. This corresponds to creating a new node in the decision tree.

3. For each split of the dataset, repeat the process of splitting the dataset on the best attribute.

4. Stop creating new nodes in the tree if all the samples in the current node already belong to the same class, or if no feature provides any value, or if the tree has already reached its maximum allowed depth.

Example decision tree of depth 2 for the Iris dataset

Decision trees are useful models because they allow a human to instantly visualize the decision-making process. However, in their basic form, they come with a large number of limitations, which are the reason why plain single decision trees are rarely used in machine learning today.

The major problem is that decision trees that are grown very deep tend to learn patterns that are very particular to the training dataset. For example, there could be one single example of Iris setosa in the training set with a given set of dimensions which would normally indicate a different species. This could be an anomaly, but a decision tree algorithm might create a node just for that specimen.

This means that a single decision tree is likely to be very strongly adapted to its training set and generalize poorly to unseen examples. In other words, the decision tree overfits the training set.

Random Forests were introduced as a modification to the basic decision tree algorithms which makes them more robust and corrects for the problem of overfitting.

Random Forest Algorithm

There are a number of variants of the random forest algorithm, but the most widely used version in use today is based on Leo Breiman's 2001 paper, so we will follow Breiman's implementation.

Let us assume we have a training set of N training examples, and for each example we have N features. A random forest will consist of N_tree decision trees, or estimators.

1. Bagging

Given the training set of N examples, we repeatedly sample subsets of the training data of size n where n is less than N. Sampling is done at random but with replacement. This subsampling of a training set is called bootstrap aggregating, or bagging, for short.

2. Random subspace method

If each training example has M features, we take a subset of them of size m < M to train each estimator. So no estimator sees the full training set, each estimator sees only m features of n training examples.

3. Training estimators

We create N_tree decision trees, or estimators, and train each one on a different set of m features and n training examples. The trees are not pruned, as they would be in the case of training a simple decision tree classifier.

4. Perform inference by aggregating predictions of estimators

To make a prediction for a new incoming example, we pass the relevant features of this example to each of the N_treeestimators. We will obtain N_tree predictions, which we need to combine to produce the overall prediction of the random forest. In the case of classification, we will use majority voting to decide on the predicted class, and in the case of regression, we will take the mean value of the predictions of all the estimators.

Random forest inference for a simple classification example with N_tree = 3

This use of many estimators is the reason why the random forest algorithm is called an ensemble method. Each individual estimator is a weak learner, but when many weak estimators are combined together they can produce a much stronger learner. Ensemble methods take a 'strength in numbers' approach, where the output of many small models is combined to produce a much more accurate and powerful prediction.

Example training random forest for regression

Let us consider the example of the Boston Housing dataset. This is a well-known dataset of information about different houses in Boston. For each house, 13 values are known, such as the crime rate in that area, industrialization value, average age of residents, and so on. Our task is to train a model to predict the value of a house given these values.

Sample of the Boston Housing dataset. The property value is given in multiples of $1000 as the last column in the table.

We split the entire Boston housing dataset into 67% of examples in the training set, and 33% as our test set.

Training a decision tree of depth 2 on the Boston Housing dataset

If we train a single decision tree classifier on the training set using the C4.5 algorithm (the commonest decision tree algorithm), and we set the maximum depth of the decision tree to 2, we will get a tree looking something like this:

We can note that of the 13 original features, this decision tree has used only LSTAT (the percentage of the population in low income groups) and RM (average number of rooms per dwelling) to generate a prediction. The four leaf nodes show us that this single tree classifier can produce four possible outputs: $30k, $44k, $22k and $14k, even though we are solving a regression problem and the true number could be one of many continuous values.

This simple decision tree has a mean absolute error of $3.6k on the training set, and $3.8k on the test set. This means that although it is not a powerful model, it performs similarly on seen and unseen data, and so it has generalized well and has not overfit the training data.

Training a deeper single decision tree

To make the decision tree more accurate, we can increase the maximum depth to 200. We can produce a much more complex tree with many nodes.

However, if we evaluate this tree, we find that its accuracy has not improved as much we would like.

The mean absolute error of this deep decision tree on the training set is $0.02k, but its mean absolute error on the test set is $2.9k. This means that the decision tree has strongly overfit our training data. In fact, many of the nodes in the decision tree were created because of only one training example. For example, one node is defined by the following path:

This set of conditions is extremely specific and means that the tree will predict the price of a particular house in the training set extremely well, but it is unlikely to generalize to unseen properties in the test set.

Making a random forest ensemble model

We can move on from single decision trees and start to leverage ensemble methods, building a random forest regression model.

We create a random forest regression model using N_tree = 10,000 estimators. We do not limit the maximum depth of the trees, so that each decision tree will be as complex as the large one shown above.

For each of the 10,000 estimators, we use bagging to select a subset of the features and a subset of the training examples. Each estimator is trained on this small subset of the training data. For prediction, we take the mean value of the predictions of each of the 10,000 trees.

Prediction example

Let us take the example of a single house in the test set:

This house has a true value of $23.6k.

To calculate the random forest model's prediction for this house, we can calculate each constituent decision tree's prediction, putting the 13 features into each of the estimators:

The first 3 of the 10,000 estimators, and their predictions for an unseen example

The first three estimators predict a price of $20.3k, $18.6k and $22.6k.

Overall, the 10,000 estimators produce predictions averaging $22.8k, which is an absolute error of $0.8k from the true value.

Performance of the model

The mean absolute error on the training set of the random forest model is $0.9k, and its error on the test set is $2.1k. This means that some overfitting has taken place, since the performance has gone down on unseen data, but the difference is much less extreme than in the case of a single deep decision tree.

We can summarize the performances of the three models as follows:

Model	Mean absolute error on training set	Mean absolute error on test set
Decision tree depth 2	3.6	3.8
Decision tree max depth 200	0.02	2.9
Random forest with 10,000 estimators	0.9	2.1

We can see that the ensemble design of the random forest model has enabled it to improve on the single deep decision tree on both the training and test set. The error has increased slightly on the training set but decreased on the test set, indicating that the random forest model is better at generalizing from training data to unseen data.

Feature importance in a random forest model

Although it is not easy to understand the inner workings of a particular random forest model, it is quite easy to understand which features are most important. This can be done by introducing small perturbations in the input to a model and analysing which features have most influence on the predictions.

It is possible to quantify the feature importances both for the individual estimators and for the ensemble model as a whole. Most random forest software implementation will provide feature importances out of the box.

In the case of the Boston housing dataset model, the features RM (rooms per dwelling) and LSTAT (percentage lower income residents) clearly have a much bigger contribution to the predictions than the remaining 11 features.

Random Forest vs. Gradient Boosted Tree

Gradient Boosted Trees are an alternative ensemble-based design that combines multiple decision trees. In contrast to a random forest, which combines the outputs of many trees which have been trained independently, a gradient boosted tree uses a cascade of decision trees, where each tree helps to correct errors made by the previous tree.

Random Forests and Gradient Boosted Trees differ both in the way that the trees are built, and the way the results are combined for inference.

In Random Forest, the results of all the estimators in the ensemble are averaged together to produce a single output. In Gradient Boosting, a simple, smaller tree is run, and then a series of other estimators are also run in order, to correct the errors of previous estimators. Each estimator's output is weighted by a parameter, so later estimators have a lower weighting than earlier estimators. Gradient Boosted Trees tend to use smaller trees for their estimators than Random Forests.

A Random Forest averages the outputs of its constituent estimators, while a Gradient Boosted Tree assigns different weights to its estimators' outputs

Gradient Boosted Trees have the advantage that they use a single loss function in training. This means that Gradient Boosted Trees can be used for any kind of task which can be expressed in terms of a loss function, making them more versatile than Random Forests.

Because Random Forests involve training each tree independently, they are very robust and less likely to overfit on the training data. Gradient Boosted Trees are harder to tune, and have more complex hyperparameters, but can deliver more powerful models if used correctly.

Applications of Random Forests

Random Forests are widely used across domains and industries thanks to their robustness and ability to deal with different types of features. With some preprocessing, they are very good at processing data that contains both continuous and categorical variables. Provided that accountability and transparency, or speed, are not stringent requirements of the business case, random forests are a popular choice in many business and research applications.

Random Forests in Finance and Banking

Many financial institutions have used random forest models for use cases such as identifying customers of certain types, and predicting customer churn.

For example, it is possible to construct a data representation of each customer of an institution, where a column indicates the date of joining the institution, another column indicates the average volume of transactions up until a certain date in the past, and the final column indicates whether the customer closed their account in a period of three months following the observations.

A random forest classifier can be trained to predict the probability of a customer closing their account, based on observations of their transaction history, and can be applied to current users to predict customer churn over the next three months. This provides highly valuable business intelligence to the company.

Random forests can also be used to identify likely fraudulent transactions. For example, each transaction in a bank has a series of features such as the deviation from the mean transaction volume of the customer, the time of day, the location, and how these values differ from that customer's usual habits. This allows a bank to build a sophisticated model to predict the likelihood of a given transaction being fraudulent. If the probability of fraud exceeds a threshold, such as 50%, the bank can take action, such as freezing the card.

Random Forests in Medicine

In medicine, random forest models have been built to identify and classify diseases based on a series of symptoms. In addition, random forests can be used to derive predictions from patients' electronic health records, which are typically a file containing a series of data points about that patient. A random forest model can be trained on past patients' symptoms and later health or disease progression, and generalized to new patients.

Random Forest History

Decision trees in their non-ensemble form have been around in various guises since at least the 1950s, although no one researcher can be said to be the discoverer and they are likely to have been discovered several times. One of the earliest papers to mention a decision tree is a 1959 article by the British statistician on an algorithm to construct a decision tree for classifying biological organisms.

Over the following decades, decision trees were gradually refined by the statistics community. The best-known algorithms were C4.5, ID3, and CART. However, neither approach involved ensemble methods, and so decision trees were still prone to overfitting, as we have seen with the Boston Housing example above.

In 1995, the Hong Kong-American researcher Ho Tin-Kam developed the first algorithm for random forests, while she was working at Bell Labs in New Jersey. She used the random subspace method to reduce the correlation between estimators, exposing each estimator to a subset of the entire feature set, but still using the entire training set to train each estimator.

In 2001 the American computer scientist Leo Breiman (one of the original discoverers of the CART algorithm) refined Ho's ensemble algorithm to introduce bagging, or bootstrap aggregation. Breiman's idea was to take subsamples from the original training set to train each estimator.

Most modern implementations of the random forest algorithm, such as the implementation in Python's Scikit-Learn library, are based on Breiman's version.

References

Belson, Matching and Prediction on the Principle of Biological Classification (1959)

Breiman, Classification and Regression Trees (1984)

Ho, Random decision forests (1995)

Breiman, Random forests (2001)