Random Forests

What is a Random Forest in Machine Learning?

The random forest is a supervised learning algorithm that randomly creates and merges multiple decision trees into one “forest.” The goal is not to rely on a single learning model, but rather a collection of decision models to improve accuracy. The primary difference between this approach and the standard decision tree algorithms is that the root nodes feature splitting nodes are generated randomly.

Why Use a Random Forest?

While requiring far more processing power, this approach offers four large advantages over classic decision trees:

  • Can be used for both classification and regression tasks.

  • Overfitting is less likely to occur as more decision trees are added to the forest.

  • Classifiers can process missing values.

  • Classifier can also be modeled to represent categorical values.

How does a Random Forest Work?

The first step is to create the random forest. The specific code various, but the general pseudocode process can be described as:

  1. Randomly select “K” features from total “m” features where k < m

  2. Among the “K” features, calculate the node “d” using the best split point

  3. Split the node into daughter nodes using the best split method

  4. Repeat the previous steps until “l” number of nodes has been reached

  5. Build forest by repeating all steps for “n” number times to create “n” number of trees

After the random forest decision trees and classifiers are created, predications can be made with the following steps:

  1. Run the test features through the rules of each decision tree to predict the outcome, then stores that predicted target outcome.

  2. Calculate the votes for each predicted target

  3. Choose the most highly voted predicted target as the final prediction