Let’s Talk about Random Forests!

In my previous article, we discussed the decision tree algorithm. Although the decision tree is a very cool intuitive algorithm it carries with it the curse of overfitting. In this article, we’ll build upon the idea of a decision tree and learn about the random forest algorithm which is widely used in the machine-learning world.

Random Forest Algorithm

  • According to Wikipedia

Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean/average prediction (regression) of the individual trees

Steps

  1. In this, we take the training data and then create bootstrapped datasets from it. Have a look at the image below to get a better idea of it.
Bootstrapping
  • So what we did is we took the complete dataset and then divided the dataset into multiple parts. Remember that the examples can be repeated in different bootstrapped samples.
  • Then we take these bootstrapped samples and use them to make multiple decision trees. The decision trees which are created will have different behavior since they are created using different samples.
  • The variety we get by creating multiple decision trees makes it more effective than individual decision trees.

How do we use them now?

  • Now we have created our decision tree, how do we evaluate it or use it to make our prediction?
  • For evaluation, we perform the following trick. Most of the data from the training set doesn’t make it to the bootstrapped dataset. It is called the ‘out of bag dataset’ and we can use it to evaluate the random forest.
  • We take an example and then pass it to a different decision tree. In case it’s a regression task, we take the average of the output. In case it’s a classification task, we give the output as the one which is predicted by most of the trees.
  • The proportion of out-of-bag samples that were incorrectly classified is the ‘out-of-bag error.

Bagging Technique

  • Bagging is a technique that is being used by random forest.
  • Bagging can be summed up as Bootstrapping followed by an aggregation step.
  • In random forest, we apply this bagging technique by first bootstrapping the dataset to create multiple decision trees and then aggregating their results to give the final output.

Pros of Random Forest

  • Random forests are robust to outliers since they get averaged out by the aggregation of multiple tree outputs.
  • It works really well with non-linear data.
  • There is a low risk of overfitting, as the results are calculated based on the output of multiple decision trees.

Hope this article gives you a better idea of how random forests work.

Till then happy learning!!!

Related Posts