Clever Understanding Random Forest

madan
5 min readDec 27, 2019

--

Before deep dive into Random Forest First, you should understand Decision Trees, see this link

What is a Random Forest?

Random forest is a solid choice for nearly any prediction problem (even non-linear ones). It’s a relatively new machine learning strategy (it came out of Bell Labs in the 90s) and it can be used for just about anything. It belongs to a larger class of machine learning algorithms called ensemble methods.

RandomForest= No of D. Trees

Random forest is a tree-based algorithm that involves building several trees (decision trees), it is an Ensemble technique capable of performing both regression and classification tasks with the use of multiple decision trees and a technique is called Bootstrap Aggregation. Commonly-known as Bagging. Ensembling is nothing but a combination of weak learners (individual trees) to produce a strong learner

For Example: Say, you want to join a Datascience Training. But you are uncertain of its Training Institutes in Hyderabad. You ask 10 people who have joined the Training. 8 of them said, “ the Training is fantastic in Analytics path (HYD).” Since the majority is in favor, you decide to join the Training. This is how we use ensemble techniques in our daily life too.

Before we go deep into RF, we first need to understand Bagging. Bagging is a simple and very powerful ensemble method. It is a general procedure that can be used to reduce our model’s variance. A higher variance means that your model is overfitted. Certain algorithms such as decision trees usually suffer from high variance. In another way, decision trees are extremely sensitive to the data on which they have been trained. If the underlying data is changed even a little bit, then the resulting decision tree can be very different and as a result, our model’s predictions will change drastically. Bagging offers a solution to the problem of high variance. It can systematically reduce overfitting by taking an average of several decision trees. Bagging uses bootstrap sampling and finally aggregates the individual models by averaging to get the ultimate predictions. Bootstrap sampling simply means sampling rows at random from the training dataset with replacement.

‘ Bootstraping the data plus using the aggregate to make a decision is called Bagging ’

Bootstrap Random samples

Approach:-

  • Random forest is the best model then and now.
  • pick at random data points from the training set. build the decision tree associated with those k-data points.
  • Each tree is fully grown not pruned.
  • Random forest has Low explainability but predictive power is high
  • Decision Tree and Linear Regression has high explainability but low predictive power

What is the difference between Bagging and Random Forest?

Many a time, we fail to ascertain that bagging is not same as random forest. To understand the difference, let’s see how bagging works:

  1. It creates randomized samples of the data set (just like random forest) and grows trees on a different sample of the original data. The remaining 1/3 of the sample is used to estimate unbiased OOB error.
  2. It considers all the features at a node (for splitting).
  3. Once the trees are fully grown, it uses averaging or voting to combine the resultant predictions.

The main difference between random forest and bagging is that random forest considers only a subset of predictors at a split. This results in trees with different predictors at top split, thereby resulting in decorrelated trees and more reliable average output. That’s why we say random forest is robust to correlated predictors.

Advantages and Disadvantages of Random Forest:-

Advantages:

  1. It is robust to correlated predictors.
  2. It is used to solve both regression and classification problems.
  3. It can be also used to solve unsupervised ML problems.
  4. It can handle thousands of input variables without variable selection.
  5. It can be used as a feature selection tool using its variable importance plot.
  6. It takes care of missing data internally in an effective manner.

Disadvantages:

  1. The Random Forest model is difficult to interpret.
  2. It tends to return erratic predictions for observations out of range of training data. For example, the training data contains two variables x and y. The range of x variable is 30 to 70. If the test data has x = 200, random forest would give an unreliable prediction.
  3. It can take longer than expected time to compute a large number of trees.

RANDOM FOREST USE CASES:-

The random forest algorithm is used in a lot of different fields, like banking, the stock market, medicine, and e-commerce.

In finance, for example, it is used to detect customers more likely to repay their debt on time or use a bank’s services more frequently. In this domain, it is also used to detect fraudsters out to scam the bank.

In trading, the algorithm can be used to determine a stock’s future behavior.

In the healthcare domain, it is used to identify the correct combination of components in medicine and to analyze a patient’s medical history to identify diseases.

In e-commerce to determine whether a customer will actually like the product or not.

Hyperparameters of Sklearn Random forest :

bootstrap : boolean, optional (default=True)

  • Whether bootstrap samples are used when building trees.

min_samples_leaf : int, float, optional (default=1)

The minimum number of samples required to be at a leaf node:

  • If int, then consider min_samples_leaf as the minimum number.
  • If float, then min_samples_leaf is a percentage and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node.

n_estimators : integer, optional (default=10)

  • The number of trees in the forest.

min_samples_split : int, float, optional (default=2)

The minimum number of samples required to split an internal node:

  • If int, then consider min_samples_split as the minimum number.
  • If float, then min_samples_split is a percentage and ceil(min_samples_split * n_samples) are the minimum number of samples for each split.

max_features: int, float, string or None, optional (default=”auto”)

The number of features to consider when looking for the best split:

  • If int, then consider max_features features at each split. -If float, then max_features is a percentage and int(max_features * n_features) features are considered at each split.
  • If “auto”, then max_features=sqrt(n_features).
  • If “sqrt”, then max_features=sqrt(n_features) (same as “auto”).
  • If “log2”, then max_features=log2(n_features).
  • If None, then max_features=n_features.

max_depth : integer or None, optional (default=None)

  • The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

max_leaf_nodes : int or None, optional (default=None)

  • Grow trees with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.

I hope that this article helped you to get a basic understanding of how the algorithm works.

My GIthub Link

--

--

madan
madan

No responses yet