Decision Tree in MachineLearning

madan
8 min readDec 23, 2019

--

I have covered briefly an introduction to Decision Tree, here we are coming to the topic.

A decision tree is a supervised learning technique, it is the most powerful and popular tool for classification and regression. A decision tree is a powerful mental tool to make smart decisions. You lay out the possible outcomes and paths.

A decision tree is a decision support tool that uses a flowchart-like structure, where each ‘internal node’ denotes a test on an ‘attribute’ each branch represents an ‘outcome’ of the ‘test’, and each ‘leaf node holds a ‘class label’(survived(Yes), No survived(No)).

Decision trees are easy to build, easy to use, and easy to interpret.”

a flow-chat like structure spreading out
Decision Tree from Wikipedia
  • I suggest you guys well explained by Josh Starmer

Let’s look at the Basic expressions of decision trees

  1. Root Node: the root node is the very first or parent node. It represents the entire population or sample and this further gets divided into two or more homogeneous sets.
  2. internal node: A node of a tree that has one or more child nodes, equivalently, one that is not a leaf.
  3. Leaf/ Terminal Node: Nodes do not split is called Leaf or Terminal node.
  4. Splitting: It is a process of dividing a node into two or more sub-nodes.
  5. Decision Node: When a sub-node splits into further sub-nodes, then it is called a decision node.
  6. Pruning: When we reduce the size of decision trees by removing nodes (opposite of Splitting), the process is called pruning.
  7. Branch / Sub-Tree: A subsection of the decision tree is called branch or sub-tree.
  8. Parent and Child Node: A node, which is divided into sub-nodes is called parent node of sub-nodes whereas sub-nodes are the child of parent node.

How to Build a Decision Tree:-

  1. ID3 < entropy , information gain
  2. Gini Index
  3. Chi-Square
  4. Reduction in Variance

ID3

ID3 used to generate a decision tree from a dataset. ID3 is the precursor to the C4.5 algorithm and is typically used in the ‘machine learning ’ and ‘NLP’ domains. and some times ID3 uses ‘entropy’ and ‘information gain ’to construct a decision tree.

Entropy

Entropy is the measure of the amount of uncertainty or randomness in data.(or) The system that had got more disorders. A decision tree is built top-down from a root node and involves partitioning the data into subsets that contain instances with similar values (homogeneous). ID3 algorithm uses entropy to calculate the homogeneity of a sample

ex: compare LKG class and Engineers class more disorders in LKG classroom

Entropy =

Information Gain: The information gain is based on the decrease in entropy after a data-set is split on an attribute. Constructing a decision tree is all about finding attribute that returns the highest information gain

Gini Index

Gini index says, if we select two items from a population at random then they must be of same class and probability for this is 1 if population is pure.

  1. It works with categorical target variable “Success” or “Failure”.
  2. It performs only Binary splits
  3. Higher the value of Gini higher the homogeneity.
  4. CART (Classification and Regression Tree) uses Gini method to create binary splits.

Steps to Calculate Gini for a split

  1. Calculate Gini for sub-nodes, using formula sum of square of probability for success and failure (p²+q²).
  2. Calculate Gini for split using weighted Gini score of each node of that split

Chi-Square

It is an algorithm to find out the statistical significance between the differences between sub nodes and parent node. We measure it by sum of squares of standardised differences between observed and expected frequencies of target variable.

  1. It works with categorical target variable “Success” or “Failure”.
  2. It can perform two or more splits.
  3. Higher the value of Chi-Square higher the statistical significance of differences between sub-node and Parent node.
  4. Chi-Square of each node is calculated using formula,
  5. Chi-square = ((Actual — Expected)² / Expected)¹/2
  6. It generates tree called CHAID (Chi-square Automatic Interaction Detector)

Reduction in Variance

Reduction in variance is an algorithm used for continuous target variables (regression problems). This algorithm uses the standard formula of variance to choose the best split. The split with lower variance is selected as the criteria to split the population:

Above X-bar is mean of the values, X is actual and n is number of values.

Steps to calculate Variance:

  1. Calculate variance for each node.
  2. Calculate variance for each split as the weighted average of each node variance.

When to stop splitting?

You might ask when to stop growing a tree? As a problem usually has a large set of features, it results in a large number of split, which in turn gives a huge tree. Such trees are complex and can lead to overfitting. So, we need to know when to stop? One way of doing this is to set a minimum number of training inputs to use on each leaf. For example, we can use a minimum of 10 passengers to reach a decision(died or survived), and ignore any leaf that takes less than 10 passengers. Another way is to set maximum depth of your model. Maximum depth refers to the length of the longest path from a root to a leaf.

Pruning

The performance of a tree can be further increased by pruning. It involves removing the branches that make use of features having low importance. This way, we reduce the complexity of tree, and thus increasing its predictive power by reducing overfitting.

Advantages:-

  • easy to use and understand
  • can handle both categorical and numerical data
  • resist to outliers, hence require little data preprocessing

Disadvantages:-

  • prone to overfitting
  • need to be careful with parameter tuning

How to avoid overfitting in Dtree:-

  • overfitting is one of the major problems for every model in ML.
  • to avoid decision trees from overfitting, we remove the branches that make features having low importance. this is called pruning.

Decision Tree for Classification:-

In this section, I predict whether a passenger on the Titanic would have been survived or not. using the Decision tree algorithm by python code.

Dataset: the dataset for this task can be downloaded from this link:

Importing Libraries

The following script imports required libraries:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

Importing the Dataset

Since our file is in CSV format, we will use panda’s read_csv method to read our CSV data file. Execute the following script to do so:

df = pd.read_csv('titanic_train.csv', index_col='PassengerId')

Data Analysis

Execute the following command to see the number of rows and columns in our dataset:

dataset.shape

The output will show “(714,6)”, which means that our dataset has 714 records and 6attributes

pd.DataFrame(df.head())

Feature Selection

Here, you need to divide given columns into two types of variables dependent(or target variable) and independent variable(or feature variables).

X = df.drop('Survived', axis=1)# here we remove survive column into dataframe
y = df['Survived']

Splitting Data

To understand model performance, dividing the dataset into a training set and a test set is a good strategy.

Let’s split the dataset by using function train_test_split(). You need to pass 3 parameters features, target, and test_set size.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 1)

Building Decision Tree Model

Let’s create a Decision Tree Model using Scikit-learn.

from sklearn.tree import DecisionTreeClassifier#Create Decision Tree classifer object
clf = DecisionTreeClassifier()
clf
#Train Decision Tree Classifer
clf = clf.fit(X_train,y_train)
#Predict the response for test dataset
y_pred = clf.predict(X_test)
y_pred

Out[34]:

array([1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0,
1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1,
1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0,
0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0,
0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0,
0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0,
0, 1, 1])

Evaluating Model

Let’s estimate, how accurately the classifier or model can predict the type of cultivars.

Accuracy can be computed by comparing actual test set values and predicted values.

from sklearn.metrics import accuracy_scoreaccuracy_score(y_test, y_pred)

Out[37]:

0.8268156424581006
  • For classification tasks some commonly used metrics are confusion_matrix
from sklearn.metrics import confusion_matrixfrom sklearn.metrics import confusion_matrix

pd.DataFrame(
confusion_matrix(y_test, y_pred),
columns=['Predicted Not Survival', 'Predicted Survival'],
index=['True Not Survival', 'True Survival'])

Out[55]:

array([[97, 15],
[16, 51]])

Decision Tree Parameters:-

DecisionTreeClassifier (): It is nothing but the decision tree classifier function to build a decision tree model in Machine Learning using Python. The DecisionTreeClassifier() function looks like this:

DecisionTreeClassifier (criterion = ‘gini’, random_state = None, max_depth = None, min_samples_leaf =1)

Here are a few important parameters:

  • criterion: It is used to measure the quality of a split in the decision tree classification. By default, it is ‘gini’; it also supports ‘entropy’.
  • max_depth: This is used to add maximum depth to the decision tree after the tree is expanded.
  • min_samples_leaf: This parameter is used to add the minimum number of samples required to be present at a leaf node.

Resources

Want to learn more about Scikit-Learn and other useful machine learning algorithms? I’d recommend checking out some more detailed resources, like an online course:

Conclusion

In this article, I showed how you can use Python’s popular Scikit-Learn library to use decision trees for classification task. While being a fairly simple algorithm in itself, implementing decision trees with Scikit-Learn is even easier.

Below

Thanks for reading. :)
And, if this was a good read. Enjoy!

And Don’t forget to clap.

Editor:Madan Maram

If you find this article interesting, feel free to say hello over Linkedin, I’m always happy to connect with other professionals in the field.

--

--

madan
madan

No responses yet