Decision Tree 101!

The decision tree is one of the tree-based algorithms in the machine learning domain. It is highly intuitive and easy to understand which makes it super useful in solving some of the classic machine-learning problems.

So, without wasting much time we’ll dive deep into the nitty-gritty details of this algorithm.

What is a Decision Tree?

The decision tree can be used to handle categorical and numerical data.

Image Source: Statquest channel on youtube

Now Let’s have a look at the above dataset, we have some features namely Chest Pain, Good Blood Circulation, Blocked Arteries, and a label i.e. Heart Disease.
Now to create a decision tree, we first create a root node, and for creating a root node we need to choose a feature that will divide our dataset in the best way possible.
But how do we choose this feature? There are many metrics to calculate this like entropy, information Gain, and Gini Impurity.
For the sake of this article, we’ll focus mainly on Gini Impurity.
Gini Impurity is a metric used to calculate how impure a node is and is given by the following formula

Suppose we take Chest Pain as our node and by looking at the above diagram we can see that if a person has chest pain then 105 out of it will have heart disease and 39 will not have heart disease. Similarly, if the patients don’t have heart disease then out of these patients 34 will have heart disease and 125 will not have heart disease.
Now Let’s calculate Gini Impurity for a node when the patient has chest pain.

The above equation gives a value of 0.394965.
Now we calculate the Gini Impurity for node 2 when the patient doesn’t have Chest Pain.

The above equation gives an output of 0.33622
Now, to calculate the Gini Index for chest pain we take a weighted sum of both the nodes which is given by the following formula.

Now putting the values in the above formula gives the following equation

The above equation gives the following 0.36413
Now similarly, we’ll calculate the Gini Index for all the features and the feature having the least Gini Index will be our root node.
Now we have divided our dataset into two parts. We’ll apply the same procedure for each node until we get pure leaf nodes or the Gini impurity of the parent node is less than the child node.

Our tree is finally ready!!

Now since our tree is fully constructed, we can pass our test data to check our predictions.

How to Calculate Gini Impurity for different types of Data?

We have seen how to calculate Gini Impurity for the yes/ No type data above.

Handling Numerical data

For continuous data suppose to like using height to predict the weight. We need to ask a question like what value of weight will be used to separate the data into nodes.
Firstly, we sort the weight data in increasing order. Then we take the mean of pair of corresponding points and then use this mean value to calculate Gini Impurity for each mean value.
The mean weight having the minimum Gini Impurity will be used for the Gini Index of this feature.

Handling Ranked Data

Suppose we have a feature containing ranks or levels then what we can do is the following.
Divide the data into two parts, i.e one node having data for rank ≤ p and the other node having data for rank> p. (p is any valid rank value).
Now we calculate Gini Index for each rank and self the rank for which Gini Index is minimum.

Extras

Decision Trees perform well for non-linear data.
They easily overfit i.e they perform well on train data but worse on testing data.
Decision Tree forms the basis for other tree-based algorithms like random forest etc.

Till Then Happy Learning!!

Decision Tree 101!

Handling Numerical data

Related Posts

Deep Dive into Logistic Regression

A quick guide to getting started with NumPy