James Routley

We just saw how a Decision Tree operates at a high-level: from the top down, it creates a series of sequential rules that split the data into well-separated regions for classification. But given the large number of potential options, how exactly does the algorithm determine where to partition the data? Before we learn how that works, we need to understand Entropy.

Given a certain set of events that occur with probabilities , the total entropy can be written as the negative sum of weighted probabilities:

The quantity has a number of interesting properties:

Entropy Properties

only if all but one of the are zero, this one having the value of 1. Thus the entropy vanishes only when there is no uncertainty in the outcome, meaning that the sample is completely unsurprising.
is maximum when all the are equal. This is the most uncertain, or 'impure', situation.
Any change towards the equalization of the probabilities increases .

The entropy can be used to quantify the impurity of a collection of labeled data points: a node containing multiple classes is impure whereas a node including only one class is pure.

Above, you can compute the entropy of a collection of labeled data points belonging to two classes, which is typical for binary classification problems. Click on the Add and Remove buttons to modify the composition of the bubble.

Did you notice that pure samples have zero entropy whereas impure ones have larger entropy values? This is what entropy is doing for us: measuring how pure (or impure) a set of samples is. We'll use it in the algorithm to train Decision Trees by defining the Information Gain.

With the intuition gained with the above animation, we can now describe the logic to train Decision Trees. As the name implies, information gain measures an amount the information that we gain. It does so using entropy. The idea is to subtract from the entropy of our data before the split the entropy of each possible partition thereafter. We then select the split that yields the largest reduction in entropy, or equivalently, the largest increase in information.

To be specific, the algorithm's steps are as follows:

ID3 Algorithm Steps

Calculate the entropy associated to every feature of the data set.
Partition the data set into subsets using different features and cutoff values. For each, compute the information gain as the difference in entropy before and after the split using the formula above. For the total entropy of all children nodes after the split, use the weighted average taking into account , i.e. how many of the samples end up on each child branch.
Identify the partition that leads to the maximum information gain. Create a decision node on that feature and split value.
When no further splits can be done on a subset, create a leaf node and label it with the most common class of the data points within it if doing classification or with the average value if doing regression.
Recurse on all subsets. Recursion stops if after a split all elements in a child node are of the same type. Additional stopping conditions may be imposed, such as requiring a minimum number of samples per leaf to continue splitting, or finishing when the trained tree has reached a given maximum depth.

Of course, reading the steps of an algorithm isn't always the most intuitive thing. To make things easier to understand, let's revisit how information gain was used to determine the first decision node in our tree.

Recall our first decision node split on Diameter ≤ 0.45. How did we choose this condition? It was the result of maximizing information gain.

An alternative to the entropy for the construction of Decision Trees is the Gini impurity. This quantity is also a measure of information and can be seen as a variation of Shannon's entropy. Decision trees trained using entropy or Gini impurity are comparable, and only in a few cases do results differ considerably. In the case of imbalanced data sets, entropy might be more prudent. Yet Gini might train faster as it does not make use of logarithms.

Let's recap what we've learned so far. First, we saw how a Decision Tree classifies data by repeatedly partitioning the feature space into regions according to some conditional series of rules. Second, we learned about entropy, a popular metric used to measure the purity (or lack thereof) of a given sample of data. Third, we learned how Decision Trees use entropy in information gain and the ID3 algorithm to determine the exact conditional series of rules to select. Taken together, the three sections detail the typical Decision Tree algorithm.

Without question, Decision Trees have a lot of things going for them. They're simple models that are easy to interpret. They're fast to train and require minimal data preprocessing. And they hand outliers with ease. Yet they suffer from a major limitation, and that is their instability compared with other predictors. They can be extremely sensitive to small perturbations in the data: a minor change in the training examples can result in a drastic change in the structure of the Decision Tree.

Check for yourself how small random Gaussian perturbations on just 5% of the training examples create a set of completely different Decision Trees:

Perhaps ironically, one way to alleviate the instability induced by perturbations is to introduce an extra layer of randomness in the training process. In practice this can be achieved by creating collections of Decision Trees trained on slightly different versions of the data set, the combined predictions of which do not suffer so heavily from high variance. This approach opens the door to one of the most successful Machine Learning algorithms thus far: random forests.

Thanks for reading! We hope that the article is insightful no matter where you are along your Machine Learning journey, and that you came away with a better understanding of the Decision Tree algorithm.

A special thanks goes out to Brent Werness for valuable contributions to this article.

Decision trees – the unreasonable power of nested decision rules

Entropy Properties

ID3 Algorithm Steps