When building a decision tree, the way we will decide what feature to split on at a node will be based on what choice of feature reduces entropy the most. (reduces entropy or reduces impurity or maximize purity)

In decision tree learning, the reduction of entropy is called information gain.

In this section, we will look at how to compute information gain and therefore, choose what features to use to split on at each node in a decision tree. We will use an example of deciding what feature to use at the root node of the decision tree we were building just now for recognizing cats vs not cats. Let's take a look at our choices:

Rather than looking at these entropy numbers and comparing them, it would be useful to take a weighted average of them. If there's a node with a lot of examples in it with high entropy that seems worse than if there was a node with just a few examples in it with high entropy.

Because entropy, as a measure of impurity, is worse if you have a very large and impure dataset compared to just a few examples and a branch of the tree that's very impure.

Back to our example, in order to pick from them, we need to combine the left and right branch into a single number, we do this by combining the two numbers and take a weighted average. The way we will choose a split is by computing the reduction in entropy of 3 trees (3 in this example) and picking the lowest number.

Note:

P1 left = the fraction of examples in the left sub-tree that have a positive label

W left = the fraction of examples of all the examples of the root node that went to the left sub-branch

If we go to the root node, remember that in the root node, we have started off with all 10 examples with 5 cats and 5 dogs, and so, at the root node, we had P1 = 0.5. The entropy of 0.5 is 1. This was maximum in purity because it was 5 cats and 5 dogs. Let's calculate all 3 examples with the formula given above:

Information gainmeasures the reduction in entropy that you get in your tree resulting from making a split

Because the entropy was originally 1 at the root node and by making the split, you end up with a lower value of entropy and the difference between those 2 values is a reduction in entropy (0.28 in the case of ear shape example)

So, why do we compute reduction in entropy rather than entropy at the left and right sub-branches?

One of the stopping criteria for deciding when to stop splitting further is if the reduction in entropy is too small. In which case, you could decide, to increase the size of the tree and risk overfitting or stop if the reduction is too small (below threshold)

In this example, splitting on ear shape (0.28) is the biggest reduction in entropy, so we will choose that.

#### Putting it all together

The information gain criteria lets you decide how to choose one feature to split a one-node.

We will use that in multiple places through a decision tree to figure out how to build a large decision tree with multiple nodes.

Here's the overall process of building a decision tree:

Start with all examples at the root node

Calculate information gain for all possible features, and pick the one with the highest information gain

Split dataset according to the selected feature, and create left and right branches of the tree

Keep repeating the splitting process until a stopping criteria is met.

A stopping criteria is met:

When a node is 100% one class

When splitting a node will result in the tree exceeding a maximum depth

Information gain from additional splits is less than threshold

When a number of examples in a node is below a threshold

## Comments