In this section, we will learn about a practical and very commonly used learning algorithm: the decision tree. We will also learn about variations of the decision tree.

**Learning Objectives:**

see what a decision tree looks like and how ti can be used to make predictions

learn how a decision tree learns from training data

learn the "impurity" metric "entropy" and how it's used when building a decision tree

learn when to use decision trees and neural network.

#### Decision Tree Model

To explain how decision trees work, let's take a look at a cat classification example:

## Ear Shape (x1) | ## Face Shape (x2) | ## Whiskers (x3) | ## Cat (y) |

Pointy (PT) | Round (R) | Present (P) | 1 |

Floppy (FP) | NR (Not Round) | P | 1 |

FP | R | Absent (A) | 0 |

PT | NR | P | 0 |

PT | R | P | 1 |

PT | R | A | 1 |

FP | NR | A | 0 |

PT | R | A | 1 |

FP | R | A | 0 |

FP | R | A | 0 |

This is a binary classification task because the labels are one or zero, and here's an example of a decision tree model that you might get after training a decision tree algorithm on the data above:

**Decision Nodes: **they look at a particular feature and based on the value of the feature, causes you to decide whether to go left or right down the tree

**Leaf Nodes: **They make a prediction

There are many different form of decision trees. Among these different decision trees, some will do better and some will do worse on the training sets or cross-validation (cv) sets and test sets.

The job of the decision tree learning algorithm is, out of all possible decision trees, to try to pick one that hopefully does well on the training set, and then also ideally generalizes well to new data such as your cv and test sets as well.

##### Learning Process

The first step in decision tree learning is to decide what feature to use at the root node. The feature we choose for the nodes are usually done through an algorithm.

Let's say, we pick the ear shape as the feature on the root node. What we'll do then, is to look at all of our training examples, and split them according to the value of the ear shape feature: pointy or floppy.

The second step is focusing on just the left branch of the decision tree to decide what nodes to put there. In particular, what feature we want to split on or what feature to use next. Let's say, through the algorithm, we decide to use the face shape: round or not round.

We will then take the examples that fit the features and move them, we will continue doing this until we reach a prediction, in which we will then repeat a similar process on the right branch. This is the process of building a decision tree.

Through this process, there are 2 key decisions that we had to make at various steps during the algorithm:

How to choose what feature to split on at each node?

When do you stop splitting

To choose what feature to split on:

decision trees will choose what feature to split on to maximize purity

purity means: you want to get what subsets? all dogs or all cats?

When to stop splitting:

when a node is 100% one class (100% dogs or cats)

when splitting a node will result in the tree exceeding a maximum depth: this makes sure the tree doesn't get too big or unwieldy and makes it less prone to overfitting

when improvements in purity score are below a threshold

when number of examples in a node is below a threshold

##### Decision Tree Learning

Let's take a look at a way to measure the purity of a set of examples. If the examples are all cats of a single class, then that's very pure, and so is "all not cats".

Entropy is a measure of the impurity of a set of data.

Let's say, given a set of 6 examples, 3 cats and 3 dogs:

P1 = fraction of examples that are all cats

In this example P1 is 3/6 = 0.5

The function of Entropy = **H(P1)**

Let's take a look at the plot:

In the illustration above, the vertical axis is the __H__

Thus in our example,

H(P1) = H(3/6) = 1

If you notice in the curve of entropy:

It's most impure when your set is 50/50

It's most pure (entropy=0) when your set is 0.0 or 1.0

P0 if a fraction of examples that are not P1, so if P1 = 2/3 cats, P0 = 1/3 cat

P0 = 1 - P1

The function of entropy:

image taken from the course: __Deeplearning.ai__

Technically log 0 is undefined or infinity, but by convention for the purposes of computing entropy, we'll take "0log(0) = 0", and that will comput the entropy as 0 or as 1=0.

To summarize, the entropy function is a measure of the impurity of a set of data. It starts from 0, goes up to 1, and then comes back down to 0 as a function of the fraction of positive examples in your sample.

## Comments