top of page

Day 38: Categorical and Continuous valued features

One Hot Encoding for Categorical Features

In the examples we have seen so far, each of the features could take on only one of two possible values, whether if you have features that can take on more than 2 discrete values, in this section, we will look at how we can use one-hot encoding to address features like that.

For example,

In one-hot encoding, if we have a feature with more than 2 possible values (3 in this example), we split these values into 3 features with 2 possible values (1 or 0).

One-hot encoding: if a categorical feature can take on k values, create k binary features (0 or 1 valued)

One-hot encoding is a technique that works not just for decision tree learning, but also lets you encode categorical features using ones and zeros, so that it can be fed as inputs to a neural network as well, which expects numbers as inputs

Continuous Value Features

In this section, we will look at how we can modify decision tree to work with features that are not just discrete values, but continuous values.

For example, let's say we're adding the feature, weight, to our data:

Take a look at the image above, if we split the weight at 8 (blue line and formula), our information gain (H) would be 0.24:

  • 2 out of 10 of the example is less than 8 lbs

  • 2 out of the left branch are cats

  • 8 out of 10 of the total examples are more than 8 lbs

  • 3 out of 8 of the right branch are cats

If we split weight at 9, H would be 0.61

If we split weight at 13, H would be 0.40

In the more general case, we will actually try not just 3 values, but multiple values along the x axis, and one convention would be to sort all of the examples according to the value of this feature and take all the value that are mid-points between the sorted list of training.

This way, if you have 10 training examples, you will test 9 different possible values for the threshold and then pick the one with the highest information gain, and decide to split that node at that feature.

In this example, assuming there's only 3 training examples to try on, we will choose to split at 9 lbs as it has the highest information gain, and then, we can recursively build additional decision trees from there.

Recent Posts

See All

Day 39: Tree Ensembles

Using Multiple Decision Trees One of the weaknesses of using a single decision tree is that decision tree can be highly sensitive to small changes in the data. One solution to make the algorithm less


bottom of page