top of page

Day 8: Feature Scaling and learning rate

Feature Scaling

Feature scaling is a technique that will enable gradient descent to run faster.

let's start by taking a look at the relationship between the size of a feature and the size of it's associated parameter

price-prediction = W1*X1 + W2*X2 + b


X1 = size = 300 - 2000 sqft

X2 = number of bedrooms = 0 - 5

In this example, House: X1 = 2000, X2 = 5, price = $500k

What do you think would be the appropriate parameters W (W1 and W2) and b?

Let's look at a couple examples:

Example 1

W1 = 50, W2 = 0.1, b = 50

price-prediction = 50*2000 + 0.1*5 + 50 = 100k + 0.5 + 50k

price-prediction = $100,050.5

price-prediction is off as the price is way cheaper than the actual price of $500,000

Example 2

W1 = 0.1, W2 = 50, b = 50

price-prediction = 0.1*2000 + 50*5 + 50 = 200k + 250k + 50k

price-prediction = $500,000

price-prediction in this scenario is more accurate

You may notice from the examples above that when possible range of values of a feature is large, for instance the size in sqft which goes all the way up to 2000, it's more likely a good model will learn to choose a relatively small parameter value.

likewise, when the possible values of the feature are small, the number of bedrooms, then a reasonable value for its parameters will be relatively large.

so, how does this affect gradient descent?

Let's say you're trying to run gradient descent with 2 features that has very large range difference. For example, X1 = 0 - 1, X2 = 10 - 100.

If you were to use your data "as-is" in this situation, contour plot may be shaped more like a skinny oval as opposed to a circle, and because of that, gradient descent may end up bouncing back and forth for a long time, before it finally reach a minimum.

In these situations, it's usually a good idea to scale the features, which means performing some transformation of your training data so that X1 and X2 might range from 0-1

The key point is that the re-scaled X1 and X2 are both now taking comparable ranges of values to each other.

How to scale features?

There are a few different ways to scale features:

  • divide by max

  • mean normalization

  • z-score normalization

divide by max

let's take a look at an example:

x1 = 300 - 2000 # x1 is a value between 300 to 2000

300 ≤ x1 ≤ 2000

x1-scaled = x1 / 2000 # x1 divided by maximum value (in this case 2000)

0.15 ≤ x1-scaled ≤ 1 # x1 scaled is now a value between 0.15 to 1

x2 = 0 - 5 # x2 is a value between 0 to 5

0 ≤ x1 ≤ 5

x2-scaled = x2 / 5 # x2 divided by maximum value (in this case 5)

0 ≤ x2-scaled ≤ 1

mean normalization

Another way to feature scale is to implement mean-normalization.

In mean normalization, you start with the original features, and then, you re-scale them so that both of them are centered around zero, where they previously only had values greater than zero, they now have values ranging from -1 to 1

the formula for mean normalization:

μ = mean

An example:

300 ≤ x1 ≤ 2000

x1-scaled = (x1 - μ1) / (2000 - 3000)

-0.18 ≤ x1-scaled ≤ 0.82

0 ≤ x2 ≤ 5

x2-scaled = (x2 - μ2) / (5 - 0)

-0.46 ≤ x2-scaled ≤ 0.54

z-score normalization

z-score normalization is another re-scaling method.

to implement z-score normalization, you'll need calculate standard-deviation(σ) of each feature

The formula for z-score normalization:

An example:

300 ≤ x1 ≤ 300

x1-scaled = (x1 - μ1) / σ1

-0.67 ≤ x1-scaled ≤ 3.1

z-score normalization: implementation

After z-score normalization, all features will have a mean of 0 and a standard deviation of 1. Using the above formula, let's code this:

# import libraries
import numpy as np
def z_score_normalize_features(X):
    # find the mean of each column/feature
    mu = np.mean(X, axis=0)
    # find the standard deviation of each column/feature
    sigma = np.std(X, axis=0)
    # element-wise, subtract mu for that column from each example, divide by std for that column
    X_norm = (X - mu) / sigma
    return (X_norm, mu, sigma)

please note that you may use the Scikit-learn library to implement feature scaling without coding this with the NumPy library shown here.

Take a look at the distribution of of features before, during, after normalization:

The plot above shows the relationship between two of a training set parameters, "age" on the vertical axis, and "size(sqft)" on the horizontal axis.

Rule of thumb for feature scaling

aim for about:

- 1 ≤ Xj ≤ 1 # for each feature Xj

-3 ≤ Xj ≤ 3 # acceptable ranges

-0.3 ≤ Xj ≤ 0.3 # acceptable ranges

0 ≤ x1 ≤ 3 # okay, no rescaling

-2 ≤ x2 ≤ 0.5 # okay, no rescaling

-100 ≤ x3 ≤ 100 # too large, re-scale

-0.001 ≤ x4 ≤ 0.001 # too small, re-scale

98.6 ≤ x5 ≤ 105 # too large, re-scale

when in doubt, re-scale

Checking gradient descent for convergence

recall that the objective of gradient descent is to reach the minimum of cost-function J.

J(w, b) should decrease after every iteration. Number of iteration varies, could be as little as 30, to as large as 100,000

Take a look at the learning curve graph above, once the line has flattened (the red line of the curve), J(w,b) most likely has converged.

Another way to see if gradient descent have reached a minimum is through the 'Automatic convergence test'

let ε "epsilon" be 0.001

if J(w, b) decreases by ε in one iteration, declare convergence

it's usually pretty difficult to choose the right threshold epsilon and may be easier to look at the learning curve graph

Choose the learning rate (α)

Your learning algorithm will run much better with an appropriate choice of learning. recall that if the learning rate is too small, it will take forever to converge, and if it is too large, it may never converge.

Concretely, if when you plot the cost for the number of iterations, notice that the cost sometimes go up and down, that could be a sign of either a large learning rate or a bug in the code.

It's recommended to start with a very small learning rate to make sure there's no bug in the code. If the cost slowly decreases (with no increment), then we know that there's no bug and we can slowly build up the value of the learning rate

You may start with '0.001', and increase the learning rate by 3x to '0.003', and again, and so forth, with each value of alpha roughly 3x bigger than the previous value.

After trying these ranges, try to pick the largest possible learning rate, or something slightly smaller than the large value that we found.

Recent Posts

See All

Day 39: Tree Ensembles

Using Multiple Decision Trees One of the weaknesses of using a single decision tree is that decision tree can be highly sensitive to small changes in the data. One solution to make the algorithm less


bottom of page