top of page

Day 7: Gradient Descent for multiple linear regression

Today, we will put together what we've previously learned to implement gradient descent for multiple linear regression using vectorization.

To recap

parameters: W1, ..., Wn



Cost function:

Normal Equation

Before moving on, we will make a quick note on an alternative way to find w and b for linear regression. This method is called the normal equation:

  • only used for linear regression

  • solve the problem of finding w and b without iteration


  • doesn't generalize to other learning algorithms

  • slow when number of features is large ( > 10,000 )

what you need to know:

  • normal equation method may be used in machine learning libraries that implement linear regression

  • Gradient descent is the recommended method for finding parameters w, b

Vector Vector dot product

The dot product is a mainstay of Linear Algebra and NumPy. The dot product is shown below:

The dot product multiplies the values in two vectors element-wise and the sums the result. Vector dot product requires the dimensions of the two vectors to be the same.

Let's implement our own version of the dot product below, using a for loop, to implement a function which returns the dot product of two vectors (assume both a and b are the same shape):

def my_dot(a, b):
    x = 0
    for i in range(a.shape[0]):
        x = x + a[i] * b[i]
    return x
# test 1-D
a = np.array([1, 2, 3, 4])
b = np.array([-1, 4, 3, 2])
my_dot(a, b)
# result of my_dot(a, b) = 24

Note, the dot product is expected to return a scalar value.

Let's try the same operations using

# test 1-D
a = np.array([1, 2, 3, 4])
b = np.array([-1, 4, 3, 2])
c =, b)
c =, a)
# result of both c would be 24

Compute Cost with Multiple Variables

The equation for the cost function with multiple variables is:


In contrast to previous functions, w and x_i are vectors rather than scalars, supporting multiple features. Below is an implementation of the above equations:

def compute_cost(X, y, w, b):
    m = X.shape[0]
    for i in range(m):
        f_wb_i =[i], w) + b
        cost = cost + (f_wb_i - y[i]) ** 2
    cost = cost / (2*m)
    return cost
Gradient descent with Multiple Variables

please note that I'm referring to the term: partial derivative/ derivative term/ symbol interchangeably in my posts, as was done in the course and they refer to the same thing. but in mathematics, partial derivative is used to refer to multi-variable functions, (>1), and derivative used to refer to single variable function.

Gradient descent for multiple variables:

where, n is the number of features, parameters w_j, b, are updated simultaneously and where:

let's implement the equations above (there are many ways to implement this equation, and this is one version):

# let's first compute the partial derivative terms
def compute_gradient(X, y, w, b):
    m, n = X.shape
    dj_dw = np.zeros((n,))
    dj_db = 0.
    for i in range(m):
        err = ([i], w) + b) - y[i]
        for j in range(n):
            dj_dw[j] = dj_dw[j] + err * X[i, j]
        dj_db = dj_db + err
    dj_dw = dj_dw / m
    dj_db = dj_db / m
    return dj_db, dj_dw

after receiving your derivative terms, let's compute gradient descent:

def gradient_descent(X, y, w_in, b_in, cost_function, gradient_function, alpha, num_iters):
    J_history = []
    w = copy.deepcopy(w_in)
    b = b_in
    for i in range(num_iters):
        dj_db, dj_dw = gradient_function(X, y, w, b)
        w = w - alpha * dj_dw
        b = b - alpha * dj_db
        if i < 100000:
            J_history.append(cost_function(X, y, w, b))
        if i%math.ceil(num_iters/10) == 0:
            print(f"Iteration {i:4d}: Cost {J_history[-1]:8.2f}")
    return w, b, J_history

To test for implementation:

note that this is the code to implement gradient descent, but no actual data is being used, and this is for reference purposes only

# initialize parameters
initial_w = np.zeros_like(w_init)
initial_b = 0.

# set gradient descent settings
iterations = 1000
alpha = 5.0e-7

# run gradient descent
w_final, b_final, J_hist = gradient_descent(X_train, y_train, initial_w, initial_b, compute_cost, compute_gradient, alpha, iterations)

print(f"b, w found by gradient descent: {b_final:0.2f}, {w_final}")
m, _ = X_train.shape
for i in range(m):
    print(f"prediction: {[i], w_final) + b_final:0.2f}, target value: {y_train[i]}")
# expected result:
b, w found by gradient descent: -0.00,[0.2   0.  -0.01. -0.07]
prediction: 426.19, target value: 460
prediction: 286.17, target value: 232
prediction: 171.47, target value: 178

Our example result shows that our predictions are not very accurate (vs the target value), we'll explore how to improve on this in our next post tomorrow.

Recent Posts

See All

Day 39: Tree Ensembles

Using Multiple Decision Trees One of the weaknesses of using a single decision tree is that decision tree can be highly sensitive to small changes in the data. One solution to make the algorithm less


bottom of page