top of page

Day 2: Train the model with gradient descent


previously, we saw visualizations of the cost function J and how we can try different choices of parameters w,b and see what cost value they can get us.

Gradient descent is an algorithm that you can use to find the values of 'w' and 'b' in a more systematic way, which results in the smallest possible cost of J(w,b)

Gradient descent is used in many places in machine learning, including some of the most advanced neural network models

Let's take a look at an overview of what we can do with gradient descent: once we have the cost function of J(w,b) that we want to minimize to a minimum, with gradient descent algorithm, we will keep changing the parameter 'w' and 'b' by a little every time to reduce the cost function J(w,b), until J settles at or near a minimum.

Before we dive into gradient descent, we previously looked at the visualization of cost function J using only one parameter w, let's take a look at possible plot visualizations when you include b as well. With cost function, we try to get to bottom of the convex plot, which would be our minimum.

We can see it at our contour plot as well, in which the minimum would be as close as possible to the middle of the ellipse or circle.

It is also possible to have a more complex surface plot J, such as this one:

As we can see from the plot above, it is possible to for cost function J to have more than one minimum, and it's not always shaped like a bow or a hammock.

the point at which the plot reach its minimum value is called global minima, during gradient descent, you may reach a minimum that is not the most minimum placement of the plot, this is a called the local minima. It is possible to have more than one local minima.

Implementing gradient descent

Gradient descent algorithm:

please note that in the mathematical function above the equal sign '=' is used in this situation as an assignment operator and not truth assertions '=='.

On each step, w, the parameter is updated with the new calculation from the right side of the formula.

Let's take a deeper look into what the symbols mean in the equation:

  • In this equation, α (alpha), is called the learning rate.

  • the learning rate is usually a small positive number between 0 and 1.

  • α (learning rate) controls how big of a step it take downhill to find the minimum.

  • if α is very large, that corresponds to a very big step taken downhill, with each step, this may result in gradient never reaching the minimum (more on this later)

  • conversely a very small α may result in a very slow gradient descent procedure.

the term in the pink box:

  • This is the derivative term of cost function J, we'll look more deeply into this later, but for now, know that this term tells you which direction you want to take your baby step.

  • In combination to the learning rate, α, it also determines the size of the steps of our model will take downhill.

  • As a reminder, the model takes 2 parameters, w and b. We will also have an assignment operators update the parameter with a very similar function:

In summary:

  • In the graph of the surface plot, our model will be taking baby steps until we get to the bottom of the value, and reach our local or global minima

  • For gradient descent algorithm, we're going to repeat the two update steps (w and b) until the algorithm converge

  • By converging, that means the algorithm has reached the point at a local minimum where the parameter, w and b, no longer change much with each additional step

  • w and b has to be updated simultaneously

We will be taking a look at an example of gradient descent tomorrow.

Further reading:

  • To gain further understanding on the mathematics behind gradient descent algorithm, please click here

Recent Posts

See All

Day 39: Tree Ensembles

Using Multiple Decision Trees One of the weaknesses of using a single decision tree is that decision tree can be highly sensitive to small changes in the data. One solution to make the algorithm less


bottom of page