top of page

Day 17: Cost Function with Regularization

Note: in this article we used the term hyper-parameters and parameters interchangeably, lambda and learning-rate are usually referred to as hyper-parameters in a ML model, but for simplicity for first time readers, they will be referred as parameters in this article.

The idea behind regularization is that if there are smaller values for the parameters, then it's almost like having a simpler model, maybe one with fewer features, which is therefore less prone to overfitting.

Regularization tends to be implemented when you have a lot of features, for example, 100 features, you may not know which are the most important features and which ones to penalize. With regularization, it usually penalize all features (more precisely, the weight parameters) by adding a lambda function to the cost function. You may apply regularization on the bias parameters as well, but they are less often implemented in practice.

Cost function with a regularization on the weight parameters:

Optional regularization term on the bias parameters that you may add:

The value for lambda is the greek alphabet lambda and it's called a regularization parameter, just like the learning rate, you'll also have to choose a number for lambda as well. We also divide lambda by 2m so that both first and second terms are scaled by 1/2m, and by scaling both terms the same way, it becomes easier to choose a good value for lambda.

By convention, the parameter b is usually not penalized for being large as it makes very little difference.

To summarize, this modified cost function trades off 2 goals we may have:

  • the first term encourages the algorithm to fit the training data well by minimizing the squared differences of the predictions and the actual values.

  • The second term algorithm also tries to keep the parameters w small, which tend to reduce overfitting.

To gain better intuition on the parameter lambda, let's look at how different values of lambda affect your learning algorithm:

  • if lambda = 0, then you're not using the regularization term at all, and the model will overfit.

  • if lambda is too large (for example 100), the learning algorithm will choose w features to be extremely close to 0 and f(x) is basically b, so the learning algorithm will fit a horizontal straight line and underfit.

An example of the algorithm for linear regression with regularization term:

Essentially, what regularization is doing on every single iteration is multiplying w_j by a number slightly less than 1, and that has the effect of shrinking the value of w_j by a little bit.

Recent Posts

See All

Day 39: Tree Ensembles

Using Multiple Decision Trees One of the weaknesses of using a single decision tree is that decision tree can be highly sensitive to small changes in the data. One solution to make the algorithm less


bottom of page