top of page

Day 26: Activation Functions

Alternatives to the Sigmoid Function

So far, we have been using the sigmoid function in all the nodes in the hidden layers, but out neural network can be more powerful with different activation functions too.

A very common choice of activation function in neural network is a function called ReLU, Rectified Linear Unit, in which g(z) = max(0, z)

Another activation function worth mentioning is linear activation function, g(z) = z, it may also be referred to as no activation function.


Take a look at the comparison between the 3 activation functions below:



Later on, we will also look at fourth activation function called softmax.


Choosing activation functions

In choosing the activation function to use for your output layer, usually depending on what the label y you're trying to predict, there will be a fairly natural choice.

Let's look at a few scenarios:

  • If you're working with a classification problem where y is either zero or one, a binary classification problem, then the sigmoid function would be a natural choice.

  • if you're solving a regression problem, for example, if you're trying to predict how tomorrow's stock price will change compared to today's stock price. As it can go up or down (y can be either positive or negative), then, we would recommend the linear activation function.

  • if y can only take non-negative values, such as housing price prediction, then ReLU activation will be a good choice in how neural network are trained by many practitioners today.

For hidden layers, the ReLU activation function is by far the most common choice in how neural networks are trained by many practitioners today. Let's take a look at the code for implementation:



from tf.keras.layers import Dense

model = Sequential([
                Dense(units=25, activation='relu'),
                Dense(units=15, activation='relu'),
                Dense(units=1, activation='sigmoid')
                ])

We don't use linear function in the hidden layers in neural network because it defeats the purpose of using a neural network, and the result will be the same as using a linear regression. This is because a linear function of a linear function is a linear function.

Recent Posts

See All

Day 39: Tree Ensembles

Using Multiple Decision Trees One of the weaknesses of using a single decision tree is that decision tree can be highly sensitive to small changes in the data. One solution to make the algorithm less

Comments


bottom of page