In this article we’ll address the following topics:

  • Why we need activation functions
  • Different kind of activation functions
  • Advantages and disadvantages of different activation functions

Why do we need Activation Functions?

Let’s take an example of a single neuron. Each neuron performs two functions:

  • Calculate the weighted sum of all input features
  • Pass this into an activation function.
z = w1*x1 + w2*x2 + w3*x3 - (Eq.1)(where x1,x2,x3 are input features and w1,w2,w3 are weights)
a = f(z) where f is the activation function.

So the activation function basically provided a non-linearity to z, which helps in learning complex functions. If we remove all the activation functions, our network will only be learning linear functions and that won’t be of any help. To get more clarity on this, you can refer to my article on neural network.

You’ll get a better understanding of this when we’ll discuss different activation functions.

Different Kinds of activation functions

There are a variety of activation functions and we can choose any one of them for our model based on the type of problem we are solving

Sigmoid Function

  • The sigmoid function is given by the above equation and is something like an S-Shape graph.
  • For any input Z, it converts the value between 0 and 1. Because of the nature of the sigmoid function, they are used as an activation function for the output layer in binary classification problems. Since it gives the probability of a presence of a particular class.

Problems with a sigmoid function

  • Vanishing Gradient Problem: Suppose our network gets very deep, and we calculate the gradients for updating the weights of our network. In sigmoid, the maximum value of gradient we can get is around 0.25, now if too many of these gradients are multiplied(using the chain rule) the final gradient will get really small and since the gradient is very small, the weight update will become negligible(since w = w- αΔw). Because of this our initial layers stops getting updated(think about it!).
  • Sigmoid is not zero centered: By looking at the graph of the sigmoid we can see that the sigmoid function can only give positive values as output. Therefore, whenever we pass our z (z = w*x + b) in sigmoid we get a limited set of values as positive, which restricts the spectrum of values possible for the output of the sigmoid function.
  • exponential functions are expensive to calculate.

Tanh

Tanh Activation Function
  • Tanh function was used initially in LeNet and it performs much better than the sigmoid function.
  • It is zero-centered, therefore there is a vast spectrum of values that the output of tanh function can take i.e. 1<z<-1.

Problems with Tanh Function

  • Saturation Problem: Similar to the sigmoid function for a large value of z, the gradient becomes zero because of which some neurons may act as dead and make no contribution in predicting the output. Look at the above image and think about why this happens.
  • The exponential function is expensive to compute.

ReLU

ReLu Activation Function
  • ReLU stands for rectified Linear Unit and is one of the most widely used activation function in the industry.
  • ReLu solved the problem of vanishing gradient because the maximum value of the gradient of relu function is one.
  • It also solved the problem of saturating neurons, since the slope is never zero for relu function.

Problem with ReLU

  • Suppose we give the following input to our relu function, z = w*x + b, and during the training process “b” becomes a very large negative value, now this, in turn, will make the value of “z” negative. In our relu function we can see that for negative values, it gives an output of zero, which again leads to the problem of dead neurons in our network.
  • A large learning rate may lead to this problem because large L.R. will lead to a bigger update in b and hence more chance of becoming negative.

Leaky ReLU

Leaky ReLU Activation Function
  • Leaky Relu is a slight modification of ReLU function.
  • It solved the problem we faced with ReLU(negative values as input) by setting 0.01x for x<0.

That is it for this article. There are other activation functions that you can explore, some of which are

  • Softmax Activation Function
  • GELU
  • ELU

Related Posts