Deep Dive into Logistic Regression

Readers Warning: Hold your seat guys, since this is going to be a long and comprehensive article covering all the nitty-gritty details of Logistic Regression.

We’ll first discussion about all the ingredients required and then see how these ingredients come together to prepare an awesome algorithm.

So, without wasting any time let’s start with our article.

When to use Logistic Regression?

You all must have heard of classic machine learning problems like cats v/s dogs, bening v/s malignant tumor, or basically, we can say, where ever we need to classify between two classes i.e. Binary Classification problems where the output class belongs to either 0 or 1.

Identifying cats v/s dogs(Fig1)

Maths Session

For the simplicity of understanding logistic regression. We’ll take a dataset containing one feature and two output classes. The general equation will be

z = w.x + b, where w is the weight, x is the feature, and b is the bias term.

When we solve the above equations using any value of w and b it should give us our class output. But there is a little problem with the above equation. The class labels in Binary classification can be either 0 or 1, but the above equation can give us any real number. But, don’t worry we have a solution to this problem.

Sigmoid Function to rescue

Sigmoid Function(Fig 2)

Now let’s discuss the sigmoid function. The sigmoid function takes as input any real number and gives as output values between 0 and 1. These values can be treated as probabilities. As you can see in the sigmoid graph as the value of x->∞, y->1 and as x-> -∞, y-> 0. So, the value of y remains between 0 and 1. The sigmoid graph cuts the y-axis at 0.5 when z is zero.

So, how do we use it in Logistic Regression? Actually sigmoid is what gives the name to Logistic Regression. The equation z= w.x +b gives an output that can be any real number, now if we pass z to the sigmoid function, we’ll get an output between 0 to 1 which are our class probabilities.

Cost Function

Now we have our equation in place, but how do we know whether our equation is working as expected. For this, we need a mechanism to calculate how close is the actual y value and predicted y value. The cost function helps in estimating this difference. The cost function we’ll be using is the log loss function

Log Loss function(Fig 3)

In the above loss function, y represents the actual value and p represents the predicted value or output value(z) in our hypothesis equation.

Now, we’ll try to interpret the above equation by breaking it down into two parts.

When actual label(y) is zero, then cost function becomes -log(1-p) and the graph of -log(1-p) is

graph of -log(1-p). Cost function on y-axis and hypothesis on the x-axis

As you can see in the above graph if the value of p(or z in our hypotheses equation) gives the output zero then the error is zero(since both actual and the predicted value is the same) and if the predicted value is one, given actual value 0. The cost function shoots up to a large value.

A Little homework for you: Try to plot the cost function log graph when y=1 and derive your conclusion.

Some of you must be thinking why are we not using the mean squared error function, well there is a valid reason behind it and we’ll discuss that in our next ingredient.

Now we have the cost function but now we need a mechanism to somehow reduce this cost function value and for that, we’ll need to find an optimized value of “w” and “b” in our equation which gives us the most accurate predictions.

But how do we find the optimal value for “w” and “b”. We’ll our maths researchers already figured out a way for it long back.

Optimization using Gradient Descent

Gradient descent is one of the most important algorithms in the machine learning world.Gradient descent will help us in finding the value of “w” and “b” such that our cost function value becomes minimum and we’ll see this with the help of a graph

Graph of cost function v/s our parameters w and b(Fig 4)

A little recap, we earlier discussed that we’ll not use mean squared error for logistic regression and the reason is that is does not give us a bowl or convex shape graph as above. The advantage of this graph is that there is only one global minima(red dot at the bottom) and no local minima, so no matter what our initial values of w and b are, it will converge to the global minima point.

The equation for gradient descent are given as

Gradient descent equation for parameter Qj(Fig 5)

Lets understand this equation

  1. Qj represents any parameter in our hypothesis. In our case we have “w” and “b”
  2. α represents the learning rate i.e how fast or slow you want to update the parameters.
  3. ∂J(Q)/∂Qj the derivative is to find how the cost function changes when we change our parameters by a tiny bit(also called the slope of tangent cutting at the point).

Now let’s see why gradient descent work

Lets understand the above graph. Suppose we calculate gradient and it comes out to be positive. Now in our equation of gradient descent(refer to figure 5), if gradient is positive, then the value of Q(here w, will decrease or move to less positive value) and vice versa for negative gradient. Check this out using pen and paper and try to solve it.

Finally, We have covered all the ingredients. Lets prepare the recipe now.

  1. Firstly we’ll see if the problem is binary classification. If yes, then we know we can use Logistic Regression.
  2. We’ll decide a hypothesis i.e. z = w*x+b
  3. Now we’ll use sigmoid activation function to and pass z through it to get a output in the range of {0,1}
  4. We Now define our cost function, which in our case is log loss.
  5. Finally we’ll use our gradients descent algorithm and keep iterating until we get a value of w and b such that the loss is minimum.

And that’s it we are ready with our Logistic regression recipe.There is a lot of math which is going behind the scenes that is out of scope of this article. If you want to learn more about it, you can check how computation graph are used in gradient descent or how calculus is used to calculate gradients in our algorithm.

Related Posts