Evaluating our Machine Learning model Performance

Model Evaluation is one of the most important steps in developing a machine learning pipeline. Just imagine, designing a model and then straight away deploying it on production. And Suppose, your model is being used in the medical domain, then it may lead to the death of multiple people(if your model performs poorly). But don’t worry, this article will provide you with all the tools needed to evaluate and improve your model performance.

Classification Problems

As we know, classification problems are those problems in which the output is a discrete value.For e.g. spam detection,cancer detection etc.

For the scope of this article, we’ll mainly focus on binary classification in which the output is either 0 or 1.

Let us take a machine learning model which predicts whether a person has a benign(positive class) or malignant(negative class) tumour. We will apply all our evaluation metrics to this model.

Evaluation Metrics

We’ll divide the evaluation metrics into two categories based on when to use them.

Evaluation metrics when the dataset is balanced.
Evaluation metrics when the dataset is imbalanced.

Imbalanced Dataset

Any dataset with an unequal class distribution is technically imbalanced. However, a dataset is said to be imbalanced when there is a significant, or in some cases extreme, disproportion among the number of examples of each class of the problem.

Suppose, we have a dataset in which we have 900 examples of benign tumour and 100 examples of malignant tumour.

Accuracy: The accuracy of a model is calculated by the following formula:

Now suppose our model predicts all the outputs as a benign tumour, now by the above formula, our accuracy is 90%(Think about it!). We can clearly see that our model failed to predict malignant tumour. Now imagine, how big of a disaster it would have been if you would have used the model in a real-world scenario.

Learning: Accuracy as an evaluation metrics should only be used when our dataset is balanced.

Evaluation metrics for imbalanced Dataset

Confusion matrix: As the name suggests, a confusion matrix is a matrix that tells us for what values our models is getting confused between different classes.

Definitions related to confusion matrix

TP: True Positive are those examples that were actually positives and were predicted as positives. Eg. The actual output was a benign tumour(positive tumour) and the model also predicted benign tumour.
FP: False Positives also known as Type 1 error are those examples that were actually negatives but our model predicted them as positive. Eg. The actual output was a malignant tumour(negative class) but our model predicted benign tumour.
FN: False Negatives also known as Type 2 error are those examples that were actually positives but our model predicted them as negative. Eg. The actual output was a benign tumour(positive class) but our model predicted it as a malignant tumour.
TN: True Negative are those examples that were actually negative and our model predicted them as negative. Eg. The actual output was a malignant tumour(negative class) and our model also predicted it as a malignant tumour.

Whenever we are training our model we should try and reduce False positives and False negatives, such that our model makes as many correct predictions as possible.

By looking at the confusion matrix, we can get an idea of how our model is performing on particular classes. Unlike accuracy, which gave an estimate of the overall model.

Precision and Recall

These are two very important metrics to evaluate the performance of the model. No one is better than the other, it just depends upon the use case and business requirement. Let’s first have a look at their definition and then we’ll develop an intuition of which to use when

Precision: The precision of a model is given using the following formula:

Precision can be defined as “out of the total positive predicted values, how many of them are actually positive.”

Recall: The recall of a model is given by using the following formula

Recall can be defined as “out of the total actual positive values, how many of them were actually predicted positive.”

Little Confused?

It took me some time to get the definition around my head. But don’t worry, we’ll understand this with an example.

Suppose you have a spam detection model having spam as a positive class and not spam as a negative class. Now think for a moment, which metric(precision or recall) you’ll use to evaluate your model.

Thinking…………………..

Okay, so let’s understand that in this problem, we don’t want any important mail to be missed out in case it was predicted as spam. So we want to reduce False positives(non-spams that were predicted as spams) which in turn will increase our precision. So, in this case, we’ll focus more on precision rather than recall.

Now, Let’s take another example in which we want to predict whether a person has cancer or not. In this case, having cancer is a positive example and not having cancer is a negative example. Now, take a break and think about the evaluation metrics you’ll use to evaluate this model.

Thinking…………………..

Let’s see, in this example, we can’t afford to classify a person having cancer as not having cancer(It’s about life and death). So, we’ll try to reduce False negatives(predicting negative class for actual positive class), which in turn will increase our recall. In this case, we’ll be focusing more on recall rather than precision.

Task for you

Think of more examples and see for yourself which metrics you can apply to it.

F1- Score

Although precision and recall are good they don’t give us the power to compare two models. If one model has a good recall and the other one has good precision, it becomes really confusing which one to use for our task(until we are completely sure that we need to focus on only one metric).

F1-Score to rescue

F1-score takes a harmonic mean of both precision and recall and gives a single value to evaluate our model. it is given by the following model.

We use the harmonic mean instead of a simple average because it punishes extreme values. A classifier with a precision of 1.0 and a recall of 0.0 has a simple average of 0.5 but an F1 score of 0.

Now you can use the F1 score to compare different models and select the one which works best for you.

Homework Time

Study about ROC and AUC evaluation metrics.
Can we use a confusion matrix for multiclass classification and if yes, how will we calculate precision and recall in that case.
Play with different models like decision trees, logistic regression and evaluate your model using different metrics.

Happy Learning!!!

Evaluating our Machine Learning model Performance

Related Posts

Deep Dive into Logistic Regression

A quick guide to getting started with NumPy