F1 score vs AUC, which is the best classification metric?

F1 score and AUC are machine learning metrics for classification models, but which should you use for your project? In this post I will explain what they are, their differences, and help you decide which is the better choice for you.

Is F1 the same as AUC?

F1 and AUC are often discussed in similar contexts and have the same end goal, but they are not the same and have very different approaches to measuring model performance.

What is F1 score?

F1 score (also known as F-measure, or balanced F-score) is an error metric whose score ranges from 0 to 1, where 0 is the worst and 1 is the best possible score.

It is a popular metric to use for classification models as it provides robust results for imbalanced datasets and evaluates both the recall and precision ability of the model. The reason F1 is able to evaluate a model's precision and recall ability is due to the way it is derived, which is as follows:

f1 score metric formula and definition

What is AUC?

AUC, or ROC AUC, stands for Area Under the Receiver Operating Characteristic Curve. The score it produces ranges from 0.5 to 1 where 1 is the best score and 0.5 means the model is as good as random.

The metric is calculated as the area underneath the Receiver Operating Characteristic Curve (ROC). The ROC is a graph which maps the relationship between the true positive rate (TPR) of the model and the false positive rate (FPR). It shows at various intervals the TPR that we can expect to receive for a given trade-off with FPR.

The area under this ROC curve, AUC, therefore equates to the model’s ability to predict classes correctly, as a large amount of area would show that the model can achieve a high true positive rate with a correspondingly low false positive rate.

To illustrate this, let’s look at an illustration of the metric:

roc auc score metric illustration or diagram

What is the difference between F1 and AUC?

The key differences between F1 and AUC are how they handle imbalanced datasets, the input they take, and their approach to calculating the resulting metrics.

Difference between F1 and AUC metric definitions

The way these two metrics are calculated are very different. AUC is the area under the ROC curve which is calculated at thresholds between the True Positive Rate and the False Positive Rate, whilst F1 is a straight forward calculation involving the overall recall and precision of the model.

In this regard they are extremely different and tackle the problem of assessing performance from very different angles.

Difference between prediction inputs for F1 and AUC

Another difference between F1 and AUC is the inputs they take. F1 requires predicted classes whilst AUC needs the predicted probabilities.

Because of this you will need to define the probability boundary between classes before using F1, unlike with AUC, and you will find that your F1 score changes depending on where it is set.

F1 score vs AUC on imbalanced datasets

The last but possibly most important difference to be aware of is their behaviour on imbalanced datasets. AUC does not perform well on imbalanced datasets which often leads to misleading results, whilst F1 is still able to measure performance objectively when the class balance is skewed.

When should I use F1 or AUC?

Now that we have looked at their key differences, how does this impact when you should use one or the other?

F1 should be used for situations when you either have an imbalanced dataset or you need to communicate your results to end users, due to the relatively simple definition of F1 in comparison with AUC. AUC should be used when you have a balanced dataset or you don’t want to set a probability boundary between classes, which is required for F1.

How to implement F1 score and AUC in Python

These metrics are easy to implement in Python using the scikit-learn package. Let’s look at a simple example of the two in action:

from sklearn.metrics import f1_score, roc_auc_score

y_true = [0, 1, 0, 0, 1, 1]
y_pred = [0, 0, 1, 0, 0, 1]

# Uses the predicted class as input
f1 = f1_score(y_true, y_pred)

y_true = [0, 1, 0, 0, 1, 1]
y_score = [0.4, 0.2, 0.8, 0.3, 0.1, 0.9]

# Uses the predicted probability for the positive class as input
auc = roc_auc_score(y_true, y_score)

Is AUC better than F1 score?

The metric which is best depends on your use case and the dataset, but if one of either F1 or AUC had to be recommended then I would suggest F1 score. It is the go-to metric for classification models, and will provide reliable scores for a wide array of projects due to it’s performance on imbalanced datasets and it’s simpler interpretability.

Classification metrics

Accuracy score
Balanced accuracy score
F1 score calculator

Metric comparisons

AUC vs accuracy, which is the best metric?
Accuracy vs balanced accuracy, which is the best metric?
F1 score vs accuracy, which is the best classification metric?
Micro vs Macro F1 score, what’s the difference?

References

AUC scikit-learn documentation
F1 scikit-learn documentation

Stephen Allwright

Stephen Allwright

I'm a Data Scientist currently working for Oda, an online grocery retailer, in Oslo, Norway. These posts are my way of sharing some of the tips and tricks I've picked up along the way.
Oslo, Norway