What are baseline models and metrics in machine learning?
When creating a machine learning model one needs to track error metrics to understand how accurate it is. However, one of the biggest mistakes that data scientists make when creating their models is that they throw everything at the first attempt, which makes it difficult to know what made the model perform well, or even if the model is worth the time and effort invested.
So, what is a baseline model in machine learning?
A baseline model is your first simple attempt at modelling which will provide you with a baseline metric that you will use as a reference point throughout development. This baseline model is often a heuristic (rule based) model, but could equally be a simple machine learning model.
What are the benefits of creating a baseline model?
The benefits of creating a baseline model at the start of your work are two-fold:
- Understanding if the benefit is worth the cost
- Assigning performance improvements
Understanding the benefit vs cost tradeoff is the main benefit of creating a baseline model in the beginning of your project. Machine learning models are expensive, this goes for the time it takes to develop and maintain them, as well as the cost of tooling required to run them. So if the baseline model is only, for example, 5% less accurate than your fully fledged XGBoost model, is it worth the cost? Without the baseline model for guidance, the XGBoost model would look impressive, but with the baseline model it provides valuable context.
The second key benefit, assigning performance improvements, is part of a wider effort to iterate over your model, where the baseline is the starting point. Knowing what feature engineering change or parameter tweak lead to which performance improvement is important for improving your understanding and knowing where to focus your efforts.
What models work well as baseline models?
So we've seen that baseline models are an important part of your workflow, but what actually is a baseline model? Is it a machine learning model, a statistic, or a set of rules? Well, the answer is that it could be any of the above. However, the most common approach is to try and find a rule based approach as this is more often than not the most basic type of model that can be created.
How do I create rule based baseline model?
A rule based approach will change depending on whether it is a classification or a regression problem, with regression problems often favoring a statistic and classification problems favoring a manually created decision tree. I will show examples of both these approaches.
Creating a rule based baseline model for regression
Say we are trying to predict 12 month revenue for customers on our eCommerce site. A baseline model could consist of calculating the sum of revenue for the previous 12 months for each customer and using this as your prediction.
Creating a rule based baseline model for classification
Imagine we are trying to predict if a customer will churn or not from our subscription platform. A rule based baseline model for this project could be: customers who have not been onto our platform for the past 30 days and been a customer for less than 1 year are predicted to churn and those don't fulfill this criteria are not predicted to churn.
How do I create a machine learning baseline model?
If the rule based approach does not work for your project then you next choice is to use a machine learning model for your baseline instead. As we outlined earlier, the goal of this baseline model is to provide the simplest reference point to begin your development with, so this model should be simple both in terms of features used and the model type.
If you're solving a regression problem then I would recommend using Linear Regression, and if you're looking at classification then Logistic Regression would be a good baseline model to use.
What should I do if my baseline model is better than my final machine learning model?
It is extremely unlikely that a baseline model would outperform your final production ready model, if this is the case then there is likely an error within the dataset you are using for your machine learning model. In this situation I would look to see if you're either:
- Leaking data into your baseline model predictions
- Preparing your features incorrectly
How much better does my machine learning model need to be than my baseline model?
Assuming that your machine learning model is indeed outperforming your baseline model, this begs the question: "How much better should it be?". As I mentioned earlier, the main reason for creating a baseline model is to understand if the machine learning model, and all the cost that comes with it, is worthwhile. Of course it completely depends on the context and the problem you are tackling, but I have a general rule of thumb that I use when looking at the final model performance metric improvement over the baseline:
- Below 5% - Stick with the baseline model
- 5 % to 10% - OK, but depends on the use case
- Over 10% - Good, stick with the machine learning model