XGBoost vs Random Forest
This post simply explains the key differences between the XGBoost and Random Forest models and outlines the best model for different use cases.
1. What is Random Forest
2. What is XGBoost
3. What are the similarities
4. What are the differences
5. When should you use them
6. Which is faster
7. Which is better
What is Random Forest?
Random Forest is a machine learning algorithm that can be used for both classification and regression problems.
It is an ensemble method that combines multiple independently trained decision trees and then undertakes majority voting on the output of all these trees to produce a final prediction.
What is XGBoost?
XGBoost is a machine learning algorithm that can also be used for both classification and regression problems.
It is also an ensemble model, but it differs by training many decision trees sequentially. In this sequential training, each decision tree is shallow and adjusted with the error from the previous tree, resulting in many weak classifiers which when combined create a highly performant model.
What are the similarities between Random Forest and XGBoost?
- They are ensemble models that use decision trees as their foundation
- They can be used for classification and regression problems
- They both perform well with large numbers of features
What are the differences between Random Forest and XGBoost?
- The decision trees in XGBoost are trained sequentially with adjustments made from the prior tree's error, whilst in Random Forest they are created in parallel and independently
- Random forest models are prone to overfitting, whilst XGBoost models are able to counter this by creating shallower trees
- XGBoost is a more complex model, which has many more parameters that can be optimised through parameter tuning
- Random Forest is more interpretable as it produces a set of decision trees which can be visualised
- XGBoost often works better with imbalanced datasets
- Random Forest is implemented in scikit-learn, whilst XGBoost is its own Python package
When should I use XGBoost vs Random Forest?
This depends on your dataset and the outcome you want. But, generally, it comes down to the following:
Use XGBoost when
- The dataset is large
- The data is imbalanced
- You can spend time optimising performance through parameter tuning
Use Random Forest when
- You want to interpret your model's results
- You want to create a simple and quick baseline model, as it's easily implemented in scikit-learn with just a few lines of code
Is Random Forest faster than XGBoost?
Both XGBoost and Random Forest can train in parallel on multiple cores and are therefore fast to train.
Whether one is faster than the other is dependent on how complex you have made the model parameters, rather than on the underlying model architecture.
XGBoost or Random Forest, which is better?
XGBoost is widely used by data scientists for good reason, because it consistently performs well on a wide variety of datasets. Therefore, if you don't have any specific technical requirements, then I would suggest using XGBoost as it will, in most situations, provide higher accuracy and less overfitting than Random Forest.
Related articles
Choosing your model
LightGBM vs XGBoost, which is better?
XGBoost vs Catboost
Machine learning basics
What is a baseline machine learning model?
Which metrics are best for imbalanced data?
What is imbalanced data?