XGBoost vs Random Forest

This post simply explains the key differences between the XGBoost and Random Forest models and outlines the best model for different use cases.

Stephen Allwright
Stephen Allwright

What is Random Forest?

Random Forest is a machine learning algorithm that can be used for both classification and regression problems.

It is an ensemble method that combines multiple independently trained decision trees and then undertakes majority voting on the output of all these trees to produce a final prediction.

What is XGBoost?

XGBoost is a machine learning algorithm that can also be used for both classification and regression problems.

It is also an ensemble model, but it differs by training many decision trees sequentially. In this sequential training, each decision tree is shallow and adjusted with the error from the previous tree, resulting in many weak classifiers which when combined create a highly performant model.

What are the similarities between Random Forest and XGBoost?

  1. They are ensemble models that use decision trees as their foundation
  2. They can be used for classification and regression problems
  3. They both perform well with large numbers of features

What are the differences between Random Forest and XGBoost?

  1. The decision trees in XGBoost are trained sequentially with adjustments made from the prior tree's error, whilst in Random Forest they are created in parallel and independently
  2. Random forest models are prone to overfitting, whilst XGBoost models are able to counter this by creating shallower trees
  3. XGBoost is a more complex model, which has many more parameters that can be optimised through parameter tuning
  4. Random Forest is more interpretable as it produces a set of decision trees which can be visualised
  5. XGBoost often works better with imbalanced datasets
  6. Random Forest is implemented in scikit-learn, whilst XGBoost is its own Python package

When should I use XGBoost vs Random Forest?

This depends on your dataset and the outcome you want. But, generally, it comes down to the following:

Use XGBoost when

  • The dataset is large
  • The data is imbalanced
  • You can spend time optimising performance through parameter tuning

Use Random Forest when

  • You want to interpret your model's results
  • You want to create a simple and quick baseline model, as it's easily implemented in scikit-learn with just a few lines of code

Is Random Forest faster than XGBoost?

Both XGBoost and Random Forest can train in parallel on multiple cores and are therefore fast to train.

Whether one is faster than the other is dependent on how complex you have made the model parameters, rather than on the underlying model architecture.

XGBoost or Random Forest, which is better?

XGBoost is widely used by data scientists for good reason, because it consistently performs well on a wide variety of datasets. Therefore, if you don't have any specific technical requirements, then I would suggest using XGBoost as it will, in most situations, provide higher accuracy and less overfitting than Random Forest.


Choosing your model

LightGBM vs XGBoost, which is better?
XGBoost vs Catboost
LightGBM vs Catboost

Machine learning basics

What is a baseline machine learning model?
Which metrics are best for imbalanced data?
What is imbalanced data?

References

XGBoost documentation
Random Forest documentation

Machine learning

Stephen Allwright Twitter

I'm a Data Scientist currently working for Oda, an online grocery retailer, in Oslo, Norway. These posts are my way of sharing some of the tips and tricks I've picked up along the way.