What is gradient boosting?
Both of these models are gradient boosting models, so let's have a quick catch-up on what this means.
Gradient boosting is a machine learning technique where many weak learners, typically decision trees, are iteratively trained and combined to create a highly performant model. The decision trees are trained sequentially and use the error from the previous tree to adjust its learning and eventually minimise the loss function.
What is LightGBM?
LightGBM is an open-source machine learning framework developed by Microsoft for classification and regression problems which uses gradient boosting.
It's an ensemble method which trains a series of decision trees sequentially but does so leaf-wise (aka. vertically), where the trees have many leaves but the number of trees is relatively low. This approach creates a highly performant boosting model whilst being fast to train.
What is Catboost?
Catboost is also a gradient boosting machine learning algorithm that can be used for classification and regression problems.
It too is a highly performant model out-of-the-box, but a few key features make Catboost unique. Firstly it builds symmetric trees, and secondly, it handles categorical and text features automatically without having to undertake pre-processing to convert them to numerical features.
What are the similarities between Catboost and LightGBM?
- Model framework. Both use the gradient boosting method to train many weak decision trees in an ensemble model
- Performance. Both models perform very well out of the box with standard parameters on most datasets
- Use case. They can be used for classification and regression
- Datasets. Both can handle large datasets with ease
What are the differences between Catboost and LightGBM?
- Training time. LightGBM is known for having fast training times, and will often be faster to train and predict than Catboost
- Categorical and text data. Catboost can handle categorical and text data without pre-processing, whilst LightGBM requires them to be encoded numerically beforehand
- Null values. Catboost handles null values without the need for pre-processing, whilst LightGBM needs them to be dealt with beforehand
- Tree construction. LightGBM constructs trees leaf-wise, whilst Catboost builds balanced symmetrical trees
- Overfitting. Due to Catboost's use of ordered boosting and balanced trees, it is much less prone to overfitting on training datasets
When should you use Catboost or LightGBM?
Both Catboost and LightGBM are well-performing boosting models, but when you should use one or the other depends upon your dataset and technical constraints.
As a rough rule of thumb, I would suggest:
- Use Catboost when you have a significant number of categorical or text features
- Use LightGBM if you have a mixture of feature types and model speed is important for your use case
Catboost vs LightGBM, which is better?
Which model is better depends primarily upon your dataset, where the more categorical variables you have the more reason there is to choose Catboost.
However, if you are unable to decide between the two based on your dataset needs then it is generally recommended to use LightGBM. This is because it works well on a wider range of datasets, is fast to train and score, and is well-documented online if you ever need help.