One hot encoding vs label encoding, which is best?

One hot encoding vs label encoding

In machine learning, there are two types of features: continuous features, such as revenue or number of orders, and categorical features, such as product category or city. Feature encoding is the process of taking a categorical variable and transforming it into a numerical feature for a machine learning model to learn from. This is a required step of feature engineering as machine learning models can only take in numerical features as input.

What methods of feature encoding are available?

There are several methods of encoding that one can use on a categorical feature, each with their tradeoffs, but the most common encoding methods that are used are Label Encoding and One Hot Encoding. In the coming sections I will go through each of these approaches.

What is One Hot Encoding?

One Hot Encoding is the process of taking a categorical variable and transforming it into several numeric features with a binary flag to mark the correct categorical value. Each of the new numeric features is one of the possible unique values in the original categorical feature.

As an example, let us take the following example where we have a categorical feature, City, which needs to be encoded.

CustomerID City
1 Oslo
2 Stockholm
3 Copenhagen

By using the One Hot Encoding method we will end up with the following result:

CustomerID Oslo Stockholm Copenhagen
1 1 0 0
2 0 1 0
3 0 0 1

We see that this method results in 3 new columns being created, each with a binary flag marking the correct value, maintaining the information that was being conveyed in the original string categorical feature whilst making it readable for a machine learning model.

What is Label Encoding?

Label Encoding is a feature encoding method which takes a categorical variable and converts the unique possibilities to a sequence of numerical values.

As an example let us use the data we looked at in the previous example:

CustomerID City
1 Oslo
2 Stockholm
3 Copenhagen

By applying the Label Encoding method, we would end up with the following:

CustomerID City
1 0
2 1
3 2

As you can see, we maintain all of the original information but are representing it instead in a numerical format which is readable by our machine learning model. This mapping between the original categorical feature and the new can be stored, which means that we can reverse transform the features when required.

What are the positives and negatives of Label Encoding vs One Hot Encoding?

We have looked at the two most common methods for encoding categorical features, and have seen that both achieve the goal of transforming our data into a usable format for machine learning models. But what are the positives and negatives of Label Encoding vs One Hot Encoding? The answer comes down to two areas: the type of model you are using and the data you have.

What are the positives and negatives of One Hot Encoding?

One Hot Encoding a categorical variable is a good universal method which works for all commonly used machine learning models, however the tradeoff is that if you have a large number of possible values then your feature set can become very large, which could cause memory or learning problems, depending on the model you use.

What are the positives and negatives of Label Encoding?

Label Encoding a categorical variable is a method which works best for tree-based models as these do a good job of splitting values in a feature. The tradeoff with Label Encoding is that it's not approrpiate for all types of machine learning models and it's difficult to interpet analysis such as feature importance after training.

One hot encoding vs label encoding, which is best?

Given everything that we have looked at, the main question is of course:

One hot encoding vs label encoding, which should you use?

That answer depends very much on your context, however given that One Hot Encoding is possible to use across all machine learning models whilst the Label Encoding tends to only work best on tree based models, I would always suggest to start with One Hot Encoding and look at Label Encoding if you see a specific need.

Implementing Label Encoding in Python

In order to implement Label Encoding in Python, you can use the popular data science Python package, scikit-learn.

import pandas as pd
from sklearn.preprocessing import LabelEncoder

#Creating a dictionary to store the mapping for later use
encoders = {}

#Create the LabelEncoder and fit it to the column
le = LabelEncoder().fit(df[col])

#Overwrite the column with our transformed data
df[col] = le.transform(df[col])

#Store our mapping for later use
encoders[col] = le

Implementing One Hot Encoding in Python

In order to implement One Hot Encoding in Python, you use the same scikit-learn package but with a slightly more complex process given that you are creating multiple columns, not adjusting an exisiting one.

import pandas as pd
from sklearn.preprocessing import OneHotEncoder

enc = OneHotEncoder()

#Fit and transform the feature
encoded = enc.fit_transform(df[[col]]).toarray()

#Add the transformed features to the model with correct naming
df[enc.get_feature_names_out()] = encoded

#Drop the initial categorical feature
df = df.drop([col],axis=1)

Label encode multiple columns
Label encode unseen values
Scale multiple columns

References

Scikit-learn LabelEncoder documentation
Scikit-learn OneHotEncoder documentation

Stephen Allwright

Stephen Allwright

I'm a Data Scientist currently working for Oda, an online grocery retailer, in Oslo, Norway. These posts are my way of sharing some of the tips and tricks I've picked up along the way.
Oslo, Norway