
Label encode unseen values in a Pandas DataFrame
Label encoding is a popular method when preparing data for machine learning, but once the label encoder is fitted to a set of data, it returns an error when asked to transform a value not seen during the fitting
Label encode unseen values
Label encoding is a technique used to transform a categorical (often string) feature into numerical values to be used in machine learning. An example of label encoding could be having a feature called "Property type"
where the potential values are "House", "Apartment", "Cabin"
, after label encoding these would be transformed to the values 0,1,2
. Saving this fitted label encoder will then allow us to transform this feature in the same way when scoring our machine learning model.
How do I use label encoding on unseen data?
Label encoding is a popular method when preparing data for machine learning, but once the label encoder is fitted to a set of data, it returns an error when asked to transform a value not seen during the fitting. There are two options to solve this error:
- Re-train the model and label encoder on the new data set
- Add an "Unseen" value when fitting your label encoder and apply new values this "Unseen" value when scoring
Retraining the model could be a viable option, however you don't know how often these new values will arise so it could just be a short term fix for a long term problem. Because of this it's often best to use option 2.
Assigning unseen values when label encoding
By marking unseen values rather than retraining you ensure a method of scoring your data regardless of how recently the model has been trained. To do this, you need to add an extra value when fitting your label encoder, in our example we will use the word "Unseen", such that when this encoder is called later on it can transform these unseen values. An example function to undertake this can be seen here:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
def label_encode_columns(df, columns, encoders=None):
if encoders is None:
encoders = {}
for col in columns:
unique_values = list(df[col].unique())
unique_values.append('Unseen')
le = LabelEncoder().fit(unique_values)
df[col] = le.transform(df[[col]])
encoders[col] = le
else:
for col in columns:
le = encoders.get(col)
df[col] = [x if x in le.classes_ else 'Unseen' for x in df[col]]
df[col] = le.transform(df[[col]])
return df, encoders
Related articles
One hot encoding vs label encoding
Label encode multiple columns
Scale multiple columns
Nested list comprehension