Label encode unseen values in a Pandas DataFrame

Label encode unseen values in a Pandas DataFrame

Label encoding is a popular method when preparing data for machine learning, but once the label encoder is fitted to a set of data, it returns an error when asked to transform a value not seen during the fitting

Stephen Allwright
Stephen Allwright

Label encode unseen values

Label encoding is a technique used to transform a categorical (often string) feature into numerical values to be used in machine learning. An example of label encoding could be having a feature called "Property type" where the potential values are "House", "Apartment", "Cabin", after label encoding these would be transformed to the values 0,1,2. Saving this fitted label encoder will then allow us to transform this feature in the same way when scoring our machine learning model.

How do I use label encoding on unseen data?

Label encoding is a popular method when preparing data for machine learning, but once the label encoder is fitted to a set of data, it returns an error when asked to transform a value not seen during the fitting. There are two options to solve this error:

  1. Re-train the model and label encoder on the new data set
  2. Add an "Unseen" value when fitting your label encoder and apply new values this "Unseen" value when scoring

Retraining the model could be a viable option, however you don't know how often these new values will arise so it could just be a short term fix for a long term problem. Because of this it's often best to use option 2.

Assigning unseen values when label encoding

By marking unseen values rather than retraining you ensure a method of scoring your data regardless of how recently the model has been trained. To do this, you need to add an extra value when fitting your label encoder, in our example we will use the word "Unseen", such that when this encoder is called later on it can transform these unseen values. An example function to undertake this can be seen here:

import pandas as pd
from sklearn.preprocessing import LabelEncoder

def label_encode_columns(df, columns, encoders=None):
	if encoders is None:
		encoders = {}
	
		for col in columns:
			unique_values = list(df[col].unique())
			unique_values.append('Unseen')
			le = LabelEncoder().fit(unique_values)
			df[col] = le.transform(df[[col]])
			encoders[col] = le
	
	else:
		for col in columns:
			le = encoders.get(col)
			df[col] = [x if x in le.classes_ else 'Unseen' for x in df[col]]
			df[col] = le.transform(df[[col]])

	return df, encoders

One hot encoding vs label encoding
Label encode multiple columns
Scale multiple columns
Nested list comprehension

References

LabelEncoder documentation

Pandas

Stephen Allwright Twitter

I'm a Data Scientist currently working for Oda, an online grocery retailer, in Oslo, Norway. These posts are my way of sharing some of the tips and tricks I've picked up along the way.