Label encode multiple columns in a Pandas DataFrame

Label encode multiple columns in a Pandas DataFrame

Label encoding is a feature engineering method for categorical features, where a column with values ['egg','flour','bread'] would be turned in to [0,1,2] which is usable by a machine learning model

Stephen Allwright
Stephen Allwright

Label encode multiple columns

Label encoding is a feature engineering method for categorical features, where a column with values ['egg','flour','bread'] would be turned in to [0,1,2] which is usable by a machine learning model. This method differs from One Hot Encoding because it is converted in column, rather than creating separate columns for each value in the original feature.

Label encoding multiple columns in production

When working with a data science product that is going to be run in production it's important to remember that when you label encode your features, you must apply the same encoder to your scoring data. Because of this requirement, the function I use for label encoding multiple columns outputs a dictionary of the encoders, making it easy to apply the same encoder later on.

import pandas as pd
from sklearn.preprocessing import LabelEncoder

#Create the function to fit and transform the label encoder
def label_encode_columns(df, columns):
    encoders = {}
    for col in columns:
        le = LabelEncoder().fit(df[col])
        df[col] = le.transform(df[col])
        encoders[col] = le
    return df, encoders

#Create the function to take in the fitted encoders and transform the scoring dataset
def label_encode_columns_w_fit_encoders(df, columns, encoders):
    for col in columns:
        le = encoders.get(col)
        df[col] = le.transform(df[col])
    return df
 
#Define the columns we want to encode
encode_columns = ['ingredient','gender','city']
 
#Fit and transform the training dataset, returing both the new training dataset and the fitted encoders to use on the scoring dataset
train_df, encoders = label_encode_columns(df=train_df, columns=encode_columns)
 
#Transform the scoring dataset using the encoders we fit previously
score_df = label_encode_columns_w_fit_encoders(df=score_df, columns=encode_columns, encoders=encoders)

Label encode unseen values when scoring
Scale multiple columns in a Pandas DataFrame
Nested list comprehension

References

Scikit-learn LabelEncoder documentation

Pandas

Stephen Allwright Twitter

I'm a Data Scientist currently working for Oda, an online grocery retailer, in Oslo, Norway. These posts are my way of sharing some of the tips and tricks I've picked up along the way.