Scale multiple columns in a Pandas DataFrame

Scale multiple columns in a Pandas DataFrame

The practical problem when working with feature scaling in a real world project is that you are often required to scale mutliple features, and also apply the same scaling that was fit on your training data to your scoring data later on

Stephen Allwright
Stephen Allwright

Scale multiple columns for model training

Scaling is a data transformation technique used in feature engineering to prepare data for the training or scoring of a machine learning model. There are several methods for scaling your data, with the two most popular within the scikit-learn library being Min Max Scaling and Standard Scaling, however in this article example we are going to focus on Min Max Scaling.

Using Min Max Scaling in feature engineering

The aim of Min Max Scaling is to transform the range of the data to be within a given boundary (by default between 0 and 1). The benefit of scaling your data in this way is that some machine learning models perform better when the features are within a similar scale. Models that are particularly effected are linear models, whilst tree based models, for example, are not effected by different data scales.

Scaling multiple features across training and scoring data

The practical problem when working with feature scaling in a real world project is that you are often required to scale multiple features, and also apply the same scaling that was fit on your training data to your scoring data later on. A way of achieving this is to create a function which fits a scaler to each feature in the training dataset, creates a dictionary of these scalers which can then be fetched later, and then uses this dictionary to transform the scoring data. Note: always fit your scalers on the training data and apply to the scoring data.

from sklearn.preprocessing import MinMaxScaler 
import pandas as pd

#Dataframe to be used for training your model
train_df 

#Dataframe to be used for testing your model
test_df

#Columns to scale in both of the dataframes
scale_columns = ['A','B','C']

def scale_columns(df, columns, scalers):
	
	if scalers is None:
		scalers = {}
		for col in columns:
			scaler = MinMaxScaler().fit(df[[col]])
			df[col] = scaler.transform(df[[col]])
			scalers[col] = scaler
	
	else:
		for col in columns:
			scaler = scalers.get(col)
			df[col] = scaler.transform(df[[col]])

	return df, scalers

train_df,scalers = scale_columns(train_df,columns=scale_columns,scalers=None)
test_df,scalers = scale_columns(test_df,columns=scale_columns,scalers=scalers)

Label encode multiple columns
Remove outliers from Pandas DataFrame

References

MinMaxScaler documentation

Pandas

Stephen Allwright Twitter

I'm a Data Scientist currently working for Oda, an online grocery retailer, in Oslo, Norway. These posts are my way of sharing some of the tips and tricks I've picked up along the way.

Comments