
Scale multiple columns in a Pandas DataFrame
The practical problem when working with feature scaling in a real world project is that you are often required to scale mutliple features, and also apply the same scaling that was fit on your training data to your scoring data later on
Scale multiple columns for model training
Scaling is a data transformation technique used in feature engineering to prepare data for the training or scoring of a machine learning model. There are several methods for scaling your data, with the two most popular within the scikit-learn library being Min Max Scaling and Standard Scaling, however in this article example we are going to focus on Min Max Scaling.
Using Min Max Scaling in feature engineering
The aim of Min Max Scaling is to transform the range of the data to be within a given boundary (by default between 0 and 1). The benefit of scaling your data in this way is that some machine learning models perform better when the features are within a similar scale. Models that are particularly effected are linear models, whilst tree based models, for example, are not effected by different data scales.
Scaling multiple features across training and scoring data
The practical problem when working with feature scaling in a real world project is that you are often required to scale multiple features, and also apply the same scaling that was fit on your training data to your scoring data later on. A way of achieving this is to create a function which fits a scaler to each feature in the training dataset, creates a dictionary of these scalers which can then be fetched later, and then uses this dictionary to transform the scoring data. Note: always fit your scalers on the training data and apply to the scoring data.
from sklearn.preprocessing import MinMaxScaler
import pandas as pd
#Dataframe to be used for training your model
train_df
#Dataframe to be used for testing your model
test_df
#Columns to scale in both of the dataframes
scale_columns = ['A','B','C']
def scale_columns(df, columns, scalers):
if scalers is None:
scalers = {}
for col in columns:
scaler = MinMaxScaler().fit(df[[col]])
df[col] = scaler.transform(df[[col]])
scalers[col] = scaler
else:
for col in columns:
scaler = scalers.get(col)
df[col] = scaler.transform(df[[col]])
return df, scalers
train_df,scalers = scale_columns(train_df,columns=scale_columns,scalers=None)
test_df,scalers = scale_columns(test_df,columns=scale_columns,scalers=scalers)
Related articles
Label encode multiple columns
Remove outliers from Pandas DataFrame