Remove outliers from Pandas DataFrame

Should you remove outliers from a dataset?

Outliers are data points in a dataset that are considered to be extreme, false, or not representative of what the data is describing. These outliers can be caused by either incorrect data collection or genuine outlying observations. Removing these outliers will often help your model to generalize better as these long tail observations could skew the learning.

Should you remove outliers from a dataset?

Outliers should be removed from your dataset if you believe that the data point is incorrect or that the data point is so unrepresentative of the real world situation that it would cause your machine learning model to not generalise.

Methods for handling outliers in a DataFrame

Removing outliers from your dataset is not necessarily the only approach to take. As a rule of thumb there are three choices that you can take when wanting to deal with outliers in your dataset.

  1. Remove - The observations are incorrect or not representative of what you are modelling
  2. Re-scale - You want to keep the observations but need to reduce their extreme nature
  3. Mark - Label the outliers to understand if they had an effect on the model afterwards

Methods to detect outliers in a Pandas DataFrame

Once you have decided to remove the outliers from your dataset, the next step is to choose a method to find them. Assuming that your dataset is too large to manually remove the outliers line by line, a statistical method will be required. There are a number of approaches that are common to use:

  1. Standard deviation - Remove the values which are a certain number of standard deviations away from the mean, if the data has a Gaussian distribution
  2. Automatic outlier detection - Train a machine learning model on a smaller normal set of observations which can then predict data points outside of this normal set
  3. Interquartile range - Remove the values which are above the 75th percentile or below the 25th percentile, doesn't require the data to be Gaussian

There are trade-offs for each of these options, however the method most commonly used in industry is the standard deviation, or z-score, approach.

How many standard deviations away from the mean should I use to detect outliers?

The standard deviation approach to removing outliers requires the user to choose a number of standard deviations at which to differentiate outlier from non-outlier.

This then begs the question, how many standard deviations should you choose?

The common industry practice is to use 3 standard deviations away from the mean to differentiate outlier from non-outlier. By using 3 standard deviations we remove the 0.3% extreme cases. Depending on your use case, you may want to consider using 4 standard deviations which will remove just the top 0.1%.

Remove outliers in Pandas DataFrame using standard deviations

The most common approach for removing data points from a dataset is the standard deviation, or z-score, approach. In this example I will show how to create a function to remove outliers that lie more than 3 standard deviations away from the mean:

import pandas as pd

def remove_outliers(df,columns,n_std):
    for col in columns:
        print('Working on column: {}'.format(col))
        
        mean = df[col].mean()
        sd = df[col].std()
        
        df = df[(df[col] <= mean+(n_std*sd))]
        
    return df

Scale columns
Label encode columns
loc vs iloc

References

Pandas mean documentation
Pandas standard deviation documentation
Scipy z-score documentation
Sklearn outlier detection documentation

Stephen Allwright

Stephen Allwright

I'm a Data Scientist currently working for Oda, an online grocery retailer, in Oslo, Norway. These posts are my way of sharing some of the tips and tricks I've picked up along the way.
Oslo, Norway