Pandas groupby aggregate functions

What is a Pandas groupby aggregate function?

Groupby is a function for Pandas which allows you to aggregate a DataFrame up a higher level of extraction. For example, if you have row level order data but want to calculate the data on a customer level then you could use groupby on the customer identifier to do this, therefore allowing you to present calculations such as total revenue and mean revenue per order.

What are the possible Pandas groupby aggregate functions?

When using the groupby function you must define which columns will be aggregated and what type of aggregation calculations should be undertaken. You can use separate packages such as NumPy for aggregations within the groupby function, however there are a number of built in aggregations that are very simple to use, these are:

  • count() – Number of non-null observations
  • nunique() - Number of unique values
  • sum() – Sum of values
  • mean() – Mean of values
  • median() – Arithmetic median of values
  • mad() - Mean absolute deviation of values
  • prod() - Product of values
  • min() – Minimum
  • max() – Maximum
  • mode() – Mode
  • std() – Standard deviation
  • var() – Variance

You can use these aggregations in the following way:            

df.groupby('customer_id').agg({'revenue':['sum','mean','std'],'product_id':['count','nunique']})               

Pandas groupby column and sum another column
Divide columns
Scale multiple columns
Label encode columns
Remove outliers

References

Groupby documentation
Aggregate documentation

Stephen Allwright

Stephen Allwright

I'm a Data Scientist currently working for Oda, an online grocery retailer, in Oslo, Norway. These posts are my way of sharing some of the tips and tricks I've picked up along the way.
Oslo, Norway