Advanced GroupBy Operations

Advanced GroupBy Operations in Pandas (Step-by-Step Guide)

GroupBy operations split data into groups based on some criteria, apply a function to each group independently, and combine the results. Advanced operations include custom aggregations, transformations, and filtering.

Key Concepts

  • Split-Apply-Combine: The three-step process of groupby
  • Aggregation: Compute summary statistics
  • Transformation: Perform group-specific operations
  • Filtration: Filter groups based on conditions

1. Sample Dataset – Our Starting Point


First, let’s build a small sales dataset to use in all examples:

import pandas as pd
import streamlit as st

# Sample sales data
sales_data = {
    'Region': ['North', 'North', 'South', 'South', 'East', 'East', 'West', 'West'],
    'Product': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B'],
    'Sales': [250, 300, 220, 280, 290, 310, 240, 270],
    'Profit': [50, 60, 40, 55, 58, 62, 48, 54],
    'Units': [25, 30, 22, 28, 29, 31, 24, 27]
}

df_sales = pd.DataFrame(sales_data)
st.subheader("Original Sales Data:")
st.write(df_sales)

Output:

We’ll use this same DataFrame for all examples so you can see how each function changes or summarizes the data.

2. Basic GroupBy Aggregation – groupby() + sum()

What it does

This is the simplest form of groupby: group rows by a column and calculate a summary statistic (like sum).

Code

# Basic groupby with aggregation
st.subheader("Sales by Region:")
st.write(df_sales.groupby('Region')['Sales'].sum())
print("\n")

Explanation

  • df_sales.groupby('Region') splits the data into four groups: East, North, South, West.
  • ['Sales'].sum() calculates the total sales for each region.

Output

3. Multiple Aggregations with agg()

What it does

agg() lets you compute several statistics at once for different columns. This is great for building quick summary tables.

Code

# Multiple aggregations
st.subheader("Multiple aggregations by Region:")
st.write(df_sales.groupby('Region').agg({
    'Sales': ['sum', 'mean', 'max'],
    'Profit': ['sum', 'mean'],
    'Units': 'sum'
}))

Explanation

For each region, we’re asking pandas to:

  • For Sales: calculate sum, mean, and max.
  • For Profit: calculate sum and mean.
  • For Units: calculate sum.

Output

This gives a compact summary per region, perfect for reporting or dashboards.

4. Custom Aggregation with groupby().apply()

What it does

apply() lets you define your own function and run it for each group. Here we’ll calculate profit margin by region.

Code

# Custom aggregation function
def profit_margin(group):
    return (group['Profit'].sum() / group['Sales'].sum()) * 100

st.subheader("Profit Margin by Region:")
st.write(df_sales.groupby('Region').apply(profit_margin))

Explanation

  • The function profit_margin() takes a subset of the DataFrame (one region at a time).
  • It calculates:
    Profit Margin = (Total Profit ÷ Total Sales) × 100
  • groupby('Region').apply(profit_margin) runs this function for each region.

Output

Now you’re not just summarizing numbers – you’re computing meaningful business metrics with your own logic.

5. Using transform() to Add Group Statistics Back to the Data

What it does

transform() returns a result that has the same length as the original DataFrame. That means you can add group-level statistics as new columns, aligned with each row.

Code

# Transform - add group statistics to original DataFrame
df_sales['Region_Avg_Sales'] = df_sales.groupby('Region')['Sales'].transform('mean')
df_sales['Sales_vs_Regional_Avg'] = df_sales['Sales'] - df_sales['Region_Avg_Sales']
st.subheader("With Regional Average:")
st.write(df_sales)

Explanation

  • transform('mean') computes the mean for each region and repeats that value for every row in that region.
  • We then subtract the regional average from each row’s sales to see whether that row is above or below the region’s typical value.

Output

This is extremely useful in real-world analysis: you can flag rows that overperform or underperform relative to their group.

6. Filtering Groups with groupby().filter()

What it does

filter() allows you to keep only the groups that meet a condition. Here we keep only regions whose total sales are greater than 500.

Code

# Filter - keep only groups meeting criteria
high_sales_regions = df_sales.groupby('Region').filter(lambda x: x['Sales'].sum() > 500)
st.subheader("Regions with total sales > 500:")
st.write(high_sales_regions)
print("\n")

Or inline with a lambda:

high_sales_regions = df_sales.groupby('Region').filter(
    lambda x: x['Sales'].sum() > 500
)

Explanation

  • The function returns True if a region’s total sales are above 500.
  • filter() keeps all rows from groups that return True and drops the others.

Output

Here, South is removed because its total sales are exactly 500, not greater than 500.

7. Grouping by Multiple Columns

What it does

You can group by more than one column to get multi-level summaries. Here we compute sales by Region and Product together.

Code

# Multiple groupby columns
st.subheader("Sales by Region and Product:")
st.write(df_sales.groupby(['Region', 'Product'])['Sales'].sum())

8. Clean Summary Tables with Named Aggregations

What it does

Named aggregations let you give readable names to the resulting columns, making your summary tables ready for reports.

Code

# Named aggregations (Pandas 0.25+)
st.subheader("Named aggregations:")
result = df_sales.groupby('Region').agg(
    total_sales=('Sales', 'sum'),
    avg_profit=('Profit', 'mean'),
    max_units=('Units', 'max')
)
st.write(result)

Explanation

  • Each argument to agg() has the form:
    new_column_name = ('existing_column', 'aggregation_function')
  • This avoids the multi-level column names that you get with dict-style agg.

Output

Now you have a clean, nicely labeled summary DataFrame – ideal for exporting to Excel, sending to stakeholders, or visualizing.

Complete code:

import pandas as pd
import streamlit as st

# Sample sales data
sales_data = {
    'Region': ['North', 'North', 'South', 'South', 'East', 'East', 'West', 'West'],
    'Product': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B'],
    'Sales': [250, 300, 220, 280, 290, 310, 240, 270],
    'Profit': [50, 60, 40, 55, 58, 62, 48, 54],
    'Units': [25, 30, 22, 28, 29, 31, 24, 27]
}

df_sales = pd.DataFrame(sales_data)
st.subheader("Original Sales Data:")
st.write(df_sales)


# Basic groupby with aggregation
st.subheader("Sales by Region:")
st.write(df_sales.groupby('Region')['Sales'].sum())
print("\n")

# Multiple aggregations
st.subheader("Multiple aggregations by Region:")
st.write(df_sales.groupby('Region').agg({
    'Sales': ['sum', 'mean', 'max'],
    'Profit': ['sum', 'mean'],
    'Units': 'sum'
}))
print("\n")

# Custom aggregation function
def profit_margin(group):
    return (group['Profit'].sum() / group['Sales'].sum()) * 100

st.subheader("Profit Margin by Region:")
st.write(df_sales.groupby('Region').apply(profit_margin))

# Transform - add group statistics to original DataFrame
df_sales['Region_Avg_Sales'] = df_sales.groupby('Region')['Sales'].transform('mean')
df_sales['Sales_vs_Regional_Avg'] = df_sales['Sales'] - df_sales['Region_Avg_Sales']
st.subheader("With Regional Average:")
st.write(df_sales)

# Filter - keep only groups meeting criteria
high_sales_regions = df_sales.groupby('Region').filter(lambda x: x['Sales'].sum() > 500)
st.subheader("Regions with total sales > 500:")
st.write(high_sales_regions)
print("\n")

# Multiple groupby columns
st.subheader("Sales by Region and Product:")
st.write(df_sales.groupby(['Region', 'Product'])['Sales'].sum())
print("\n")

# Named aggregations (Pandas 0.25+)
st.subheader("Named aggregations:")
result = df_sales.groupby('Region').agg(
    total_sales=('Sales', 'sum'),
    avg_profit=('Profit', 'mean'),
    max_units=('Units', 'max')
)
st.write(result)

Final Thoughts

In this article, we explored several powerful patterns for advanced GroupBy usage in pandas:

  • Simple aggregation with groupby() + sum()
  • Multiple metrics with agg()
  • Custom metrics with apply()
  • Adding group stats back to rows with transform()
  • Filtering whole groups with filter()
  • Multi-column grouping
  • Clean summary tables with named aggregations