Advanced Merging and Joining in Pandas: A Complete Tutorial

Understanding Data Merging: Combining Multiple DataFrames

When working with real-world data, information is rarely contained in a single table. You’ll often need to combine data from multiple sources—customer information from one database, order details from another, and product specifications from a third. This is where pandas’ powerful merging and joining capabilities become essential.

In this tutorial, we’ll explore different ways to combine DataFrames using a practical e-commerce scenario. We’ll work with customer data, order information, and product details to demonstrate each merge type and when to use it.


Setting Up Our Sample Data

Let’s start by creating three sample datasets that represent a typical e-commerce database structure.

Customer Information Dataset

import pandas as pd
import streamlit as st

# Customer data - Contains basic customer information
customers = pd.DataFrame({
    'CustomerID': [1, 2, 3, 4, 5],
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'City': ['NYC', 'LA', 'Chicago', 'Boston', 'Seattle']
})

st.subheader("Customers:")
st.write(customers)

What we have here:

  • Five registered customers in our system
  • Each customer has a unique ID, name, and city
  • This represents our master customer database

Output:

Order Transaction Dataset:

# Orders data
orders = pd.DataFrame({
    'OrderID': [101, 102, 103, 104, 105, 106],
    'CustomerID': [1, 2, 2, 3, 6, 1],
    'Amount': [250, 300, 150, 400, 200, 350],
    'Product': ['Laptop', 'Phone', 'Tablet', 'Monitor', 'Mouse', 'Keyboard']
})
st.subheader("\nOrders:")
st.write(orders)

Output:

Merge Type 1: Inner Join – Only Matching Records

An inner join returns only the records that exist in both DataFrames. Think of it as the intersection in a Venn diagram.

# Inner Join (default)
inner_merge = pd.merge(customers, orders, on='CustomerID', how='inner')
st.subheader("Inner Join (only matching customers):")
st.write(inner_merge)
print("\n")


When to use inner join:

  • When you only want complete records
  • When missing data on either side should be excluded
  • For analytics where you need guaranteed relationships
  • Example: “Show me all customers who have actually made purchases”

Output:

Merge Type 2: Left Join – Keep All Left Table Records

A left join keeps ALL records from the left DataFrame (customers) and matches records from the right DataFrame (orders) where possible. Unmatched records get NaN values.

# Left Join (all customers, even without orders)
left_merge = pd.merge(customers, orders, on='CustomerID', how='left')
st.subheader("Left Join (all customers):")
st.write(left_merge)
print("\n")

When to use left join:

  • When the left table is your “master” reference
  • When you want to preserve all records from the primary dataset
  • For analysis that includes entities without relationships
  • Example: “Show me all customers, including those who haven’t ordered yet”

Output:

Merge Type 3: Right Join – Keep All Right Table Records
A right join is the opposite of a left join—it keeps ALL records from the right DataFrame (orders) and matches records from the left DataFrame (customers) where possible.

# Right Join (all orders, even if customer not in customers table)
right_merge = pd.merge(customers, orders, on='CustomerID', how='right')
st.subheader("Right Join (all orders):")
st.write(right_merge)
print("\n")

Output:

When to use right join:

  • When the right table is more important
  • When you want to preserve all transactions
  • For data quality audits
  • Example: “Show me all orders, even those with invalid customer IDs”

Merge Type 4: Outer Join – Keep Everything

An outer join (also called a full outer join) keeps ALL records from BOTH DataFrames, regardless of whether there’s a match. This is like taking the union of both datasets.

# Outer Join (all records from both)
outer_merge = pd.merge(customers, orders, on='CustomerID', how='outer')
st.subheader("Outer Join (all records):")
st.write(outer_merge)
print("\n")

Output:

When to use outer join:

  • When you need to see everything
  • For comprehensive data audits
  • When both datasets might have unique records
  • Example: “Show me all customers and all orders, highlighting any gaps”

Advanced Technique: Merge with Indicator Column

The indicator parameter adds a special column that tells you where each row came from. This is incredibly useful for data quality analysis.

# Merge with indicator
merge_indicator = pd.merge(customers, orders, on='CustomerID', how='outer', indicator=True)
st.subheader("Merge with indicator:")
st.write(merge_indicator)
print("\n")

Output:

What the indicator column shows:

  • left_only: Record exists only in the left DataFrame (customers without orders)
  • right_only: Record exists only in the right DataFrame (orders without customers)
  • both: Record exists in both DataFrames (matched records)

Practical uses:

# Find customers without orders
inactive_customers = merge_indicator[merge_indicator['_merge'] == 'left_only']

# Find orders without valid customers (data quality issue)
orphaned_orders = merge_indicator[merge_indicator['_merge'] == 'right_only']

# Find successful matches
matched_records = merge_indicator[merge_indicator['_merge'] == 'both']

Real-world application: This is perfect for monthly reporting where you need to identify:

  • New customers who haven’t made their first purchase yet
  • Data integrity issues (orders with invalid customer references)
  • Active customer-order relationships

Multi-Table Merging: Joining Three DataFrames

In real applications, you often need to combine more than two tables. Let’s add product information and merge all three datasets together.

# Multiple DataFrames
products = pd.DataFrame({
    'Product': ['Laptop', 'Phone', 'Tablet', 'Monitor'],
    'Category': ['Electronics', 'Electronics', 'Electronics', 'Electronics'],
    'Cost': [800, 500, 300, 350]
})

# Merge multiple DataFrames
multi_merge = pd.merge(
    pd.merge(customers, orders, on='CustomerID'),
    products,
    on='Product',
    how='left'
)
st.subheader("Multiple DataFrame merge:")
st.write(multi_merge)
print("\n")

Step-by-step process:

  1. Inner merge customers and orders (only matching records)
  2. Left merge the result with products (keep all orders, add product info)
  3. Result: Customer-Order-Product combined view

Why left join for the second merge?

  • We want to keep all orders even if product details are missing
  • Products like “Mouse” and “Keyboard” aren’t in our products table
  • This prevents data loss

Output:

Merging on Different Column Names
Sometimes your DataFrames use different names for the same concept. Pandas handles this elegantly.

# Merging on different column names
df1 = pd.DataFrame({
    'emp_id': [1, 2, 3],
    'emp_name': ['John', 'Jane', 'Jack']
})

df2 = pd.DataFrame({
    'employee_id': [1, 2, 4],
    'salary': [50000, 60000, 55000]
})

diff_cols_merge = pd.merge(df1, df2, left_on='emp_id', right_on='employee_id', how='outer')
st.subheader("Merge on different column names:")
st.write(diff_cols_merge)
print("\n")

Output:

Key parameters:

  • left_on: Specifies the joining column from the left DataFrame
  • right_on: Specifies the joining column from the right DataFrame
  • Both columns remain in the result (unlike using on=)

Join Method: Using Index-Based Joining

The join() method is a shortcut for merging on index values. This is faster and cleaner when your DataFrames are already indexed properly.

# Join using index
df_indexed1 = customers.set_index('CustomerID')
df_indexed2 = orders.set_index('CustomerID')

joined = df_indexed1.join(df_indexed2, how='inner', lsuffix='_cust', rsuffix='_order')
st.subheader("Join using index:")
st.write(joined)

How join() differs from merge():

  • join() works on index by default
  • merge() works on columns by default
  • join() is more concise when using indexes
  • merge() is more flexible for complex cases

The suffix parameters: When both DataFrames have columns with the same name, suffixes help distinguish them:

  • lsuffix='_cust': Adds ‘_cust’ to left DataFrame column names
  • rsuffix='_order': Adds ‘_order’ to right DataFrame column names

Complete code:

import pandas as pd
import streamlit as st


# Customer data
customers = pd.DataFrame({
    'CustomerID': [1, 2, 3, 4, 5],
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'City': ['NYC', 'LA', 'Chicago', 'Boston', 'Seattle']
})

# Orders data
orders = pd.DataFrame({
    'OrderID': [101, 102, 103, 104, 105, 106],
    'CustomerID': [1, 2, 2, 3, 6, 1],
    'Amount': [250, 300, 150, 400, 200, 350],
    'Product': ['Laptop', 'Phone', 'Tablet', 'Monitor', 'Mouse', 'Keyboard']
})

st.subheader("Customers:")
st.write(customers)
st.subheader("\nOrders:")
st.write(orders)
print("\n")

# Inner Join (default)
inner_merge = pd.merge(customers, orders, on='CustomerID', how='inner')
st.subheader("Inner Join (only matching customers):")
st.write(inner_merge)
print("\n")

# Left Join (all customers, even without orders)
left_merge = pd.merge(customers, orders, on='CustomerID', how='left')
st.subheader("Left Join (all customers):")
st.write(left_merge)
print("\n")

# Right Join (all orders, even if customer not in customers table)
right_merge = pd.merge(customers, orders, on='CustomerID', how='right')
st.subheader("Right Join (all orders):")
st.write(right_merge)
print("\n")

# Outer Join (all records from both)
outer_merge = pd.merge(customers, orders, on='CustomerID', how='outer')
st.subheader("Outer Join (all records):")
st.write(outer_merge)
print("\n")

# Merge with indicator
merge_indicator = pd.merge(customers, orders, on='CustomerID', how='outer', indicator=True)
st.subheader("Merge with indicator:")
st.write(merge_indicator)
print("\n")

# Multiple DataFrames
products = pd.DataFrame({
    'Product': ['Laptop', 'Phone', 'Tablet', 'Monitor'],
    'Category': ['Electronics', 'Electronics', 'Electronics', 'Electronics'],
    'Cost': [800, 500, 300, 350]
})

# Merge multiple DataFrames
multi_merge = pd.merge(
    pd.merge(customers, orders, on='CustomerID'),
    products,
    on='Product',
    how='left'
)
st.subheader("Multiple DataFrame merge:")
st.write(multi_merge)
print("\n")

# Merging on different column names
df1 = pd.DataFrame({
    'emp_id': [1, 2, 3],
    'emp_name': ['John', 'Jane', 'Jack']
})

df2 = pd.DataFrame({
    'employee_id': [1, 2, 4],
    'salary': [50000, 60000, 55000]
})

diff_cols_merge = pd.merge(df1, df2, left_on='emp_id', right_on='employee_id', how='outer')
st.subheader("Merge on different column names:")
st.write(diff_cols_merge)
print("\n")

# Join using index
df_indexed1 = customers.set_index('CustomerID')
df_indexed2 = orders.set_index('CustomerID')

joined = df_indexed1.join(df_indexed2, how='inner', lsuffix='_cust', rsuffix='_order')
st.subheader("Join using index:")
st.write(joined)

Performance note: Index-based joins are faster than column-based merges because indexes are optimized for lookups. If you’re doing many merges on the same column, set it as the index first.