Advanced Pandas Operations Guide

1. String Operations and Regex

Definition

Pandas provides vectorized string operations through the .str accessor, allowing you to perform string manipulations on entire Series efficiently. Regular expressions (regex) enable powerful pattern matching and text extraction.

Key Concepts

  • Vectorized Operations: Apply string methods to entire columns at once
  • Regex Patterns: Use pattern matching for complex text operations
  • String Accessor (.str): Gateway to string methods in Pandas
  • Extract and Split: Parse structured text data

Example

import pandas as pd
import numpy as np
import re

# Sample data with various string formats
data = {
    'Name': ['John Doe', 'Jane Smith', 'Bob JOHNSON', 'alice brown', 'Charlie Davis'],
    'Email': ['john.doe@email.com', 'jane_smith@company.org', 'bob.j@test.net', 
              'alice@example.com', 'charlie.davis@work.co.uk'],
    'Phone': ['(555) 123-4567', '555-234-5678', '(555)345-6789', 
              '555 456 7890', '+1-555-567-8901'],
    'Product_Code': ['ELEC-2024-001', 'FURN-2023-052', 'ELEC-2024-003',
                     'CLTH-2024-015', 'FURN-2023-089'],
    'Description': ['  Premium Laptop  ', 'Office Chair', '  SMART TV  ',
                    'Cotton T-Shirt', '  Wooden Desk  ']
}

df = pd.DataFrame(data)
print("Original Data:")
print(df)
print("\n")

# Basic String Operations
# Convert to lowercase
df['Name_Lower'] = df['Name'].str.lower()
# Convert to uppercase
df['Name_Upper'] = df['Name'].str.upper()
# Title case
df['Name_Title'] = df['Name'].str.title()
print("Case conversions:")
print(df[['Name', 'Name_Lower', 'Name_Upper', 'Name_Title']])
print("\n")

# Strip whitespace
df['Description_Clean'] = df['Description'].str.strip()
print("Stripped descriptions:")
print(df[['Description', 'Description_Clean']])
print("\n")

# String length
df['Name_Length'] = df['Name'].str.len()
print("Name lengths:")
print(df[['Name', 'Name_Length']])
print("\n")

# Contains (case-sensitive by default)
df['Has_John'] = df['Name'].str.contains('John')
df['Has_John_CaseInsensitive'] = df['Name'].str.contains('john', case=False)
print("Contains 'John':")
print(df[['Name', 'Has_John', 'Has_John_CaseInsensitive']])
print("\n")

# Starts with / Ends with
df['Starts_With_J'] = df['Name'].str.startswith('J')
df['Ends_With_n'] = df['Name'].str.endswith('n')
print("Starts/Ends with:")
print(df[['Name', 'Starts_With_J', 'Ends_With_n']])
print("\n")

# Replace
df['Phone_Clean'] = df['Phone'].str.replace(r'[^\d]', '', regex=True)
print("Phone numbers cleaned (digits only):")
print(df[['Phone', 'Phone_Clean']])
print("\n")

# Split strings
df['First_Name'] = df['Name'].str.split().str[0]
df['Last_Name'] = df['Name'].str.split().str[-1]
print("Split names:")
print(df[['Name', 'First_Name', 'Last_Name']])
print("\n")

# Split into columns
name_parts = df['Name'].str.split(' ', expand=True)
name_parts.columns = ['First', 'Last']
print("Split into separate columns:")
print(name_parts)
print("\n")

# Extract using regex (with named groups)
# Extract domain from email
df['Email_Domain'] = df['Email'].str.extract(r'@([a-zA-Z0-9.-]+)')
print("Extracted email domains:")
print(df[['Email', 'Email_Domain']])
print("\n")

# Extract multiple groups from Product_Code
product_parts = df['Product_Code'].str.extract(r'([A-Z]+)-(\d+)-(\d+)')
product_parts.columns = ['Category', 'Year', 'Serial']
print("Extracted product code parts:")
print(product_parts)
print("\n")

# Extract all matches (returns list)
text_data = pd.Series(['Item A123 and B456', 'Product C789', 'No codes here'])
codes = text_data.str.findall(r'[A-Z]\d+')
print("Extract all codes:")
print(codes)
print("\n")

# String slicing
df['Email_Username'] = df['Email'].str[:df['Email'].str.find('@')]
df['Area_Code'] = df['Phone_Clean'].str[:3]
print("String slicing:")
print(df[['Email', 'Email_Username']])
print(df[['Phone_Clean', 'Area_Code']])
print("\n")

# Concatenate strings
df['Full_Info'] = df['Name'] + ' - ' + df['Email']
print("Concatenated strings:")
print(df['Full_Info'])
print("\n")

# Count occurrences
test_series = pd.Series(['apple, banana, apple', 'orange, apple', 'banana'])
test_series_count = test_series.str.count('apple')
print("Count 'apple' occurrences:")
print(test_series_count)
print("\n")

# Regex pattern matching with match (anchored at beginning)
df['Email_Valid'] = df['Email'].str.match(r'^[a-zA-Z0-9._]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$')
print("Email validation:")
print(df[['Email', 'Email_Valid']])
print("\n")

# Using regex flags
text_with_html = pd.Series(['<p>Hello</p>', '<div>World</div>', 'Plain text'])
cleaned_html = text_with_html.str.replace(r'<[^>]+>', '', regex=True)
print("Remove HTML tags:")
print(cleaned_html)

2. Categorical Data and Memory Optimization

Definition

A categorical data type is used for variables that take on a limited number of distinct values. Converting to categorical can significantly reduce memory usage and improve performance, especially with large datasets.

Key Concepts

  • Memory Efficiency: Categorical data stores values as integers internally
  • Ordered Categories: Define order for sorting and comparison
  • Category Operations: Add, remove, rename categories
  • Performance: Faster groupby and value_counts operations

Example

# Create sample data with repetitive values
np.random.seed(42)
size = 100000

data = {
    'ID': range(size),
    'Country': np.random.choice(['USA', 'UK', 'Canada', 'Australia', 'Germany'], size),
    'Product': np.random.choice(['Laptop', 'Phone', 'Tablet', 'Monitor', 'Keyboard'], size),
    'Size': np.random.choice(['Small', 'Medium', 'Large', 'XLarge'], size),
    'Rating': np.random.choice(['Poor', 'Fair', 'Good', 'Excellent'], size),
    'Sales': np.random.randint(100, 1000, size)
}

df_large = pd.DataFrame(data)

print("Original DataFrame info:")
print(df_large.info(memory_usage='deep'))
print("\n")

# Check memory usage before conversion
print("Memory usage before categorical conversion:")
print(df_large.memory_usage(deep=True))
print(f"Total: {df_large.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
print("\n")

# Convert to categorical
df_categorical = df_large.copy()
df_categorical['Country'] = df_categorical['Country'].astype('category')
df_categorical['Product'] = df_categorical['Product'].astype('category')
df_categorical['Size'] = df_categorical['Size'].astype('category')
df_categorical['Rating'] = df_categorical['Rating'].astype('category')

print("Memory usage after categorical conversion:")
print(df_categorical.memory_usage(deep=True))
print(f"Total: {df_categorical.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
print("\n")

# Calculate memory savings
original_memory = df_large.memory_usage(deep=True).sum()
categorical_memory = df_categorical.memory_usage(deep=True).sum()
savings = (1 - categorical_memory / original_memory) * 100
print(f"Memory savings: {savings:.2f}%")
print("\n")

# Working with smaller dataset for demonstration
df_small = df_large.head(20).copy()

# Create ordered categorical
size_order = ['Small', 'Medium', 'Large', 'XLarge']
df_small['Size_Ordered'] = pd.Categorical(
    df_small['Size'],
    categories=size_order,
    ordered=True
)

print("Ordered categorical:")
print(df_small[['Size', 'Size_Ordered']].head(10))
print(f"Is ordered: {df_small['Size_Ordered'].cat.ordered}")
print("\n")

# Sorting with ordered categorical
df_sorted = df_small.sort_values('Size_Ordered')
print("Sorted by Size (ordered):")
print(df_sorted[['ID', 'Size_Ordered']].head(10))
print("\n")

# Access category properties
print("Categories:", df_small['Size_Ordered'].cat.categories)
print("Codes (internal representation):", df_small['Size_Ordered'].cat.codes[:10].tolist())
print("\n")

# Add new category
df_small['Size_Ordered'] = df_small['Size_Ordered'].cat.add_categories(['XXLarge'])
print("Categories after adding XXLarge:", df_small['Size_Ordered'].cat.categories)
print("\n")

# Remove category
df_small['Size_Ordered'] = df_small['Size_Ordered'].cat.remove_categories(['XXLarge'])
print("Categories after removing XXLarge:", df_small['Size_Ordered'].cat.categories)
print("\n")

# Rename categories
rating_cat = pd.Categorical(df_small['Rating'])
df_small['Rating_Cat'] = rating_cat
df_small['Rating_Cat'] = df_small['Rating_Cat'].cat.rename_categories({
    'Poor': '1-Poor',
    'Fair': '2-Fair',
    'Good': '3-Good',
    'Excellent': '4-Excellent'
})
print("Renamed categories:")
print(df_small[['Rating', 'Rating_Cat']].head(10))
print("\n")

# Reorder categories
df_small['Rating_Cat'] = df_small['Rating_Cat'].cat.reorder_categories([
    '1-Poor', '2-Fair', '3-Good', '4-Excellent'
], ordered=True)
print("Reordered categories:")
print(df_small['Rating_Cat'].cat.categories)
print("\n")

# Value counts on categorical (faster)
print("Value counts on categorical data:")
print(df_categorical['Country'].value_counts())
print("\n")

# GroupBy on categorical (faster)
print("GroupBy on categorical data:")
print(df_categorical.groupby('Country')['Sales'].mean().head())
print("\n")

# Convert multiple columns at once
df_multi_cat = df_large.head(1000).copy()
categorical_columns = ['Country', 'Product', 'Size', 'Rating']
df_multi_cat[categorical_columns] = df_multi_cat[categorical_columns].astype('category')

print("Multiple columns converted:")
print(df_multi_cat.dtypes)

3. Apply, Map, and ApplyMap

Definition

These methods allow you to apply custom functions to DataFrames and Series. apply() works on rows/columns, map() works on Series element-wise, and applymap() (now map() for DataFrames in newer versions) works on every element of a DataFrame.

Key Concepts

  • apply(): Apply function along axis (rows or columns)
  • map(): Element-wise transformation for Series
  • applymap(): Element-wise transformation for DataFrame (deprecated, use map())
  • Lambda Functions: Anonymous functions for quick operations
  • Vectorization: Prefer built-in methods when possible for speed

Example

# Sample data
employees = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Department': ['Sales', 'IT', 'HR', 'Sales', 'IT'],
    'Salary': [50000, 60000, 45000, 55000, 65000],
    'Years_Experience': [3, 5, 2, 4, 7],
    'Performance_Score': [85, 92, 78, 88, 95]
})

print("Employee Data:")
print(employees)
print("\n")

# ========== APPLY ON SERIES ==========
# Apply function to a single column
def categorize_salary(salary):
    if salary < 50000:
        return 'Low'
    elif salary < 60000:
        return 'Medium'
    else:
        return 'High'

employees['Salary_Category'] = employees['Salary'].apply(categorize_salary)
print("Apply function to Series:")
print(employees[['Name', 'Salary', 'Salary_Category']])
print("\n")

# Using lambda function
employees['Bonus'] = employees['Salary'].apply(lambda x: x * 0.1)
print("Lambda function on Series:")
print(employees[['Name', 'Salary', 'Bonus']])
print("\n")

# ========== APPLY ON DATAFRAME (axis=1 for rows) ==========
# Apply function across rows
def calculate_performance_bonus(row):
    base_bonus = row['Salary'] * 0.05
    performance_multiplier = row['Performance_Score'] / 100
    return base_bonus * performance_multiplier

employees['Performance_Bonus'] = employees.apply(calculate_performance_bonus, axis=1)
print("Apply function across rows:")
print(employees[['Name', 'Salary', 'Performance_Score', 'Performance_Bonus']])
print("\n")

# Multiple return values using apply
def employee_summary(row):
    return pd.Series({
        'Total_Comp': row['Salary'] + row['Performance_Bonus'],
        'Experience_Level': 'Senior' if row['Years_Experience'] >= 5 else 'Junior',
        'High_Performer': row['Performance_Score'] >= 90
    })

summary_df = employees.apply(employee_summary, axis=1)
print("Apply with multiple return values:")
print(summary_df)
print("\n")

# ========== APPLY ON DATAFRAME (axis=0 for columns) ==========
numeric_df = employees[['Salary', 'Years_Experience', 'Performance_Score']]

# Apply function to each column
column_stats = numeric_df.apply(lambda x: pd.Series({
    'mean': x.mean(),
    'std': x.std(),
    'min': x.min(),
    'max': x.max()
}))
print("Apply function to columns:")
print(column_stats)
print("\n")

# ========== MAP ON SERIES ==========
# Map with dictionary
department_codes = {
    'Sales': 'SLS',
    'IT': 'INF',
    'HR': 'HRM'
}
employees['Dept_Code'] = employees['Department'].map(department_codes)
print("Map with dictionary:")
print(employees[['Name', 'Department', 'Dept_Code']])
print("\n")

# Map with function
employees['Experience_Doubled'] = employees['Years_Experience'].map(lambda x: x * 2)
print("Map with function:")
print(employees[['Name', 'Years_Experience', 'Experience_Doubled']])
print("\n")

# Map with Series (useful for lookups)
dept_budgets = pd.Series({
    'Sales': 1000000,
    'IT': 1500000,
    'HR': 500000
})
employees['Dept_Budget'] = employees['Department'].map(dept_budgets)
print("Map with Series:")
print(employees[['Name', 'Department', 'Dept_Budget']])
print("\n")

# ========== APPLYMAP (deprecated) / MAP for DataFrame ==========
# In newer Pandas versions, use .map() instead of .applymap()
sample_df = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [4, 5, 6],
    'C': [7, 8, 9]
})

print("Original DataFrame:")
print(sample_df)
print("\n")

# Apply function to every element (use map in Pandas 2.1+)
# For older versions: sample_df.applymap(lambda x: x ** 2)
try:
    squared_df = sample_df.map(lambda x: x ** 2)
except AttributeError:
    # Fallback for older Pandas versions
    squared_df = sample_df.applymap(lambda x: x ** 2)

print("Squared values (element-wise):")
print(squared_df)
print("\n")

# ========== PERFORMANCE COMPARISON ==========
print("Performance Comparison:")
print("-" * 50)

# Create larger dataset
large_df = pd.DataFrame({
    'values': np.random.randn(10000)
})

# Method 1: Apply (slower)
import time
start = time.time()
result1 = large_df['values'].apply(lambda x: x ** 2)
apply_time = time.time() - start

# Method 2: Vectorized operation (faster)
start = time.time()
result2 = large_df['values'] ** 2
vectorized_time = time.time() - start

print(f"Apply time: {apply_time:.4f} seconds")
print(f"Vectorized time: {vectorized_time:.4f} seconds")
print(f"Speedup: {apply_time/vectorized_time:.2f}x")
print("\n")

# ========== ADVANCED APPLY USAGE ==========
# Using apply with args
def add_value(x, value_to_add):
    return x + value_to_add

employees['Salary_Plus_5000'] = employees['Salary'].apply(add_value, args=(5000,))
print("Apply with arguments:")
print(employees[['Name', 'Salary', 'Salary_Plus_5000']])
print("\n")

# Conditional apply
def conditional_bonus(row):
    if row['Department'] == 'Sales':
        return row['Salary'] * 0.15
    elif row['Department'] == 'IT':
        return row['Salary'] * 0.12
    else:
        return row['Salary'] * 0.10

employees['Dept_Specific_Bonus'] = employees.apply(conditional_bonus, axis=1)
print("Conditional apply across rows:")
print(employees[['Name', 'Department', 'Salary', 'Dept_Specific_Bonus']])

4. Advanced Missing Data Handling

Definition

Missing data (NaN, None, NaT) requires careful handling. Pandas provides various methods to detect, remove, fill, and interpolate missing values based on different strategies.

Key Concepts

  • Detection: isna(), notna(), isnull(), notnull()
  • Removal: dropna() with various parameters
  • Filling: fillna(), ffill(), bfill()
  • Interpolation: Linear, polynomial, time-based methods
  • Indicators: Add columns to track which values were missing

Example

# Create dataset with missing values
data_missing = {
    'Date': pd.date_range('2024-01-01', periods=15, freq='D'),
    'Temperature': [32, 35, np.nan, 38, 40, np.nan, np.nan, 45, 48, 50, np.nan, 52, 55, np.nan, 58],
    'Humidity': [65, np.nan, 70, 72, np.nan, 75, 78, 80, np.nan, np.nan, 85, 87, 90, 92, np.nan],
    'Rainfall': [0, 0, np.nan, 5, 10, np.nan, 0, 0, 15, np.nan, 20, 0, 0, np.nan, 5],
    'City': ['NYC', 'NYC', 'NYC', np.nan, 'NYC', 'LA', 'LA', np.nan, 'LA', 'LA', 'Chicago', 'Chicago', np.nan, 'Chicago', 'Chicago']
}

df_missing = pd.DataFrame(data_missing)
print("DataFrame with missing values:")
print(df_missing)
print("\n")

# ========== DETECTION ==========
# Check for missing values
print("Missing values per column:")
print(df_missing.isna().sum())
print("\n")

print("Percentage of missing values per column:")
print((df_missing.isna().sum() / len(df_missing) * 100).round(2))
print("\n")

# Check if any value is missing in each row
df_missing['Has_Missing'] = df_missing.isna().any(axis=1)
print("Rows with any missing value:")
print(df_missing[df_missing['Has_Missing']])
print("\n")

# ========== REMOVAL ==========
# Drop rows with any missing values
df_dropped_any = df_missing.dropna()
print(f"Rows after dropping any NaN: {len(df_dropped_any)} (from {len(df_missing)})")
print(df_dropped_any)
print("\n")

# Drop rows where all values are missing
df_all_missing = pd.DataFrame({
    'A': [1, np.nan, 3, np.nan],
    'B': [np.nan, np.nan, 6, 7],
    'C': [np.nan, np.nan, 9, 10]
})
print("Before dropping all-NaN rows:")
print(df_all_missing)
df_dropped_all = df_all_missing.dropna(how='all')
print("\nAfter dropping all-NaN rows:")
print(df_dropped_all)
print("\n")

# Drop rows with missing values in specific columns
df_dropped_subset = df_missing.dropna(subset=['Temperature', 'City'])
print(f"Rows after dropping NaN in Temperature and City: {len(df_dropped_subset)}")
print(df_dropped_subset)
print("\n")

# Drop columns with missing values
df_dropped_cols = df_missing.dropna(axis=1)
print("Columns after dropping those with NaN:")
print(df_dropped_cols.columns.tolist())
print("\n")

# Drop columns with more than X% missing
threshold = 0.3  # 30%
df_threshold = df_missing.dropna(thresh=int(threshold * len(df_missing)), axis=1)
print(f"Columns with less than {threshold*100}% missing:")
print(df_threshold.columns.tolist())
print("\n")

# ========== FILLING ==========
# Fill with a constant value
df_filled_constant = df_missing.copy()
df_filled_constant['Temperature'] = df_filled_constant['Temperature'].fillna(0)
print("Fill Temperature with 0:")
print(df_filled_constant[['Date', 'Temperature']])
print("\n")

# Fill with mean/median/mode
df_filled_stats = df_missing.copy()
df_filled_stats['Temperature'] = df_filled_stats['Temperature'].fillna(
    df_filled_stats['Temperature'].mean()
)
df_filled_stats['Humidity'] = df_filled_stats['Humidity'].fillna(
    df_filled_stats['Humidity'].median()
)
print("Fill with mean (Temperature) and median (Humidity):")
print(df_filled_stats[['Temperature', 'Humidity']])
print("\n")

# Forward fill (propagate last valid observation forward)
df_ffill = df_missing.copy()
df_ffill['Temperature'] = df_ffill['Temperature'].ffill()
print("Forward fill Temperature:")
print(df_ffill[['Date', 'Temperature']])
print("\n")

# Backward fill
df_bfill = df_missing.copy()
df_bfill['Temperature'] = df_bfill['Temperature'].bfill()
print("Backward fill Temperature:")
print(df_bfill[['Date', 'Temperature']])
print("\n")

# Fill with limit (only fill certain number of consecutive NaNs)
df_limit = df_missing.copy()
df_limit['Temperature'] = df_limit['Temperature'].ffill(limit=1)
print("Forward fill with limit=1:")
print(df_limit[['Date', 'Temperature']])
print("\n")

# Fill different columns with different values
df_filled_dict = df_missing.copy()
df_filled_dict = df_filled_dict.fillna({
    'Temperature': df_filled_dict['Temperature'].mean(),
    'Humidity': df_filled_dict['Humidity'].median(),
    'Rainfall': 0,
    'City': 'Unknown'
})
print("Fill with dictionary (different values per column):")
print(df_filled_dict)
print("\n")

# ========== INTERPOLATION ==========
# Linear interpolation
df_interp = df_missing.copy()
df_interp['Temperature'] = df_interp['Temperature'].interpolate(method='linear')
print("Linear interpolation:")
print(df_interp[['Date', 'Temperature']])
print("\n")

# Polynomial interpolation
df_interp_poly = df_missing.copy()
df_interp_poly['Temperature'] = df_interp_poly['Temperature'].interpolate(
    method='polynomial',
    order=2
)
print("Polynomial interpolation (order=2):")
print(df_interp_poly[['Date', 'Temperature']].round(2))
print("\n")

# Time-based interpolation
df_time = df_missing.set_index('Date').copy()
df_time['Temperature'] = df_time['Temperature'].interpolate(method='time')
print("Time-based interpolation:")
print(df_time['Temperature'].round(2))
print("\n")

# ========== MISSING DATA INDICATORS ==========
# Create indicator for missing values
df_indicator = df_missing.copy()
df_indicator['Temperature_Was_Missing'] = df_indicator['Temperature'].isna()
df_indicator['Temperature'] = df_indicator['Temperature'].fillna(
    df_indicator['Temperature'].mean()
)
print("With missing data indicator:")
print(df_indicator[['Date', 'Temperature', 'Temperature_Was_Missing']])
print("\n")

# ========== GROUP-BASED FILLING ==========
# Fill missing values with group mean
df_group = df_missing.copy()
df_group['Temperature'] = df_group.groupby('City')['Temperature'].transform(
    lambda x: x.fillna(x.mean())
)
print("Fill with group (City) mean:")
print(df_group[['Date', 'City', 'Temperature']])
print("\n")

# ========== REPLACING SPECIFIC VALUES ==========
# Replace specific values with NaN
df_replace = pd.DataFrame({
    'A': [1, 2, -999, 4, -999],
    'B': [5, -999, 7, 8, 9]
})
print("Before replacing -999 with NaN:")
print(df_replace)
df_replace = df_replace.replace(-999, np.nan)
print("\nAfter replacing:")
print(df_replace)
print("\n")

# ========== ADVANCED: MULTIPLE IMPUTATION CONCEPT ==========
# Simple demonstration of multiple strategies
df_multi_strategy = df_missing.copy()

# Strategy 1: Numerical columns - interpolate
numeric_cols = df_multi_strategy.select_dtypes(include=[np.number]).columns
df_multi_strategy[numeric_cols] = df_multi_strategy[numeric_cols].interpolate()

# Strategy 2: Categorical columns - forward fill
categorical_cols = df_multi_strategy.select_dtypes(include=['object']).columns
df_multi_strategy[categorical_cols] = df_multi_strategy[categorical_cols].ffill()

print("Multi-strategy imputation:")
print(df_multi_strategy)

5. Query and Eval for Performance

Definition

query() and eval() provide efficient ways to filter and compute on DataFrames using string expressions. They can be faster than traditional methods for large datasets and offer cleaner, more readable syntax.

Key Concepts

  • query(): Filter rows using boolean expressions as strings
  • eval(): Evaluate expressions and assign results
  • Performance: Uses numexpr engine for optimized computation
  • Variable References: Use @ to reference Python variables
  • Column Name Spaces: Handle columns with spaces or special characters

Example

# Create sample dataset
np.random.seed(42)
n_rows = 100000

large_data = pd.DataFrame({
    'Product': np.random.choice(['Laptop', 'Phone', 'Tablet', 'Monitor'], n_rows),
    'Region': np.random.choice(['North', 'South', 'East', 'West'], n_rows),
    'Sales': np.random.randint(100, 1000, n_rows),
    'Profit': np.random.randint(10, 200, n_rows),
    'Quantity': np.random.randint(1, 50, n_rows),
    'Year': np.random.choice([2022, 2023, 2024], n_rows),
    'Customer_Rating': np.random.uniform(1, 5, n_rows)
})

# Create smaller version for display
df_query = large_data.head(20).copy()
print("Sample Data:")
print(df_query)
print("\n")

# ========== BASIC QUERY ==========
# Traditional filtering
traditional = df_query[(df_query['Sales'] > 500) & (df_query['Region'] == 'North')]
print("Traditional filtering:")
print(traditional)
print("\n")

# Query method (cleaner syntax)
query_result = df_query.query('Sales > 500 and Region == "North"')
print("Query method:")
print(query_result)
print("\n")

# ========== QUERY WITH MULTIPLE CONDITIONS ==========
# Complex conditions
result = df_query.query('Sales > 500 and (Region == "North" or Region == "South") and Year >= 2023')
print("Complex query:")
print(result)
print("\n")

# Using in operator
result = df_query.query('Region in ["North", "East"] and Sales > 400')
print("Query with 'in' operator:")
print(result)
print("\n")

# ========== QUERY WITH VARIABLES ==========
# Reference Python variables with @
min_sales = 500
max_sales = 800
target_region = 'West'

result = df_query.query('Sales >= @min_sales and Sales <= @max_sales and Region == @target_region')
print("Query with Python variables (@):")
print(result)
print("\n")

# Using list variables
target_products = ['Laptop', 'Monitor']
result = df_query.query('Product in @target_products and Sales > 500')
print("Query with list variable:")
print(result)
print("\n")

# ========== QUERY WITH COLUMN NAME SPACES ==========
# Handling columns with spaces
df_spaces = pd.DataFrame({
    'Product Name': ['A', 'B', 'C', 'D'],
    'Sales Amount': [100, 200, 300, 400],
    'Profit Margin': [10, 20, 30, 40]
})

result = df_spaces.query('`Sales Amount` > 150 and `Profit Margin` >= 20')
print("Query with spaces in column names (use backticks):")
print(result)
print("\n")

# ========== EVAL FOR COMPUTATIONS ==========
# Traditional computation
df_eval = df_query.copy()
df_eval['Profit_Margin'] = (df_eval['Profit'] / df_eval['Sales']) * 100
print("Traditional computation:")
print(df_eval[['Sales', 'Profit', 'Profit_Margin']].head())
print("\n")

# Using eval (cleaner and potentially faster)
df_eval = df_query.copy()
df_eval.eval('Profit_Margin = (Profit / Sales) * 100', inplace=True)
print("Eval computation:")
print(df_eval[['Sales', 'Profit', 'Profit_Margin']].head())
print("\n")

# Multiple computations
df_eval.eval('''
    Revenue = Sales * Quantity
    Cost = Revenue - Profit
    ROI = (Profit / Cost) * 100
''', inplace=True)
print("Multiple eval computations:")
print(df_eval[['Sales', 'Quantity', 'Revenue', 'Cost', 'Profit', 'ROI']].head())
print("\n")

# ========== EVAL WITH ASSIGNMENT ==========
# Direct assignment
df_eval = df_query.copy()
df_eval.eval('Total_Revenue = Sales * Quantity', inplace=True)
df_eval.eval('High_Value = Total_Revenue > 5000', inplace=True)
print("Eval with assignment:")
print(df_eval[['Sales', 'Quantity', 'Total_Revenue', 'High_Value']].head())
print("\n")

# ========== PERFORMANCE COMPARISON ==========
print("Performance Comparison (100,000 rows):")
print("-" * 50)

# Method 1: Traditional filtering
start = time.time()
result1 = large_data[(large_data['Sales'] > 500) & 
                     (large_data['Profit'] > 100) & 
                     (large_data['Year'] >= 2023)]
traditional_time = time.time() - start

# Method 2: Query
start = time.time()
result2 = large_data.query('Sales > 500 and Profit > 100 and Year >= 2023')
query_time = time.time() - start

print(f"Traditional filtering: {traditional_time:.4f} seconds")
print(f"Query method: {query_time:.4f} seconds")
print(f"Speedup: {traditional_time/query_time:.2f}x")
print("\n")

# Computation performance
# Method 1: Traditional
start = time.time()
large_data_copy = large_data.copy()
large_data_copy['Margin'] = (large_data_copy['Profit'] / large_data_copy['Sales']) * 100
traditional_comp_time = time.time() - start

# Method 2: Eval
start = time.time()
large_data_copy2 = large_data.copy()
large_data_copy2.eval('Margin = (Profit / Sales) * 100', inplace=True)
eval_time = time.time() - start

print(f"Traditional computation: {traditional_comp_time:.4f} seconds")
print(f"Eval method: {eval_time:.4f} seconds")
print(f"Speedup: {traditional_comp_time/eval_time:.2f}x")
print("\n")

# ========== ADVANCED QUERY OPERATIONS ==========
# Query with string methods
df_string = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Email': ['alice@email.com', 'bob@test.com', 'charlie@email.com', 
              'david@company.com', 'eve@test.com'],
    'Age': [25, 30, 35, 28, 32]
})

# Note: String methods don't work directly in query, use traditional method
result = df_string[df_string['Email'].str.contains('email')]
print("String filtering (use traditional method):")
print(result)
print("\n")

# Query with numerical ranges
result = df_query.query('500 < Sales < 800')
print("Query with range:")
print(result)
print("\n")

# ========== COMBINING QUERY AND EVAL ==========
# Chain query and eval
df_combined = df_query.copy()
result = (df_combined
          .query('Sales > 400 and Region in ["North", "East"]')
          .eval('Profit_Pct = (Profit / Sales) * 100')
          .query('Profit_Pct > 15')
          .sort_values('Profit_Pct', ascending=False))

print("Combined query and eval:")
print(result[['Product', 'Region', 'Sales', 'Profit', 'Profit_Pct']].head())
print("\n")

# ========== EVAL FOR INDEX OPERATIONS ==========
# Using eval with index
df_indexed = df_query.set_index('Product').copy()
df_indexed.eval('Total = Sales + Profit', inplace=True)
print("Eval with indexed DataFrame:")
print(df_indexed[['Sales', 'Profit', 'Total']].head())

Summary

When to Use Each Technique

  1. String Operations: When working with text data – cleaning, parsing, pattern matching, or extracting information
  2. Categorical Data: When dealing with repetitive string values in large datasets to save memory and improve performance
  3. Apply/Map: When you need custom logic that can’t be achieved with vectorized operations (but always consider vectorization first)
  4. Missing Data Handling: Essential for real-world datasets – choose strategy based on data type and missingness pattern
  5. Query/Eval: For cleaner syntax and potential performance gains on large datasets with complex filtering or calculations

Best Practices

  • String Operations: Use vectorized .str methods instead of loops; compile regex patterns for reuse
  • Categorical: Convert before analysis, especially for columns with <50% unique values
  • Apply/Map: Prefer vectorized operations when possible; use apply only when necessary
  • Missing Data: Understand why data is missing before choosing a strategy; document imputation methods
  • Query/Eval: Use for readability and performance; test on your specific dataset to confirm speed benefits

Performance Tips

  1. Always profile your code to identify actual bottlenecks
  2. Vectorized operations > Query/Eval > Apply > Python loops
  3. For string operations, compile regex patterns outside loops
  4. Categorical conversion pays off with >100K rows and low cardinality
  5. Missing data: Simpler methods (fillna) are faster than complex interpolation