Definition
query() and eval() provide efficient ways to filter and compute on DataFrames using string expressions. They can be faster than traditional methods for large datasets and offer cleaner, more readable syntax.
Key Concepts
- query(): Filter rows using boolean expressions as strings
- eval(): Evaluate expressions and assign results
- Performance: Uses numexpr engine for optimized computation
- Variable References: Use @ to reference Python variables
- Column Name Spaces: Handle columns with spaces or special characters
Example
python
# Create sample dataset
np.random.seed(42)
n_rows = 100000
large_data = pd.DataFrame({
'Product': np.random.choice(['Laptop', 'Phone', 'Tablet', 'Monitor'], n_rows),
'Region': np.random.choice(['North', 'South', 'East', 'West'], n_rows),
'Sales': np.random.randint(100, 1000, n_rows),
'Profit': np.random.randint(10, 200, n_rows),
'Quantity': np.random.randint(1, 50, n_rows),
'Year': np.random.choice([2022, 2023, 2024], n_rows),
'Customer_Rating': np.random.uniform(1, 5, n_rows)
})
# Create smaller version for display
df_query = large_data.head(20).copy()
print("Sample Data:")
print(df_query)
print("\n")
# ========== BASIC QUERY ==========
# Traditional filtering
traditional = df_query[(df_query['Sales'] > 500) & (df_query['Region'] == 'North')]
print("Traditional filtering:")
print(traditional)
print("\n")
# Query method (cleaner syntax)
query_result = df_query.query('Sales > 500 and Region == "North"')
print("Query method:")
print(query_result)
print("\n")
# ========== QUERY WITH MULTIPLE CONDITIONS ==========
# Complex conditions
result = df_query.query('Sales > 500 and (Region == "North" or Region == "South") and Year >= 2023')
print("Complex query:")
print(result)
print("\n")
# Using in operator
result = df_query.query('Region in ["North", "East"] and Sales > 400')
print("Query with 'in' operator:")
print(result)
print("\n")
# ========== QUERY WITH VARIABLES ==========
# Reference Python variables with @
min_sales = 500
max_sales = 800
target_region = 'West'
result = df_query.query('Sales >= @min_sales and Sales <= @max_sales and Region == @target_region')
print("Query with Python variables (@):")
print(result)
print("\n")
# Using list variables
target_products = ['Laptop', 'Monitor']
result = df_query.query('Product in @target_products and Sales > 500')
print("Query with list variable:")
print(result)
print("\n")
# ========== QUERY WITH COLUMN NAME SPACES ==========
# Handling columns with spaces
df_spaces = pd.DataFrame({
'Product Name': ['A', 'B', 'C', 'D'],
'Sales Amount': [100, 200, 300, 400],
'Profit Margin': [10, 20, 30, 40]
})
result = df_spaces.query('`Sales Amount` > 150 and `Profit Margin` >= 20')
print("Query with spaces in column names (use backticks):")
print(result)
print("\n")
# ========== EVAL FOR COMPUTATIONS ==========
# Traditional computation
df_eval = df_query.copy()
df_eval['Profit_Margin'] = (df_eval['Profit'] / df_eval['Sales']) * 100
print("Traditional computation:")
print(df_eval[['Sales', 'Profit', 'Profit_Margin']].head())
print("\n")
# Using eval (cleaner and potentially faster)
df_eval = df_query.copy()
df_eval.eval('Profit_Margin = (Profit / Sales) * 100', inplace=True)
print("Eval computation:")
print(df_eval[['Sales', 'Profit', 'Profit_Margin']].head())
print("\n")
# Multiple computations
df_eval.eval('''
Revenue = Sales * Quantity
Cost = Revenue - Profit
ROI = (Profit / Cost) * 100
''', inplace=True)
print("Multiple eval computations:")
print(df_eval[['Sales', 'Quantity', 'Revenue', 'Cost', 'Profit', 'ROI']].head())
print("\n")
# ========== EVAL WITH ASSIGNMENT ==========
# Direct assignment
df_eval = df_query.copy()
df_eval.eval('Total_Revenue = Sales * Quantity', inplace=True)
df_eval.eval('High_Value = Total_Revenue > 5000', inplace=True)
print("Eval with assignment:")
print(df_eval[['Sales', 'Quantity', 'Total_Revenue', 'High_Value']].head())
print("\n")
# ========== PERFORMANCE COMPARISON ==========
print("Performance Comparison (100,000 rows):")
print("-" * 50)
# Method 1: Traditional filtering
start = time.time()
result1 = large_data[(large_data['Sales'] > 500) &
(large_data['Profit'] > 100) &
(large_data['Year'] >= 2023)]
traditional_time = time.time() - start
# Method 2: Query
start = time.time()
result2 = large_data.query('Sales > 500 and Profit > 100 and Year >= 2023')
query_time = time.time() - start
print(f"Traditional filtering: {traditional_time:.4f} seconds")
print(f"Query method: {query_time:.4f} seconds")
print(f"Speedup: {traditional_time/query_time:.2f}x")
print("\n")
# Computation performance
# Method 1: Traditional
start = time.time()
large_data_copy = large_data.copy()
large_data_copy['Margin'] = (large_data_copy['Profit'] / large_data_copy['Sales']) * 100
traditional_comp_time = time.time() - start
# Method 2: Eval
start = time.time()
large_data_copy2 = large_data.copy()
large_data_copy2.eval('Margin = (Profit / Sales) * 100', inplace=True)
eval_time = time.time() - start
print(f"Traditional computation: {traditional_comp_time:.4f} seconds")
print(f"Eval method: {eval_time:.4f} seconds")
print(f"Speedup: {traditional_comp_time/eval_time:.2f}x")
print("\n")
# ========== ADVANCED QUERY OPERATIONS ==========
# Query with string methods
df_string = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'Email': ['alice@email.com', 'bob@test.com', 'charlie@email.com',
'david@company.com', 'eve@test.com'],
'Age': [25, 30, 35, 28, 32]
})
# Note: String methods don't work directly in query, use traditional method
result = df_string[df_string['Email'].str.contains('email')]
print("String filtering (use traditional method):")
print(result)
print("\n")
# Query with numerical ranges
result = df_query.query('500 < Sales < 800')
print("Query with range:")
print(result)
print("\n")
# ========== COMBINING QUERY AND EVAL ==========
# Chain query and eval
df_combined = df_query.copy()
result = (df_combined
.query('Sales > 400 and Region in ["North", "East"]')
.eval('Profit_Pct = (Profit / Sales) * 100')
.query('Profit_Pct > 15')
.sort_values('Profit_Pct', ascending=False))
print("Combined query and eval:")
print(result[['Product', 'Region', 'Sales', 'Profit', 'Profit_Pct']].head())
print("\n")
# ========== EVAL FOR INDEX OPERATIONS ==========
# Using eval with index
df_indexed = df_query.set_index('Product').copy()
df_indexed.eval('Total = Sales + Profit', inplace=True)
print("Eval with indexed DataFrame:")
print(df_indexed[['Sales', 'Profit', 'Total']].head())
Summary
When to Use Each Technique
- String Operations: When working with text data – cleaning, parsing, pattern matching, or extracting information
- Categorical Data: When dealing with repetitive string values in large datasets to save memory and improve performance
- Apply/Map: When you need custom logic that can’t be achieved with vectorized operations (but always consider vectorization first)
- Missing Data Handling: Essential for real-world datasets – choose strategy based on data type and missingness pattern
- Query/Eval: For cleaner syntax and potential performance gains on large datasets with complex filtering or calculations
Best Practices
- String Operations: Use vectorized
.strmethods instead of loops; compile regex patterns for reuse - Categorical: Convert before analysis, especially for columns with <50% unique values
- Apply/Map: Prefer vectorized operations when possible; use apply only when necessary
- Missing Data: Understand why data is missing before choosing a strategy; document imputation methods
- Query/Eval: Use for readability and performance; test on your specific dataset to confirm speed benefits
Performance Tips
- Always profile your code to identify actual bottlenecks
- Vectorized operations > Query/Eval > Apply > Python loops
- For string operations, compile regex patterns outside loops
- Categorical conversion pays off with >100K rows and low cardinality
- Missing data: Simpler methods (fillna) are faster than complex interpolation
