Avoiding the UserWarning
error while using Pandas group_by
Introduction
If you have ever experienced the
UserWarning: Found complex group-by expression, which can't be expressed efficiently with the pandas API. If you can, please rewrite your query such that group-by aggregations are simple (e.g. mean, std, min, max, ...)
message while using the narwhals group_by()
method, this is for you. If you haven't, this is also for you as you might experience it and you need to know how to avoid it.
The pandas API most likely cannot efficiently handle the complexity of the aggregation operations you are trying to run. Take the following two codes as an example.
import narwhals as nw
import pandas as pd
data = {"a": [1, 2, 3, 4, 5], "b": [5, 4, 3, 2, 1], "c": [10, 20, 30, 40, 50]}
df_pd = pd.DataFrame(data)
@nw.narwhalify
def approach_1(df):
# Pay attention to this next line
df = df.group_by("a").agg(d=(nw.col("b") + nw.col("c")).sum())
return df
print(approach_1(df_pd))
a d
0 1 15
1 2 24
2 3 33
3 4 42
4 5 51
import narwhals as nw
import pandas as pd
data = {"a": [1, 2, 3, 4, 5], "b": [5, 4, 3, 2, 1], "c": [10, 20, 30, 40, 50]}
df_pd = pd.DataFrame(data)
@nw.narwhalify
def approach_2(df):
# Pay attention to this next line
df = df.with_columns(d=nw.col("b") + nw.col("c")).group_by("a").agg(nw.sum("d"))
return df
print(approach_2(df_pd))
a d
0 1 15
1 2 24
2 3 33
3 4 42
4 5 51
Both Approaches shown above return the exact same result, but Approach 1 is inefficient and returns the warning message we showed at the top.
What makes the first approach inefficient and the second approach efficient? It comes down to what the pandas API lets us express.
Approach 1
# From line 11
return df.group_by("a").agg((nw.col("b") + nw.col("c")).sum().alias("d"))
To translate this to pandas, we would do:
df.groupby("a").apply(
lambda df: pd.Series([(df["b"] + df["c"]).sum()], index=["d"]), include_groups=False
)
apply
in pandas, that's a performance footgun - best to avoid it and use vectorised operations instead.
Let's take a look at how "approach 2" gets translated to pandas to see the difference.
Approach 2
# Line 11 in Approach 2
return df.with_columns(d=nw.col("b") + nw.col("c")).group_by("a").agg({"d": "sum"})
This gets roughly translated to:
df.assign(d=lambda df: df["b"] + df["c"]).groupby("a").agg({"d": "sum"})
apply
and a custom lambda
function, then this is going to be much more efficient.
Tips for Avoiding the UserWarning
To ensure efficiency and avoid warnings similar to those seen in Approach 1, we recommend that you follow these practices:
- Decompose complex operations: break down complex transformations into simpler steps. In this case, keep the
.agg
method simple. Compute new columns first, then use these columns in aggregation or other operations. - Avoid redundant computations: if an operation (like addition) is used multiple times, compute it once and store the result in a new column.
- Leverage built-in functions: use built-in functions provided by the DataFrame library. In this case, using the
with_columns()
method allows you to pre-compute before grouping and aggregation.
By following these guidelines, you can are sure to avoid the aforementioned warning.
Happy grouping! 🫡