Skip to content

Empty aggregations

What is the sum of zero values? As it turns out, tools disagree:

import narwhals as nw
import polars as pl
import duckdb

polars_df = pl.DataFrame({"a": [None], "b": [1]}, schema={"a": pl.Int64, "b": pl.Int64})
print("Polars result")
print(polars_df.group_by("b").agg(pl.col("a").sum()))

print("DuckDB result")
print(duckdb.sql("""from polars_df select b, sum(a) as a group by b"""))
Polars result
shape: (1, 2)
┌─────┬─────┐
 b    a   
 ---  --- 
 i64  i64 
╞═════╪═════╡
 1    0   
└─────┴─────┘
DuckDB result
┌───────┬────────┐
   b      a    
 int64  int128 
├───────┼────────┤
     1    NULL 
└───────┴────────┘

Polars, pandas, and PyArrow think the result is zero. SQL engines think it's NULL. Who's correct?

For now, we respect each backend's opinion and leave this result backend-specific, to avoid interfering with how aggregations compose with other operations. If it's crucial to you that an empty sum returns 0 for all backends, you can always follow the sum with fill_null(0).

from narwhals.typing import IntoFrameT


def custom_group_by_sum(df_native: IntoFrameT) -> IntoFrameT:
    return (
        nw.from_native(df_native)
        .group_by("b")
        .agg(nw.col("a").sum())
        .with_columns(nw.col("a").fill_null(0))
    )


print("Polars result:")
print(custom_group_by_sum(polars_df))
print("DuckDB result:")
print(custom_group_by_sum(duckdb.table("polars_df")))
Polars result:
┌──────────────────┐
|Narwhals DataFrame|
|------------------|
|  shape: (1, 2)   |
|  ┌─────┬─────┐   |
|   b    a      |
|   ---  ---    |
|   i64  i64    |
|  ╞═════╪═════╡   |
|   1    0      |
|  └─────┴─────┘   |
└──────────────────┘
DuckDB result:
┌──────────────────┐
|Narwhals LazyFrame|
|------------------|
|┌───────┬────────┐|
|   b      a    |
| int64  int128 |
|├───────┼────────┤|
|     1       0 |
|└───────┴────────┘|
└──────────────────┘