Null/NaN handling
TL;DR
All dataframe tools, except for those which piggy-back off of pandas, make a clear distinction between NaN and null values.
Tip
We recommend only handling null values in applications and leaving NaN values as an edge case resulting from users having performed undefined mathematical operations.
What's the difference?
Most data tools except pandas make a clear distinction between:
- Null values, representing missing data.
- NaN values, resulting from "illegal" mathematical operations like
0/0
.
In Narwhals, this is reflected in separate methods for Null/NaN values:
Operation | Null | NaN |
---|---|---|
is | Expr.is_null |
Expr.is_nan |
fill | Expr.fill_null |
Expr.fill_nan |
drop | Expr.drop_nulls |
Not yet implemented (See discussion)polars.Expr.drop_nans |
count | Expr.null_count |
No upstream equivalent |
In pandas however the concepts are muddied, as different sentinel values represent missing depending on the data type.
Check how different tools distinguish them (or don't) in the following example:
import narwhals as nw
import numpy as np
from narwhals.typing import IntoFrameT
data = {"a": [1.0, 0.0, None]}
def check_null_behavior(df: IntoFrameT) -> IntoFrameT:
return (
nw.from_native(df)
.with_columns(a=nw.col("a") / nw.col("a"))
.with_columns(
a_is_null=nw.col("a").is_null(),
a_is_nan=nw.col("a").is_nan(),
)
).to_native()
import pandas as pd
df = pd.DataFrame(data)
print(check_null_behavior(df))
a a_is_null a_is_nan
0 1.0 False False
1 NaN True True
2 NaN True True
import pandas as pd
df = pd.DataFrame(data).convert_dtypes(dtype_backend="pyarrow")
print(check_null_behavior(df))
a a_is_null a_is_nan
0 1.0 False False
1 NaN False True
2 <NA> True <NA>
import polars as pl
df = pl.DataFrame(data)
print(check_null_behavior(df))
shape: (3, 3)
┌──────┬───────────┬──────────┐
│ a ┆ a_is_null ┆ a_is_nan │
│ --- ┆ --- ┆ --- │
│ f64 ┆ bool ┆ bool │
╞══════╪═══════════╪══════════╡
│ 1.0 ┆ false ┆ false │
│ NaN ┆ false ┆ true │
│ null ┆ true ┆ null │
└──────┴───────────┴──────────┘
import pyarrow as pa
df = pa.table(data)
print(check_null_behavior(df))
pyarrow.Table
a: double
a_is_null: bool
a_is_nan: bool
----
a: [[1,nan,null]]
a_is_null: [[false,false,true]]
a_is_nan: [[false,true,null]]
Notice how the classic pandas dtypes make no distinction between the concepts, whereas the other libraries do. Note however that discussion on what PyArrow-backed pandas dataframe should do is ongoing.
NaN comparisons
According to the IEEE-754 standard, NaN should compare as not equal to itself, and cannot be compared with other floating point numbers. Python and PyArrow follow these rules:
import pyarrow as pa
import pyarrow.compute as pc
print("Python result:")
print(float("nan") == float("nan"), 0.0 == 0.0)
print()
print("PyArrow result:")
arr = pa.array([float("nan"), 0.0])
print(pc.equal(arr, arr))
Python result:
False True
PyArrow result:
[
false,
true
]
Polars and DuckDB, however, don't follow this rule, and treat NaN as equal to itself.
import polars as pl
import duckdb
print("Polars result")
df = pl.DataFrame({"a": [float("nan"), 0.0]})
print(df.with_columns(a_equals_a=pl.col("a") == pl.col("a")))
print()
print("DuckDB result")
print(duckdb.sql("from df select a, a == a as a_equals_a"))
Polars result
shape: (2, 2)
┌─────┬────────────┐
│ a ┆ a_equals_a │
│ --- ┆ --- │
│ f64 ┆ bool │
╞═════╪════════════╡
│ NaN ┆ true │
│ 0.0 ┆ true │
└─────┴────────────┘
DuckDB result
┌────────┬────────────┐
│ a │ a_equals_a │
│ double │ boolean │
├────────┼────────────┤
│ nan │ true │
│ 0.0 │ true │
└────────┴────────────┘
Furthermore, Polars excludes NaN values in max
whereas DuckDB treats them as larger than any other floating-point value.
For all these reasons it bears reiterating that our recommendation is to only handle null values in applications, and leave NaN values as an edge case resulting from users having performed undefined mathematical operations.