Null/NaN handling

TL;DR

All dataframe tools, except for those which piggy-back off of pandas, make a clear distinction between NaN and null values.

Tip

We recommend only handling null values in applications and leaving NaN values as an edge case resulting from users having performed undefined mathematical operations.

What's the difference?

Most data tools except pandas make a clear distinction between:

Null values, representing missing data.
NaN values, resulting from "illegal" mathematical operations like 0/0.

In Narwhals, this is reflected in separate methods for Null/NaN values:

Operation	Null	NaN
is	`Expr.is_null`	`Expr.is_nan`
fill	`Expr.fill_null`	`Expr.fill_nan`
drop	`Expr.drop_nulls`	Not yet implemented (See discussion) `polars.Expr.drop_nans`
count	`Expr.null_count`	No upstream equivalent

In pandas however the concepts are muddied, as different sentinel values represent missing depending on the data type.

Check how different tools distinguish them (or don't) in the following example:

import narwhals as nw
import numpy as np
from narwhals.typing import IntoFrameT

data = {"a": [1.0, 0.0, None]}


def check_null_behavior(df: IntoFrameT) -> IntoFrameT:
    return (
        nw.from_native(df)
        .with_columns(a=nw.col("a") / nw.col("a"))
        .with_columns(
            a_is_null=nw.col("a").is_null(),
            a_is_nan=nw.col("a").is_nan(),
        )
    ).to_native()

pandaspandas (pyarrow-backed)Polars (eager)PyArrow

import pandas as pd

df = pd.DataFrame(data)
print(check_null_behavior(df))

     a  a_is_null  a_is_nan
0  1.0      False     False
1  NaN       True      True
2  NaN       True      True

import pandas as pd

df = pd.DataFrame(data).convert_dtypes(dtype_backend="pyarrow")
print(check_null_behavior(df))

      a  a_is_null a_is_nan
0   1.0      False    False
1   NaN      False     True
2  <NA>       True     <NA>

import polars as pl

df = pl.DataFrame(data)
print(check_null_behavior(df))

shape: (3, 3)
┌──────┬───────────┬──────────┐
│ a    ┆ a_is_null ┆ a_is_nan │
│ ---  ┆ ---       ┆ ---      │
│ f64  ┆ bool      ┆ bool     │
╞══════╪═══════════╪══════════╡
│ 1.0  ┆ false     ┆ false    │
│ NaN  ┆ false     ┆ true     │
│ null ┆ true      ┆ null     │
└──────┴───────────┴──────────┘

import pyarrow as pa

df = pa.table(data)
print(check_null_behavior(df))

pyarrow.Table
a: double
a_is_null: bool
a_is_nan: bool
----
a: [[1,nan,null]]
a_is_null: [[false,false,true]]
a_is_nan: [[false,true,null]]

Notice how the classic pandas dtypes make no distinction between the concepts, whereas the other libraries do. Note however that discussion on what PyArrow-backed pandas dataframe should do is ongoing.

NaN comparisons

According to the IEEE-754 standard, NaN should compare as not equal to itself, and cannot be compared with other floating point numbers. Python and PyArrow follow these rules:

import pyarrow as pa
import pyarrow.compute as pc

print("Python result:")
print(float("nan") == float("nan"), 0.0 == 0.0)
print()
print("PyArrow result:")
arr = pa.array([float("nan"), 0.0])
print(pc.equal(arr, arr))

Python result:
False True

PyArrow result:
[
  false,
  true
]

Polars and DuckDB, however, don't follow this rule, and treat NaN as equal to itself.

import polars as pl
import duckdb

print("Polars result")
df = pl.DataFrame({"a": [float("nan"), 0.0]})
print(df.with_columns(a_equals_a=pl.col("a") == pl.col("a")))
print()
print("DuckDB result")
print(duckdb.sql("from df select a, a == a as a_equals_a"))

Polars result
shape: (2, 2)
┌─────┬────────────┐
│ a   ┆ a_equals_a │
│ --- ┆ ---        │
│ f64 ┆ bool       │
╞═════╪════════════╡
│ NaN ┆ true       │
│ 0.0 ┆ true       │
└─────┴────────────┘

DuckDB result
┌────────┬────────────┐
│   a    │ a_equals_a │
│ double │  boolean   │
├────────┼────────────┤
│    nan │ true       │
│    0.0 │ true       │
└────────┴────────────┘

Furthermore, Polars excludes NaN values in max whereas DuckDB treats them as larger than any other floating-point value.

For all these reasons it bears reiterating that our recommendation is to only handle null values in applications, and leave NaN values as an edge case resulting from users having performed undefined mathematical operations.