Boolean columns
Generally speaking, Narwhals operations preserve null values.
For example, if you do nw.col('a')*2
, then:
- Values which were non-null get multiplied by 2.
- Null values stay null.
import narwhals as nw
from narwhals.typing import IntoFrameT
data = {"a": [1.4, None, 4.2]}
def multiplication(df: IntoFrameT) -> IntoFrameT:
return nw.from_native(df).with_columns((nw.col("a") * 2).alias("a*2")).to_native()
import pandas as pd
df = pd.DataFrame(data)
print(multiplication(df))
a a*2
0 1.4 2.8
1 NaN NaN
2 4.2 8.4
import polars as pl
df = pl.DataFrame(data)
print(multiplication(df))
shape: (3, 2)
┌──────┬──────┐
│ a ┆ a*2 │
│ --- ┆ --- │
│ f64 ┆ f64 │
╞══════╪══════╡
│ 1.4 ┆ 2.8 │
│ null ┆ null │
│ 4.2 ┆ 8.4 │
└──────┴──────┘
import pyarrow as pa
table = pa.table(data)
print(multiplication(table))
pyarrow.Table
a: double
a*2: double
----
a: [[1.4,null,4.2]]
a*2: [[2.8,null,8.4]]
What do we do, however, when the result column is boolean? For
example, nw.col('a') > 0
?
Unfortunately, this is backend-dependent:
- for all backends except pandas, null values are preserved
- for pandas, this depends on the dtype backend:
- for PyArrow dtypes and pandas nullable dtypes, null values are preserved
- for the classic NumPy dtypes, null values are typically filled in with
False
.
pandas is generally moving towards nullable dtypes, and they may become the default in the future, so we hope that the classical NumPy dtypes not supporting null values will just be a temporary legacy pandas issue which will eventually go away anyway.
from narwhals.typing import FrameT
def comparison(df: FrameT) -> FrameT:
return nw.from_native(df).with_columns((nw.col("a") > 2).alias("a>2")).to_native()
import pandas as pd
df = pd.DataFrame(data)
print(comparison(df))
a a>2
0 1.4 False
1 NaN False
2 4.2 True
import polars as pl
df = pl.DataFrame(data)
print(comparison(df))
shape: (3, 2)
┌──────┬───────┐
│ a ┆ a>2 │
│ --- ┆ --- │
│ f64 ┆ bool │
╞══════╪═══════╡
│ 1.4 ┆ false │
│ null ┆ null │
│ 4.2 ┆ true │
└──────┴───────┘
import pyarrow as pa
table = pa.table(data)
print(comparison(table))
pyarrow.Table
a: double
a>2: bool
----
a: [[1.4,null,4.2]]
a>2: [[false,null,true]]