Boolean columns
Null preservation
Generally speaking, Narwhals operations preserve null values.
For example, if you do nw.col('a')*2
, then:
- Values which were non-null get multiplied by 2.
- Null values stay null.
import narwhals as nw
from narwhals.typing import IntoFrameT
data = {"a": [1.4, None, 4.2]}
def multiplication(df: IntoFrameT) -> IntoFrameT:
return nw.from_native(df).with_columns((nw.col("a") * 2).alias("a*2")).to_native()
import pandas as pd
df = pd.DataFrame(data)
print(multiplication(df))
a a*2
0 1.4 2.8
1 NaN NaN
2 4.2 8.4
import polars as pl
df = pl.DataFrame(data)
print(multiplication(df))
shape: (3, 2)
┌──────┬──────┐
│ a ┆ a*2 │
│ --- ┆ --- │
│ f64 ┆ f64 │
╞══════╪══════╡
│ 1.4 ┆ 2.8 │
│ null ┆ null │
│ 4.2 ┆ 8.4 │
└──────┴──────┘
import pyarrow as pa
table = pa.table(data)
print(multiplication(table))
pyarrow.Table
a: double
a*2: double
----
a: [[1.4,null,4.2]]
a*2: [[2.8,null,8.4]]
What do we do, however, when the result column is boolean? For
example, nw.col('a') > 0
?
Unfortunately, this is backend-dependent:
- for all backends except pandas, null values are preserved
- for pandas, this depends on the dtype backend:
- for PyArrow dtypes and pandas nullable dtypes, null values are preserved
- for the classic NumPy dtypes, null values are typically filled in with
False
.
pandas is generally moving towards nullable dtypes, and they may become the default in the future, so we hope that the classical NumPy dtypes not supporting null values will just be a temporary legacy pandas issue which will eventually go away anyway.
from narwhals.typing import FrameT
def comparison(df: FrameT) -> FrameT:
return nw.from_native(df).with_columns((nw.col("a") > 2).alias("a>2")).to_native()
import pandas as pd
df = pd.DataFrame(data)
print(comparison(df))
a a>2
0 1.4 False
1 NaN False
2 4.2 True
import polars as pl
df = pl.DataFrame(data)
print(comparison(df))
shape: (3, 2)
┌──────┬───────┐
│ a ┆ a>2 │
│ --- ┆ --- │
│ f64 ┆ bool │
╞══════╪═══════╡
│ 1.4 ┆ false │
│ null ┆ null │
│ 4.2 ┆ true │
└──────┴───────┘
import pyarrow as pa
table = pa.table(data)
print(comparison(table))
pyarrow.Table
a: double
a>2: bool
----
a: [[1.4,null,4.2]]
a>2: [[false,null,true]]
Kleene logic
Generally speaking, if we have two boolean columns 'a'
and 'b'
, then nw.col('a') | nw.col('b')
and
nw.col('a') & nw.col('b')
follow Kleene logic. That is to say:
nw.col('a') |
nw.col('b') |
nw.col('a') | nw.col('b') |
nw.col('a') & nw.col('b') |
---|---|---|---|
True | True | True | True |
True | False | True | False |
True | None | True | None |
False | True | True | False |
False | False | False | False |
False | None | None | False |
None | True | True | None |
None | False | None | False |
None | None | None | None |
Here, too, pandas backed by NumPy types differs, as its boolean columns cannot store null values:
- For
nw.col('a') | nw.col('b')
, pandas returnsTrue
if at least one column contains aTrue
value, andFalse
otherwise. - For
nw.col('a') & nw.col('b')
, pandas returnsTrue
if both columns containTrue
values, andFalse
otherwise.
In any_horizontal
and all_horizontal
there is an ignore_nulls
argument, which behaves as follows:
-
If
True
, then null values are ignored and contribute nothing to the final result. If there are no values, the result is:False
forany_horizontal
.True
forall_horizontal
.- If
False
, then Kleene logic is followed. If using pandas backed by classical NumPy types, then this option is not supported.