DataFrame
To write a dataframe-agnostic function, the steps you'll want to follow are:
-
Initialise a Narwhals DataFrame or LazyFrame by passing your dataframe to
nw.from_native
. All the calculations stay lazy if we start with a lazy dataframe - Narwhals will never automatically trigger computation without you asking it to.Note: if you need eager execution, make sure to pass
eager_only=True
tonw.from_native
. -
Express your logic using the subset of the Polars API supported by Narwhals.
- If you need to return a dataframe to the user in its original library, call
nw.to_native
.
Steps 1 and 3 are so common that we provide a utility @nw.narwhalify
decorator, which allows you
to only explicitly write step 2.
Let's explore this with some simple examples.
Example 1: descriptive statistics
Just like in Polars, we can pass expressions to
DataFrame.select
or LazyFrame.select
.
Make a Python file with the following content:
import narwhals as nw
from narwhals.typing import FrameT
@nw.narwhalify
def func(df: FrameT) -> FrameT:
return df.select(
a_sum=nw.col("a").sum(),
a_mean=nw.col("a").mean(),
a_std=nw.col("a").std(),
)
Let's try it out:
import pandas as pd
df = pd.DataFrame({"a": [1, 1, 2]})
print(func(df))
a_sum a_mean a_std
0 4 1.333333 0.57735
import polars as pl
df = pl.DataFrame({"a": [1, 1, 2]})
print(func(df))
shape: (1, 3)
┌───────┬──────────┬─────────┐
│ a_sum ┆ a_mean ┆ a_std │
│ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ f64 │
╞═══════╪══════════╪═════════╡
│ 4 ┆ 1.333333 ┆ 0.57735 │
└───────┴──────────┴─────────┘
import polars as pl
df = pl.LazyFrame({"a": [1, 1, 2]})
print(func(df).collect())
shape: (1, 3)
┌───────┬──────────┬─────────┐
│ a_sum ┆ a_mean ┆ a_std │
│ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ f64 │
╞═══════╪══════════╪═════════╡
│ 4 ┆ 1.333333 ┆ 0.57735 │
└───────┴──────────┴─────────┘
import pyarrow as pa
table = pa.table({"a": [1, 1, 2]})
print(func(table))
pyarrow.Table
a_sum: int64
a_mean: double
a_std: double
----
a_sum: [[4]]
a_mean: [[1.3333333333333333]]
a_std: [[0.5773502691896257]]
Alternatively, we could have opted for the more explicit version:
import narwhals as nw
from narwhals.typing import IntoFrameT
def func(df_native: IntoFrameT) -> IntoFrameT:
df = nw.from_native(df_native)
df = df.select(
a_sum=nw.col("a").sum(),
a_mean=nw.col("a").mean(),
a_std=nw.col("a").std(),
)
return nw.to_native(df)
Despite being more verbose, it has the advantage of preserving the type annotation of the native object - see typing for more details.
In general, in this tutorial, we'll use the former.
Example 2: group-by and mean
Just like in Polars, we can pass expressions to GroupBy.agg
.
Make a Python file with the following content:
import narwhals as nw
from narwhals.typing import FrameT
@nw.narwhalify
def func(df: FrameT) -> FrameT:
return df.group_by("a").agg(nw.col("b").mean()).sort("a")
Let's try it out:
import pandas as pd
df = pd.DataFrame({"a": [1, 1, 2], "b": [4, 5, 6]})
print(func(df))
a b
0 1 4.5
1 2 6.0
import polars as pl
df = pl.DataFrame({"a": [1, 1, 2], "b": [4, 5, 6]})
print(func(df))
shape: (2, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ f64 │
╞═════╪═════╡
│ 1 ┆ 4.5 │
│ 2 ┆ 6.0 │
└─────┴─────┘
import polars as pl
df = pl.LazyFrame({"a": [1, 1, 2], "b": [4, 5, 6]})
print(func(df).collect())
shape: (2, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ f64 │
╞═════╪═════╡
│ 1 ┆ 4.5 │
│ 2 ┆ 6.0 │
└─────┴─────┘
import pyarrow as pa
table = pa.table({"a": [1, 1, 2], "b": [4, 5, 6]})
print(func(table))
pyarrow.Table
a: int64
b: double
----
a: [[1,2]]
b: [[4.5,6]]
Example 3: horizontal sum
Expressions can be free-standing functions which accept other expressions as inputs.
For example, we can compute a horizontal sum using nw.sum_horizontal
.
Make a Python file with the following content:
import narwhals as nw
from narwhals.typing import FrameT
@nw.narwhalify
def func(df: FrameT) -> FrameT:
return df.with_columns(a_plus_b=nw.sum_horizontal("a", "b"))
Let's try it out:
import pandas as pd
df = pd.DataFrame({"a": [1, 1, 2], "b": [4, 5, 6]})
print(func(df))
a b a_plus_b
0 1 4 5
1 1 5 6
2 2 6 8
import polars as pl
df = pl.DataFrame({"a": [1, 1, 2], "b": [4, 5, 6]})
print(func(df))
shape: (3, 3)
┌─────┬─────┬──────────┐
│ a ┆ b ┆ a_plus_b │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪══════════╡
│ 1 ┆ 4 ┆ 5 │
│ 1 ┆ 5 ┆ 6 │
│ 2 ┆ 6 ┆ 8 │
└─────┴─────┴──────────┘
import polars as pl
df = pl.LazyFrame({"a": [1, 1, 2], "b": [4, 5, 6]})
print(func(df).collect())
shape: (3, 3)
┌─────┬─────┬──────────┐
│ a ┆ b ┆ a_plus_b │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪══════════╡
│ 1 ┆ 4 ┆ 5 │
│ 1 ┆ 5 ┆ 6 │
│ 2 ┆ 6 ┆ 8 │
└─────┴─────┴──────────┘
import pyarrow as pa
table = pa.table({"a": [1, 1, 2], "b": [4, 5, 6]})
print(func(table))
pyarrow.Table
a: int64
b: int64
a_plus_b: int64
----
a: [[1,1,2]]
b: [[4,5,6]]
a_plus_b: [[5,6,8]]
Example 4: multiple inputs
nw.narwhalify
can be used to decorate functions that take multiple inputs as well and
return a non dataframe/series-like object.
For example, let's compute how many rows are left in a dataframe after filtering it based on a series.
Make a Python file with the following content:
from typing import Any
import narwhals as nw
@nw.narwhalify(eager_only=True)
def func(df: nw.DataFrame[Any], s: nw.Series, col_name: str) -> int:
return df.filter(nw.col(col_name).is_in(s)).shape[0]
We require eager_only=True
here because lazyframe doesn't support .shape
.
Let's try it out:
import pandas as pd
df = pd.DataFrame({"a": [1, 1, 2, 2, 3], "b": [4, 5, 6, 7, 8]})
s = pd.Series([1, 3])
print(func(df, s.to_numpy(), "a"))
3
import polars as pl
df = pl.DataFrame({"a": [1, 1, 2, 2, 3], "b": [4, 5, 6, 7, 8]})
s = pl.Series([1, 3])
print(func(df, s.to_numpy(), "a"))
3
import pyarrow as pa
table = pa.table({"a": [1, 1, 2, 2, 3], "b": [4, 5, 6, 7, 8]})
a = pa.array([1, 3])
print(func(table, a.to_numpy(), "a"))
3