Skip to content

DataFrame

To write a dataframe-agnostic function, the steps you'll want to follow are:

  1. Initialise a Narwhals DataFrame or LazyFrame by passing your dataframe to nw.from_native. All the calculations stay lazy if we start with a lazy dataframe - Narwhals will never automatically trigger computation without you asking it to.

    Note: if you need eager execution, make sure to pass eager_only=True to nw.from_native.

  2. Express your logic using the subset of the Polars API supported by Narwhals.

  3. If you need to return a dataframe to the user in its original library, call nw.to_native.

Steps 1 and 3 are so common that we provide a utility @nw.narwhalify decorator, which allows you to only explicitly write step 2.

Let's explore this with some simple examples.

Example 1: descriptive statistics

Just like in Polars, we can pass expressions to DataFrame.select or LazyFrame.select.

Make a Python file with the following content:

import narwhals as nw
from narwhals.typing import FrameT


@nw.narwhalify
def func(df: FrameT) -> FrameT:
    return df.select(
        a_sum=nw.col("a").sum(),
        a_mean=nw.col("a").mean(),
        a_std=nw.col("a").std(),
    )

Let's try it out:

import pandas as pd

df = pd.DataFrame({"a": [1, 1, 2]})
print(func(df))
   a_sum    a_mean    a_std
0      4  1.333333  0.57735
import polars as pl

df = pl.DataFrame({"a": [1, 1, 2]})
print(func(df))
shape: (1, 3)
┌───────┬──────────┬─────────┐
 a_sum  a_mean    a_std   
 ---    ---       ---     
 i64    f64       f64     
╞═══════╪══════════╪═════════╡
 4      1.333333  0.57735 
└───────┴──────────┴─────────┘
import polars as pl

df = pl.LazyFrame({"a": [1, 1, 2]})
print(func(df).collect())
shape: (1, 3)
┌───────┬──────────┬─────────┐
 a_sum  a_mean    a_std   
 ---    ---       ---     
 i64    f64       f64     
╞═══════╪══════════╪═════════╡
 4      1.333333  0.57735 
└───────┴──────────┴─────────┘
import pyarrow as pa

table = pa.table({"a": [1, 1, 2]})
print(func(table))
pyarrow.Table
a_sum: int64
a_mean: double
a_std: double
----
a_sum: [[4]]
a_mean: [[1.3333333333333333]]
a_std: [[0.5773502691896257]]

Alternatively, we could have opted for the more explicit version:

import narwhals as nw
from narwhals.typing import IntoFrameT


def func(df_native: IntoFrameT) -> IntoFrameT:
    df = nw.from_native(df_native)
    df = df.select(
        a_sum=nw.col("a").sum(),
        a_mean=nw.col("a").mean(),
        a_std=nw.col("a").std(),
    )
    return nw.to_native(df)

Despite being more verbose, it has the advantage of preserving the type annotation of the native object - see typing for more details.

In general, in this tutorial, we'll use the former.

Example 2: group-by and mean

Just like in Polars, we can pass expressions to GroupBy.agg. Make a Python file with the following content:

import narwhals as nw
from narwhals.typing import FrameT


@nw.narwhalify
def func(df: FrameT) -> FrameT:
    return df.group_by("a").agg(nw.col("b").mean()).sort("a")

Let's try it out:

import pandas as pd

df = pd.DataFrame({"a": [1, 1, 2], "b": [4, 5, 6]})
print(func(df))
   a    b
0  1  4.5
1  2  6.0
import polars as pl

df = pl.DataFrame({"a": [1, 1, 2], "b": [4, 5, 6]})
print(func(df))
shape: (2, 2)
┌─────┬─────┐
 a    b   
 ---  --- 
 i64  f64 
╞═════╪═════╡
 1    4.5 
 2    6.0 
└─────┴─────┘
import polars as pl

df = pl.LazyFrame({"a": [1, 1, 2], "b": [4, 5, 6]})
print(func(df).collect())
shape: (2, 2)
┌─────┬─────┐
 a    b   
 ---  --- 
 i64  f64 
╞═════╪═════╡
 1    4.5 
 2    6.0 
└─────┴─────┘
import pyarrow as pa

table = pa.table({"a": [1, 1, 2], "b": [4, 5, 6]})
print(func(table))
pyarrow.Table
a: int64
b: double
----
a: [[1,2]]
b: [[4.5,6]]

Example 3: horizontal sum

Expressions can be free-standing functions which accept other expressions as inputs. For example, we can compute a horizontal sum using nw.sum_horizontal.

Make a Python file with the following content:

import narwhals as nw
from narwhals.typing import FrameT


@nw.narwhalify
def func(df: FrameT) -> FrameT:
    return df.with_columns(a_plus_b=nw.sum_horizontal("a", "b"))

Let's try it out:

import pandas as pd

df = pd.DataFrame({"a": [1, 1, 2], "b": [4, 5, 6]})
print(func(df))
   a  b  a_plus_b
0  1  4         5
1  1  5         6
2  2  6         8
import polars as pl

df = pl.DataFrame({"a": [1, 1, 2], "b": [4, 5, 6]})
print(func(df))
shape: (3, 3)
┌─────┬─────┬──────────┐
 a    b    a_plus_b 
 ---  ---  ---      
 i64  i64  i64      
╞═════╪═════╪══════════╡
 1    4    5        
 1    5    6        
 2    6    8        
└─────┴─────┴──────────┘
import polars as pl

df = pl.LazyFrame({"a": [1, 1, 2], "b": [4, 5, 6]})
print(func(df).collect())
shape: (3, 3)
┌─────┬─────┬──────────┐
 a    b    a_plus_b 
 ---  ---  ---      
 i64  i64  i64      
╞═════╪═════╪══════════╡
 1    4    5        
 1    5    6        
 2    6    8        
└─────┴─────┴──────────┘
import pyarrow as pa

table = pa.table({"a": [1, 1, 2], "b": [4, 5, 6]})
print(func(table))
pyarrow.Table
a: int64
b: int64
a_plus_b: int64
----
a: [[1,1,2]]
b: [[4,5,6]]
a_plus_b: [[5,6,8]]

Example 4: multiple inputs

nw.narwhalify can be used to decorate functions that take multiple inputs as well and return a non dataframe/series-like object.

For example, let's compute how many rows are left in a dataframe after filtering it based on a series.

Make a Python file with the following content:

from typing import Any

import narwhals as nw


@nw.narwhalify(eager_only=True)
def func(df: nw.DataFrame[Any], s: nw.Series, col_name: str) -> int:
    return df.filter(nw.col(col_name).is_in(s)).shape[0]

We require eager_only=True here because lazyframe doesn't support .shape.

Let's try it out:

import pandas as pd

df = pd.DataFrame({"a": [1, 1, 2, 2, 3], "b": [4, 5, 6, 7, 8]})
s = pd.Series([1, 3])
print(func(df, s.to_numpy(), "a"))
3
import polars as pl

df = pl.DataFrame({"a": [1, 1, 2, 2, 3], "b": [4, 5, 6, 7, 8]})
s = pl.Series([1, 3])
print(func(df, s.to_numpy(), "a"))
3
import pyarrow as pa

table = pa.table({"a": [1, 1, 2, 2, 3], "b": [4, 5, 6, 7, 8]})
a = pa.array([1, 3])
print(func(table, a.to_numpy(), "a"))
3